escience2019 has ended
Friday, September 27 • 12:00pm - 12:30pm
SDM: A Scientific Dataset Delivery Platform

Sign up or log in to save this to your schedule and see who's attending!

Illyoung Choi (University of Arizona), Jude Nelson (Blockstack PBC), Larry Peterson (Open Networking Foundation), and John Hartman (University of Arizona)

Scientific computing is becoming more data-centric and more collaborative, which means increasingly large datasets are being transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn’t require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate datasets and access them. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM outperforms existing data access methods when configured with a CDN. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM has only a 9% longer elapsed time than local HDD storage and 18% longer elapsed time than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.


Illyoung Choi

University of Arizona

Friday September 27, 2019 12:00pm - 12:30pm
Macaw Room