Loading…
escience2019 has ended

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Tuesday, September 24
 

7:30am

8:30am

Curricula and Teaching Methods in Cloud Computing, Big Data, and Data Science (DTW 2019)
The emergence of Data Science technologies that combine Cloud Computing, Big Data and Data Analytics technologies as specialized fields in computing is motivating development of new teaching methods in course design to provide education in the techniques and technologies needed to extract knowledge from large datasets in virtualized environment. In current literature there is a lack of well-articulated learning resource for beginners that would integrate administrative, programing, and algorithm design aspects of related domains. We believe it is important to allow students, researchers, and professionals to understand cross-domain aspects of these challenges before they embark on further exploration of these fields. A small number of high-quality contributions will be presented during the workshop. At the end of the workshop, a forum discussion is planned to debate on future directions of curricula and teaching methods in Data Science, Big Data, and Cloud Computing.

Visit https://github.com/EDISONcommunity/EDSF/wiki/(1)-DTW2019-Data-Teaching-Workshop-September-2019,-San-Diego for more information.

----

Data Science Model Curriculum Implementation for Various Types of Big Data Infrastructure Courses
Tomasz Wiktorski (University of Stavanger, Norway), Yuri Demchenko (University of Amsterdam, The Netherlands), and Oleg Chertov (National Technical University of Ukraine)

Teaching DevOps and Cloud Based Software Engineering in University Curricula
Yuri Demchenko (University of Amsterdam), Zhiming Zhao (University of Amsterdam), Jayachander Surbiryala (University of Stavanger), Spiros Koulouzis (University of Amsterdam), Zeshun Shi (University of Amsterdam), Xiaofeng Liao (University of Amsterdam), and Jelena Gordiyenko (Agile Telecom)

EDISON Data Science Framework (EDSF) Extension to Address Transversal Skills Required by Emerging Industry 4.0 Transformation
Yuri Demchenko (University of Amsterdam), Tomasz Wiktorski (University of Stavanger), Juan Cuadrado Gallego (University of Alcala), and Steve Brewer (University of Southampton)

Tuesday September 24, 2019 8:30am - 12:00pm
Boardroom East

8:30am

Platform-Driven e-Infrastructure Innovations (EINFRA)
Addressing emerging grand challenges in scientific research, health, engineering or global consumer services necessitates dramatic increases in responsive supercomputing and extreme data capacities. The half-day workshop Platform-driven e-Infrastructure Innovations (EINFRA) addresses projects and use cases that deal with extreme data or computing challenges. They will benefit from this workshop by discussing commonalities and differences among their different approaches.

Visit https://www.process-project.eu/workshops for more information.

----

Transkribus. A Platform for Automated Text Recognition and Searching of Historical Documents
Sebastian Colutto (University of Innsbruck), Philip Kahle (University of Innsbruck), Hackl Guenter (University of Innsbruck), and Guenter Muehlberger (University of Innsbruck)

Unlocking the LOFAR LTA
Hanno Spreeuw (Netherlands eScience Center), Souley Madougou (Netherlands eScience Center), Ronald Van Haren (Netherlands eScience Center), Berend Weel (Netherlands eScience Center), Adam Belloum (University of Amsterdam), and Jason Maassen (Netherlands eScience Center)

European HPC Landscape
Florian Berberich (PRACE aisbl and Jülich Supercomputing Center, Forschungszetrum Juelich GmbH), Janina Liebmann (Jülich Supercomputing Center, Forschungszetrum Juelich GmbH), Jean-Philippe Nominé (ETP4HPC and Commissariat à l'énergie atomique et aux énergies alternatives), Oriol Pineda (PRACE aisbl and Barcelona Supercomputing Center), Philippe Segers (Grand équipement national de calcul intensif), and Veronica Teodor (Jülich Supercomputing Center, Forschungszetrum Juelich GmbH)

Reference Exascale Architecture
Martin Bobák (Slovak Academy of Sciences), Ladislav Hluchy (Slovak Academy of Sciences), Adam Belloum (University of Amsterdam), Reginald Cushing (University of Amsterdam), Jan Meizner (AGH University of Science and Technology), Piotr Nowakowski (AGH University of Science and Technology), Viet Tran (Slovak Academy of Sciences), Ondrej Habala (Slovak Academy of Sciences), Jason Maassen (Netherlands eScience Center), Balázs Somosköi (Lufthansa Systems), Mara Graziani (University of Applied Sciences, Western Switzerland (HES-SO)), Matti Heikkurinen (University of Applied Sciences, Western Switzerland (HES-SO)), Maximilian Höb (Ludwig-Maximilians Universität), and Jan Schmidt (Ludwig-Maximilians Universität)

The AllScale API
Philipp Gschwandtner (University of Innsbruck), Herbert Jordan (University of Innsbruck), Peter Thoman (University of Innsbruck), and Thomas Fahringer (University of Innsbruck)

ESiWACE: On European Infrastructure Efforts for Weather and Climate Modeling at Exascale
Philipp Neumann (German Climate Computing Center) and Joachim Biercamp (German Climate Computing Center)

Tuesday September 24, 2019 8:30am - 12:00pm
Macaw Room

8:30am

Advanced Knowledge Technologies for Science in a FAIR World (AKTS)
A new wave of knowledge technologies is sparking innovation in eScience, including the emergence of large knowledge graphs created from text extraction and crowdsourcing, the rise of Wikidata as a nexus for core entities and resources in science, the proposed open knowledge networks of scientific content that include provenance and natural interfaces, and the advent of web-scale semantic dataset search using standard schemas. Given that semantics and ontologies have enabled many scientific advances, these new knowledge technologies offer exciting possibilities that will be discussed at this workshop.

Visit https://www.isi.edu/ikcap/akts/akts2019 for more information.

----

Describing datasets in Wikidata
Denny Vrandecic (Google)

Making Data FAIR Requires More than Just Principles: We Need Knowledge Technologies
Mark Musen (Stanford University)

Iterative Document Retrieval via Deep Learning Approaches for Biomedical Question Answering
Ibrahim Burak Ozyurt (UC San Diego) and Jeffrey Grethe (UC San Diego)

Incorporating New Concepts Into the Scientific Variables Ontology
Maria Stoica (University of Colorado, Boulder) and Scott Peckham (University of Colorado, Boulder)

Tuesday September 24, 2019 8:30am - 5:00pm
Boardroom West

8:30am

Research Objects 2019 (RO 2019)
Scholarly Communication has evolved significantly, with increasing focus on Open Research, FAIR data sharing and community-developed open source methods. The concepts of authorship and citation are changing, as researchers are increasingly reusing and evolving common software tools and datasets. Yet with a growing amount of cloud compute power and open platforms available, reproducibility of computational analyses becomes more challenging, and not yet commonly included in peer review.

While recent advances in scientific workflows and provenance capture systems have improved on this situation, Research Objects propose a way to package, describe, publish, archive, explore and understand digital research outputs by reusing existing Web standards and formats. In this workshop we will explore recent advancements in Research Objects and research data packaging, and attempt to address the challenges remaining to increase Research Object uptake with data providers, researchers, infrastructures, publishers and other stakeholders.

Visit https://researchobject.github.io/ro2019 for more information.

----

Reproducibility by Other Means: Transparent Research Objects
Timothy McPhillips (University of Illinois at Urbana-Champaign), Craig Willis (University of Illinois at Urbana-Champaign), Michael R. Gryk (University of Illinois at Urbana-Champaign), Santiago Nunez-Corrales (University of Illinois at Urbana-Champaign), and Bertram Ludascher (University of Illinois at Urbana-Champaign)

Interactivity, Distributed Workflows, and Thick Provenance: A Review of Challenges Confronting Digital Humanities Research Objects
Katrina Fenlon (University of Maryland, College of Information Studies)

Application of BagIt-Serialized Research Object Bundles for Packaging and Re-Execution of Computational Analyses
Kyle Chard (University of Chicago), Niall Gaffney (University of Texas at Austin), Matthew B. Jones (University of California at Santa Barbara), Kacper Kowalik (University of Illinois at Urbana-Champaign), Bertram Ludäscher (University of Illinois at Urbana-Champaign), Timothy McPhillips (University of Illinois at Urbana-Champaign), Jarek Nabrzyski (University of Notre Dame), Victoria Stodden (University of Illinois at Urbana-Champaign), Ian Taylor (University of Notre Dame), Thomas Thelen (University of California at Santa Barbara), Matthew J. Turk (University of Illinois at Urbana-Champaign), and Craig Willis (University of Illinois at Urbana-Champaign)

Data Quality Issues in Current Nanopublications
Imran Asif (Heriot-Watt University, Edinburgh), Jessica Chen-Burger (Heriot-Watt University, Edinburgh), and Alasdair J. G. Gray (Heriot-Watt University, Edinburgh

Tuesday September 24, 2019 8:30am - 5:00pm
Cockatoo Room

10:00am

Break
Tuesday September 24, 2019 10:00am - 10:30am
Foyer

12:00pm

Lunch
Tuesday September 24, 2019 12:00pm - 1:00pm
Bay Front Lawn

1:00pm

Bridging from Concepts to Data and Computation for eScience (BC2DC'19)
Research addressing global challenges federates a growing diversity of disciplines, requires sustained contributions from many autonomous organizations and builds on heterogeneous evolving computational platforms. Scientific knowledge is scattered across cloud-based services, local storage, and in source code targeting specific architectures and computational contexts. Concepts reflected in disparate sources are hardly computer-communicable and computer-actionable across or even within disciplines. This workshop focuses on platform-driven and domain-specific developments that aim to unify underlying platforms, clouds, data, computational resources and concepts in order to empower research developers to deliver increasingly complex eScience systems.

Visit https://bc2dc.github.io for more information.

----

Active Provenance for Data-Intensive Workflows: Engaging Users and Developers
Alessandro Spinuso (Koninklijk Nederlands Meteorologisch Instituut), Malcolm Atkinson (University of Edinburgh), and Federica Magnoni (Istituto Nazionale Geofisica e Vulcanologia)

Modeling and Matching Digital Data Marketplace Policies
Sara Shakeri (University of Amsterdam), Valentina Maccatrozzo (Netherlands eScience Center), Lourens Veen (Netherlands eScience Center), Rena Bakhshi (Netherlands eScience Center), Leon Gommans (University of Amsterdam), Cees de Laat (University of Amsterdam), and Paola Grosso (University of Amsterdam)

DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud
Iraklis Klampanos (NCSR "Demokritos"), Athanasios Davvetas (NCSR "Demokritos"), André Gemünd (Fraunhofer SCAI), Malcolm Atkinson (University of Edinburgh), Antonios Koukourikos (NCSR "Demokritos"), Rosa Filgueira (University of Edinburgh), Amrey Krause (University of Edinburgh), Alessandro Spinuso (KNMI), Angelos Charalambidis (NCSR "Demokritos"), Federica Magnoni (INGV), Emanuele Casarotti (INGV), Christian Pagé (CERFACS), Mike Lindner (KIT), Andreas Ikonomopoulos (NCSR "Demokritos"), and Vangelis Karkaletsis (NCSR "Demokritos")

Ease Access to Climate Simulations for Researchers: IS-ENES Climate4Impact
Christian Pagé (Université de Toulouse, CNRS), Wim Som de Cerff (Koninklijk Nederlands Meteorologisch Instituut), Maarten Plieger (Koninklijk Nederlands Meteorologisch Instituut), Alessandro Spinuso (Koninklijk Nederlands Meteorologisch Instituut), and Xavier Pivan (Université de Toulouse, CNRS)

Managing Scientific Literature with Software from the PORTAL-DOORS Project
Shiladitya Dutta (Brain Health Alliance), Pooja Kowshik (Brain Health Alliance), Adarsh Ambati (Brain Health Alliance), Sathvik Nori (Brain Health Alliance), S. Koby Taswell (Brain Health Alliance), and Carl Taswell (Brain Health Alliance)

Towards a Computer-Interpretable Actionable Formal Model to Encode Data Governance Rules
Rui Zhao (University of Edinburgh) and Malcolm Atkinson (University of Edinburgh)

Towards a New Paradigm for Programming Scientific Workflows
Reginald Cushing (University of Amsterdam), Onno Valkering (University of Amsterdam), Adam Belloum (University of Amsterdam), and Cees de Laat (University of Amsterdam)

Bridging Concepts and Practice in eScience via Simulation-Driven Engineering
Rafael Ferreira da Silva (University of Southern California), Henri Casanova (University of Hawaii), Ryan Tanaka (University of Hawaii), and Frédéric Suter (IN2P3 Computing Center, CNRS)

Tuesday September 24, 2019 1:00pm - 5:00pm
Macaw Room

1:00pm

Using Amazon Web Services (AWS) for Data Analytics (registration required for account set-up)
Amazon Web Services (AWS) is offering a half-day tutorial that will include the following topics: An Introduction to AWS foundational services used in research such as Amazon EC2, Amazon S3, auto-scaling, Amazon EC2 Spot; Jupyter Notebook on AWS for Data Analytics market, and CloudFormation templates and Machine Learning Using AWS. Hands on labs will be led by Dr. Sanjay Padhi, Head of AWS Research, and Randy Ridgley, Principal Solutions Architect. Please bring a laptop.

View the agenda and register at https://awsesciencetutorial.splashthat.com.


Tuesday September 24, 2019 1:00pm - 5:00pm
Boardroom East

2:30pm

Break
Tuesday September 24, 2019 2:30pm - 3:00pm
Foyer

5:00pm

eScience Welcome Reception and Gateways Poster Session
This poster session features Gateways conference attendees, and provides an opportunity for eScience attendees to meet and mingle with Gateways participants and each other.

Tuesday September 24, 2019 5:00pm - 7:00pm
Kon Tiki Room & Foyer
 
Wednesday, September 25
 

7:30am

8:30am

Joint Welcome (including Gateways attendees)
Speakers
avatar for Ilkay Altintas

Ilkay Altintas

SDSC/UC San Diego
avatar for Katherine Lawrence

Katherine Lawrence

U of Michigan/Science Gateways Community Institute
I help people creating advanced digital resources for research and education connect their projects with helpful services, expertise, and information. Ask me how the Science Gateways Community Institute can support your projects--at no cost--to better leverage the people and money... Read More →


Wednesday September 25, 2019 8:30am - 9:00am
Kon Tiki Room

9:00am

Keynote: Randy Olson on "Narrative is Everything: The ABT Framework and Narrative Evolution"
Speakers
avatar for Randy Olson

Randy Olson

Randy Olson Productions
Randy Olson is a scientist-turned-filmmaker who left a tenured professorship of marine biology (PhD Harvard University) to attend USC Cinema School, then work in and around Hollywood for 25 years. He wrote and directed the documentary feature film “Flock of Dodos: The Evolution-Intelligent Design Circus,” which premier... Read More →


Wednesday September 25, 2019 9:00am - 10:00am
Kon Tiki Room

10:00am

Break
Wednesday September 25, 2019 10:00am - 10:30am
Foyer

10:30am

SOMOSPIE: A Modular SOil MOisture SPatial Inference Engine Based on Data-Driven Decisions
Danny Rorabaugh (University of Tennessee), Mario Guevara (University of Delaware), Ricardo Llamas (University of Delaware), Joy Kitson (University of Delaware), Rodrigo Vargas (University of Delaware), and Michela Taufer (University of Tennessee)

The current availability of soil moisture data over large areas comes from satellite remote sensing technologies (i.e., radar-based systems), but these data have coarse resolution and often exhibit large spatial information gaps. Where data are too coarse or sparse for a given need (e.g., precision agriculture), one can leverage machine-learning techniques coupled with other sources of environmental information (e.g., topography) to generate gap-free information and at a finer spatial resolution (i.e., increased granularity). To this end, we develop a spatial inference engine consisting of modular stages for processing spatial environmental data, generating predictions with machine-learning techniques, and analyzing these predictions. We demonstrate the functionality of this approach and the effects of data processing choices via multiple prediction maps over a United States ecological region with a highly diverse soil moisture profile (i.e., the Middle Atlantic Coastal Plains). The relevance of our work derives from a pressing need to improve the spatial representation of soil moisture for applications in environmental sciences (e.g., ecological niche modeling, carbon monitoring systems, and other Earth system models) and precision agriculture (e.g., optimizing irrigation practices and other land management decisions).

Speakers
DR

Danny Rorabaugh

University of Tennessee



Wednesday September 25, 2019 10:30am - 11:00am
Macaw Room

10:30am

defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data
Rosa Filgueira (University of Edinburgh), Michael Jackson (University of Edinburgh), Anna Roubickova (University of Edinburgh), Amrey Krause (University of Edinburgh), Ruth Ahnert (Queen Mary University of London), Tessa Hauswedell (University College London), Julianne Nyhan (University College London), David Beavan (The Alan Turing Institute), Timothy Hobson (The Alan Turing Institute), Mariona Coll Ardanuy (The Alan Turing Institute), Giovanni Colavizza (The Alan Turing Institute), James Hetherington (The Alan Turing Institute), and Melissa Terras (University of Edinburgh)

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books, in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two computing environments, Cray Urika-GX, and Eddie, as well as in desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirement.

Speakers
RF

Rosa Filgueira

University of Edinburgh


Wednesday September 25, 2019 10:30am - 11:00am
Cockatoo Room

10:30am

Data Analysis and Sharing with the ENES Climate Analytics Service (ECAS)
The ENES Climate Analytics Service (ECAS) is a new service from the EOSC-hub project. It enables scientific end-users to perform data analysis experiments on large volumes of climate data, by exploiting a PID-enabled, server-side, and parallel approach. It aims at providing a paradigm shift for the ENES community with a strong focus on data intensive analysis, provenance management, and server-side approaches as opposed to the current ones mostly client-based, sequential and with limited/missing end-to-end analytics workflow/provenance capabilities. Furthermore, the integrated data analytics service enables basic data provenance tracking by establishing PID support through the whole chain, and thereby improving reusability, traceability, and reproducibility.

The objective of the tutorial is to present ECAS and its processing and data management capabilities for potential future users. Attendees will learn about the ECAS software stack (Jupyter, Ophidia and others) and how to use the different integrated software packages. Furthermore, besides the processing capabilities, the tutorial also cover data/workflow sharing with other researchers or with broader community experts. This is enabled through integrated Cloud-based services like B2DROP and B2SHARE.

The tutorial will be divided into a teaching as well as a practical hands-on training part and includes:
  1. presentation(s) on the theoretical and technical background of ECAS. This covers the data cube concept and its operations (e.g.: subset extraction, reduction, aggregation). Furthermore, we provide an introduction to the Ophidia framework, which is the components of ECAS for processing multidimensional data
  2. tutorials and training materials with hands of Jupyter notebooks. Participants will have the opportunity to dive into the ECAS software stack and learn how to manipulate multidimensional data through real world use cases from the climate domain.
ECAS is hosted on two sites: at DKRZ and at CMCC. Only a prior registration is required to use the service.


Wednesday September 25, 2019 10:30am - 2:30pm
Boardroom East

10:30am

Creating Reproducible Experimentation Workflows with Popper: A Hands-on, Bring Your Own Code Tutorial
Currently, approaches to scientific research require activities that take up much time but do not actually advance our scientific understanding. For example, researchers and their students spend countless hours reformatting data and writing code to attempt to reproduce previously published research. What if the scientific community could find a better way to create and publish their workflows, data, and models to minimize the amount of the time spent “reinventing the wheel”? Popper is an experimentation protocol and CLI tool for implementing scientific exploration pipelines following a DevOps approach that allows researchers to generate work that is easy to reproduce and extend.

Modern open source software development communities have created tools that make it easier to manage large codebases, allowing them to deal with high levels of complexity, not only in terms of managing code changes, but with the entire ecosystem that is needed in order to deliver changes to software in an agile, rapidly changing environment. These practices and tools are collectively referred to as DevOps. The Popper experimentation protocol repurposes the DevOps practice in the context of scientific explorations so that researchers can leverage existing tools and technologies to maintain and publish scientific analyses that are easy to reproduce.

In the first part of this tutorial, we will briefly introduce DevOps and give an overview of best practices. We will then show how these practices can be repurposed for carrying out scientific explorations and illustrate using some examples. The second part of the course will be devoted to hands-on experiences with the goal of walking the audience through the usage of the Popper CLI tool.

Participants will need a laptop with internet access. Participants are welcome to bring their own code/data for the exercises and should go over these setup instructions prior to the course: https://popperized.github.io/swc-lesson/setup.html



Wednesday September 25, 2019 10:30am - 6:15pm
Rousseau West Room

11:00am

The International Forest Risk Model (INFORM): A Method for Assessing Supply Chain Deforestation Risk with Imperfect Data
Neil Caithness (University of Oxford), Cécile Lachaux (Man & Nature), and David C. H. Wallom (University of Oxford)

A method for quantifiably estimating the deforestation risk exposure of agricultural Forest Risk Commodities in commercial supply chains is presented. The model consists of a series of equations applied using end-to-end data representing quantitative descriptors of the supply chain and its effect on deforestation. A robust penalty is included for historical deforestation and a corresponding reward for reductions in the rate of deforestation. INFORM is a method for data analysis that answers a particular question for any Forest Risk Commodity in a supply chain: what is its cumulative deforestation risk exposure? To illustrate the methodology a case study of a livestock producer in France who sources soya-based animal feed from Brazil and wishes to document the deforestation risk associated with the product is described and calculated. Building on this example a discussion of the future applicability of INFORM within emerging supply-chain transparency initiatives is made including describing clear shortcomings in the method and how it may also be used to motivate the production of better data by those that may be subject of its analysis.

Speakers
DC

David C. H. Wallom

University of Oxford



Wednesday September 25, 2019 11:00am - 11:30am
Macaw Room

11:00am

Understanding a Rapidly Expanding Refugee Camp Using Convolutional Neural Networks and Satellite Imagery
Susanne Benz (UC San Diego), Hogeun Park (UC San Diego), Jiaxin Li (UC San Diego), Daniel Crawl (UC San Diego), Jessica Block (UC San Diego), Mai Nguyen (UC San Diego), and Ilkay Altintas (UC San Diego)

In summer 2017, close to one million Rohingya, an ethnic minority group in Myanmar, have fled to Bangladesh due to the persecution of Muslims. This large influx of refugees has resided around existing refugee camps. Because of this dramatic expansion, the newly established Kutupalong-Balukhali expansion site lacked basic infrastructure and public service. While Non-Governmental Organizations (NGOs) such as Refugee Relief and Repatriation Commissioner (RRCC) conducted a series of counting exercises to understand the demographics of refugees, our understanding of camp formation is still limited. Since the household type survey is time-consuming and does not entail geo-information, we propose to use a combination of high-resolution satellite imagery and machine learning (ML) techniques to assess the spatiotemporal dynamics of the refugee camp. Four Very-High Resolution (VHR) images (i.e., World View-2) are analyze to compare the camp pre- and post-influx. Using deep learning and unsupervised learning, we organized the satellite image tiles of a given region into geographically relevant categories. Specifically, we used a pre-trained convolutional neural network (CNN) to extract features from the image tiles, followed by cluster analysis to segment the extracted features into similar groups. Our results show that the size of the built-up area increased significantly from 0.4 km2 in January 2016 and 1.5 km2 in May 2017 to 8.9 km2 in December 2017 and 9.5 km2 in February 2018. Through the benefits of unsupervised machine learning, we further detected the densification of the refugee camp over time and were able to display its heterogeneous structure. The developed method is scalable and applicable to rapidly expanding settlements across various regions. And thus a useful tool to enhance our understanding of the structure of refugee camps, which enables us to allocate resources for humanitarian needs to the most vulnerable populations.

Speakers
SB

Susanne Benz

UC San Diego



Wednesday September 25, 2019 11:00am - 11:30am
Cockatoo Room

11:00am

Scalable Performance Awareness for In Situ Scientific Applications
Matthew Wolf (Oak Ridge National Laboratory), Jong Choi (Oak Ridge National Laboratory), Greg Eisenhauer (Georgia Institute of Technology), Stéphane Ethier (Princeton Plasma Physics Laboratory), Kevin Huck (University of Oregon), Scott Klasky (Oak Ridge National Laboratory), Jeremy Logan (Oak Ridge National Laboratory), Allen Malony (University of Oregon), Chad Wood (University of Oregon), Julien Dominski (Princeton Plasma Physics Laboratory), and Gabriele Merlo (University of Texas, Austin)

Part of the promise of exascale computing and the next generation of scientific simulation codes is the ability to bring together time and spatial scales that have traditionally been treated separately. This enables creating complex coupled simulations and in situ analysis pipelines, encompassing such things as "whole device" fusion models or the simulation of cities from sewers to rooftops. Unfortunately, the HPC analysis tools that have been built up over the preceding decades are ill suited to the debugging and performance analysis of such computational ensembles. In this paper, we present a new vision for performance measurement and understanding of HPC codes, MONitoring Analytics (MONA). MONA is designed to be a flexible, high performance monitoring infrastructure that can perform monitoring analysis in place or in transit by embedding analytics and characterization directly into the data stream, without relying upon delivering all monitoring information to a central database for post-processing. It addresses the trade-offs between the prohibitively expensive capture of all performance characteristics and not capturing enough to detect the features of interest. We demonstrate several uses of MONA; capturing and indexing multi-executable performance profiles to enable later processing, extraction of performance primitives to enable the generation of customizable benchmarks and performance skeletons, and extracting communication and application behaviors to enable better control and placement for the current and future runs of the science ensemble. Relevant performance based on a system for MONA built from ADIOS and SOSFlow technologies are provided for DOE science applications and leadership machines.

Speakers
MW

Matthew Wolf

Oak Ridge National Laboratory


Wednesday September 25, 2019 11:00am - 11:30am
Boardroom West

11:30am

ForestEyes Project: Can Citizen Scientists Help Rainforests?
Fernanda Beatriz Jordan Rojas Dallaqua (ICT/UNIFESP), Álvaro Luiz Fazenda (ICT/UNIFESP), and Fabio Augusto Faria (ICT/UNIFESP)

Scientific projects involving volunteers for analyzing, collecting data and using their own computational resources, known as Citizen Science (CS), have become more popular due to advances related to information and communication technology (ICT). In literature, many CS projects have been proposed to involve citizens in different knowledge domain such as astronomy, chemistry, mathematics, and physics. In this work, a CS project called ForestEyes proposes to track deforestation in rainforests by asking volunteers to analyze and classify remote sensing images. These manually classified data are used as input for training a pattern classifier and thus, it labels new remote sensing images. ForestEyes project was created on the Zooniverse.org CS platform and early campaigns with remote sensing images from Brazilian Legal Amazon (BLA) were performed to attest the quality of the volunteers' responses. The effectiveness results were processed and compared to an oracle classification (PRODES -- Amazon Deforestation Monitoring Project). With 2.5 weeks of launch, 2050 tasks were completed with more than 35,000 answers from 383 volunteers (117 anonymous and 266 registered users). In performed experiments, we show volunteers can achieve excellent effectiveness results in remote sensing image classification task on the ForestEyes project. Furthermore, these results show that CS might be a powerful tool to quickly obtain a large amount of high-quality labeled data.

Speakers


Wednesday September 25, 2019 11:30am - 12:00pm
Macaw Room

11:30am

Social Media Intelligence and Learning Environment: an Open Source Framework for Social Media Data Collection, Analysis and Curation
Chen Wang (University of Illinois at Urbana-Champaign), Luigi Marini (University of Illinois at Urbana-Champaign), Chieh-Li Chin (University of Illinois at Urbana-Champaign), Nickolas Vance (University of Illinois at Urbana-Champaign), Curtis Donelson (University of Illinois at Urbana-Champaign), Pascal Meunier (Purdue University), and Joseph T. Yun (University of Illinois at Urbana-Champaign)


Social Media Intelligence and Learning Environment (SMILE) is an open source framework bringing cutting-edge computational models on social media data to social science researchers and students with any level of programming and computation expertise. Many existing social media analysis tools require programming knowledge, a fee, or are closed source, making it challenging for social science researchers to apply existing and new methods to social media data. SMILE provides a user-friendly web interface, through which researchers can perform a wide spectrum of research tasks, ranging from social media data collection, natural language processing, text classification, social network analysis, and generating human readable outputs and visualizations. SMILE has adopted several technologies to support its needs. The data service of SMILE leverages the GraphQL language to provide an efficient and succinct API for client to communicate with a heterogeneous collection of social media APIs, including Twitter and Reddit. SMILE implements a microservices design and utilizes Amazon AWS services, such as Lambda and Batch for computation, S3 for data storage, and Elasticsearch for a Twitter streaming database, which makes it more portable, economic, and resilient. Analysis outputs can be shared with the larger community using Clowder, an open source data management system to support data curation of long tail data and metadata. SMILE is one of the main applications deployed as a standalone tool within the Social Media Macroscope (SMM), a science gateway based on the HUBzero platform.

Speakers
CW

Chen Wang

University of Illinois at Urbana-Champaign
NV

Nickolas Vance

University of Illinois at Urbana-Champaign
CD

Curtis Donelson

University of Illinois at Urbana-Champaign


Wednesday September 25, 2019 11:30am - 12:00pm
Cockatoo Room

11:30am

ENVRI-FAIR - Interoperable Environmental FAIR Data and Services for Society, Innovation and Research
Andreas Petzold (Forschungszentrum Jülich GmbH), Ari Asmi (University of Helsinki), Alex Vermeulen (Lund University), Gelsomina Pappalardo (CNR Institute of Methodologies for Environmental Analysis), Daniele Bailo (Istituto Nazionale di Geofisica e Vulcanologia), Dick Schaap (MARIS B.V.), Helen M. Glaves (British Geological Survey), Ulrich Bundke (Forschungszentrum Jülich GmbH), and Zhiming Zhao (University of Amsterdam)

ENVRI-FAIR is the connection of the Cluster of European Environmental Research Infrastructures (ENVRI) to the European Open Science Cloud (EOSC). The overarching goal of ENVRI-FAIR is that at the end of the project, all participating RIs have built a set of FAIR data services which enhances the efficiency and productivity of researchers, supports innovation, enables data- and knowledge-based decisions and connects the ENVRI Cluster to the EOSC. This goal is reached by: (1) well defined community policies and standards on all steps of the data life cycle, aligned with the wider European policies, as well as with international developments; (2) each participating RI will have sustainable, transparent and auditable data services, for each step of data life cycle, compliant to the FAIR principles. (3) the focus of the proposed work is put on the implementation of prototypes for testing pre-production services at each RI; the catalogue of prepared services is defined for each RI independently, depending on the maturity of the involved RIs; (4) the complete set of thematic data services and tools provided by the ENVRI cluster is exposed under the EOSC catalogue of services.

Speakers
ZZ

Zhiming Zhao

University of Amsterdam



Wednesday September 25, 2019 11:30am - 12:00pm
Boardroom West

12:00pm

Lunch
Wednesday September 25, 2019 12:00pm - 1:00pm
Bay Front Lawn

1:00pm

Data Identification and Process Monitoring for Reproducible Earth Observation Research
Bernhard Gößwein (Vienna University of Technology), Tomasz Miksa (Vienna University of Technology & SBA Research), Andreas Rauber (Vienna University of Technology), and Wolfgang Wagner (Vienna University of Technology)

Earth observation researchers use specialised computing services for satellite image processing offered by various data backends. The source of data is often the same, for example Sentinel-2 satellites operated by the European Space Agency, but the way how data is pre-processed, corrected, updated, and later analysed may differ among the backends.

Backends often lack mechanisms for data versioning, for example, data corrections are not tracked. Furthermore, an evolving software stack used for data processing remains a black box to researchers. Researchers have no means to identify why executions of the same code deliver different results. This hinders reproducibility of earth observation experiments.

In this paper, we present how infrastructure of existing earth observation data backends can be modified to support reproducibility. The proposed extensions are based on recommendations of the Research Data Alliance regarding data identification and the VFramework for automated process provenance documentation. We implemented these extensions at the Earth Observation Data Centre, a partner in the openEO consortium. We evaluated the solution on a variety of usage scenarios, providing also performance and storage measures to evaluate the impact of the modifications. The results indicate reproducibility can be supported with minimal performance and storage overhead.

Speakers
BG

Bernhard Gößwein

Vienna University of Technology
TM

Tomasz Miksa

Vienna University of Technology & SBA Research



Wednesday September 25, 2019 1:00pm - 1:30pm
Macaw Room

1:00pm

Custom Execution Environments with Containers in Pegasus-Enabled Scientific Workflows
Karan Vahi (University of Southern California), Mats Rynge (University of Southern California), George Papadimitriou (University of Southern California), Duncan Brown (Syracuse University), Rajiv Mayani (University of Southern California), Rafael Ferreira da Silva (University of Southern California), Ewa Deelman (University of Southern California), Anirban Mandal (University of North Carolina), Eric Lyons (University of Massachusetts at Amherst), and Michael Zink (University of Massachusetts at Amherst)

Science reproducibility is a cornerstone feature in scientific workflows. In most cases, this has been implemented as a way to exactly reproduce the computational steps taken to reach the final results. While these steps are often completely described, including the input parameters, datasets, and codes, the environment in which these steps are executed is only described at a higher level with endpoints and operating system name and versions. Though this may be sufficient for reproducibility in the short term, systems evolve and are replaced over time, breaking the underlying workflow reproducibility. A natural solution to this problem is containers, as they are well defined, have a lifetime independent of the underlying system, and can be user-controlled so that they can provide custom environments if needed. This paper highlights some unique challenges that may arise when using containers in distributed scientific workflows. Further, this paper explores how the Pegasus Workflow Management System implements container support to address such challenges.

Speakers
KV

Karan Vahi

University of Southern California



Wednesday September 25, 2019 1:00pm - 1:30pm
Boardroom West

1:00pm

1:30pm

A Hybrid Algorithm for Mineral Dust Detection Using Satellite Data
Peichang Shi (University of Maryland), Qianqian Song (University of Maryland), Janita Patwardhan (University of Maryland), Zhibo Zhang (University of Maryland), Jianwu Wang (University of Maryland), and Aryya Gangopadhyay (University of Maryland)

Mineral dust, defined as aerosol originating from the soil, can have various harmful effects to the environment and human health. The detection of dust, and particularly incoming dust storms, may help prevent some of these negative impacts. In this paper, using satellite observations from Moderate Resolution Imaging Spectroradiometer (MODIS) and the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation Observation (CALIPSO), we compared several machine learning algorithms to traditional physical models and evaluated their performance regarding mineral dust detection. Based on the comparison results, we proposed a hybrid algorithm to integrate physical model with the data mining model, which achieved the best accuracy result among all the methods. Further, we identified the ranking of different channels of MODIS data based on the importance of the band wavelengths in dust detection. Our model also showed the quantitative relationships between the dust and the different band wavelengths.

Speakers
JW

Jianwu Wang

University of Maryland


Wednesday September 25, 2019 1:30pm - 2:00pm
Macaw Room

1:30pm

SciInc: A Container Runtime for Incremental Recomputation
Andrew Youngdahl (DePaul University), Dai-Hai Ton-That (DePaul University), and Tanu Malik (DePaul University)

Reviewing a computational experiment by repeating it and verifying its results is a time consuming task. A proper review often entails iteratively assessing the impact of changed arguments and datasets upon the results of a computation. Altering subsets of inputs, however, repeats all computational steps of an experiment even if steps are not impacted by the changed input. Minimizing redundant computations through partial recomputation and memoization is a promising incremental recomputation approach to improve review efficiency.

Current container technology, commonly used for sharing and reviewing experiments in new environments, does not provide support for incremental recomputation. In this paper we present SciInc a container runtime system that, given a computation, efficiently repeats iterative computations by reusing partial results
which are identical in both the repeat and the original. The run-time maintains an in-memory versioned provenance trace of the computation, and uses the trace to detect and adjust changes via a memoization-capable change propagation algorithm. Using a novel checkpoint/restore mechanism we show how incremental recomputation can be achieved within the container run-time without modifying programs or introducing new software stacks. We choose light-weight data structures for storing and implementing
the trace to maintain the invariant of reproducible computation within the container run-time. To determine the effectiveness of change propagation and memoization, we compare against popular container technology and incremental recomputation methods using published data analysis experiments

Speakers
TM

Tanu Malik

DePaul University


Wednesday September 25, 2019 1:30pm - 2:00pm
Boardroom West

2:00pm

Workflow Design Analysis for High Resolution Satellite Image Analysis
Ioannis Paraskevakos (Rutgers University), Matteo Turilli (Rutgers University), Bento Collares Gonçalves (Stony Brook, NY), Heather Lynch (Stony Brook, NY), and Shantenu Jha (Rutgers University and Brookhaven National Laboratory)

Ecological sciences are using imagery from a variety of sources to monitor and survey populations and ecosystems. Very High Resolution (VHR) satellite imagery provide an effective dataset for large scale surveys. Convolutional Neural Networks have successfully been employed to analyze such imagery and detect large animals. As the datasets increase in volume, O(TB), and number of images, O(1k), utilizing High Performance Computing (HPC) resources becomes necessary. In this paper, we investigate a task-parallel data-driven workflows design to support imagery analysis pipelines with heterogeneous tasks on HPC. We analyze the capabilities of each design when processing a dataset of 3,000 VHR satellite images for a total of 4~TB. We experimentally model the execution time of the tasks of the image processing pipeline. We perform experiments to characterize the resource utilization, total time to completion, and overheads of each design. Based on the model, overhead and utilization analysis, we show which design approach to is best suited in scientific pipelines with similar characteristics.

Speakers
IP

Ioannis Paraskevakos

Rutgers University


Wednesday September 25, 2019 2:00pm - 2:30pm
Macaw Room

2:00pm

Usage Patterns of Wideband Display Environments In e-Science Research, Development and Training
Jason Leigh (University of Hawaii at Manoa), Dylan Kobayashi (University of Hawaii at Manoa), Nurit Kirshenbaum (University of Hawaii at Manoa), Troy Wooton (University of Hawaii at Manoa), Alberto Gonzalez (University of Hawaii at Manoa), Luc Renambot (University of Illinois at Chicago), Andrew Johnson (University of Illinois at Chicago), Maxine Brown (University of Illinois at Chicago), Andrew Burks (University of Illinois at Chicago), Krishna Bharadwaj (University of Illinois at Chicago), Arthur Nishimoto (University of Illinois at Chicago), Lance Long (University of Illinois at Chicago), Jason Haga (National Institute of Advanced Industrial Science and Technology), John Burns (University of Hawaii at Hilo), Francis Cristobal (University of Hawaii at Hilo), Jared McLean (University of Hawaii at Hilo), Roberto Pelayo (University of Hawaii at Hilo), and Mahdi Belcaid (University of Hawaii at Manoa)

SAGE (the Scalable Adaptive Graphics Environment) and its successor SAGE2 (the Scalable Amplified Group Environment) are operating systems for managing content across wideband display environments. This paper documents the prevalent usage patterns of SAGE-enabled display walls in support of the e-Science enterprise, based on nearly 15 years of observations of the SAGE community. These patterns will help guide e-Science users and cyberinfrastructure developers on how best to leverage wideband display walls, and the types of software services that could be provided in the future.

Speakers
JL

Jason Leigh

University of Hawaii at Manoa
JH

Jason Haga

National Institute of Advanced Industrial Science and Technology


Wednesday September 25, 2019 2:00pm - 2:30pm
Boardroom West

2:30pm

Break
Wednesday September 25, 2019 2:30pm - 3:00pm
Foyer

3:00pm

SATVAM: Toward an IoT Cyber-Infrastructure for Low-Cost Urban Air Quality Monitoring
Yogesh Simmhan (Indian Institute of Science), Srijith Nair (Indian Institute of Science), Sumit Monga (Indian Institute of Science), Ravi Sahu (Indian Institute of Technology), Kuldeep Dixit (Indian Institute of Technology), Ronak Sutaria (Indian Institute of Technology), Brijesh Mishra (Indian Institute of Technology), Anamika Sharma (Indian Institute of Technology), Anand SVR (Indian Institute of Science), Malati Hegde (Indian Institute of Science), Rajesh Zele (Indian Institute of Technology), and Sachchida N. Tripathi (Indian Institute of Technology)

Air pollution is a public health emergency in large cities. The availability of commodity sensors and the advent of Internet of Things (IoT) enable the deployment of a city-wide network of 1000’s of low-cost real-time air quality monitors to help manage this challenge. This needs to be supported by an IoT cyber-infrastructure for reliable and scalable data acquisition from the edge to the cloud. The low accuracy of such sensors also motivates the need for data-driven calibration models that can accurately predict the science variables from the raw sensor signals. Here, we offer our experiences with designing and deploying such an IoT software platform and calibration models, and validate it through a pilot field deployment at two mega-cities, Delhi and Mumbai. Our edge data service is able to even out the differential bandwidths from the sensing devices and to the cloud repository, and recover from transient failures. Our analytical models reduce the errors of the sensors from a best-case of 63% using the factory baseline to as low as 21%.

Speakers
YS

Yogesh Simmhan

Indian Institute of Science



Wednesday September 25, 2019 3:00pm - 3:30pm
Macaw Room

3:00pm

Comprehensible Control for Researchers and Developers Facing Data Challenges
Malcolm Atkinson (University of Edinburgh), Rosa Filgueira (University of Edinburgh), Iraklis Klampanos (National Centre for Scientific Research Demokritos), Antonis Koukourikos (National Centre for Scientific Research Demokritos), Amrey Krause (University of Edinburgh), Federica Magnoni (Istituto Nazionale di Geofisica e Vulcanologia), Christian Pagé (Université de Toulouse, CNRS), Andreas Rietbrock (Karlsruhe Institute of Technology), and Alessandro Spinuso (Koninklijk Nederlands Meteorologisch Instituut)

We present DARE's strategy delivering capabilities to application communities sharing data, archival services, computational resources and established working practices by mapping their work onto distributed and heterogeneous e-Infrastructures. We adopt abstractions presenting the concepts of concern for each group of experts to help them interact with precision while avoiding complexity. These abstract concepts are mapped to incrementally improved engineering solutions while preserving their semantics. Consequently, one of the expert groups we support well are the research developers meeting the evolving needs of application domains. Our platform simultaneously supports delivery of unchanging services and innovation, giving users control of when to adopt innovations. We deliver incentives to promote that adoption. The quality of collaborating communities' work is validated using pervasive persistent provenance services. We report initial experience with two applications: computational seismology and climate-impact analysis. We show the strategy is feasible and sustainable by introducing the architecture.

Speakers
MA

Malcolm Atkinson

University of Edinburgh



Wednesday September 25, 2019 3:00pm - 3:30pm
Boardroom West

3:00pm

Connect your Research Data with Collaborators and Beyond
Data is an integral part of scientific research. With a rapid growth in data collection and generation capability and an increasingly collaborative nature of research activities, data management and data sharing have become central and key to accomplishing research goals. Researchers today have variety of solutions at their disposal from local storage to Cloud based storage. However, most of these solutions focus on hierarchical file and folder organization. While such an organization is pervasively used and quite useful, it relegates information about the context of the data such as description and collaborative notes about the data to external systems. This spread of information into different silos impedes the flow research activities.

In this tutorial, we will introduce and provide hands on experience with the SeedMeLab platform, which provides a web-based data management and data sharing cyberinfrastructure. SeedMeLab enables research groups to manage, share, search, visualize, and present their data in a web-based environment using an access-controlled, branded, and customizable website they own and control. It supports storing and viewing data in a familiar tree hierarchy, but also supports formatted annotations, lightweight visualizations, and threaded comments on any file/folder. The system can be easily extended and customized to support metadata, job parameters, and other domain and project-specific contextual items. The software is open source and available as an extension to the popular Drupal content management system.

For more information visit: http://SeedMeLab.org

Speakers
AC

Amit Chourasia

SDSC/UC San Diego
DN

David Nadeau

SDSC/UC San Diego


Wednesday September 25, 2019 3:00pm - 6:15pm
Boardroom East

3:30pm

Toward a Dynamic Network-Centric Distributed Cloud Platform for Scientific Workflows: A Case Study for Adaptive Weather Sensing
Eric Lyons (University of Massachusetts Amherst), George Papadimitriou (University of Southern California), Cong Wang (University of North Carolina at Chapel Hill), Komal Thareja (University of North Carolina at Chapel Hill), Paul Ruth (University of North Carolina at Chapel Hill), J. J. Villalobos (Rutgers Discovery Informatics Institute), Ivan Rodero (Rutgers Discovery Informatics Institute), Ewa Deelman (University of Southern California), Michael Zink (University of Massachusetts Amherst), and Anirban Mandal (University of North Carolina at Chapel Hill)

Computational science today depends on complex, data-intensive applications operating on datasets from a variety of scientific instruments. A major challenge is the integration of data into the scientists workflow. Recent advances in dynamic, networked cloud resources provide the building blocks to construct reconfigurable, end-to-end infrastructure that can increase scientific productivity. However, applications have not adequately taken advantage of these advanced capabilities. In this work, we have developed a novel network-centric platform that enables high-performance, adaptive data flows and coordinated access to distributed cloud resources and data repositories for atmospheric scientists. We demonstrate the effectiveness of our approach by evaluating time-critical, adaptive weather sensing workflows which utilize advanced networked infrastructure to ingest live weather data from radars and compute data products used for timely response to weather events. The workflows are orchestrated by the Pegasus workflow management system and were chosen because of their diverse resource requirements. We show that our approach results in timely processing of Nowcast workflows under different infrastructure configurations and network conditions. We also show how workflow task clustering choices affect throughput of an ensemble of Nowcast workflows with improved turnaround times. Additionally, we find that using our network-centric platform powered by advanced layer2 networking techniques results in faster, more reliable data throughput, makes cloud resources easier to provision, and the workflows easier to configure for operational use and automation.

Speakers
EL

Eric Lyons

University of Massachusetts Amherst
GP

George Papadimitriou

University of Southern California



Wednesday September 25, 2019 3:30pm - 4:00pm
Macaw Room

3:30pm

4:00pm

The Evolution of Bits and Bottlenecks in a Scientific Workflow Trying to Keep Up with Technology: Accelerating 4D Image Segmentation Applied to NASA Data
Scott Sellars (University of California San Diego), John Graham (University of California San Diego), Dmitry Mishin (University of California San Diego), Kyle Marcus (University of California San Diego), Ilkay Altintas (University of California San Diego), Thomas DeFanti (University of California San Diego), Larry Smarr (University of California San Diego), Camille Crittenden (University of California, Berkeley), Frank Wuerthwein (University of California San Diego), Joulien Tatar (University of California Irvine), Phu Nguyen (University of California Irvine), Eric Shearer (University of California Irvine), Soroosh Sorooshian (University of California Irvine), and F. Martin Ralph (Center for Western Weather and Water Extremes)

In 2016, a team of earth scientists directly engaged a team of computer scientists to identify cyberinfrastructure (CI) approaches that would speed up an earth science workflow. This paper describes the evolution of that workflow as the two teams bridged CI and an image segmentation algorithm to do large scale earth science research. The Pacific Research Platform (PRP) and The Cognitive Hardware and Software Ecosystem Community Infrastructure (CHASE-CI) resources were used to significantly decreased the earth science workflow's wall-clock time from 19.5 days to 53 minutes. The improvement in wall-clock time comes from the use of network appliances, improved image segmentation, deployment of a containerized workflow, and the increase in CI experience and training for the earth scientists. This paper presents a description of the evolving innovations used to improve the workflow, bottlenecks identified within each workflow version, and improvements made within each version of the workflow, over a three-year time period.

Speakers
SS

Scott Sellars

University of California San Diego


Wednesday September 25, 2019 4:00pm - 4:30pm
Macaw Room

4:30pm

Break
Wednesday September 25, 2019 4:30pm - 4:45pm
Foyer

4:45pm

The Future of Swedish e-Science: SeRC 2.0
Erwin Laure (SeRC & KTH), Olivia Eriksson (SeRC & KTH), Erik Lindahl (SeRC & KTH), and Dan Henningson (SeRC & KTH)

Since 2010, the Swedish e-Science Research Centre (SeRC) is funding and coordinating e-Science activities in a broad spectrum of scientific disciplines. After an initial 5-year phase that produced outstanding results, SeRC is increasingly focusing on fostering interactions between disciplines and has created so-called Multidisciplinary Collaborative Programs (MCPs). In these programs, domain researchers collaborate with e-Science methods and tool developers and e-Infrastructure providers. In this paper we give an overview of the initial phase of SeRC and present the new programs that started operating in 2019.

Speakers
EL

Erwin Laure

SeRC & KTH



Wednesday September 25, 2019 4:45pm - 5:03pm
Cockatoo Room

4:45pm

Out-of-the-box Reproducibility: A Survey of Machine Learning Platforms
Richard Juul Isdahl (Norwegian University of Science and Technology) and Odd Erik Gundersen (Norwegian University of Science and Technology)

Even machine learning experiments, which are fully conducted on computers are not necessarily reproducible. An increasing number of both open source and commercial machine learning platforms are being developed that help address this problem. However, there is no standard for assessing and comparing which features are required to fully support reproducibility. We propose a quantitative method that alleviates this problem. Based on the proposed method we assess and compare the current state of the art machine learning platforms for how well they support making empirical results reproducible. Our results show that BEAT and Floydhub have the best support for reproducibility with Codalab and Kaggle as close contenders. The most commonly used machine learning platforms provided by the big tech companies have poor support for reproducibility.

Speakers
OE

Odd Erik Gundersen

Norwegian University of Science and Technology



Wednesday September 25, 2019 4:45pm - 5:15pm
Macaw Room

4:45pm

BBBlockchain: Blockchain-Based Participation in Urban Development
Robert Muth (Technische Universität Berlin), Kerstin Eisenhut (Technische Universität Berlin), Jochen Rabe (Technische Universität Berlin), and Florian Tschorsch (Technische Universität Berlin)

Urban development processes often suffer from mistrust amongst different stakeholder groups. The lack of transparency within complex and long-term planning processes and the limited scope for co-creation and joint decision-making constitute a persistent problem for successful participation in urban planning. Civic technology has the potential to improve this predicament.

With BBBlockchain, we propose a blockchain-based participation platform, which is able to address all layers of participation. In the development of the platform, we focus on two key aspects: How to increase transparency and how to introduce enhanced co-decision-making. To this end, we exploit the immutable nature of blockchains and effectively offer a platform that excludes monopolistic control over information. The decision-making process is governed by smart contracts implementing, for example, timestamping of planning documents, opinion polls, and the management of a participatory budget. Our architecture and prototypes show the operational capabilities of this approach in a series of use cases for urban development.

Speakers
RM

Robert Muth

Technische Universität Berlin


Wednesday September 25, 2019 4:45pm - 5:15pm
Boardroom West

5:03pm

Love, Money, Fame, Nudge
Love, Money, Fame, Nudge. These are the four levers we have to encourage scientists to use our e-Infrastructures. They are the same incentives we use for e-Scientists to use and develop e-Infrastructures. We use them to encourage contribution to our FAIR Research [1] and Data Commons [2] and adoption of our software platforms. We use them to encourage researchers to become skilled in FAIR data stewardship and produce reproducible papers. They are key to persuading institutions to professionalise research software engineering [3]. They underpin pretty much everything we are doing and hope to do.

Of all of these, Nudge [4] is the one lever that is the most robust, the most likely to get embedded and the most likely to get sustained. Nudge is getting the job done by using stealth and side effects. Nudge is looking at what we have or what is out there and tweaking it – like bioschemas.org has done in life sciences using schema.org. The e-infrastructures that have really affected research practice include Google (search, docs, scholar), cloud computing, and containerisation.

Nudge is the ramp that takes you the “last mile” [5] (or is that the “first mile?”) to the fancy e-Infrastructure from the not so fancy spreadsheet. Nudge technologies include: Jupyter notebooks, RStudio, Open refine and even Galaxy workflows.

Nudge is also the least likely to lead to a paper in IEEE e-Science. It may not even get funded, being “merely useful”. It’s not likely to win the prize as a vision.

If the future of e-Science is that there is no e-Science, just Science for everyone, the long tail included, then maybe we should ask ourselves “how do we make e-Science invisible?” and that could translate to asking ourselves, in everything we are doing, “what’s the nudge?”.

References
  1. Wilkinson MD, Dumontier M et al (2016) The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, https://doi.org/10.1038/sdata.2016.18
  2. Grossman RL, Heath A, Murphy M, Patterson M, Wells W, (2016) “A Case for Data Commons: Toward Data Science as a Service,” Computing in Science & Engineering 18(5):10–20.
  3. Society of Research Software Engineering, (2019) https://society-rse.org/
  4. Thaler RH, Sunstein CR (2008) Nudge: Improving Decisions about Health, Wealth, and Happiness. Yale University Press. ISBN 978-0-14-311526-7. OCLC 791403664.
  5. Koureas D, Arvanitidis C (2016) Community engagement: The ‘last mile’ challenge for European research e-infrastructures. Research Ideas and Outcomes 2: e9933. https://doi.org/10.3897/rio.2.e9933

Speakers


Wednesday September 25, 2019 5:03pm - 5:21pm
Cockatoo Room

5:15pm

dislib: Large Scale High Performance Machine Learning in Python
Javier Álvarez Cid-Fuentes (Barcelona Supercomputing Center), Salvi Solà (Barcelona Supercomputing Center), Pol Álvarez (Barcelona Supercomputing Center), Alfred Castro-Ginard (Dept. Física Quàntica i Astrofísica, Institut de Ciències del Cosmos (ICCUB), Universitat de Barcelona (IEEC-UB)), and Rosa M. Badia (Barcelona Supercomputing Center)

During the last years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This provides a lot of potential to computational science, especially in research fields that deal with large amounts of data, such as genomics, earth sciences, and astrophysics. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.

Speakers
JA

Javier Álvarez Cid-Fuentes

Barcelona Supercomputing Center



Wednesday September 25, 2019 5:15pm - 5:45pm
Macaw Room

5:15pm

A Framework for Model Search Across Multiple Machine Learning Implementations
Yoshiki Takahashi (Tokyo Institute of Technology), Masato Asahara (NEC System Platform Research Laboratories), and Kazuyuki Shudo (Tokyo Institute of Technology)

Several recently devised machine learning (ML) algorithms have shown improved accuracy for various predictive problems. Model searches, which explore to find an optimal ML algorithm and hyperparameter values for the target problem, play a critical role in such improvements. During a model search, data scientists typically use multiple ML implementations to construct several predictive models; however, it takes significant time and effort to employ multiple ML implementations due to the need to learn how to use them, prepare input data in several different formats, and compare their outputs. Our proposed framework addresses these issues by providing simple and unified coding method. It has been designed with the following two attractive features: i) new machine learning implementations can be added easily via common interfaces between the framework and ML implementations and ii) it can be scaled to handle large model configuration search spaces via profile-based scheduling. The results of our evaluation indicate that, with our framework, implementers need only write 55-144 lines of code to add a new ML implementation. They also show that ours was the fastest framework for the HIGGS dataset, and the second-fastest for the SECOM dataset.

Speakers
YT

Yoshiki Takahashi

Tokyo Institute of Technology
KS

Kazuyuki Shudo

Tokyo Institute of Technology



Wednesday September 25, 2019 5:15pm - 5:45pm
Boardroom West

5:21pm

What is eScience, and where does it go from here?
In the 15th year of the eScience conference, it’s time to look back to the start and look forward to where we should head. In this talk, I will discuss what eScience has meant, what I think it now means, and why our current names are problematic. There’s even a small chance I’ll propose a better name. In addition, no matter what we call eScience, there are some aspects we can improve, specifically communication and credit. I’ll talk about specific problems with these aspects as well as potential improvements.

Speakers


Wednesday September 25, 2019 5:21pm - 5:39pm
Cockatoo Room

5:39pm

Discussion
Wednesday September 25, 2019 5:39pm - 5:57pm
Cockatoo Room

5:45pm

Recognition of Frog Chorusing with Acoustic Indices and Machine Learning
Hongxiao Gan (Queensland University of Technology), Jinglan Zhang (Queensland University of Technology), Michael Towsey (Queensland University of Technology), Anthony Truskinger (Queensland University of Technology), Debra Stark (The University of Queensland), Berndt van Rensburg (The University of Queensland), Yuefeng Li (Queensland University of Technology), and Paul Roe (Queensland University of Technology)

This research explores the recognition of choruses of two co-existing sibling frog species in long-duration field recordings using false-colour spectrograms and acoustic indices. Acid frogs are a group of endemic frogs that are particularly sensitive to habitat change and competition from other species. Wallum Sedgefrogs (Litoria olongburensis) are the most threatened acid frog species facing habitat loss and degradation across much of their distribution, in addition to further pressures associated with anecdotally-recognised competition from their sibling species, the Eastern Sedgefrogs (Litoria fallax). Monitoring the calling behaviours of these two species is essential for informing L. olongburensis management and protection, and for obtaining ecological information about the process and implications of their competition. Since their habitats can easily be disturbed by human activity and their body size is very small, passive acoustic monitoring is a good method to monitor their activities. However, after accumulating months of field recordings, it is a time-consuming and labour-intensive task to listen through recordings to identify the two species. Therefore, there is a high demand for automated acoustic pattern and species recognition tools to efficiently navigate months of recordings and identify target species. Our research provides more insight on how to choose acoustic features to efficiently recognise species from fieldcollected recordings at a larger scale. The experimental results show that these techniques are useful to identify choruses of the two competitive frog species with an accuracy of 76.7% on identifying four acoustic patterns (whether the two species occurred).

Speakers
HG

Hongxiao Gan

Queensland University of Technology



Wednesday September 25, 2019 5:45pm - 6:15pm
Macaw Room

5:45pm

Enhanced Interactive Parallel Coordinates using Machine Learning and Uncertainty Propagation for Engineering Design
Wiktor Piotrowski (University of Cambridge), Timoleon Kipouros (University of Cambridge), and P John Clarkson (University of Cambridge)

The design process of an engineering system requires thorough consideration of varied specifications, each with potentially large number of dimensions. The sheer volume of data, as well as its complexity, can overwhelm the designer and obscure vital information. Visualisation of big data can mitigate the issue of information overload but static display can suffer from overplotting. To tackle the issue of overplotting and cluttered data, we present an interactive and touch-screen capable visualisation toolkit that combines Parallel Coordinates and Scatter Plot approaches for managing multidimensional engineering design data.

As engineering projects require a multitude of varied software to handle the various aspects of the design process, the combined datasets often do not have an underlying mathematical model. We address this issue by enhancing our visualisation software with Machine Learning methods which also facilitate further insights into the data.

Furthermore, various software within the engineering design cycle produce information of different level of fidelity (accuracy and trustworthiness), as well as with different speed. The induced uncertainty is also considered and modelled in the synthetic dataset and is also presented in an interactive way. This paper describes a new visualisation software package and demonstrates its functionality on a complex aircraft systems design dataset.

Speakers
TK

Timoleon Kipouros

University of Cambridge


Wednesday September 25, 2019 5:45pm - 6:15pm
Boardroom West

5:57pm

ICT to Support the Transformation of Science in the Roaring Twenties
The way how science is done is profoundly changing. Machine Learning and Artificial Intelligence are now applied in most of the sciences to process data and understand (or not) the observed phenomena. The recent research directions and results with respect to data and data exchange to feed the AI-ML layer will be addressed in this talk.

Speakers
CD

Cees de Laat

Prof. dr. ir. Cees de Laat chairs the System and Network Engineering (SNE) laboratory at the Faculty of Science at University of Amsterdam. The SNE lab conducts research on leading-edge computer systems of all scales, ranging from global-scale systems and networks to embedded devices... Read More →



Wednesday September 25, 2019 5:57pm - 6:15pm
Cockatoo Room
 
Thursday, September 26
 

7:45am

8:45am

Conference Welcome
Speakers
avatar for Rajesh Gupta

Rajesh Gupta

UC San Diego


Thursday September 26, 2019 8:45am - 9:00am
Kon Tiki Room

9:00am

Keynote: Manish Parashar on "Exploring the Future of Facilities-based, Driven-Driven Science"
Large-scale experimental and observational facilities, individually and collectively, provide new opportunities for data-driven research across a wide range of science and engineering domains. These facilities provide shared-use infrastructure, instrumentation, and data products that are openly accessible to a broad community of researchers and educators. However, as these facilities grow in scale and provide increasing volumes of data and data products, effectively using them has become a significant challenge. In this talk, I will explore new opportunities enabled by these facilities as well as the new challenges presented. I will also explore how the cyberinfrastructure continuum, from the edge to extreme scales systems, can be harnessed to support end-to-end data-driven workflows. Specifically, I will explore approaches for intelligent data delivery, in-transit data processing and edge-core integration. This research is part of the Continuum Computing project at the Rutgers Discovery Informatics Institute.

Speakers
avatar for Manish Parashar

Manish Parashar

Rutgers University
Manish Parashar is Distinguished Professor of Computer Science at Rutgers University. He is also the founding Director of the Rutgers Discovery Informatics Institute (RDI2). He is currently on an IPA appointment at the National Science Foundation where he is serving as Office Director... Read More →



Thursday September 26, 2019 9:00am - 10:00am
Kon Tiki Room

10:00am

Break
Thursday September 26, 2019 10:00am - 10:30am
Foyer

10:30am

Future Vision of e-Science Based on the Insights Gained Through Experiences
About 20 years have passed since the term “e-Science” was created. Since then, rapid development of new information technologies has greatly impacted the research methods and lifecycle of the scientific enterprise. State-of-the-art technologies such as big data analytics, IoT, AI, and robotics may solve currently impossible problems, however there still remain fundamental problems that must be solved to make the vision of e-Science a reality. In this talk, the future vision of e-Science will be presented based on the insights gained through experiences on past research such as Grid and Cloud computing which aimed to build on-demand and dynamic virtual infrastructures based on requirements created by application usage. Through this examination, fundamental problems will be identified and mapped to the current e-Science landscape. Understanding these issues will lead to solutions that must be achieved in order to make the vision of e-Science a reality.

Speakers
YT

Yoshio Tanaka

National Institute of Advanced Industrial Science and Technology (AIST), Japan



Thursday September 26, 2019 10:30am - 10:48am
Toucan Room

10:30am

Photon Propagation using GPUs by the IceCube Neutrino Observatory
Dmitry Chirkin (University of Wisconsin-Madison), Juan Carlos Dıaz-Vélez (University of Wisconsin-Madison), Claudio Kopper (Michigan State University), Alexander Olivas (University of Maryland), Benedikt Riedel (University of Wisconsin-Madison), Martin Rongen (RWTH Aachen University), David Schultz (University of Wisconsin-Madison), and Jakob van Santen (Deutsches Elektronen-Synchrotron-Zeuthen)

IceCube Neutrino Observatory is a cubic kilometer neutrino detector located at the South Pole designed to detect high-energy astrophysical neutrinos. To thoroughly understand the detected neutrinos and their properties, the detector response to simulated signal and background has to be modeled using Monte Carlo techniques. An integral part of these studies are the optical properties of the ice the observatory is built into. The propagation of individual photons from particles produced by neutrino interactions in the ice can be greatly accelerated using graphics processing units (GPUs). In this paper, we will describe how we perform the photon propagation using GPUs and the physical properties of the ice we need to consider.

Speakers
DS

David Schultz

University of Wisconsin-Madison



Thursday September 26, 2019 10:30am - 11:00am
Macaw Room

10:30am

Quality-Aware Human-Machine Text Extraction for Biocollections using Ensembles of OCRs
Icaro Alzuru (University of Florida), Rhiannon Stephens (Australian Museum), Andréa Matsunaga (Advanced Computing and Information Systems Laboratory), Maurício Tsugawa (Advanced Computing and Information Systems Laboratory), Paul Flemons (Australian Museum), and José A.B. Fortes (University of Florida)

Information Extraction (IE) from the text in images is affected by the output quality of the text recognition process. Misspelled or missing text may propagate errors or even preclude IE. The low confidence in automated methods makes some IE projects exclusively rely on human work (crowdsourcing). That is the case of biological collections (biocollections), where the metadata (Darwin-core Terms), found in digitized labels, are transcribed by citizen scientists. In this paper, we present an approach to reduce the number of crowdsourcing tasks required to obtain the transcription of the text found in biocollections’ images. By using an ensemble of Optical Character Recognition (OCR) engines: OCRopus, Tesseract, and the Google Cloud OCR, our approach identifies the lines and characters that, with a high probability, are correct. This allows dedicating man-hours to the transcription of only low confidence fragments of text. The number of lines to transcribe is also reduced through hybrid human-machine crowdsourcing: the output of the ensemble of OCRs is used as the first "human" transcription of the redundant crowdsourcing process. Our approach was tested in six biocollections (2,966 images), reducing the number of crowdsourcing tasks by 76% (58% due to lines accepted by the ensemble of OCRs and about 18% due to accelerated convergence when using hybrid crowdsourcing). The automatically extracted text presented a character error rate of 0.001 (0.1%).

Speakers
IA

Icaro Alzuru

University of Florida



Thursday September 26, 2019 10:30am - 11:00am
Kon Tiki Room

10:30am

Pegasus Scientific Workflows with Containers
Workflows are a key technology for enabling complex scientific computations. They capture the interdependencies between processing steps in data analysis and simulation pipelines as well as the mechanisms to execute those steps reliably and efficiently. Workflows can capture complex processes to promote sharing and reuse, and also provide provenance information necessary for the verification of scientific results and scientific reproducibility. Application containers such as Docker and Singularity are increasingly becoming a preferred way for bundling user application code with complex dependencies, to be used during workflow execution.

Pegasus is being used in a number of scientific domains doing production grade science. In 2016 the LIGO gravitational wave experiment used Pegasus to analyze instrumental data and confirm the first detection of a gravitational wave. The Southern California Earthquake Center (SCEC) based at USC, uses a Pegasus managed workflow infrastructure called CyberShake to generate hazard maps for the Southern California region. In March 2017, SCEC conducted a CyberShake study on DOE systems ORNL Titan and NCSA BlueWaters. Overall, the study required 450,000 node-hours of computation across the two systems. Pegasus is also being used in astronomy, bioinformatics, civil engineering, climate modeling, earthquake science, molecular dynamics and other complex analyses.

The goal of the tutorial is to introduce the benefits of modeling pipelines in a portable way with use of scientific workflows with application containers. We will examine the workflow lifecycle at a high level and issues and challenges associated with various steps in the workflow lifecycle such as creation, execution and monitoring and debugging. Through hands on exercises, we will model an application pipeline, bundle the application codes in containers, and execute the pipeline on distributed computing infrastructures. The attendees will leave the tutorial with knowledge on how to implement their own computations using containers and workflows.

Speakers
KV

Karan Vahi

University of Southern California
MR

Mats Rynge

USC Information Sciences Institute



Thursday September 26, 2019 10:30am - 2:30pm
Cockatoo Room

10:48am

Knowledge as Infrastructure
Knowledge networks that encode curated information about data, tools, processes, and the science itself will be essential cyberinfrastructure for the future. Recognizing the importance of knowledge graphs, the US National Science Foundation has embraced the creation of an Open Knowledge Network through its newly announced Convergence Accelerator pilot.

The Convergence Accelerator pilot started with a singular vision: to identify areas of research where investment in convergent approaches -- those bringing together people across disciplines to solve problems – have the potential to yield high-benefit results. The Convergence Accelerator seeks to expand and refine NSF's efforts to support fundamental scientific exploration through partnerships that potentially include stakeholders from industry, government, nonprofits and other sectors.

Tracing the evolution of eScience topics from its first year of establishment in 2005, one finds topics such as web services, service-oriented architecture, grid computing, high-performance computing, data management and digital repositories, dataflow, middleware, cloud computing, In future, eScience will be assisted by AI—intelligent agents, smart instruments, smart tutors, machine learning and deep learning for data interpretation, and the like. Knowledge graphs are essential to the success of such AI and are, therefore, central to eScience.

This talk will introduce challenges and opportunities of knowledge networks and will provide an overview of the NSF Convergence Accelerator for the Open Knowledge Network.

Speakers
CB

Chaitan Baru

Senior Science Advisor, National Science Foundation


Thursday September 26, 2019 10:48am - 11:06am
Toucan Room

11:00am

Simulating Data Access Profiles of Computational Jobs in Data Grids
Volodimir Begy (CERN, University of Vienna), Joeri Hermans (University of Liège), Martin Barisits (CERN), Mario Lassnig (CERN), and Erich Schikuta (University of Vienna)

The data access patterns of applications running in computing grids are changing due to the recent proliferation of high-speed local and wide area networks. The data-intensive jobs are no longer strictly required to run at the computing sites, where the respective input data are located. Instead, jobs may access the data employing arbitrary combinations of data-placement, stage-in and remote data access. These data access profiles exhibit partially non-overlapping throughput bottlenecks. This fact can be exploited in order to minimize the time jobs spend waiting for input data. In this work we present a novel grid computing simulator, which puts a heavy emphasis on the various data access profiles. Its purpose is to enable reproducible performance studies on data access patterns. The fundamental assumptions underlying our simulator are justified by empirical experiments performed in the Worldwide LHC Computing Grid (WLCG) at CERN. We demonstrate how to calibrate the simulator parameters in accordance with the true system using posterior inference with likelihood-free Markov Chain Monte Carlo. Thereafter, we validate the simulator's output with respect to authentic production workloads from WLCG, demonstrating its remarkable accuracy.

Speakers
VB

Volodimir Begy

CERN, University of Vienna



Thursday September 26, 2019 11:00am - 11:30am
Macaw Room

11:00am

Active Learning Yields Better Training Data for Scientific Named Entity Recognition
Roselyne Tchoua (University of Chicago), Aswathy Ajith (University of Chicago), Zhi Hong (University of Chicago), Logan Ward (Argonne National Laboratory), Kyle Chard (University of Chicago), Debra Audus (National Institute of Standards and Technology), Shrayesh Patel (University of Chicago), Juan de Pablo (University of Chicago), and Ian Foster (Argonne National Laboratory)

Despite significant progress in natural language processing, machine learning models require substantial expert annotated training data to perform well in tasks such as named entity recognition (NER) and entity relations extraction. Furthermore, NER is often more complicated when working with scientific text. For example, in polymer science, chemical structure may be encoded using nonstandard naming conventions, the same concept can be expressed using many different terms (synonymy), and authors may refer to polymers with ad-hoc labels. These challenges, which are not unique to polymer science, make it difficult to generate training data, as specialized skills are needed to label text correctly. We have previously designed polyNER,
a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based word vector classifier. PolyNER facilitates a labeling process that
is otherwise tedious and expensive. Here, we use active learning to efficiently obtain more annotations from experts and improve performance. PolyNER requires just five hours of expert time to achieve discrimination capacity comparable to that of a state-of-the-art chemical natural language processing toolkit, highlighting the potential for human-computer partnership in domain-specific scientific NER.

Speakers
RT

Roselyne Tchoua

University of Chicago


Thursday September 26, 2019 11:00am - 11:30am
Kon Tiki Room

11:06am

Underpinning eScience with Translational Computer Science
Given the increasingly pervasive role and growing importance of computing and data in all aspects of science and society fundamental advances in computer science and their translation to the real world have become essential. Consequently, there may be benefits to formalizing translational computer science (TCS) to complement the traditional foundational and applied modes of computer science research, as has been done for translational medicine. TCS has the potential to accelerate the impact of computer science research overall. In this paper we discuss the attributes of TCS, and formally define it. We enumerate a number of roadblocks that have limited its adoption to date and sketch a path forward.

Speakers

Thursday September 26, 2019 11:06am - 11:24am
Toucan Room

11:24am

Current Challenges for eScience
The technology and application landscape is changing faster than ever creating new challenges for eScience. In this talk I will present a few challenges that need to be overcome by the community very soon to reap the full benefits of these technologies for eScience.

Speakers
VC

Vipin Chaudhary

SUNY Buffalo and NSF
Vipin Chaudhary is the SUNY Empire Innovation Professor of Computer Science and Engineering at SUNY Buffalo, and the co-founder of the Center for Computational and Data-Enabled Science and Engineering. He is currently on an IPA appointment at the National Science Foundation where... Read More →


Thursday September 26, 2019 11:24am - 11:42am
Toucan Room

11:30am

Towards Exascale: Measuring the Energy Footprint of Astrophysics HPC Simulations
Giuliano Taffoni (INAF - OATs), Luca Tornatore (INAF - OATs), David Goz (INAF - OATs), Antonio Ragagnin (INAF - OATs), Sara Bertocco (INAF - OATs), Igor Coretti (INAF - OATs), Manolis Marazakis (FORTH - Foundation For Research & Technology), Fabien Chaix (FORTH - Foundation For Research & Technology), Manolis Plumidis (FORTH - Foundation For Research & Technology Hellas), Manolis Katevenis (FORTH Foundation For Research & Technology Hellas), Renato Panchieri (EnginSoft S.p.A. (EnginSoft)), and Gino Perna (EnginSoft S.p.A. (EnginSoft))

The increasing amount of data produced in Astronomy by observational studies and the size of theoretical problems to be tackled in the next future pushes the need of HPC (High Performance Computing) resources towards the "Exascale". The HPC sector is undergoing a profound phase of transition, in which one of the toughest challenges to cope with is the energy efficiency that is one of the main stopper to the achievement of "Exascale".

Since ideal peak-performance is unlikely to be achieved in realistic scenarios, the aim of this work is to give some insights about the energy consumption of contemporary architectures with real scientific applications in a HPC context.

We use two state-of-the-art applications from the astrophysical domain, that we optimized in order to fully exploit the underlying hardware: a direct $N$-body code and a semi-analytical code for Cosmic Structure formation simulations.

For these two applications, we quantitatively evaluate the impact of computation on the energy consumption when running on three different systems: one that represents the present of current HPC systems (an Intel-based Intel cluster), one that (possibly) represents the future of HPC systems (a prototype of an Exascale supercomputer) and a micro-cluster based on Arm MPSoC.

We provide a comparison of the time-to-solution, energy-to-solution and energy delay product (EDP) metrics, for different software configurations.

ARM-based HPC systems have lower energy consumption albeit running $\approx 10$ times slower: as it is necessary for the future Exascale systems to dramatically lower their idle energy consumption, we find that such a system can rival with Intel-based ones provided that $(i)$ a high--performance multi-hierarchy interconnect is available and $(ii)$ the physics codes are properly re-designed.

Speakers
LT

Luca Tornatore

INAF - OATs



Thursday September 26, 2019 11:30am - 12:00pm
Macaw Room

11:30am

Reliability-Aware and Graph-Based Approach for Rank Aggregation of Biological Data
Pierre Andrieu (Université Paris-Sud, CNRS, Université Paris-Saclay), Bryan Brancotte (Institut Pasteur), Laurent Bulteau (Université Paris-Est Marne-la-Vallée, CNRS), Sarah Cohen-Boulakia (Université Paris-Sud, CNRS, Université Paris-Saclay), Alain Denise (Université Paris-Sud, CNRS, Université Paris-Saclay), Adeline Pierrot (Université Paris-Sud, CNRS, Université Paris-Saclay), and Stéphane Vialette (Université Paris-Est Marne-la-Vallée, CNRS)

Huge amounts of biological data are available in public databases and can be queried using portals with keyword queries. Ranked lists of answers are obtained by users. However, properly querying such portals remains difficult since various formulations of the same query can be considered (e.g., using synonyms of the initial keyword).

Consequently, users have to manually combine several lists of hundreds of answers into one list.
Rank aggregation techniques are particularly well-fitted to this context as they take in a set of ranked elements (rankings) and provide a consensus, that is, a single ranking which is the "closest" to the input rankings.
However, the problem of rank aggregation is NP-hard in most cases. Using an exact algorithm is currently not possible for more than a few dozens of elements. A plethora of heuristics have thus been proposed which behaviour are, by essence, difficult to anticipate: given a set of input rankings, one cannot guarantee how far from an exact solution the consensus ranking provided by an heuristic will be.

The two challenges we want to tackle in this paper are the following: (i) providing an approach based on a pre-process to decompose large data sets into smaller ones where high-quality algorithms can be run and (ii) providing information to users on the reliability of the positions of elements in the consensus ranking produced.
Our approach not only lies in mathematical bases, offering guarantees on the result computed and but it has also been implemented in a real system available to life science community and tested on various real use cases.

Speakers
PA

Pierre Andrieu

Université Paris-Sud, CNRS, Université Paris-Saclay



Thursday September 26, 2019 11:30am - 12:00pm
Kon Tiki Room

11:42am

A Reference Architecture for eScience as a Solution Science for Multidisciplinary Problems
Over the last two decades, eScience has emerged and matured into a multidisciplinary field for translating computer science innovations for use in scientific problem solving. Although eScience uses techniques from other related areas including Big Data, data science, informatics, supercomputing and computational science, it is mainly concerned with the development and application of methods and technologies to compose solutions for the scientific, societal and educational challenges of our time. This paper discusses a typical team science ecosystem for eScience, and a reference architecture for end-to-end research lifecycle support that is distilled from a number of eScience projects and can be used to guide future applications.

Speakers
avatar for Ilkay Altintas

Ilkay Altintas

SDSC/UC San Diego



Thursday September 26, 2019 11:42am - 12:00pm
Toucan Room

12:00pm

Lunch
Thursday September 26, 2019 12:00pm - 1:00pm
Bay Front Lawn

1:00pm

Computing in the Continuum: Harnessing a Pervasive Data Ecosystem
The exponential growth of digital data sources enabled by the IoT, coupled with the ubiquity of non-trivial computational power, at the edges, in the core and in-between, for processing this data have the potential for fundamentally transforming our ability to understand and manage our lives and our environment. However, despite tremendous advances in technology this vision remains largely unrealized -- while our capacity for generating data is expanding dramatically, our ability for managing, manipulating and analyzing this data, for transforming it into knowledge and understanding in a timely manner, and for integrating it with practice has not kept pace. In this talk I will explore computing in the continuum – a paradigm that opportunistically leverages loosely connected resources and services to process data in-situ and in-transit, to extract timely insights that are actionable. Using examples from our work as part of the CometCloud project, I will present research challenges and some initial solutions towards realizing this paradigm.

Speakers
avatar for Manish Parashar

Manish Parashar

Rutgers University
Manish Parashar is Distinguished Professor of Computer Science at Rutgers University. He is also the founding Director of the Rutgers Discovery Informatics Institute (RDI2). He is currently on an IPA appointment at the National Science Foundation where he is serving as Office Director... Read More →



Thursday September 26, 2019 1:00pm - 1:18pm
Toucan Room

1:00pm

Evaluation of Pilot Jobs for Apache Spark Applications on HPC Clusters
Valerie Hayot-Sasson (Concordia University) and Tristan Glatard (Concordia University)

Big Data is becoming prominent throughout many scientific fields and, as a result, scientific communities are seeking Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as Hadoop and Spark were primarily designed to be executed on dedicated commodity infrastructures. As Big Data frameworks cannot leverage HPC schedulers directly, they must be executed on an overlay cluster atop an HPC allocations. This is problematic as application resource requirements needed by the HPC scheduler may not be known by the user. Pilot scheduling strategies have been developed to address the limitations of traditional HPC batch job schedulers. Pilot schedulers, such as HTCondor and DIRAC, decouple resource provisioning from task scheduling, thereby enabling efficient resource utilization through dynamic scheduling. This paper evaluates the benefits pilot-scheduling strategies over traditional batch submission on HPC clusters with overlay Apache Spark clusters. We evaluate the overall speedup brought on by employing pilot-scheduling strategies through the application of four increasing resource configurations. Overall, we find that there is little benefit to using pilot scheduling strategies, though it can bring 2x when system queuing times are very slow. However, these occurrences are rare. Generally pilots have approximately the same makespan as batch. Despite makespan differences being found to be mostly due queuing times, pilots did not appear to have any advantage in this regard, potentially due to system scheduling policies. Regardless, pilots may still be useful when application wall times are underestimated. This remains to be investigated.

Speakers
VH

Valerie Hayot-Sasson

Concordia University


Thursday September 26, 2019 1:00pm - 1:30pm
Kon Tiki Room

1:00pm

1:18pm

Scientific Applications and Heterogeneous Architectures – Data Analytics and the Intersection of HPC and Edge Computing
This talk discusses two emerging trends in computing (i.e., the convergence of data generation and analytics, and the emergence of edge computing) and how these trends can impact heterogeneous applications. Next-generation supercomputers, with their extremely heterogeneous resources and dramatically higher performance than current systems, will generate more data than we need or, even, can handle. At the same time, more and more data is generated at the “edge,” requiring computing and storage to move closer and closer to data sources. The coordination of data generation and analysis across the spectrum of heterogonous systems including supercomputers, cloud computing, and edge computing adds additional layers of heterogeneity to applications’ workflows. More importantly, the coordination can neither rely on manual, centralized approaches as it is predominately done today in HPC nor exclusively be delegated to be just a problem for commercial Clouds. This talk presents case studies of heterogenous applications in precision medicine and precision farming that expand scientist workflows beyond the supercomputing center and shed our reliance on large-scale simulations exclusively, for the sake of scientific discovery.

Speakers
MT

Michela Taufer

The University of Tennessee


Thursday September 26, 2019 1:18pm - 1:36pm
Toucan Room

1:30pm

Profit Optimization for Splitting and Sampling Based Resource Management in Big Data Analytics-as-a-Service Platforms in Cloud Computing Environments
Yali Zhao (The University of Melbourne), Athanasios Vasilakos (Lulea University of Technology), James Bailey (The University of Melbourne), and Richard Sinnott (The University of Melbourne)

Exploring optimal big data analytics solutions for problem solving in various application domains becomes an ever-important research area. Big data Analytics-as-a-Service (AaaS) platforms offer online AaaS to various domains in a pay-per-use model. Big data analytics incurs expensive costs and takes lengthy processing times due to large-scale computing requirements. To tackle the cost and time challenges for big data processing, we focus on proposing automatic and efficient resource management algorithms to maximize profits and minimize times while guaranteeing Service Level Agreements (SLAs) on Quality of Service (QoS) requirements of queries. For query processing constrained by tight deadlines and limited query budgets, the proposed algorithms enable data splitting and sampling based resource scheduling for parallel and approximate processing that significantly reduce data processing times and resource costs. We formulate the multi-objective optimization resource scheduling problem to maximize profits for AaaS platforms and minimize query response times. We design extensive experiments for algorithm performance evaluation, results show our proposed scheduling algorithms outperform state-of-the-art algorithms that improve query admission rates, maximize profits, minimize query times, provide elastic and automatic large-scale resource configurations to minimize resource costs, and deliver timely, cost-effective, and reliable AaaS with SLA guarantees.

Speakers
YZ

Yali Zhao

The University of Melbourne


Thursday September 26, 2019 1:30pm - 2:00pm
Kon Tiki Room

1:36pm

Understanding ML Driven HPC: Applications and Infrastructure
Shantenu Jha (Rutgers University and Brookhaven National Laboratory) and Geoffrey Fox (Indiana University)

We recently outlined the vision of ”Learning Everywhere” which captures the possibility and impact of how learning methods and traditional HPC methods can be coupled together. A primary driver of such coupling is the promise that Machine Learning (ML) will give major performance improvements for traditional HPC simulations. Motivated by this potential, the ML around HPC class of integration is of particular significance. In a related follow-up paper, we provided an initial taxonomy for integrating learning around HPC methods. In this paper, which is part of the Learning Everywhere series, we discuss “how” learning methods and HPC simulations are being integrated to enhance effective performance of computations. This paper identifies several modes — substitution, assimilation, and control, in which learning methods integrate with HPC simulations and provide representative applications in each mode. This paper discusses some open research questions and we hope will motivate and clear the ground for ML around HPC benchmarks.

Speakers
SJ

Shantenu Jha

Rutgers University and Brookhaven National Laboratory
GF

Geoffrey Fox

Indiana University



Thursday September 26, 2019 1:36pm - 1:54pm
Toucan Room

1:45pm

1:54pm

Learning Everywhere: A Taxonomy for the Integration of Machine Learning and Simulations
Speakers
GF

Geoffrey Fox

Indiana University
SJ

Shantenu Jha

Rutgers University and Brookhaven National Laboratory



Thursday September 26, 2019 1:54pm - 2:12pm
Toucan Room

2:00pm

On Distributed Information Composition in Big Data Systems
Haifa AlQuwaiee (New Jersey Institute of Technology), Songlin He (New Jersey Institute of Technology), Chase Wu (New Jersey Institute of Technology), Qiang Tang (New Jersey Institute of Technology), and Xuewen Shen (New Jersey Institute of Technology)

Modern big data computing systems exemplified by Hadoop employ parallel processing based on distributed storage. The results produced by parallel tasks such as computing modules in scientific workflows or reducers in the MapReduce framework are typically distributed across different data nodes. However, most existing systems do not provide a mechanism to composite such distributed information, as required by many big data applications. We construct analytical cost models and formulate a Distributed Information Composition problem in Big Data Systems, referred to as DIC-BDS, to aggregate multiple datasets stored as data blocks in Hadoop Distributed File System (HDFS) using a composition operator of specific complexity to produce one final output. We rigorously prove that DIC-BDS is NP-complete, and propose two heuristic algorithms: Fixed-windowDistributed Composition Scheme (FDCS) and Dynamic-window Distributed Composition Scheme with Delay (DDCS-D). We conduct extensive experiments in Google clouds with various composition operators of commonly considered degrees of complexity including O(n), O(n logn), and O(n2). Our experimental results show the performance superiority of the proposed solutions over existing methods. Specically, FDCS outperforms all other algorithms in comparison with a composition operator of complexity O(n) or O(n logn), while DDCS-D achieves the minimum total composition time with a composition operator of complexity O(n2). The proposed algorithms provide an additional level of data processing for efficient information aggregation in existing workow and big data systems.

Speakers
HA

Haifa AlQuwaiee

New Jersey Institute of Technology



Thursday September 26, 2019 2:00pm - 2:30pm
Kon Tiki Room

2:12pm

Serverless Science for Simple, Scalable, and Shareable Scholarship
Kyle Chard (University of Chicago; Argonne National Laboratory) and Ian Foster (University of Chicago; Argonne National Laboratory)

Speakers
KC

Kyle Chard

University of Chicago; Argonne National Laboratory



Thursday September 26, 2019 2:12pm - 2:30pm
Toucan Room

2:30pm

Poster Session & Break
Accepted posters are shown below with their poster number and the eScience2019 attendees in bold.

2. Accelerating Scientific Discovery with SCAIGATE Science Gateway
Chao Jiang (University of Florida), David Ojika (University of Florida), Bhavesh Patel (Dell EMC), Ann Gordon-Ross (University of Florida), and Herman Lam (University of Florida)

14. The Engagement and Performance Operations Center: EPOC
Edward Moynihan (Indiana University), Jennifer Schopf (Indiana University), and Jason Zurawski (Lawrence Berkeley National Lab)

41. A Review of Scalable Machine Learning Frameworks
Saba Amiri (Universiteit van Amsterdam, Netherlands), Sara Salimzadeh (Universiteit van Amsterdam, Netherlands) and Adam S. Z. Belloum (Universiteit van Amsterdam, Netherlands)

64. Streaming Graph Ingestion with Resource-Aware Buffering and Graph Compression
Subhasis Dasgupta (University of California San Diego), Aditya Bagchi (RKMV Educational and Research Institute), and Amarnath Gupta (University of California San Diego)

90. Predicting Eating Events in Free Living Individuals
Jiue-An Yang (University of California San Diego), Jiayi Wang (University of California San Diego), Supun Nakandala (University of California San Diego), Arun Kumar (University of California San Diego), and Marta M. Jankowska (University of California San Diego)

96. Streaming Workflows on Edge Devices to Process Sensor Data on a Smart Manufacturing Platform
Prakashan Korambath (University of California, Los Angeles), Haresh Malkani (University of California, Los Angeles), and Jim Davis (University of California, Los Angeles)

102. Sharing and Archiving Data Science Course Projects to Support Pedagogy for Future Cohorts
Stephanie Labou (University of California San Diego), Ho Jung Yoo (University of California San Diego), David Minor (University of California San Diego), and Ilkay Altintas (San Diego Supercomputer Center, University of California San Diego)

103. Enabling Server-Based Computing and FAIR Data Sharing with the ENES Climate Analytics Service
Sofiane Bendoukha (German Climate Computing Center (DKRZ)), Tobias Weigel (German Climate Computing Center (DKRZ)), Sandro Fiore (Euro-Mediterranean Center on Climate Change (CMCC)), and Donatello Elia (Euro-Mediterranean Center on Climate Change (CMCC))

104. Expanding Library Resources for Data and Compute-Intensive Education and Research
Stephanie Labou (University of California San Diego) and Reid Otsuji (University of California San Diego)

105. Enabling Transparent Access to Heterogeneous Architectures for IS-ENES climate4impact using the DARE Platform
Christian Pagé (Université de Toulouse, CNRS), Wim Som de Cerff (Koninklijk Nederlands Meteorologisch Instituut), Maarten Plieger (Koninklijk Nederlands Meteorologisch Instituut), Alessandro Spinuso (Koninklijk Nederlands Meteorologisch Instituut), and Xavier Pivan (Université de Toulouse, CNRS)

106. Support for HTCondor high-Throughput Computing Workflows in the REANA Reusable Analysis Platform
Rokas Maiulaitis (CERN), Paul Brenner (University of Notre Dame), Scott Hampton (University of Notre Dame), Michael D. Hildreth (University of Notre Dame), Kenyi Paolo Hurtado Anampa (University of Notre Dame), Irena Johnson (University of Notre Dame), Cody Kankel (University of Notre Dame), Jan Okraska (CERN), Diego Rodriguez (CERN), and Tibor Šimko (CERN)

107. Effective Digital Object Access and Sharing Over a Networked Environment using DOIP and NDN
Cas Fahrenfort (University of Amsterdam) and Zhiming Zhao (University of Amsterdam)

108. Contextual Linking between Workflow Provenance and System Performance Logs
Elias el Khaldi Ahanach (University of Amsterdam), Spiros Koulouzis (University of Amsterdam), and Zhiming Zhao (University of Amsterdam)

109. A Historical Big Data Analysis to Understand the Social Construction of Juvenile Delinquency in the United States
Sandeep Puthanveetil Satheesan (University of Illinois at Urbana-Champaign), Alan B. Craig (Extended Collaborative Support Services, Extreme Science and Engineering Discovery Environment), and Yu Zhang (State University of New York at Brockport)

110. Workflow Automation in Liquid Chromatography Mass Spectrometry
Reinhard Gentz (Lawrence Berkeley National Laboratory), Hector Garcıa Martin (Joint BioEnergy Institute), Edward Baidoo (Joint BioEnergy Institute), and Sean Peisert (Lawrence Berkeley National Laboratory)

111. A Vision towards Future eScience
Shinji Shimojo (Osaka University) and Susumu Date (Osaka University)

112. HUBzero© Goes OneSciencePlace: The Next Community-Driven Steps For Providing Software-as-a-Service
David Benham (Purdue University) and Sandra Gesing (University of Notre Dame)

Thursday September 26, 2019 2:30pm - 3:30pm
Kon Tiki Room & Foyer

3:30pm

Transparency by Design in eScience Research
Beth Plale (Indiana University)

Transparency in science is an attitude of openness and sharing of information, both of which are fundamental to the progress of science and to effective functioning of the research enterprise. Transparency by design in eScience is to be transparent by default (by design) in our practices, methodologies, and research results. A commitment to transparency is a reminder of the worthy work that we do, and the important role that we as scientists have to improve and protect life on this earth.

Speakers
BP

Beth Plale

Indiana University



Thursday September 26, 2019 3:30pm - 3:48pm
Toucan Room

3:30pm

Dynamic Sizing of Continuously Divisible Jobs for Heterogeneous Resources
Nicholas Hazekamp (University of Notre Dame), Benjamin Tovar (University of Notre Dame), and Douglas Thain (University of Notre Dame)

Many scientific applications operate on large datasets that can be partitioned and operated on concurrently. The existing approaches for concurrent execution generally rely on data which is statically partitioned. This static partitioning can lock performance in a sub-optimal configuration, leading to higher execution time and an inability to respond to dynamic resources.

We present the Continuously Divisible Job abstraction which allows statically defined applications to have their component tasks dynamically sized responding to system behaviour. The Continuously Divisible Job abstraction defines a simple interface that dictates how work can be recursively divided, executed, and merged. Implementing this abstraction allows scientific applications to leverage dynamic job coordinators for execution. We also propose the Virtual File abstraction which allows read-only subsets of large files to be treated as separate files.

In exploring the Continuously Divisible Job abstraction, two applications were implemented using the Continuously Divisible Job interface, a bioinformatics application and a high-energy physics event analysis. These were tested using an abstract job interface and several job coordinators. Comparing these against a previous static partitioning implementation we show comparable or better performance without having to make static decisions or implement complex dynamic application handling.

Speakers
NH

Nicholas Hazekamp

University of Notre Dame



Thursday September 26, 2019 3:30pm - 4:00pm
Kon Tiki Room

3:30pm

3:48pm

Towards Building a Converged Infrastructure for Multi-Disciplinary eScience
Despite significant methodological success, eScience, an approach that utilizes computational and often data operations to explore, analyze and find solutions to science problems, has mostly remained stovepiped. While new solutions have indeed advanced respective science disciplines, there has been surprisingly little cross-fertilization of methods, models, and infrastructure across discipline boundaries. One reason for this that every discipline needs its own custom problem formulation, model representation, and solution strategies that are hard to generalize to other disciplines. In this talk, we present the idea that future eScience will be backed by the idea of a “semantic model commons” -- a repository of semantically annotated mathematical and computational models and associated computational modules that can be shared, repurposed, and customized by escientists who work in different science communities. To illustrate the possibilities, we consider the phenomenon of material, disease and information propagation and the forces that control, accelerate, hinder and alter the dynamics of propagation, and sketch how a converged model infrastructure can support this cross-discipline eScience and yet benefit each discipline.

Speakers
AG

Amarnath Gupta

University of California San Diego



Thursday September 26, 2019 3:48pm - 4:06pm
Toucan Room

4:00pm

Characterizing In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-Generation Supercomputers
Michela Taufer (The University of Tennessee), Stephen Thomas (The University of Tennessee), Michael Wyatt (The University of Tennessee), Tu Mai Anh Do (University of Southern California), Loïc Pottier (University of Southern California), Rafael Ferreira da Silva (University of Southern California), Harel Weinstein (Cornell University), Michel A. Cuendet (Cornell University; Lausanne University Hospital), Trilce Estrada (University of New Mexico), and Ewa Deelman (University of Southern California)

Molecular Dynamics (MD) simulations executed on state-of-the-art supercomputers are producing data at a rate faster than it can be written out to disk. In situ and in transit analysis of data produced by MD simulations reduce the original volume of information by several orders of magnitude, thereby alleviating the negative impact of I/O bottleneck. This work focuses on characterizing the impact of in situ and in transit analytics on the overall MD workflow performance, and the capability for capturing rapid, rare events in the simulated molecular system. The MD simulation and analysis processes share data via remote direct memory access (RDMA) using Dataspaces. Our metrics of interest are time spent waiting in I/O, or lost frames by the MD simulation, and idle time by the analysis. We measure these metrics for a diverse set of molecular systems, characterize their trends for in situ and in transit configurations, and model which frames are dropped and which ones are analyzed for a real use case. The insights gained from this study are generally applicable for in situ and in transit workflows that require optimization of parameters to minimize loss in workflow performance and analytic accuracy.

Speakers

Thursday September 26, 2019 4:00pm - 4:30pm
Kon Tiki Room

4:06pm

Cyberinfrastructure Center of Excellence Pilot: Connecting Large Facilities Cyberinfrastructure
Ewa Deelman (University of Southern California), Anirban Mandal (University of North Carolina at Chapel Hill), Valerio Pascucci (University of Utah, Salt Lake City), Susan Sons (Indiana University), Jane Wyngaard (University of Notre Dame), Charles Vardeman (University of Notre Dame), Steve Petruzza (University of Utah), Ilya Baldin (University of North Carolina), Laura Christopherson (University of North Carolina), Ryan Mitchell (University of Southern California), Loic Pottier (University of Southern California), Mats Rynge (University of Southern California), Erik Scott (University of North Carolina), Karan Vahi (University of Southern California), Marina Kogan (University of Utah), Jasmine Mann (University of Southern California), Tom Gulbransen (Battelle Ecology, Inc.), Daniel Allen (Battelle Ecology, Inc.), David Barlow (Battelle Ecology, Inc.), Santiago Bonarrigo (Battelle Ecology, Inc.), Chris Clark (Battelle Ecology, Inc.), Leslie Goldman (Battelle Ecology, Inc.), Tristan Goulden (Battelle Ecology, Inc.), Phil Harvey (Battelle Ecology, Inc.), David Hulsander (Battelle Ecology, Inc.), Steve Jacobs (Battelle Ecology, Inc.), Christine Laney (Battelle Ecology, Inc.), Ivan Lobo-Padilla (Battelle Ecology, Inc.), Jeremy Sampson (Battelle Ecology, Inc.), John Staarmann (Battelle Ecology, Inc.), and Steve Stone (Battelle Ecology, Inc.)

The National Science Foundation's Large Facilities are major, multi-user research facilities that operate and manage sophisticated and diverse research instruments and platforms (e.g., large telescopes, interferometers, distributed sensor arrays) that serve a variety of scientific disciplines, from astronomy and physics to geology and biology and beyond. Large Facilities are increasingly dependent on advanced cyberinfrastructure (i.e., computing, data, and software systems; networking; and associated human capital) to enable the broad delivery and analysis of facility-generated data. These cyberinfrastructure tools enable scientists and the public to gain new insights into fundamental questions about the structure and history of the universe, the world we live in today, and how our environment may change in the coming decades.

This talk describes a pilot project that aims to develop a model for a Cyberinfrastructure Center of Excellence (CI CoE) that facilitates community building and knowledge sharing and that disseminates and applies best practices and innovative solutions for facility CI.

Speakers
ED

Ewa Deelman

University of Southern California


Thursday September 26, 2019 4:06pm - 4:24pm
Toucan Room

4:15pm

Exascale Simulations in Astrophysics
Speakers
MN

Mike Norman

San Diego Supercomputer Center, UC San Diego



Thursday September 26, 2019 4:15pm - 5:00pm
Macaw Room

4:24pm

The Future of Scientific Observation: Artificial Intelligence at the Edge
Scientific instruments have changed dramatically over the last century. We have moved from hand-crafted telescopes and thermometers to advanced digital sensors that can continuously provide data from the planet’s most remote locations. Recent advances in machine learning technology are leading to another significant shift in sensor science and scientific instrumentation — software-defined sensors. By embedding parallel computation with high-bandwidth sensors, a new kind of intelligent, software-defined edge computing sensor is possible. This disruptive new approach to scientific observation is part of an evolving computing continuum that analyses data in-situ, and uses HPC to model, predict, and learn. In this presentation, the Waggle Platform for building distributed edge sensors, developed at Argonne National Laboratory, and the Array of Things project at the University of Chicago will be discussed in the context of edge computing, machine learning, and the computing continuum.

Speakers
PB

Pete Beckman

Pete Beckman is the co-director of the Northwestern University/Argonne Institute for Science and Engineering. During the past 25 years, his research has been focused on software and architectures for large-scale parallel and distributed computing systems. For the DOE’s Exascale... Read More →


Thursday September 26, 2019 4:24pm - 4:42pm
Toucan Room

4:30pm

SPARCS: Stream-Processing Architecture Applied in Real-Time Cyber-Physical Security
Reinhard Gentz (Lawrence Berkeley National Laboratory), Sean Peisert (Lawrence Berkeley National Laboratory), Joshua Boverhof (Lawrence Berkeley National Laboratory), and Daniel Gunter (Lawrence Berkeley National Laboratory)

In this paper, we showcase a complete, end-to-end, fault tolerant, bandwidth and latency optimized architecture for real time utilization of data from multiple sources that allows the collection, transport, storage, processing, and display of both raw data and analytics, which can be utilized for a wide variety of applications ranging from automation/control to monitoring and security. We propose a practical, hierarchical design that allows easy addition and reconfiguration of software and hardware components, while utilizing local processing of data at sensor or field site level to reduce latency and upstream bandwidth requirements. The system supports multiple fail-safe mechanisms to guarantee the delivery of sensor data. We describe the application of this architecture to cyber-physical security (CPS) by supporting security monitoring of an electric distribution grid, through the collection and analysis of distribution-grid level phasor measurement unit (PMU) data, as well as Supervisory Control And Data Acquisition (SCADA) communication in the control area network.

Speakers
RG

Reinhard Gentz

Lawrence Berkeley National Laboratory


Thursday September 26, 2019 4:30pm - 5:00pm
Kon Tiki Room

4:42pm

Discussion
Thursday September 26, 2019 4:42pm - 5:00pm
Toucan Room

5:00pm

Break
Thursday September 26, 2019 5:00pm - 5:15pm
Foyer

5:15pm

Keynote: Maryann Martone on "Neuroscience as an open, FAIR and citable discipline"
The launch of several international large brain projects indicates that we are still far from understanding the brain at even a basic level, let alone being able to intervene meaningfully in most degenerative, psychiatric and traumatic brain disorders. Such projects reflect the idea that neuroscience needs to be placed on a more data-rich, computational footing to address the inherent complexity of the nervous system. But should we just be looking towards big science to produce comprehensive and integrated data and tools? What about the thousands of studies conducted by individual investigators and small teams, so called “long tail data"? How does the regular practice of neuroscience need to change to address grand challenges in brain science?

Across the breadth of academia, researchers are defining new modes of scholarship designed to take advantage of 21st century technology for linking and distributing information. Principles, best practices and tools for networked scholarship are emerging. Chief among these is the move towards open science, making the products of research as open as possible to ensure their broadest use. Second, increased recognition that research outputs should not only include journal articles and books, but data, tools and workflows. Third, that research outputs should be FAIR: Findable, Accessible, Interoperable and Reusable-the characteristics required for making digital objects maximally useful for both humans and machines. FAIR encompasses the agreement upon and use of community standards for data exchange. Finally, that citation and credit systems be redesigned to reflect the broadening of scientific output.

In this presentation, I will discuss the community and technical infrastructure for moving neuroscience towards an open, FAIR and citable science, highlighting our experiences in building and maintaining the Neuroscience Information Framework and other related projects. I will also provide an example of work underway in the spinal cord Injury community to come together around the sharing and integration of long tail data.

Speakers
avatar for Maryann Martone

Maryann Martone

UC San Diego
Maryann Martone received her BA from Wellesley College in Biological Psychology and Ancient Greek and her Ph. D. in Neuroscience from the University of California, San Diego. She is a professor Emerita at UCSD, but still maintains an active laboratory and  currently serves as the... Read More →


Thursday September 26, 2019 5:15pm - 6:15pm
Kon Tiki Room

6:30pm

Luau Dinner and Entertainment
Thursday September 26, 2019 6:30pm - 8:30pm
Bay Front Lawn
 
Friday, September 27
 

7:45am

8:45am

Conference Welcome



Speakers
avatar for Adam Belloum

Adam Belloum

Universiteit van Amsterdam


Friday September 27, 2019 8:45am - 9:00am
Toucan, Macaw, Cockatoo Rooms

9:00am

Keynote: Dieter Kranzlmüller on "Environmental Computing on SuperMUC-NG - A Partnership between Computer and Domain Sciences"
As a leadership-class computing facility, the mission of the Leibniz Supercomputing Centre (LRZ) is to enable significant achievement and advancement in science with powerful computational resources including dedicated hardware and software coupled with computational science expertise for specific research domains. One such focus area is environmental science, which is highly relevant for our daily lives and society. With LRZ’s latest supercomputer SuperMUC-NG and its combination of a powerful general purpose architecture with integrated data science and AI capabilities, users from the environmental sciences increase the size and resolution of their model while reducing the necessary computing time. The major factor of success, however, is the partnership between the domain scientists and the computational specialists at LRZ which includes every step along discovery, from dedicated training through the entire scientific workflow during production runs. This talk introduces the LRZ partnership model and highlights a number of example use cases from environmental computing.

Speakers
avatar for Dieter Kranzlmüller

Dieter Kranzlmüller

Ludwig-Maximilians-Universität München
Prof. Dieter Kranzlmueller is Chairman of the Board of Directors at Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities. In 2008 Dieter Kranzlmueller joined the Board of Directors at LRZ and became a full professor of computer science at the Chair for... Read More →



Friday September 27, 2019 9:00am - 10:00am
Toucan, Macaw, Cockatoo Rooms

10:00am

Break
Friday September 27, 2019 10:00am - 10:30am
Foyer

10:30am

A Vision towards Future eScience
Today, scientific research heavily depends on the digital world. Almost the entire process of scientific research including data acquisition, data analysis, and visualization is now being conducted in the digital world. Also, today’ scientific research essentially requires the global collaboration by scientists and IT researchers, each of who works on a different organization. In this paper the authors describe the future vision of eScience as well as challenges and expectations to eScience community.

Speakers

Friday September 27, 2019 10:30am - 10:48am
Cockatoo Room

10:30am

Timing is Everything: Identifying Diverse Interaction Dynamics in Scenario and Non-Scenario Meetings
Chreston Miller (Virginia Tech) and Christa Miller (Virginia Tech)

In this paper we explore the use of temporal patterns to define interaction dynamics between different kinds of meetings. Meetings occur on a daily basis and include different behavioral dynamics between participants, such as floor shifts and intense dialog. These dynamics can tell a story of the meeting and provide insight into how participants interact. We focus our investigation on defining diversity metrics to compare the interaction dynamics of scenario and non-scenario meetings. These metrics may be able to provide insight into the similarities and differences between scenario and non-scenario meetings. We observe that certain interaction dynamics can be identified through temporal patterns of speech intervals, i.e., when a participant is talking. We apply the principles of Parallel Episodes in identifying moments of speech overlap, e.g., interaction “bursts”, and introduce Situated Data Mining, an approach for identifying repeated behavior patterns based on situated context. Applying these algorithms provides an overview of certain meeting dynamics and defines metrics for meeting comparison and diversity of interaction. We tested on a subset of the AMI corpus and developed three diversity metrics to describe similarities and differences between meetings. These metrics also present the researcher with an overview of interaction dynamics and presents points-of-interest for analysis.

Speakers
CM

Chreston Miller

Virginia Tech
CM

Christa Miller

Virginia Tech



Friday September 27, 2019 10:30am - 11:00am
Toucan Room

10:30am

OKG-Soft: An Open Knowledge Graph with Machine Readable Scientific Software Metadata
Daniel Garijo (University of Southern California), Maximiliano Osorio (University of Southern California), Deborah Khider (University of Southern California), Varun Ratnakar (University of Southern California), and Yolanda Gil (University of Southern California)

Software is crucial for understanding, reusing and reproducing scientific results. Software is often stored in code repositories, which may contain human readable instructions necessary to use it and set it up. However, a significant amount of time is usually required to understand how to invoke a software, prepare data for its execution and to reuse it in combination with other software. In this paper we introduce OKG-Soft, an open knowledge graph that describes software in a machine readable manner and a framework to annotate, query, explore and curate software metadata. OKG-Soft emphasizes the ability to compose software, proposing an ontology to describe inputs and outputs of software components and their expected variables. We demonstrate the usefulness of OKG-Soft with two applications in the environmental and social sciences: a tool for exploring software models and a portal that exploits the contents of the graph to combine climate, hydrology, agriculture and economic models.

Speakers
DG

Daniel Garijo

University of Southern California
DK

Deborah Khider

University of Southern California
YG

Yolanda Gil

University of Southern California



Friday September 27, 2019 10:30am - 11:00am
Macaw Room

10:48am

What is Neo-Informatics?
The discipline of informatics, generically cast as the science and engineering of information system within a socio-technical framework, originating in the middle of last century has undergone generational adaptations as computer hardware, networks and software have evolved. Within the "eScience" era of the last two decades, discipline-specific fields of informatics have flourished, such as bioinformatics, geoinformatics, astroinformatics and many more. In fact, there may be few fields of study that have not added an informatics sub-field. Over the same time, efforts at systematizing the common (or core, i.e. discipline neutral) aspects of informatics have been successful: use cases, human-centered design, iterative approaches, information models and more are some of the key elements. However new pressures are being placed on functional and non-functional requirements of information systems: with underlying data that are high dimensional, heterogeneous, sparse and with uncertain quality. Demands arise from renewed attention to machine learning, neural networks and artificial intelligence in general, whose methods as implemented in software libraries produce results to be assessed and interpreted (often leading to decisions made) by "humans-in-the-loop". Informatics, revisited is a possible answer. This presentation features some history of informatics, recent successes grounded in, and around, mineral-informatics, and offers ideas for new directions, with the goal of advancing eScience in general.

Speakers

Friday September 27, 2019 10:48am - 11:06am
Cockatoo Room

11:00am

Multi-model Investigative Exploration of Social Media Data with BOUTIQUE: A Case Study in Public Health
Junan Guo (University of California San Diego), Subhasis Dasgupta (University of California San Diego), and Amarnath Gupta (University of California San Diego)

We present our experience with a data science problem in Public Health, where researchers use social media (Twitter) to determine whether the public shows awareness of HIV prevention measures offered by Public Health campaigns. To help the researcher, we develop a “investigative exploration” system called BOUTIQUE that allows a user to perform a multi-step visualization and exploration of data through a dashboard interface. Unique features of BOUTIQUE includes its ability to handle heterogeneous types of data provided by a polystore, and its ability to use computation as part of the investigative exploration process. In this paper, we present the design of the BOUTIQUE middleware and walk through an investigation process for a real-life problem.

Speakers
JG

Junan Guo

University of California San Diego
SD

Subhasis Dasgupta

University of California San Diego
AG

Amarnath Gupta

University of California San Diego



Friday September 27, 2019 11:00am - 11:30am
Toucan Room

11:00am

Efficient Runtime Capture of Multiworkflow Data Using Provenance
Renan Souza (COPPE/UFRJ & IBM Research), Leonardo Azevedo (IBM Research), Raphael Thiago (IBM Research), Elton Soares (IBM Research), Marcelo Nery (IBM Research), Marco A. S. Netto (IBM Research), Emilio Vital (IBM Research), Renato Cerqueira (IBM Research), Patrick Valduriez (Inria & U. Montpellier), and Marta Mattoso (COPPE/UFRJ)

Computational Science and Engineering (CSE) projects are typically developed by multidisciplinary teams. Despite being part of the same project, each team manages its own workflows, using specific execution environments and data processing tools. Analyzing the data processed by all workflows globally is a core task in a CSE project. However, this analysis is hard because the data generated by these workflows are not integrated. In addition, since these workflows may take a long time to execute, data analysis needs to be done at runtime to reduce cost and time of the CSE project. A typical solution in scientific data analysis is to capture and relate the data in a provenance database while the workflows run, thus allowing for data analysis at runtime. However, the main problem is that such data capture competes with the running workflows, adding significant overhead to their execution. To mitigate this problem, we introduce in this paper a system called ProvLake, which adopts design principles for providing efficient distributed data capture from the workflows. While capturing the data, ProvLake logically integrates and ingests them into a provenance database ready for analyses at runtime. We validated ProvLake by implementation of a real use case in the O&G industry of four workflows that process 5 TB datasets for a deep learning classifier. Compared with Komadu, the closest solution that meets our goals, our approach enables runtime multiworkflow data analysis with much smaller overhead, such as 0.1%.

Speakers
RS

Renan Souza

COPPE/UFRJ & IBM Research



Friday September 27, 2019 11:00am - 11:30am
Macaw Room

11:06am

The Future of eScience
Speakers
avatar for Dieter Kranzlmüller

Dieter Kranzlmüller

Ludwig-Maximilians-Universität München
Prof. Dieter Kranzlmueller is Chairman of the Board of Directors at Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities. In 2008 Dieter Kranzlmueller joined the Board of Directors at LRZ and became a full professor of computer science at the Chair for... Read More →



Friday September 27, 2019 11:06am - 11:24am
Cockatoo Room

11:24am

eScience 2050: A Look Back
Dennis Gannon (Indiana University)

This is a look back at the period of eScience from 2019 to 2050. Of course, as it is being published in 2019, it is clearly a work of science fiction, but it is based on technology trends that seem relatively clear. Specifically, we consider the impact of four themes on eScience: the explosion of AI as an eScience enabler, quantum computing as a service in the cloud, DNA data storage in the cloud, and neuromorphic computing.

Speakers
DG

Dennis Gannon

Indiana University



Friday September 27, 2019 11:24am - 11:42am
Cockatoo Room

11:30am

Increasing Life Science Resources Re-Usability using Semantic Web Technologies
Marine Louarn (INSERM & Univ Rennes, Inria, CNRS, IRISA), Fabrice Chatonnet (INSERM, Univ Rennes, CHU Rennes, EFS), Xavier Garnier (Univ Rennes, Inria, CNRS, IRISA), Thierry Fest (INSERM, Univ Rennes, CHU Rennes, EFS), Anne Siegel (Univ Rennes, Inria, CNRS, IRISA), and Olivier Dameron (Univ Rennes, Inria, CNRS, IRISA)

In life sciences, current standardization and integration efforts are directed towards reference data and knowledge bases. However, original studies results are generally provided in non standardized specific formats. In addition, the only formalization of analysis pipelines is often limited to textual descriptions in the method sections.

Both situations impair the results reproducibility, their maintenance and their reuse for advancing other studies. Semantic Web technologies have proven their efficiency for facilitating the integration and reuse of reference data and knowledge bases.

We hypothesize that Semantic Web technologies also facilitate reproducibility and reuse of life sciences studies involving pipelines that compute associations between entities according to intermediary relations and dependencies.

In order to assess this hypothesis, we considered a case-study in systems biology (http://regulatorycircuits.org), which provides tissue-specific regulatory interaction networks to elucidate perturbations across complex diseases.

Our approach consisted in surveying the complete set of provided supplementary data to reveal the underlying structure between the biological entities described in the data. We used this structure to integrate data with Semantic Web technologies and formalized the Regulatory Circuits analysis pipeline as SPARQL queries.
Our result was a 335,429,988 triples dataset on which two SPARQL queries were sufficient to extract each single tissue-specific regulatory network.

Speakers
ML

Marine Louarn

INSERM & Univ Rennes, Inria, CNRS, IRISA



Friday September 27, 2019 11:30am - 12:00pm
Toucan Room

11:30am

AdaptLidarTools: A Full-Waveform Lidar Processing Suite
Ravi Shankar (Boise State University), Nayani Ilangakoon (Boise State University), Aaron Orenstein (Treasure Valley Math and Science Center), Floriana Ciaglia (Boise State University), Nancy Glenn (Boise State University), and Catherine Olschanowsky (Boise State University)

AdaptLidarTools is a software package that processes full-waveform lidar data. Full-waveform lidar is an active remote sensing technique in which a laser beam is emitted towards a target and the backscattered energy is recorded as a near continuous waveform. A collection of waveforms from airborne lidar can capture landscape characteristics in three dimensions. Specific to vegetation, the extracted echoes and echo properties from the waveforms can provide scientists its structural (height, volume, layers of canopy, among others) and functional (leaf area index, diversity) characteristics. The discrete waveforms are transformed into georeferenced 2D rasters (images). The georeferencing orients the raster on a map and allows scientists to correlate field-based observations for validation of the waveform observations. AdaptLidarTools provides an extensible, open source framework that processes the waveforms and allows multiple processing methods.

AdaptLidarTools is designed to explore new methods to fit full-waveform lidar signals and to maximize the information in the waveforms for vegetation applications. The toolkit explores First Differencing, complementary to Gaussian fitting, for faster processing of full-waveform lidar signals and for handling increasingly large volumes of full-waveform lidar datasets. AdaptLidarTools takes approximately 30 min to derive a raster of a given echo property from a raw waveform file of 1 GB size. The toolkit is designed to generate the first order echo properties such as position, amplitude, pulse width, and other properties such as rise time, fall time, backscattered cross section and others that current proprietary and open source tools do not generate. The derived echo properties are delivered as georeferenced raster files of a given spatial resolution that can be viewed and processed by most remote sensing data processing software.

Speakers
RS

Ravi Shankar

Boise State University



Friday September 27, 2019 11:30am - 12:00pm
Macaw Room

11:42am

Discussion
Friday September 27, 2019 11:42am - 12:00pm
Cockatoo Room

12:00pm

Data Encoding in Lossless Prediction-Based Compression Algorithms
Ugur Cayoglu (Karlsruhe Institute of Technology (KIT)), Frank Tristram (Karlsruhe Institute of Technology (KIT)), Jörg Meyer (Karlsruhe Institute of Technology (KIT)), Jennifer Schröter (Karlsruhe Institute of Technology (KIT)), Tobias Kerzenmacher (Karlsruhe Institute of Technology (KIT)), Peter Braesicke (Karlsruhe Institute of Technology (KIT)), and Achim Streit (Karlsruhe Institute of Technology (KIT))

The increase in compute power and more sophisticated simulation models with higher resolution output triggered the need for compression algorithms for scientific data. Currently there are several compression algorithms under development for. Most of these algorithms are using prediction-based compression methods, where each value is predicted and the residual to the true value is saved on disk. There are two forms of residual calculation which are currently established: Exclusive-or and numerical difference. In this paper we will summarize both techniques and show their strengths and weaknesses. We will
show that shifting the prediction and true value to a binary number with certain properties results into a better compression ratio with minimal additional computational costs. This gain in compression ratio enables the usage of less sophisticated prediction algorithms to achieve the same or better compression ratio with higher throughput during compression and decompression. Further we will introduce a new encoding scheme to achieve a 8% increase in compression ratio on average compared to state of the art.

Speakers
UC

Ugur Cayoglu

Karlsruhe Institute of Technology (KIT)



Friday September 27, 2019 12:00pm - 12:30pm
Toucan Room

12:00pm

SDM: A Scientific Dataset Delivery Platform
Illyoung Choi (University of Arizona), Jude Nelson (Blockstack PBC), Larry Peterson (Open Networking Foundation), and John Hartman (University of Arizona)

Scientific computing is becoming more data-centric and more collaborative, which means increasingly large datasets are being transferred across the Internet. Transferring these datasets efficiently and making them accessible to scientific workflows is an increasingly difficult task. In addition, the data transfer time can be a significant portion of the overall workflow running time. This paper presents SDM (Syndicate Dataset Manager), a scientific dataset delivery platform. Unlike general-purpose data transfer tools, SDM offers on-demand access to remote scientific datasets. On-demand access doesn’t require staging datasets to local file systems prior to computing on them, and it also enables overlapping computation and I/O. In addition, SDM offers a simple interface for users to locate datasets and access them. To validate the usefulness of SDM, we performed realistic metagenomic sequence analysis workflows on remote genomic datasets. In general, SDM outperforms existing data access methods when configured with a CDN. With warm CDN caches, SDM completes the workflow 17-20% faster than staging methods. Its performance is even comparable to local storage. SDM has only a 9% longer elapsed time than local HDD storage and 18% longer elapsed time than local SSD storage. Together, its performance and its ease-of-use make SDM an attractive platform for performing scientific workflows on remote datasets.

Speakers
IC

Illyoung Choi

University of Arizona



Friday September 27, 2019 12:00pm - 12:30pm
Macaw Room

12:00pm

Toward an Elastic Data Transfer Infrastructure
Joaquin Chung (Argonne National Laboratory), Zhengchun Liu (Argonne National Laboratory), Rajkumar Kettimuthu (Argonne National Laboratory), and Ian Foster (Argonne National Laboratory)

Data transfer over wide area network is an integral part of many science workflows. These workflows move data produced at experimental, observational and/or computational facilities to geographically distributed resources for analysis, sharing, and storing. Enhancements in the data transfer infrastructure of universities have improved the performance of data transfers for a number of users and science workflows. Despite these positive developments, our previous analyses of approximately 40 billion GridFTP command logs totaling 3.3 exabytes and 4.8 million transfers logs collected by the Globus transfer service from 2014/01/01 to 2018/01/01 show that data transfer nodes (DTNs) are completely idle (i.e., no transfers) 94.3% of the time. Furthermore, 80% of the DTNs are active less than 6% of the time. Motivated by the opportunity to optimize the architecture of data transfer infrastructure, we developed an elastic architecture for data transfer in which the DTNs expand and shrink based on the demand. Our architecture is composed of agents that monitor resource utilization at bare metal nodes and an orchestrator that decides when to provision/deprovision resources. Our results show that our elastic DTI can save up to 98% of resources compared with a typical DTN deployment, while experiencing only minimal overhead (∼1%).

Speakers
RK

Rajkumar Kettimuthu

Argonne National Laboratory



Friday September 27, 2019 12:00pm - 12:30pm
Cockatoo Room