escience2019 has ended
Back To Schedule
Wednesday, September 25 • 5:15pm - 5:45pm
dislib: Large Scale High Performance Machine Learning in Python

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Javier Álvarez Cid-Fuentes (Barcelona Supercomputing Center), Salvi Solà (Barcelona Supercomputing Center), Pol Álvarez (Barcelona Supercomputing Center), Alfred Castro-Ginard (Dept. Física Quàntica i Astrofísica, Institut de Ciències del Cosmos (ICCUB), Universitat de Barcelona (IEEC-UB)), and Rosa M. Badia (Barcelona Supercomputing Center)

During the last years, machine learning has proven to be an extremely useful tool for extracting knowledge from data. This provides a lot of potential to computational science, especially in research fields that deal with large amounts of data, such as genomics, earth sciences, and astrophysics. At the same time, Python has become one of the most popular programming languages among researchers due to its high productivity and rich ecosystem. Unfortunately, existing machine learning libraries for Python do not scale to large data sets, are hard to use by non-experts, and are difficult to set up in high performance computing clusters. These limitations have prevented scientists to exploit the full potential of machine learning in their research. In this paper, we present and evaluate dislib, a distributed machine learning library on top of PyCOMPSs programming model that addresses the issues of other existing libraries. In our evaluation, we show that dislib can be up to 9 times faster, and can process data sets up to 16 times larger than other popular distributed machine learning libraries, such as MLlib. In addition to this, we also show how dislib can be used to reduce the computation time of a real scientific application from 18 hours to 17 minutes.


Javier Álvarez Cid-Fuentes

Barcelona Supercomputing Center

Wednesday September 25, 2019 5:15pm - 5:45pm PDT
Macaw Room