Richard Juul Isdahl (Norwegian University of Science and Technology) and Odd Erik Gundersen (Norwegian University of Science and Technology)
Even machine learning experiments, which are fully conducted on computers are not necessarily reproducible. An increasing number of both open source and commercial machine learning platforms are being developed that help address this problem. However, there is no standard for assessing and comparing which features are required to fully support reproducibility. We propose a quantitative method that alleviates this problem. Based on the proposed method we assess and compare the current state of the art machine learning platforms for how well they support making empirical results reproducible. Our results show that BEAT and Floydhub have the best support for reproducibility with Codalab and Kaggle as close contenders. The most commonly used machine learning platforms provided by the big tech companies have poor support for reproducibility.