Wednesday, September 25 • 10:30am - 11:00am
defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data

Rosa Filgueira (University of Edinburgh), Michael Jackson (University of Edinburgh), Anna Roubickova (University of Edinburgh), Amrey Krause (University of Edinburgh), Ruth Ahnert (Queen Mary University of London), Tessa Hauswedell (University College London), Julianne Nyhan (University College London), David Beavan (The Alan Turing Institute), Timothy Hobson (The Alan Turing Institute), Mariona Coll Ardanuy (The Alan Turing Institute), Giovanni Colavizza (The Alan Turing Institute), James Hetherington (The Alan Turing Institute), and Melissa Terras (University of Edinburgh)

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books, in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two computing environments, Cray Urika-GX, and Eddie, as well as in desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirement.


