Thursday, September 26 • 10:30am - 11:00am
Quality-Aware Human-Machine Text Extraction for Biocollections using Ensembles of OCRs

Icaro Alzuru (University of Florida), Rhiannon Stephens (Australian Museum), Andréa Matsunaga (Advanced Computing and Information Systems Laboratory), Maurício Tsugawa (Advanced Computing and Information Systems Laboratory), Paul Flemons (Australian Museum), and José A.B. Fortes (University of Florida)

Information Extraction (IE) from the text in images is affected by the output quality of the text recognition process. Misspelled or missing text may propagate errors or even preclude IE. The low confidence in automated methods makes some IE projects exclusively rely on human work (crowdsourcing). That is the case of biological collections (biocollections), where the metadata (Darwin-core Terms), found in digitized labels, are transcribed by citizen scientists. In this paper, we present an approach to reduce the number of crowdsourcing tasks required to obtain the transcription of the text found in biocollections’ images. By using an ensemble of Optical Character Recognition (OCR) engines: OCRopus, Tesseract, and the Google Cloud OCR, our approach identifies the lines and characters that, with a high probability, are correct. This allows dedicating man-hours to the transcription of only low confidence fragments of text. The number of lines to transcribe is also reduced through hybrid human-machine crowdsourcing: the output of the ensemble of OCRs is used as the first "human" transcription of the redundant crowdsourcing process. Our approach was tested in six biocollections (2,966 images), reducing the number of crowdsourcing tasks by 76% (58% due to lines accepted by the ensemble of OCRs and about 18% due to accelerated convergence when using hybrid crowdsourcing). The automatically extracted text presented a character error rate of 0.001 (0.1%).


