Fast Parsing of the Kaggle CORD-19 Dataset

Apr 13, 2020 • Anders Berkeman, Carl Drougge, and Sofia Hörberg

The Question

How to convert 68.000 JSON files into a single CSV file in a reproducible and traceable way in one minute on a laptop.

Introduction

In the popular “COVID-19 Open Research Dataset Challenge (CORD-19)” on Kaggle, the challenge is to extract information from a dataset composed of a large number of scientific papers. While the Natural Language Processing (NLP) part of this project is the main focus of the Kaggle challenge, this post covers the important pre-processing of the dataset.

The complete source code is available on github.

Pre-Processing

The Kaggle dataset is composed of JSON-files, one file per paper. In total there are more than 68.000 files spread over several directories. The pre-processing proposed here converts all these files into one Comma Separated Values (CSV) file. (Actually, the separator could be any character.) Various NLP-preprocessing tricks such as conversion to lowercase and tokenisation can also be carried out in the process.

Why does this matter? Mainly because

reading a single CSV file is faster than reading 68.000 JSON-files from different directories.
the resulting CSV file, and thus all further processing, becomes independent of the input data format, that may change.
tokenisation, conversion to lowercase etc. is carried out once and separated from all further processing, which is faster.
a CSV-file can be manipulated by standard shell tools such as grep for visualisation and validation tasks.

A single file is also easier to share.

In addition, we want the pre-processing to be fully reproducible and transparent. It should be possible to run on several different versions of the dataset and algorithms without difficulty.

The Code

The source code is available here. It is using the Accelerator data processor from eBay for parallel and reproducible computing.

It is all very simple. There is only one custom method, dev/a_import.py, which reads all JSON files in parallel, parses them, and writes them to a common dataset. The method is called once for every directory containing JSON files, and all resulting datasets are chained. Finally, the bundled csvexport method is used to convert the resulting dataset chain into a CSV-like file. This method has options for selection of separator, quotings, and more.

Example Execution time

On a 2018 high-performance laptop (HP Elitebook 840 G5), the complete processing takes 61 seconds.

(A 2009 Lenovo workstation will run this in 17 seconds, at half the price of the laptop!)

About the Accelerator

The Accelerator from eBay is designed for fast processing of large datasets on a single machine. Programs are simply written in Python, and thanks to its fast parallel processing and clever reuse of pre-computed results, execution times are typically in the range of a few seconds per task. Therefore, ideas could be tried out with very little overhead, making the Accelerator a perfect tool for fast exploration of large datasets.

Additional Resources

The Accelerator’s Homepage (exax.org)
The Accelerator on Github/exaxorg
The Accelerator on PyPI
Reference Manual