Urd Example - Part 1: Plain Import of a List of Files

Urd is the Norse goddess of the past. It is also the name of the Accelerator’s database that keeps track of built jobs and their dependencies.

This short example shows how to import a sequence of files into a chained dataset. The dataset can then be subject to further processing.

This is the first part of a series of three posts intended to show how to design build scripts for automated importing of files. The other posts are Part 2: Let Urd Keep Track of Datasets and Part 3: Appending New Columns to an Existing Dataset Chain. In this part, the build script is very basic and it does not make use of any of Urd’s advanced features. Using Urd for keeping track of things is the topic of the other posts in this series.

Background

It is common that a project is based on a set of input files. Sometimes these files could be arranged in order, for example by date, sometimes they do not have such a relation. Sometimes the number of files is fixed, sometimes more files will be added to the project later. The Accelerator is built to handle all these different cases while providing transparency and reproducibility.

This post looks at the basic case – how to import a sequence of files in a given order, and create a dataset chain which makes the data from all files available using the same simple interface.

A Build Script for Importing Files

Assume that all files in a list like the one below is to be imported.

files = (
    'yellow_tripdata_2009-01.csv',
    'yellow_tripdata_2009-02.csv',
    'yellow_tripdata_2009-03.csv',
)

(These are, in fact, the first three files in the NYC City Taxi dataset. In an upcoming post, we’ll provide an example build script that imports all the taxi files.)

It is straightforward to import these files in a loop, like this

def main(urd):
    previous = None
    for filename in files:
        importjob = urd.build('csvimport', filename=filename, previous=previous)
        previous = importjob

    # Now, "importjob" is a job object that can be used to access the whole chain of Datasets.

The csvimport method will be called once for each of the files in the files list. The method will read the file and create a dataset from the files’ contents.

The value returned by the urd.build()-call is a job object corresponding to the built csvimport job. The job object is also used as a reference to the dataset(s) created in the job.

To create a chain of datasets, we insert the job object of the previous import job into the next, using the previous-parameter. This basically creates a linked list of jobs (and datasets).

If new files are added at a later time, these will be imported and chained as well.

Running the Build Script.

Build scripts are named build.py or build_<something>.py. Run them with ax run or ax run <something>. Add option --fullpath to print the full path names of all job directories.

Below is a typical build script output

dev.build_importchain
        -  csvimport                                     MAKE  /zbd/workdirs/dev/dev-0
	              10.6 seconds
        -  csvimport                                     MAKE  /zbd/workdirs/dev/dev-1
	              10.0 seconds
        -  csvimport                                     MAKE  /zbd/workdirs/dev/dev-2
	              10.9 seconds

Inspection

There are two commands available for looking at various aspects of datasets: dsinfo and dsgrep.

Dataset Information

dsinfo is used to get information about a dataset or dataset chain.

To investigate the default dataset (and its chain -c) of job dev-2, run

% ax dsinfo dev-2 -c

which will return something similar to

dev-2/default
    Method: csvimport
    Filename: /zbd/data/nyctaxi2019/yellow_tripdata_2009-03.csv
    Previous: dev-1
    Columns:
        dropoff_datetime    bytes
        dropoff_latitude    bytes
        dropoff_longitude   bytes
        fare_amount         bytes
        mta_tax             bytes
        passenger_count     bytes
        payment_type        bytes
        pickup_datetime     bytes
        pickup_latitude     bytes
        pickup_longitude    bytes
        rate_code           bytes
        store_and_fwd_flag  bytes
        surcharge           bytes
        tip_amount          bytes
        tolls_amount        bytes
        total_amount        bytes
        trip_distance       bytes
        vendor_id           bytes
    18 columns
    14,387,371 lines
    Chain length 3, from dev-0 to dev-2
          0: dev-0/default (14,092,413)
	  1: dev-1/default (13,380,122)
	  2: dev-2/default (14,387,371)
    41,859,906 total lines in chain

Here, it shows that the full name of the dataset is dev-2/default, and that it is chained (linked) to dev-1 using the previous attribute. The bottom lines show that there are three datasets in the chain, together with the number of lines in each dataset. (Note that all columns are untyped, i.e. are of type bytes. Ideally, the dataset should be typed using the dataset_type method. More about this in an upcoming post.)

Try ax dsinfo --help for more options.

List and Find Data in a Dataset

The dsgrep command is similar to grep, and it can be used to find and show data in a dataset.

% ax dsgrep <regexp> <dataset> <columns>

The <columns> option is optional, and <regexp> is an extended grep regular expression.

The dsgrep command can be used to print all data in a dataset. This is done using the regular expression “.” that matches any character.

% ax dsgrep . dev-2 | head

(The pipe to head is used to limit the number of output lines to ten.)

It is also possible to limit search to one or a subset of all columns

% ax dsgrep . dev-2 tip_amount | head

that will print ten lines from the tip_amount column.

Again, use the --help option to get a list of all options.

In the next post, we will show how to use Urd to keep track of imported files and jobs in general.

Additional Resources

The Accelerator’s Homepage (exax.org)
The Accelerator on Github/eBay
The Accelerator on PyPI
Reference Manual