Urd Example - Part 1: Plain Import of a List of Files
Urd is the Norse goddess of the past. It is also the name of the Accelerator’s database that keeps track of built jobs and their dependencies.
This short example shows how to import a sequence of files into a chained dataset. The dataset can then be subject to further processing.
This is the first part of a series of three posts intended to show how to design build scripts for automated importing of files. The other posts are Part 2: Let Urd Keep Track of Datasets and Part 3: Appending New Columns to an Existing Dataset Chain. In this part, the build script is very basic and it does not make use of any of Urd’s advanced features. Using Urd for keeping track of things is the topic of the other posts in this series.
It is common that a project is based on a set of input files. Sometimes these files could be arranged in order, for example by date, sometimes they do not have such a relation. Sometimes the number of files is fixed, sometimes more files will be added to the project later. The Accelerator is built to handle all these different cases while providing transparency and reproducibility.
This post looks at the basic case – how to import a sequence of files in a given order, and create a dataset chain which makes the data from all files available using the same simple interface.
A Build Script for Importing Files
Assume that all files in a list like the one below is to be imported.
files = ( 'yellow_tripdata_2009-01.csv', 'yellow_tripdata_2009-02.csv', 'yellow_tripdata_2009-03.csv', )
(These are, in fact, the first three files in the NYC City Taxi dataset. In an upcoming post, we’ll provide an example build script that imports all the taxi files.)
It is straightforward to import these files in a loop, like this
def main(urd): previous = None for filename in files: importjob = urd.build('csvimport', filename=filename, previous=previous) previous = importjob # Now, "importjob" is a job object that can be used to access the whole chain of Datasets.
csvimport method will be called once for each of the files in
files list. The method will read the file and create a dataset
from the files’ contents.
The value returned by the
urd.build()-call is a job object
corresponding to the built
csvimport job. The job object is also
used as a reference to the dataset(s) created in the job.
To create a chain of datasets, we insert the job object of the
previous import job into the next, using the
This basically creates a linked list of jobs (and datasets).
If new files are added at a later time, these will be imported and chained as well.
Running the Build Script.
Build scripts are named
build_<something>.py. Run them with
ax run or
ax run <something>.
--fullpath to print the full path names of all job directories.
Below is a typical build script output
dev.build_importchain - csvimport MAKE /zbd/workdirs/dev/dev-0 10.6 seconds - csvimport MAKE /zbd/workdirs/dev/dev-1 10.0 seconds - csvimport MAKE /zbd/workdirs/dev/dev-2 10.9 seconds
There are two commands available for looking at various aspects of
dsinfo is used to get information about a dataset or dataset chain.
To investigate the default dataset (and its chain
-c) of job
% ax dsinfo dev-2 -c
which will return something similar to
dev-2/default Method: csvimport Filename: /zbd/data/nyctaxi2019/yellow_tripdata_2009-03.csv Previous: dev-1 Columns: dropoff_datetime bytes dropoff_latitude bytes dropoff_longitude bytes fare_amount bytes mta_tax bytes passenger_count bytes payment_type bytes pickup_datetime bytes pickup_latitude bytes pickup_longitude bytes rate_code bytes store_and_fwd_flag bytes surcharge bytes tip_amount bytes tolls_amount bytes total_amount bytes trip_distance bytes vendor_id bytes 18 columns 14,387,371 lines Chain length 3, from dev-0 to dev-2 0: dev-0/default (14,092,413) 1: dev-1/default (13,380,122) 2: dev-2/default (14,387,371) 41,859,906 total lines in chain
Here, it shows that the full name of the dataset is
and that it is chained (linked) to
dev-1 using the
attribute. The bottom lines show that there are three datasets in the
chain, together with the number of lines in each dataset. (Note
that all columns are untyped, i.e. are of type
bytes. Ideally, the
dataset should be typed using the
dataset_type method. More about
this in an upcoming post.)
ax dsinfo --help for more options.
List and Find Data in a Dataset
dsgrep command is similar to
grep, and it can be used to find
and show data in a dataset.
% ax dsgrep <regexp> <dataset> <columns>
<columns> option is optional, and
<regexp> is an extended
dsgrep command can be used to print all data in a dataset.
This is done using the regular expression “
.” that matches any character.
% ax dsgrep . dev-2 | head
(The pipe to
head is used to limit the number of output lines to
It is also possible to limit search to one or a subset of all columns
% ax dsgrep . dev-2 tip_amount | head
that will print ten lines from the
Again, use the
--help option to get a list of all options.
In the next post, we will show how to use Urd to keep track of imported files and jobs in general.
The Accelerator’s Homepage (exax.org)
The Accelerator on Github/eBay
The Accelerator on PyPI