Urd Example - Part 1: Plain Import of a List of Files
[Last updated 2021-10-12]
Urd is the Norse goddess of the past. It is also the name of the Accelerator’s database that keeps track of built jobs and their dependencies.
This short example shows how to import a sequence of files into a chained dataset. The dataset can then be subject to further processing.
This is the first part of a series of three posts intended to show how to design build scripts for automated importing of files. The other posts are Part 2: Let Urd Keep Track of Datasets and Part 3: Appending New Columns to an Existing Dataset Chain. In this part, the build script is very basic and it does not make use of any of Urd’s advanced features. Using Urd for keeping track of things is the topic of the other posts in this series.
Background
It is common that a project is based on a set of input files. Sometimes these files could be arranged in order, for example by date, sometimes they do not have such a relation. Sometimes the number of files is fixed, sometimes more files will be added to the project later. The Accelerator is built to handle all these different cases while providing transparency and reproducibility.
This post looks at the basic case – how to import a sequence of files in a given order, and create a dataset chain which makes the data from all files available using the same simple interface.
A Build Script for Importing Files
Assume that all files in a list like the one below is to be imported.
files = (
'yellow_tripdata_2009-01.csv',
'yellow_tripdata_2009-02.csv',
'yellow_tripdata_2009-03.csv',
)
(These are, in fact, the first three files in the NYC City Taxi dataset. In an upcoming post, we’ll provide an example build script that imports all the taxi files.)
It is straightforward to import these files in a loop, like this
def main(urd):
previous = None
for filename in files:
importjob = urd.build('csvimport', filename=filename, previous=previous)
previous = importjob
# Now, "importjob" is a job object that can be used to access the whole chain of Datasets.
The csvimport
method will be called once for each of the files in
the files
list. The method will read the file and create a dataset
from the files’ contents.
The value returned by the urd.build()
-call is a job object
corresponding to the built csvimport
job. The job object is also
used as a reference to the dataset(s) created in the job.
To create a chain of datasets, we insert the job object of the
previous import job into the next, using the previous
-parameter.
This basically creates a linked list of jobs (and datasets).
If new files are added at a later time, these will be imported and chained as well.
Running the Build Script.
Build scripts are named build.py
or
build_<something>.py
. Run them with ax run
or ax run <something>
.
Add option --fullpath
to print the full path names of all job directories.
Below is a typical build script output
dev.build_importchain
- csvimport MAKE /zbd/workdirs/dev/dev-0
10.6 seconds
- csvimport MAKE /zbd/workdirs/dev/dev-1
10.0 seconds
- csvimport MAKE /zbd/workdirs/dev/dev-2
10.9 seconds
Inspection
There are three commands available for looking at various aspects of
datasets: ds
, cat
, and grep
.
Dataset Information
ds
is used to get information about a dataset or dataset chain.
To investigate the default dataset (and its chain -c
) of job dev-2
, run
% ax ds dev-2 -c
which will return something similar to
dev-2/default
Method: csvimport
Filename: /zbd/data/nyctaxi2019/yellow_tripdata_2009-03.csv
Previous: dev-1
Columns:
dropoff_datetime bytes
dropoff_latitude bytes
dropoff_longitude bytes
fare_amount bytes
mta_tax bytes
passenger_count bytes
payment_type bytes
pickup_datetime bytes
pickup_latitude bytes
pickup_longitude bytes
rate_code bytes
store_and_fwd_flag bytes
surcharge bytes
tip_amount bytes
tolls_amount bytes
total_amount bytes
trip_distance bytes
vendor_id bytes
18 columns
14,387,371 lines
Chain length 3, from dev-0 to dev-2
0: dev-0/default (14,092,413)
1: dev-1/default (13,380,122)
2: dev-2/default (14,387,371)
41,859,906 total lines in chain
Here, it shows that the full name of the dataset is dev-2/default
,
and that it is chained (linked) to dev-1
using the previous
attribute. The bottom lines show that there are three datasets in the
chain, together with the number of lines in each dataset. (Note
that all columns are untyped, i.e. are of type bytes
. Ideally, the
dataset should be typed using the dataset_type
method. More about
this in an upcoming post.)
Try ax ds --help
for more options.
List and Find Data in a Dataset
The grep
command is similar to the grep
shell command, and it can
be used to find and show data in a dataset.
% ax grep <regexp> <dataset> <columns>
The <columns>
option is optional, and <regexp>
is an extended
grep
regular expression.
% ax grep 4 dev-2 passenger_count
The cat
command can be used to print all data (or a specified set
of columns) in dataset(s).
% ax cat dev-2 | head
(The pipe to head
is used to limit the number of output lines to
ten.)
Again, use the --help
option to get a list of all options.
In the next post, we will show how to use Urd to keep track of imported files and jobs in general.
Additional Resources
The Accelerator’s Homepage (exax.org)
The Accelerator on Github/exaxorg
The Accelerator on PyPI
Reference Manual