Urd Example - Part 3: Appending New Columns to an Existing Dataset Chain

[Last updated 2021-10-12]

Urd is the Norse goddess of the past. It is also the name of the Accelerator’s database that keeps track of built jobs and their dependencies.

In this post, we show how to use Urd to append a new column to a dataset chain.

This is the third and last part of a series of three posts. The other posts are Part 1: Plain Import of a List of Files and Part 2: Let Urd Keep Track of Datasets .

Two Build Scripts

Again, we are working on these input files with timestamp

files = (
    ('2009-01', 'yellow_tripdata_2009-01.csv'),
    ('2009-02', 'yellow_tripdata_2009-02.csv'),
    ('2009-03', 'yellow_tripdata_2009-03.csv'),
)

Below are two build scripts.

The first build script, doing the importing, was presented in the previous @@@post. Nothing new here.

def main(urd):
    last_created = urd.peek_latest('import').timestamp
    for ts, filename in files:
        if ts > last_created:
            urd.begin('import', ts)
            previous = urd.latest('import').joblist.get(-1)
            urd.build('csvimport', filename=filename, previous=previous)
            urd.finish('import')

The second build script will make sure that the method append will be run on all jobs created by the first build script. Furthermore, all append-jobs will be chained in the same way as the csvimport jobs.

1
2
3
4
5
6
7
8
def main(urd)
    last_appended = urd.peek_latest('append').timestamp
    for ts in urd.since('import', last_appended):
        urd.begin('append', ts)
        parent = urd.get('import', ts).joblist.get('csvimport')
        previous = urd.latest('append').joblist.get(-1)
        urd.build('append', data='this is the time: %s' % (ts,), parent=parent, previous=previous)
        urd.finish('append')

Detailed explanation:

Note that both the parent (found by urd.get()) and the previous (found by urd.latest()) sessions will be added as dependencies to the running append session.

Inspection

Now we have two Urd lists

% ax urd
<user>/import
<user>/append

And here are all sessions stored in them

% ax curl import/since/0
2009-01
2009-02
2009-03

% ax curl append/since/0
2009-01
2009-02
2009-03

Looking at the latest job in the append list (slightly compacted)

% ax urd append
timestamp: 2009-03
caption  :
deps     : <user>/append/2009-02
           <user>/import/2009-03
JobList(
   [  0] append : dev-5
)

Also, have a look at the latest append dataset

% ax ds :append:

Some Notes

When New Files are Added, Everything Just Works!

Running the import build script with new files added will, for each new file, import the file, chain the dataset to the existing chain, and add a new session to Urd with the new timestamp.

Running the append build script will in a similar fashion append a new column to all datasets in the import chain that does not already have the new column.

In a more advanced setup, a cron job could be set up that regularly in an incremental fashion will import and process files and data that are new since the last run.

Do more than one job in a list - +dataset_type

A typical import build script will do more that just csvimport. It will also type the imported dataset using dataset_type, and perhaps apply sorting and/or hash partitioning using dataset_sort and dataset_hashpart. In addition, custom methods may be applied to do some special operations on the imported data. In the end, an import Urd session will be composed of several jobs, of which the last one is the one that will contain the dataset used for further processing.

Separation into Build Scripts and Workdirs

Separation is a good thing. It makes sense to have separate build scripts for various tasks, such as the tasks of importing and processing. It also makes sense to let the different build scripts store jobs in different workdirs. The import workdir will be written once by the import build script, and it will then be used many times. Other workdirs may contain more exploratory jobs, and may be erased from time to time. Using separate workdirs ensure that each part of the processing occupies its own space in the storage system, making it possible to, say, erase out of date data analysis jobs without affecting the import jobs.

Additional Resources

The Accelerator’s Homepage (exax.org)
The Accelerator on Github/exaxorg
The Accelerator on PyPI
Reference Manual