Urd Example - Part 3: Appending New Columns to an Existing Dataset Chain
[Last updated 2021-10-12]
Urd is the Norse goddess of the past. It is also the name of the Accelerator’s database that keeps track of built jobs and their dependencies.
In this post, we show how to use Urd to append a new column to a dataset chain.
This is the third and last part of a series of three posts. The other posts are Part 1: Plain Import of a List of Files and Part 2: Let Urd Keep Track of Datasets .
Two Build Scripts
Again, we are working on these input files with timestamp
files = ( ('2009-01', 'yellow_tripdata_2009-01.csv'), ('2009-02', 'yellow_tripdata_2009-02.csv'), ('2009-03', 'yellow_tripdata_2009-03.csv'), )
Below are two build scripts.
The first build script will import a list of files into a dataset chain and store the jobs in a list in Urd.
The second build script will run a method that appends a column to the full chain of import jobs, one job at a time, creating a new chained dataset with an additional column.
The first build script, doing the importing, was presented in the previous @@@post. Nothing new here.
def main(urd): last_created = urd.peek_latest('import').timestamp for ts, filename in files: if ts > last_created: urd.begin('import', ts) previous = urd.latest('import').joblist.get(-1) urd.build('csvimport', filename=filename, previous=previous) urd.finish('import')
The second build script will make sure that the method
be run on all jobs created by the first build script. Furthermore,
append-jobs will be chained in the same way as the
1 2 3 4 5 6 7 8 def main(urd) last_appended = urd.peek_latest('append').timestamp for ts in urd.since('import', last_appended): urd.begin('append', ts) parent = urd.get('import', ts).joblist.get('csvimport') previous = urd.latest('append').joblist.get(-1) urd.build('append', data='this is the time: %s' % (ts,), parent=parent, previous=previous) urd.finish('append')
Line 2. Like in the previous import example, we find the timestamp of the latest (i.e. newest) append job that we’ve done so far.
Line 3. Loop over all timestamps in the
importlist that are newer than the latest append job. We have not created append jobs for these yet.
Line 4 and 8 defines the
appendurd session for the timestamps from the for-loop.
In line 5, we find the
csvimportjob from the
importUrd-list with the same timestamp as the ongoing Urd session. We will append a column to this job’s dataset.
In line 6, we find the previous
appendjob by fetching the Urd session with the current timestamp and selecting the last job in the joblist. (There is only one.)
Finally, line 7 builds the chained
appendjob, that takes the
csvimportjob as parent. All columns in the
csvimportjob will now also be exposed in the
appendjob, together with any new columns created by
Note that both the parent (found by
urd.get()) and the previous
urd.latest()) sessions will be added as dependencies to
Now we have two Urd lists
% ax urd <user>/import <user>/append
And here are all sessions stored in them
% ax curl import/since/0 2009-01 2009-02 2009-03 % ax curl append/since/0 2009-01 2009-02 2009-03
Looking at the latest job in the
append list (slightly compacted)
% ax urd append timestamp: 2009-03 caption : deps : <user>/append/2009-02 <user>/import/2009-03 JobList( [ 0] append : dev-5 )
Also, have a look at the latest
% ax ds :append:
When New Files are Added, Everything Just Works!
import build script with new files added will, for each
new file, import the file, chain the dataset to the existing
chain, and add a new session to Urd with the new timestamp.
append build script will in a similar fashion append a
new column to all datasets in the
import chain that does not already
have the new column.
In a more advanced setup, a
cron job could be set up that regularly
in an incremental fashion will import and process files and data
that are new since the last run.
Do more than one job in a list - +dataset_type
A typical import build script will do more that just
will also type the imported dataset using
dataset_type, and perhaps
apply sorting and/or hash partitioning using
dataset_hashpart. In addition, custom methods may be applied to do
some special operations on the imported data. In the end, an import
Urd session will be composed of several jobs, of which the last one is
the one that will contain the dataset used for further processing.
Separation into Build Scripts and Workdirs
Separation is a good thing. It makes sense to have separate build scripts for various tasks, such as the tasks of importing and processing. It also makes sense to let the different build scripts store jobs in different workdirs. The import workdir will be written once by the import build script, and it will then be used many times. Other workdirs may contain more exploratory jobs, and may be erased from time to time. Using separate workdirs ensure that each part of the processing occupies its own space in the storage system, making it possible to, say, erase out of date data analysis jobs without affecting the import jobs.
The Accelerator’s Homepage (exax.org)
The Accelerator on Github/eBay
The Accelerator on PyPI