Job Objects: The Keys to Data and Results
In Accelerator terms, a job is an executed program, it contains everything input to as well as everything generated by a program run, and it is stored on disk. The job object is used to represent a job, and it is a keeper of links to datasets, files, parameters, and metadata.
The job object is used to pass data and files between jobs, thereby creating dependencies, and between jobs and build scripts. Job objects are easy to use, and bring a set of functions that makes common tasks straightforward to write and comprehend. This post gives an overview of most of the job object’s possibilities.
Job Objects are Returned from Build Calls
The first place we encounter job objects is in build scripts. The Accelerator uses build scripts to execute jobs. Jobs are stored on disk, so that they can be used by other jobs and re-used if the same build call is run again. Here is an example of a simple build script:
def main(urd): job = urd.build('csvimport', filename='mydata.csv')
This build script will create a job containing everything related to
the execution of the method
csvimport with the option
filename='mydata.csv. When execution finishes, the call returns a
job object representing the job.
Job Objects Represent Either Existing Jobs or the Currently Running Job
To be specific, job objects come in two very similar flavours. There is one used to reference existing jobs, for example in build scripts as shown in the example above, and one used inside running jobs, used to represent the running job itself. The vast majority of functionality is available in both places, but functions that only makes sense in one of the places is unavailable in the other. The difference is kind of obvious when working on a project, so it will not be mentioned again here.
1. Job References
The most common use of the job object in a build script is to make the job available to new jobs, like in this example:
job_import = urd.build('csvimport', filename='mydata.csv') job_analyse = urd.build('analyse', source=job_import)
job_analyse has access to everything related to
since it is using
job_import as an input parameter.
The Accelerator’s Dataset is a data structure for tabular data, that is designed for efficient parallel processing.
There is a tight connection between jobs and datasets. Datasets are created by, and stored in, jobs.
Are There any Datasets in the Job?
This will return a list of all datasets in a job:
all_datasets = job.datasets
How do I get a Reference to a Named Dataset?
To get a reference to a particular named dataset, use the
ds = job.dataset(name)
name corresponds to the default dataset, i.e.
How do I Create a Dataset in a Job?
Datasets are created using
job.datasetwriter(). Creating datasets
is out of this posts’ scope, but here’s a sketch:
def prepare(job): dw = job.datasetwriter(name='movies') dw.add('moviename', 'unicode') return dw def analysis(sliceno, prepare_res): dw = prepare_res: ... dw.write(data)
See the manual or other examples or posts for more information.
There are three ways for a job to store data: as datasets, files, or using the return value, where the return value is actually a special case of a file.
How to Load and Store Data in Files
To simplify the coding experience, the Accelerator has built in support for storing and retrieving data in Python’s pickle format or in JSON format.
A running job can use
job.save() to store data to a file in a job.
The data will be serialised using Python’s pickle format. For
def synthesis(job): data = ... job.save(data) # using default name "result.pickle" job.save(data, 'anothername.pickle') return data # store data as "result.pickle"
Note the last line - the data returned from synthesis is automatically
stored in the file
Loading data from a job is equally simple, using
job.load(), like this:
data = job.load('anothername.pickle') data = job.load() # to load "result.pickle"
A common example would be to run a job and then present its output
somehow. Assume that the
analyse job returns some data using the
return statement, then we could do:
job_analyse = urd.build('analyse', source=job_import) data = job_analyse.load() # do something with data here, for example print(data)
The power of using the pickle format is that it is possible to store and retrieve complex datatypes between jobs and build scripts without having to consider formatting or parsing!
The JSON-equivalents are called
job.json_load(). (The Accelerator typically uses JSON format
internally for meta data files and pickle format for everything else.)
It should be mentioned that
job.open() as well as the load and save
functions can work on parallel files, i.e. files that are written
in parallel independently in each slice/process by one job, and read back in
parallel in another job. See manual for more information!
How can I Find the Names of all Files Stored in a Job?
all_files = job.files()
This will return a
set of all files created by the job. This
function takes an optional
pattern argument that is used to filter
only_the_text_files = job.files('*.txt')
So, How does the Job Know what Files it Contains?
job.save(), as well as its JSON counterpart, will
register its files automatically to the running job. If files are
generated by some other means, files can be registered manually, like
def synthesis(job): ... PIL.Image.save(filename) # saves a file using Python Image Library job.register_file(filename)
or, using the extended
job.open(), a wrapper for plain
will register the file automatically:
def synthesis(job): ... with job.open(filename, 'wb') as fh: fh.write(mydata)
How to get the absolute filename to a file in a job?
If the name of the file is known,
absname = job.filename(name)
will return the full pathname to the file.
There are several ways to inspect a job.
Using the Accelerator Board web server and inspect jobs and datasets from your web browser. Launch it by typing
and connect to it using a web browser pointing to
accelerator.confto have it autostarting listening to any port of choice.)
ax job <job>command to get information on a specific job
All job meta-data that makes sense to a human is stored in JSON-format, so it is possible to navigate to a job directory and inspect the files directly, in particular, the files
Here is an example using the board server
The result directory is a place where “global” results and findings can be put. We use soft links, that point to files in jobs, instead of putting actual files there. The reason is that if a files is copied from a job to the result directory, the connection between the file and its origin (the job) is gone. The difference between a file and a link is that the link shows the actual location of the original file (which is in the job that created it). So the link maintains transparency.
Making a File in a Job Visible in the Result Directory
The preferred and easiest way to create links in the result directory
is to use
job.link_result(), like this:
job = urd.build('plotsomething', source=mydataset) job.link_results('plot.png')
This will create a soft link in the result directory pointing to the
The name of the link can be set using the
job = urd.build('plotsomething', source=mydataset) job.link_results('plot.png', linkname='distribution_of_taxicabs_on_mondays.png')
linkname= is optional.
6. File references
Passing jobs as input to other jobs is, as we’ve seen, straightforward. But sometimes we only want to pass specific files from one job to another. There are several ways to do this:
1. Using the job object only
We pass the job object representing an existing job with files to a new job. In this case, the new job is thus informed about the job that created and holds the files, but it will not know the name of the actual files.
If there is only one file, and we use the convention to name the file
result.pickle, we can load the contents of the file using
job.load(). But if there are more than one file, or the file name
is not always unique, the new job does not know how to resolve the
situation. One solution is to hard code the file names in the new
job’s source code. This is ok in many cases, but it is not a generic
2. Pass the filename as parameter (but please don’t)
We pass the full pathname of the file as a parameter to the new job. In this case, the new job can read the file directly, but the information about which job that generated the file is gone. This breaks transparency and should be avoided.
3. Pass the file name together with a reference to the job
job.withfile() function is used for that, like this:
job = urd.build('create_two_files') urd.build('checkfile', thefile = job.withfile('fileA') urd.build('checkfile', thefile = job.withfile('fileB')
create_two_files job (implicitly) creates the files
checkfile method is then built twice,
once for each of the filenames in the “source” job
Inside the method (in this case
checkfile), it looks something
from accelerator import JobWithFile options = dict(thefile=JobWithFile()) def synthesis(): filename = options.thefile.filename() # data = options.thefile.load() # or data = options.thefile.load_json() # or with options.thefile.open('rb') as fh: data = fh.read()
7. Miscellaneous Functions
Miscellaneous Functions, Available for Existing Jobs
These functions are available on existing jobs to investigate a job’s output, input parameters, and execution times (covering both parallel and serial parts of the program in detail).
job.output() # A string containing everything the job has written to stdout and stderr! job.params # A dict containing all the job's parameters job.post # Execution times for all parts of the job
Miscellaneous Functions, Available for Running Jobs
Information on location of input and result directory is available to the running job object:
The reference to the
input_directory is useful when writing data
import methods. Storing data in the
input_directory, specified by
the Accelerator’s configuration file, makes it possible to import data
without absolute file path dependencies. This makes it much easier to
port a project to a different system.
Miscellaneous Functions, Common for both Existing and Running Jobs
These are only mentioned for completeness:
job.method # name of the job's method (i.e. corresponding source filename) job.path # absolute path of job directory job.workdir # name of workdir where the job is stored job.number # the integer part of the jobid.
8. Job Chains
Similar to dataset chains, it is possible to create chains of jobs. A
job chain is created using the
jobs.previous input parameter. In a
jobs = ('previous',) def synthesis(): ...
and in the build script
previous = None for ix in range(10): previous = urd.build('thejob', previous=previous)
This creates a job chain that is ten jobs long. The
job.chain()-function will return a list of all jobs in the chain.
list_of_jobs_in_chain = job.chain()
Similar to dataset chains, this function takes some options, see documentation for details.
9. Relation to Other Classes
The graph below shows as simplified version of how the different classes used by the Accelerator are connected. The classes at the top are used in build scripts, and classes at the bottom in running jobs.
only a small subset of all available class member functions are shown in order to keep the image simple, and
the graph does only show the most relevant classes.
The intention of this post is to show the various functions and most common use cases for the job object. The job object provides streamlined solutions to common situations that reduce the amount of code to write (and maintain). For more details and practical examples, see the references below.
The Accelerator’s Homepage (exax.org)
The Accelerator on Github/eBay
The Accelerator on PyPI