Calculating DAGMan delay
From CEDPS
Contents |
Background
When Pegasus uses Condor, there are queueing (and other?) delays between the time a job's dependencies, here called parents, complete and the time the job itself starts. To figure this out, we need to combine the timing information in the kickstart invocations and Condor jobstate logs with the dependency information from Condor ".dag" files.
Procedure
Input files
- invocation records [*.out.000]
- jobstate.log
- *.dag [parent/child condor job relationships]
Output files
- invocations.bp
- jobstate.bp
- dag.bp
Parse invocation records
Use a script to loop through every invocation file. We need to grab the filename and set the part after the path but before the .out.NNN as the comp.id and the /PIDxx/ directory as the "p.id". Both of these are used to correlate across the input files. This will be slow because we need to run the interpreter once per file (my sample SCEC dataset has 15455 such files). Putting metadata in filenames like this almost always ends up in a hassle for processing tools.
./process_invocations.sh
#!/bin/sh
# clear output file
: > invocations.bp
# get a list of invocation files
find . -name "*.out.001" > invocation_records
# parse each invocation file
while read f; do
# cut last 2 .xxx fields to get an identifier
i=`basename $f .out.001`
j=`echo $f | grep -E -o "PID[0-9]{1,3}/" | cut -f1 -d'/'`
echo $f "comp.id=$i" "p.id=$j"
# parse the file and append results
nl_parser -m kickstart -p one_event=true $f -p comp.id=$i -p p.id=$j >> invocations.bp
done < invocation_records
Parse Condor jobstate logs
#!/bin/sh
OFILE=jobstate.bp
# clear output file
: > $OFILE
# parse each file
find . -name jobstate.log | \
while read f; do
echo $f
# use PIDxx part as identifier
i=`echo $f | grep -o "PID[0-9]*"`
# parse the file and append results
nl_parser -m jobstate $f -p p.id=$i >> $OFILE
done
Parse Condor DAG files
#!/bin/sh
OFILE=dag.bp
# clear output file
: > $OFILE
# parse each file
find . -name "*.dag" | grep PID | \
while read f; do
echo $f
# use PIDxx part as identifier
i=`basename $f .dag`
# parse the file and append results
nl_parser -m condor_dag $f -p p.id=$i >> $OFILE
done
Load files into the database
- Create the database. This is not really necessary right now, but will be pretty soon when the semantics of "--create" get changed to mean "create the tables" and not "create the tables and the database".
db=scec
user=dang
host=localhost
mysql -e "create database $db"
mysql -e "grant all on ${db}.* to '$user'@'$host'"
- Catenate all files, so the nl_loader isn't reading from stdin
cat *.bp > all.bp
- Run the loader on the input file. The "--drop" argument should be replaced by "--create" when the semantic change noted above is implemented.
nl_loader -u mysql://$host -p db=$db --no-unique --drop --restore=`pwd`/loader.state -i all.bp
Queries
This is not finished yet. The basic idea, though, is to build the tables by degrees so I can understand how to optimize each stage.
dagman_queries.conf
[myquery] desc="work in progress" query = """ -- -- Edges -- create table edges (id integer auto_increment primary key, parent varchar(50), child varchar(50), index(parent), index(child)); -- 6 sec insert into edges(parent,child) select i1.value, i2.value from event as e1 join ident as i1 on e1.id = i1.e_id join ident as i2 on i1.e_id = i2.e_id where i1.name = 'comp.parent' and i2.name = 'comp.child' and e1.name = 'condor.dag.edge'; -- -- Node terminated time -- create table term (node varchar(50), time double, index(node)); -- sec insert into term(node,time) select i1.value, max(e1.time) from ident as i1 join event as e1 on e1.id = i1.e_id where i1.name = 'comp' and e1.name = 'pegasus.jobstate.job_terminated' group by i1.value; -- -- Submit time -- create table submit (node varchar(50), time double, index(node)); -- 1.72 sec 415 235-7960insert into submit(node,time) select i1.value, min(e1.time) from ident as i1 join event as e1 on e1.id = i1.e_id where i1.name = 'comp' and e1.name = 'pegasus.jobstate.submit' group by i1.value; -- -- Parent-child term -- create table pcterm (parent varchar(50), child varchar(50), time double); -- 2.02 sec insert into pcterm(parent, child, time) select edges.parent, edges.child, max(term.time) from edges join term on parent = node group by child; -- -- Dagman delay -- select child, submit.time - pcterm.time from pcterm join submit on pcterm.child = submit.node; """
To run the query:
db=scec host=localhost nl_dbquery -u mysql://$host -d $db -c dagman_queries.conf -q1
Experimental results
Conclusions
TBD
