Calculating DAGMan delay

From CEDPS

Jump to: navigation, search

Contents

Background

When Pegasus uses Condor, there are queueing (and other?) delays between the time a job's dependencies, here called parents, complete and the time the job itself starts. To figure this out, we need to combine the timing information in the kickstart invocations and Condor jobstate logs with the dependency information from Condor ".dag" files.

Procedure

Input files

  1. invocation records [*.out.000]
  2. jobstate.log
  3. *.dag [parent/child condor job relationships]

Output files

  1. invocations.bp
  2. jobstate.bp
  3. dag.bp

Parse invocation records

Use a script to loop through every invocation file. We need to grab the filename and set the part after the path but before the .out.NNN as the comp.id and the /PIDxx/ directory as the "p.id". Both of these are used to correlate across the input files. This will be slow because we need to run the interpreter once per file (my sample SCEC dataset has 15455 such files). Putting metadata in filenames like this almost always ends up in a hassle for processing tools.

./process_invocations.sh

#!/bin/sh
# clear output file
: > invocations.bp
# get a list of invocation files
find . -name "*.out.001" > invocation_records
# parse each invocation file
while read f; do
    # cut last 2 .xxx fields to get an identifier
    i=`basename $f .out.001`
    j=`echo $f | grep -E -o "PID[0-9]{1,3}/" | cut -f1 -d'/'`
    echo $f "comp.id=$i" "p.id=$j"
    # parse the file and append results
    nl_parser -m kickstart -p one_event=true $f -p comp.id=$i -p p.id=$j >> invocations.bp
done < invocation_records

Parse Condor jobstate logs

#!/bin/sh
OFILE=jobstate.bp
# clear output file
: > $OFILE
# parse each file
find . -name jobstate.log | \
while read f; do
    echo $f
    # use PIDxx part as identifier
    i=`echo $f | grep -o "PID[0-9]*"`
    # parse the file and append results
    nl_parser -m jobstate $f -p p.id=$i >> $OFILE
done

Parse Condor DAG files

#!/bin/sh
OFILE=dag.bp
# clear output file
: > $OFILE
# parse each file
find . -name "*.dag" | grep PID | \
while read f; do
    echo $f
    # use PIDxx part as identifier
    i=`basename $f .dag`
    # parse the file and append results
    nl_parser -m condor_dag $f -p p.id=$i >> $OFILE
done

Load files into the database

  • Create the database. This is not really necessary right now, but will be pretty soon when the semantics of "--create" get changed to mean "create the tables" and not "create the tables and the database".
db=scec
user=dang
host=localhost
mysql -e "create database $db"
mysql -e "grant all on ${db}.* to '$user'@'$host'"
  • Catenate all files, so the nl_loader isn't reading from stdin
cat *.bp  > all.bp
  • Run the loader on the input file. The "--drop" argument should be replaced by "--create" when the semantic change noted above is implemented.
nl_loader -u mysql://$host -p db=$db --no-unique --drop --restore=`pwd`/loader.state -i all.bp

Queries

This is not finished yet. The basic idea, though, is to build the tables by degrees so I can understand how to optimize each stage.

dagman_queries.conf

[myquery]
desc="work in progress"
query = """
--
-- Edges
--
create table edges (id integer auto_increment primary key, parent varchar(50), child varchar(50), index(parent), index(child));
-- 6 sec
insert into edges(parent,child) select i1.value, i2.value 
from event as e1 
join ident as i1 on e1.id = i1.e_id
join ident as i2 on i1.e_id = i2.e_id
where i1.name = 'comp.parent' and
i2.name = 'comp.child'
and e1.name = 'condor.dag.edge';
--
-- Node terminated time
--
create table term (node varchar(50), time double, index(node));
--  sec
insert into term(node,time) select i1.value, max(e1.time)
from ident as i1 join event as e1 on e1.id = i1.e_id
where 
i1.name = 'comp' and
e1.name = 'pegasus.jobstate.job_terminated'
group by i1.value;
--
-- Submit time
--
create table submit (node varchar(50), time double, index(node));
-- 1.72 sec
415 235-7960insert into submit(node,time) select i1.value, min(e1.time)
from ident as i1 join event as e1 on e1.id = i1.e_id
where
i1.name = 'comp' and
e1.name =  'pegasus.jobstate.submit'
group by i1.value;
--
-- Parent-child term
--
create table pcterm (parent varchar(50), child varchar(50), time double);
-- 2.02 sec
insert into pcterm(parent, child, time)
select edges.parent, edges.child, max(term.time) from
edges join term on parent = node
group by child;
--
-- Dagman delay
--
select child, submit.time - pcterm.time
from pcterm
join submit on pcterm.child = submit.node;
"""

To run the query:

db=scec 
host=localhost
nl_dbquery -u mysql://$host -d $db -c dagman_queries.conf -q1

Experimental results

Conclusions

TBD

Personal tools