Comparison of Kickstart and DAGMan logs
From CEDPS
Contents |
Overview
From 6/5/08 meeting notes: "Ewa mentioned that comparing Dagman logs and Kickstart records may provide interesting performance insights since Dagman runs 'clusters' of jobs and the comparative performance of those clusters will indicate how good their clustering algorithm is."
Steps:
- Get logs from a run including both DAGMan and Kickstart
- Write/modify parsers so this data can be converted to NetLogger format
- Load the data into a database
- Write queries / R programs to compare performance of DAGMan clusters
- Post and explain results
Get logs from a run including both DAGMan and Kickstart
grolsch.lbl.gov:/scratch/logfiles/pegasus/condor-jobstatelog
Write/modify parsers so this data can be converted to NetLogger format
Parser is called "jobstate", but there are some issues. From email between Keith Beattie and Gaurang Mehta, 6/13/08:
- Keith: Note that these have one TAILSTATD/DAGMAN STARTED pair but two FINISHED pairs, with the first FINISHED pair followed by a line which does not start with a timestamp. We're guessing that the first try failed, was re-submitted (or some such) and overwrote part - but not all - of the beginning of the jobstate.log file. Perhaps this is a bug in your tailstatd process?
- Gaurang: I think this is true. We need to do a rollover or append operation in the tailstatd code. We just have not got round to doing this. If you want you can always run tailstatd (its in $PEGASUS_HOME/bin) on the dagman.out save the jobstate.log file and then run it again on subsequent .dag.rescue.dagman.out and so on. (Though i think i never provided the dagman.out file)..
Load the data into a database
The data is organized into directories named CyberShake_<Site-Code>_<Num>. We want to add a keyword/value pair for each Site and Num, so we need to process them separately.
- To find the sites:
find . -name "CyberShake_*_*" -type d | cut -f2 -d_ | sort | uniq LBP PAS USC WNGC
- Merge each site/num into a single file, then load this file into the database, using the db name 'netlogger_jan08'
- This is all bundled up into a little shell script 'load_jobstate.sh':
#!/bin/sh
# vars
ofile=jobstate-all.bplog
DB=netlogger_jan08
# merge jobstate logs
if test -s $ofile
then
echo "$ofile exists, skipping parse and merge"
else
for site in `find . -name "CyberShake_*_*" -type d | cut -f2 -d_ | sort | uniq`
do
echo "site $site"
for f in `find CyberShake_${site}_* -name jobstate.log`
do
sn=`echo $f | cut -d_ -f3 | cut -d'/' -f1`
echo "file $f ( $site $sn )"
nl_parser -m jobstate -p "sc.id=$site" -p "sn.id=$sn" $f >> $ofile
done
done
fi
# clear DB
for tbl in attr dn event ident text ; do
mysql -e "use $DB ; delete from $tbl"
done
# load jobstate logs
nl_loader -v -C -i jobstate-all.bplog -u mysql://localhost \
-p read_default_file=~/.my.cnf -p db=$DB
- BTW, a little trick to check how things are going is to do 'wc -l' on the log and compare to the length of the events table:
mysql -e "use netlogger_jan08; select count(id) from event"
- Also you need to merge together the kickstart logs and load them into the same database. Unfortunately, you need to go through the same rigamarole because we will need to match the site/num from the jobstart directory names with the site/num from the kickstart directory names. In general, you'd need a variation on the above shell script. But in this case we'll just load one file, so we can do this by hand:
find CyberShake_LBP_35/20080110T135251-0800 -name "merge_scec-*.out" | xargs -L 100 \ nl_parser -m pegasus -p one_event=1 -p sc.id=LBP -p sn.id=35 > CyberShake_LBP_35.bplog nl_loader -u mysql://localhost -p read_default_file="~/.my.cnf" -p db=netlogger_jan08 -i CyberShake_LBP_35.bplog
Write queries / R programs to compare performance of DAGMan clusters
Basic info. queries:
- event names: select name from event group by name;
- identifier names: select name from ident group by name;
Some exploratory queries:
- list of job execution events for LBP_35:
select * from event
join ident as i1 on id = i1.e_id
join ident as i2 on i1.e_id = i2.e_id
where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn' and i2.value = '35'
and event.name = 'pegasus.jobstate.execute';
- list of condor identifiers for same:
select i3.value from event join ident as i1 on id = i1.e_id join ident as i2 on i1.e_id = i2.e_id join ident as i3 on i3.e_id = i2.e_id where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn' and i2.value = '35' and i3.name = 'condor' and event.name = 'pegasus.jobstate.execute';
- identifier and duration for each condor (sub-)job:
-- use above query to make temp table for 'start' events
create temporary table condorids
select i3.e_id, event.time, i3.value
from event
join ident as i1 on id = i1.e_id
join ident as i2 on i1.e_id = i2.e_id
join ident as i3 on i3.e_id = i2.e_id
where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn' and i2.value = '35' and i3.name = 'condor'
and event.name = 'pegasus.jobstate.execute';
-- make a second temp table for 'end' events
create temporary table condorids2
select i3.e_id, event.time, i3.value
from event
join ident as i1 on id = i1.e_id
join ident as i2 on i1.e_id = i2.e_id
join ident as i3 on i3.e_id = i2.e_id
where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn' and i2.value = '35' and i3.name = 'condor'
and (event.name = 'pegasus.jobstate.job_terminated' or event.name = 'pegasus.job_disconnected');
-- join the two tables to get job times
select c1.value as 'jobid', c2.time - c1.time
from condorids as c1
join condorids2 as c2 on c1.value = c2.value;
Post and explain results
TBD
