Comparison of Kickstart and DAGMan logs

From CEDPS

Jump to: navigation, search

Contents

Overview

From 6/5/08 meeting notes: "Ewa mentioned that comparing Dagman logs and Kickstart records may provide interesting performance insights since Dagman runs 'clusters' of jobs and the comparative performance of those clusters will indicate how good their clustering algorithm is."

Steps:

  1. Get logs from a run including both DAGMan and Kickstart
  2. Write/modify parsers so this data can be converted to NetLogger format
  3. Load the data into a database
  4. Write queries / R programs to compare performance of DAGMan clusters
  5. Post and explain results

Get logs from a run including both DAGMan and Kickstart

grolsch.lbl.gov:/scratch/logfiles/pegasus/condor-jobstatelog

Write/modify parsers so this data can be converted to NetLogger format

Parser is called "jobstate", but there are some issues. From email between Keith Beattie and Gaurang Mehta, 6/13/08:

  • Keith: Note that these have one TAILSTATD/DAGMAN STARTED pair but two FINISHED pairs, with the first FINISHED pair followed by a line which does not start with a timestamp. We're guessing that the first try failed, was re-submitted (or some such) and overwrote part - but not all - of the beginning of the jobstate.log file. Perhaps this is a bug in your tailstatd process?
  • Gaurang: I think this is true. We need to do a rollover or append operation in the tailstatd code. We just have not got round to doing this. If you want you can always run tailstatd (its in $PEGASUS_HOME/bin) on the dagman.out save the jobstate.log file and then run it again on subsequent .dag.rescue.dagman.out and so on. (Though i think i never provided the dagman.out file)..

Load the data into a database

The data is organized into directories named CyberShake_<Site-Code>_<Num>. We want to add a keyword/value pair for each Site and Num, so we need to process them separately.

  • To find the sites:
find . -name "CyberShake_*_*" -type d | cut -f2 -d_  | sort | uniq
LBP
PAS
USC
WNGC
  • Merge each site/num into a single file, then load this file into the database, using the db name 'netlogger_jan08'
  • This is all bundled up into a little shell script 'load_jobstate.sh':
#!/bin/sh
# vars
ofile=jobstate-all.bplog
DB=netlogger_jan08
# merge jobstate logs
if test -s $ofile
then
    echo "$ofile exists, skipping parse and merge"
else
    for site in `find . -name "CyberShake_*_*" -type d | cut -f2 -d_  | sort | uniq`
    do
        echo "site $site"
        for f in `find CyberShake_${site}_* -name jobstate.log`
        do 
            sn=`echo $f | cut -d_ -f3 | cut -d'/' -f1` 
            echo "file $f ( $site $sn )"
            nl_parser -m jobstate -p "sc.id=$site" -p "sn.id=$sn" $f >> $ofile
        done
    done
fi
# clear DB
for tbl in attr  dn  event ident text ; do
    mysql -e "use $DB ; delete from $tbl"
done
# load jobstate logs
nl_loader -v -C -i jobstate-all.bplog -u mysql://localhost \
   -p read_default_file=~/.my.cnf -p db=$DB 
  • BTW, a little trick to check how things are going is to do 'wc -l' on the log and compare to the length of the events table:
mysql -e "use netlogger_jan08; select count(id) from event"
  • Also you need to merge together the kickstart logs and load them into the same database. Unfortunately, you need to go through the same rigamarole because we will need to match the site/num from the jobstart directory names with the site/num from the kickstart directory names. In general, you'd need a variation on the above shell script. But in this case we'll just load one file, so we can do this by hand:
find CyberShake_LBP_35/20080110T135251-0800 -name "merge_scec-*.out" | xargs -L 100 \
   nl_parser -m pegasus -p one_event=1 -p sc.id=LBP -p sn.id=35 > CyberShake_LBP_35.bplog 
nl_loader -u mysql://localhost -p read_default_file="~/.my.cnf" -p db=netlogger_jan08 -i CyberShake_LBP_35.bplog

Write queries / R programs to compare performance of DAGMan clusters

Basic info. queries:

  • event names: select name from event group by name;
  • identifier names: select name from ident group by name;

Some exploratory queries:

  • list of job execution events for LBP_35:
select * from event 
  join ident as i1 on id = i1.e_id  
  join ident as i2 on i1.e_id = i2.e_id
where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn' and i2.value = '35'
      and event.name = 'pegasus.jobstate.execute';
  • list of condor identifiers for same:
select i3.value from event join ident as i1 on id = i1.e_id  join ident as i2 on i1.e_id = i2.e_id 
join ident as i3 on i3.e_id = i2.e_id where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn' 
and i2.value = '35' and i3.name = 'condor' and event.name = 'pegasus.jobstate.execute';
  • identifier and duration for each condor (sub-)job:
-- use above query to make temp table for 'start' events
create temporary table condorids 
select i3.e_id, event.time, i3.value 
from event 
join ident as i1 on id = i1.e_id  
join ident as i2 on i1.e_id = i2.e_id  
join ident as i3 on i3.e_id = i2.e_id 
where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn'  and i2.value = '35' and i3.name = 'condor' 
      and event.name = 'pegasus.jobstate.execute';
-- make a second temp table for 'end' events
create temporary table condorids2 
select i3.e_id, event.time, i3.value 
from event 
join ident as i1 on id = i1.e_id  
join ident as i2 on i1.e_id = i2.e_id  
join ident as i3 on i3.e_id = i2.e_id 
where i1.name = 'sc' and i1.value = 'LBP' and i2.name = 'sn'  and i2.value = '35' and i3.name = 'condor' 
      and (event.name = 'pegasus.jobstate.job_terminated' or event.name = 'pegasus.job_disconnected');
-- join the two tables to get job times
select c1.value as 'jobid', c2.time - c1.time 
from condorids as c1 
join condorids2 as c2 on c1.value = c2.value;

Post and explain results

TBD

Personal tools