PDSF logs
From CEDPS
Contents |
Parsing the "reporting" logs
Dan 06:54, 11 June 2008 (UTC)
Background
On PDSF, users can voluntarily state that they are consuming a given resource. The sum of these claims to a resource for all running jobs is recorded in periodic snapshots.
The goal is to get users to claim what they are actually using, by comparing the resource consumption with the actual resource stats from Ganglia.
Step #1 to that goal is parsing and looking at the resource consumption claims. The rest of this page shows how this was done with the NetLogger parser/loader and some R code.
Setup
Run on: Mac OSX laptop (2.4GHz Intel Core2 Duo), MySQL 5.0.45
Data set
$ wc -l reporting 286632 reporting $ grep -e 'global' -e 'host_consumable' reporting | wc -l 25676
Parse data (~10 sec)
uni> <reporting nl_parser -m sge_rpt > reporting.bplog
Load data (4 min 26 sec)
$ mysql -e 'create database sge_rpt' $ nl_loader -C -p database=sge_rpt -u mysql://localhost -p read_default_file=~/.my.cnf -v -i reporting.bplog
Analyze data from R
Query
> library(RMySQL) > con <- dbConnect(MySQL(),dbname="sge_rpt") > dbGetQuery(con, "select count(*) from event") count(*) 1 403425 > dbGetQuery(con, "select count(*) from attr where name = 'rsrc' and value like 'dv%'") count(*) 1 232275 # note: due to joins, this takes a couple of minutes. need to think about this.. > data <- dbGetQuery(con, "select time, a1.value as 'resource', a2.value as 'value', a3.value as 'limit' from event join attr as a2 on event.id = a1.e_id join attr as a2 on event.id = a2.e_id join attr as a3 on event.id = a3.e_id where a1.name = 'rsrc' and a1.value like 'dv%' and a2.name = 'val' and a3.name = 'limit'")
Manipulate for plotting
> sec2date
function(x) ISOdatetime(1970,1,1,0,0,0, tz='GMT') + x
> data$ts <- sec2date(data$time)
> names(data)
[1] "time" "resource" "value" "limit" "ts"
# make numeric values really numbers
> data$value <- as.numeric(data$value)
> data$limit <- as.numeric(data$limit)
# find non-zeros
> sums <- aggregate(data$value, by=list(data$resource), sum)
> sums[sums$x > 0,]
Group.1 x
1 danteio 103070
22 eliza11io 41389
23 eliza12io 2860201
24 eliza13io 281020
29 eliza6io 1284391
30 hpssio 11
31 projectio 265264
> nz.resource <- sums[sums$x > 0,'Group.1']
# filter out
d <- data[data$resource %in% nz.resource,]
# drop unused factor levels
d$resource <- factor(d$resource)
Plot
# plot with different colors for each resource > xyplot(value ~ sec2date(time)|resource, d, type='l', main="PDSF I/O consumption, 2008-06-10", xlab="time", ylab="consumption (units)", auto.key=TRUE)
# plot with different sub-panels for each resource > xyplot(value ~ sec2date(time)|resource, d, type='h', main="PDSF I/O consumption, 2008-06-10", xlab="time", ylab="consumption (units)", auto.key=TRUE)


