PDSF logs

From CEDPS

Jump to: navigation, search

Contents

Parsing the "reporting" logs

Dan 06:54, 11 June 2008 (UTC)

Background

On PDSF, users can voluntarily state that they are consuming a given resource. The sum of these claims to a resource for all running jobs is recorded in periodic snapshots.

The goal is to get users to claim what they are actually using, by comparing the resource consumption with the actual resource stats from Ganglia.

Step #1 to that goal is parsing and looking at the resource consumption claims. The rest of this page shows how this was done with the NetLogger parser/loader and some R code.

Setup

Run on: Mac OSX laptop (2.4GHz Intel Core2 Duo), MySQL 5.0.45

Data set

$ wc -l reporting
  286632 reporting
$ grep -e 'global' -e 'host_consumable' reporting | wc -l
   25676

Parse data (~10 sec)

uni> <reporting nl_parser -m sge_rpt > reporting.bplog

Load data (4 min 26 sec)

$ mysql -e 'create database sge_rpt'
$ nl_loader -C -p database=sge_rpt -u mysql://localhost -p read_default_file=~/.my.cnf -v -i reporting.bplog

Analyze data from R

Query

> library(RMySQL)
> con <- dbConnect(MySQL(),dbname="sge_rpt")
> dbGetQuery(con, "select count(*) from event")
  count(*)
1   403425
> dbGetQuery(con, "select count(*) from attr where name = 'rsrc' and value like 'dv%'")
  count(*)
1   232275
# note: due to joins, this takes a couple of minutes. need to think about this..
> data <- dbGetQuery(con, "select time, a1.value as 'resource', a2.value as 'value', a3.value as 'limit'  from event join attr as a2 on event.id = a1.e_id join attr as a2 on event.id = a2.e_id join attr as  a3 on event.id = a3.e_id where a1.name = 'rsrc' and a1.value like 'dv%' and a2.name = 'val' and a3.name = 'limit'")

Manipulate for plotting

> sec2date
 function(x) ISOdatetime(1970,1,1,0,0,0, tz='GMT') + x
> data$ts <- sec2date(data$time)
> names(data)
[1] "time"     "resource" "value"    "limit"    "ts"    
# make numeric values really numbers
> data$value <- as.numeric(data$value)
> data$limit <- as.numeric(data$limit)
# find non-zeros
> sums <- aggregate(data$value, by=list(data$resource), sum)
> sums[sums$x > 0,]
    Group.1       x
1    danteio  103070
22 eliza11io   41389
23 eliza12io 2860201
24 eliza13io  281020
29  eliza6io 1284391
30    hpssio      11
31 projectio  265264
> nz.resource <- sums[sums$x > 0,'Group.1']
# filter out
d <- data[data$resource %in% nz.resource,]
# drop unused factor levels
d$resource <- factor(d$resource)

Plot

# plot with different colors for each resource
> xyplot(value ~ sec2date(time)|resource, d, type='l', main="PDSF I/O consumption, 2008-06-10", xlab="time", ylab="consumption (units)", auto.key=TRUE)

I/O Node Consumption with colors for each

# plot with different sub-panels for each resource
> xyplot(value ~ sec2date(time)|resource, d, type='h', main="PDSF I/O consumption, 2008-06-10", xlab="time", ylab="consumption (units)", auto.key=TRUE)

I/O Node Consumption with a panel for each

Personal tools