Log Database Use Cases

From CEDPS

Jump to: navigation, search

Contents

CEDPS Log Database Use Cases

Version 1 of the CEDPS Log Database assumes a single database per site. Future versions will provide a distributed overlay for all site Log Databases.

Here are some sample queries that should be supported.

Note: all queries below are for a given time window

From a GOC admin:

  • find log messages for jobs from VO=Atlas running at site=FNAL
  • find log messages related to service=condor, user=Joe, site=Indiana
  • find log messages for user=Joe
  • find log messages with status=error
  • find log messages which event=*authn* with status=error
  • find log messages where the time between start/end events are more than 3X the baseline
  • find log messages with start events with no matching end event

From a User (ie: all these relate to logs for the user DN):

  • find log messages for all my jobs
  • find log messages with status=error
  • find log messages which event=*authn* with status=error
  • find log messages where time intervals are more than 3X the baseline, where the baseline is computed from historical data in the log database

From a VO:

  • what sites had connection attempts for a given user DN
  • what data files were accessed most
  • which user moved the largest amount of data in my VO
  • find all logs where job manager status=killed (ie: jobs that were killed for running too long)
  • which user submitted the most jobs (Gratia is better choice for this, but maybe it should be supported?)

From a site admin:

  • what was the average GridFTP transfer speed on server=gridftp.lbl.gov
  • what are the top 10 fastest/slowest sites receiving GridFTP transfers from my site
  • what is the distribution of job run times on CE=myComputeElement

Query output format

To start with, we assume a command line tool that uses the grid proxy certificate to connect to a web service wrapper for mySQL. The tool should be able to output the following:

  • CEDPS "Best Practice" format (ie: name=value pairs)
  • CSV format

Required functionality

To support the above queries efficiently requires the following functionality:

  • support for wildcard queries on DNs and event names
  • performance baselines for many types (all?) of start/end pairs

The following functionality is assumed to be provided directly by SQL:

  • restriction of the result based on boolean combinations of attribute/value pairs (where clause)
  • equijoin, and other types of joins (join)
  • elimination of duplicate results in a given result set (distinct operator)

The following may need to be done at the implementation level:

  • data indexed by time, DN, VO, event

Other issues

How to handle Distributed Queries?

Assume central log database has the following ONLY:

  • start/end events for jobs and gridftp transfers that include a DN and a GUID
  • URL of site archive for this event
Personal tools