Brown-bag NERSC seminar notes
From CEDPS
Contents |
NERSC Brown-bag Seminar notes
Background
The brown-bag seminars are announced and open to all of NERSC staff. In fact, they are open to any LBNL staff.
Date: 9/6/2007
Presenter: Brian Tierney
Notes: Dan Gunter
Discussion notes
Questions and answers, and comments ("C:", "B:" for when Brian jumped back in). Some of these indicate directions to research, but most simply show things that need to be explained carefully (for future reference).
- Q: When you collect logs into a central spot, do you then assume they are publicly accessible?
- A: Partially, yes, but this will be explained more later
- Q: What do we do with all the logs that don't conform to this format?
- A: We have converters for a few formats, but there is still the problem that many times both the start and end of an activity is not logged.
- Q: What is syslog-ng being used for here? Does it convert the logs for you?
- A: Syslog-ng is there mostly to move the logs around.
- Q: What about parallel jobs?
- A: Our focus is more on grid/distributed types of jobs
- Q: Are event names types or instances?
- A: Types.
- Q: [re: stripping out sensitive fields] I thought the purpose was debugging -- how can you do that if you strip out things like the user name?
- A: Exactly! My personal view is to keep it. But some sites, esp. in Europe, have very strict policies about exporting certain types of information.
- C: Within OSG lots of the troubleshooting is actually performed by people on-site.
- Q: Instead of shipping logs upstream, what about having a query interface that controls access?
- A: In theory that would work, but in practice there are difficulties. You may end up needing to send the query to thousands of hosts since the location of the job is by design not known to the user.
- Q: So, how does this work now (debugging problems)?
- C: The ticket-system is used. The user files a trouble ticket with OSG, then OSG forwards it to the site, then the person on site deals with it. Mostly a non-automated process.
- C: Most of the failures we see on-site come from users who are running their jobs for the first time. Once they get set up, things tend to work pretty well.
- B: Remember we're trying to discuss "soft" failures here, as well.
- Q: What about end events that eventually do arrive?
- A: Right now, we ignore them.
- C: But these are new information. You should probably log them, too.
- C: This kind of analysis is something that could be done in the database (as opposed to in real-time).
- Q: Will you generate a list of keywords that people could use?
- A: We find people don't want to be told what keywords to use.
- C: But it couldn't hurt to have the list, even if most people ignore it.
- Q: At what level do you envision this happening? What about 20K node jobs?
- A: Not at that level, although we do have this "summarizer" thing..
- Q: What about an XML format? Wouldn't this be extensible?
- A: We've toyed with that, but decided this was simpler to generate and parse.
- Q: Is the parser specific to those few programs you mentioned?
- A: Those are the only parsers we've written, but the framework is designed for extensibility.
- C: What I'm really interested in is the query interface
- A: That's the part we haven't done yet.
- C: For example, the kind of thing I need to know is if someone's job supposedly took too long, do the resources used by that job correlate with any known failures?
- B: What log files are you guys using now?
- C: Batch scheduler logs are important. (PBS uses ASCII, LoadLeveler uses binary w/API)
- C: We have a database of job logs going back for years.
Action items
In further discussion with David Skinner and Richard Gerber, we decided:
- NERSC should use syslog-ng to start collecting logs in a consistent way
- one way to make sending to syslog-ng easy is to have a command-line utility
- This is something we (CEDPS) will develop and deliver
