Data

From CEDPS

Jump to: navigation, search

Contents

Goals of the CEDPS Data Services Area

  • Develop tools and techniques for reliable, high-performance, secure, and policy-driven placement of data within a distributed science environment
    • Data placement and distribution services that implement different data distribution and placement behaviors
      • Data mirroring
      • Data staging for computations
      • Flexible, policy-driven data placement
    • Enhanced GridFTP Data Transfer Service (formerly called Managed Object Placement Service or MOPS)
      • Space management, including space allocation
      • Bandwidth management
      • Connections management
      • Enhancements for Lots of Small Files (pipelining, concurrency)

Enhanced GridFTP Data Transfer Service (ANL Team)

Overview of Accomplishments and Plans

Overview of GridFTP, May 2009 (PDF)

GridFTP Feature development funded by the CEDPS Project

  • Performance Enhancements
    • Data set partitioned into lots of small files
    • UDT as an alternative transport protocol
  • Improved resource management
    • Memory management
    • Space usage enforcement
  • Improved Scalability
    • Striping
    • Load balancing
    • Dynamic backends
  • Improved troubleshooting
  • Better Usability
    • GWFTP
    • GridFTP GUI
  • Other features
    • Multicasting
    • Overlay routing
    • Information provider
    • GUMS authorization

Application communities that work with CEDPS and use GridFTP

  • Advanced Photon Source (APS) at ANL
  • Spallation Neutron Source (SNS) at ORNL
  • Argonne Leadership Computing Facility (ALCF) at ANL
  • Earth System Grid (ESG)
  • Compact Muon Solenoid (CMS) physics application at Fermi
  • Open Science Grid (OSG)
  • Leadership Computing Facility (LCF) at ORNL
  • Scientific Data Management Center (SDM) at LBNL
  • Institute for Ultra-Scale Visualization (UltraVis)
  • TeraGrid (TG)
  • Laser Interferometer Gravitational Wave Observatory Project (LIGO)

GridFTP Software and Documentation

Earlier work under CEDPS

Data Replication, Mirroring and Placement (ISI Team)

In the original proposal, there was a large focus on policy-driven data placement. Based on our discussions with DOE application communities and a better understanding of their use cases, we have done work in these areas: development of simple tools for data mirroring and priority-based data transfers and research in two areas: data placement considerations for scientific workflows and polic-driven data placement.

Data Mirroring Tool

Based on our interactions with DOE applications, we decided to focus on development of a simple, lightweight tool for data mirroring between a source and destination directory.

  • Ensures that destination directory is in sync with source directory
    • Ensures that files already present at destination and whose sources are unchanged are not transferred again
  • Recursively scans source and destination files or directories
  • Evaluates file consistency based on optional checks
    • file existence, modification timestamp, file size, checksum
  • Produces list of file source-destination pairs that require synchronization
  • Interoperates with existing tools from the Globus Toolkit
    • including widely used globus-url-copy for file transfer functionality
  • Provides intuitive, simple command-line user interface
    • similar to well-known synchronization tools (e.g., rsync)


Bulk Transfer Service (BuTrS)

The Bulk Transfer Service (BuTrS) is a lightweight, simple, efficient utility for bulk data transfer with priorities.

  • Lightweight
    • Distributed in a single Java archive
    • Minimal third-party dependencies
    • 1-step installation (Add jar to classpath) with no configuration required
  • Simple
    • Zero administrative overhead (no containers, daemons, or databases)
    • Conventional, easy-to-use Java API (no RPC overhead)
    • Familiar callback style notification of transfer results
  • Efficient
    • Concurrent transfers on all pairwise combinations of transfer endpoints
    • Low constant overhead, linear-time algorithms for inserting, deleting, and reprioritizing bulk transfers
    • In-memory data structures to avoid database or file IO overhead
  • The BuTrS software and documentation are available here


Research: Data Placement and Scientific Workflows

Pre-staging Data For Workflows

Initial work with ISI and Wisconsin teams looked at pre-staging data for scientific workflow execution and found that such pre-staging can have a significant impact on workflow performance

  • Paper: "Data Placement for Scientific Applications in Distributed Environments, Ann Chervenak, Ewa Deelman, Miron Livny, Mei-Hui Su, Rob Schuler, Shishir Bharathi, Gaurang Mehta, Karan Vahi, 8th IEEE/ACM International Conference on Grid Computing (Grid 2007), Austin, TX, 2007. Paper (PDF), Presentation(PDF)


Characterization of Scientific Workflows

Student research: Shishir Bharathi, USC PhD Student (will graduate summer 2009) along with Pegasus workflow project team

  • Extensive recent research in Workflow Systems
  • Comparison of systems requires good benchmarks
  • Objectives:
    • Characterize workflows from a variety of scientific domains
      • Includes SCEC, LIGO, Montage, Genome, SIPHT
    • Identify basic workflow structures common to many applications
    • Use characterization to generate synthetic but realistic workflows
      • Range of scale for number of jobs, data sizes
  • Paper: “Characterization of Scientific Workflows,” Shishir Bharathi, Ann Chervenak, Ewa Deelman, Gaurang Mehta, Mei-Hui Su, Karan Vahi, WORKS08 Workshop, November 2008.

Data Placement Strategies and their Impact on Scientific Workflows

Student research: Shishir Bharathi

  • Strategies employed to stage data in and out of compute resources can have a significant impact on the overall execution of a scientific workflow
  • We study the relationships between:
    • data placement services that perform the staging and
    • workflow managers that control the release of computational jobs
  • Define a framework that classifies data staging strategies based on degree of interaction with the workflow manager:
    • decoupled
    • loosely-coupled
    • tightly-coupled modes
  • Simulation studies that investigate the effect of data staging on scientific workflows
    • Variety of data placement algorithms, ranging from simple pre-staging of data to heuristics and genetic algorithms
  • Paper: “Data Staging Strategies and Their Impact on the Execution of Scientific Workflows,” Shishir Bharathi, Ann Chervenak, to appear in DADC 2009 Workshop

Research: Policy-Driven Data Placement

Student research: Muhammad Ali Amer, USC PhD student and Sara Alspaugh, undergraduate summer intern from U. Virginia

  • Interested in enforcing VO-level policies for how data should be replicated and disseminated
    • Compact Muon Solenoid experiment: tiered data distribution
    • Laser Interferometer Gravitational Wave Observatory: query-based data distribution and replication
    • UK QCD experiment: maintain multiple copies of each file
  • Integrated open source rule engine (Drools) with existing tools for distributed data management
  • Paper: “Policy Based Data Placement for Scientific Virtual Organizations”, Muhammad Ali Amer, Ann Chervenak, Sara Alspaugh, submitted to Grid2009 Conference, October 2009.
  • Poster: "Policy-Driven Data Management for Distributed Scientific Collaborations Using a Rule Engine", (poster paper), Sara Alspaugh, Ann Chervenak, Ewa Deelman, Supercomputing (SC08) Conference, Austin, Texas, November 2008. Received Best Undergraduate Student Poster award in ACM Student Poster competition.

Storage Allocation and Request Matching (Wisconsin Team)

Overview of Accomplishments for Storage Allocation and Request Matching

Overview, April 2009 (PDF)

Lotman

LotMan is a lightweight storage allocation tool

  • LotMan provides administrators with an easy to set up tool for controlling allocation of disk space
  • GridFTP plug-in for LotMan
    • Allows GridFTP administrators to control storage space usage on a per user basis
    • GridFTP can prevent a transfer from starting if it knows ahead of time (via the LotMan plug-in) that sufficient storage is not available
  • The LotMan software has been integrated into the Virtual Data Toolkit (VDT) and is available for download

Lease Manager and Stork Data Placement Scheduler

The Lease Manager is a flexible tool for matching resources with requests

  • Can be used to manage many different types of “counted” resources, such as license usage and job / resource matches
  • Provides two-way match making, so that both the resource and the request can specify requirements of the other
  • The state of leases can be persisted
  • In the event of a crash, the lease manager can continue from where it left off
  • The Lease Manager can be used by Stork Data Placement Scheduler to provide dynamic matching of data transfer jobs:
    • Resources can specify the number of transfers they support, as well as required attributes of the transfer job
    • The transfer job can restrict itself to specific resources or attributes of resources

Checksum Verification and Economics of Storage Management (Fermi Team)

Checksum Verification Capability for dCache GridFTP Implementation

Overview of Checksum Verification Capability, May 2009 (PDF)

  • Checksum verification capability was integrated with the dCache GridFTP server
  • Used by CMS application scientists, who are heavy users of Fermi’s dCache GridFTP implementation
    • Move large quantities of data between CERN and Fermi Lab (a Tier 1 site)
    • Serves CMS data sets to physicists throughout the United States
  • CMS uses the capability to verify checksums on approximately 10 Terabytes per day of data downloaded to Fermi Lab

dCache release with checksum support

Economics of Storage Management

Personal tools