Data Success Stories
From CEDPS
SSH GridFTP doubles data transfer rate for GFDL
SSH authenticated GridFTP developed by the CEDPS team allowed NOAA to quickly deploy it using the existing infrastructure. The data rates achieved with SSH authenticated globus-url-copy are roughly double (100MB/sec vs 50MB/sec) than what GFDL was achieving with other high speed data movement tools at a fraction of the system resource utilization.
CEDPS work enables automated data movement within APS and improves the performance by a factor of 5
The CEDPS team developed a *GridFTP server for Cygwin*, a Unix-like environment for Windows, which enables Advanced Photon Source (APS) to automate the data movement between its data acquisition machine (running on Windows) and its HPC cluster.
Data acquired at the beamline gets moved to the HPC cluster within APS for processing. The data acquisition machine in the APS beamline is a Windows machine and the HPC cluster runs Linux. Earlier, this data movement was manual and they were getting a data rate of 23 MB/s using a Windows-native protocol. Now the data movement is automated through a control script that does a third-party GridFTP transfer between the acquisition machine and the HPC cluster.
With GridFTP, the data rates are significantly better – they are getting about 110 MB/s. Both raw data and processed data are stored in the Lustre parallel file system at the HPC cluster. The Lustre parallel file system is accessible to both the internal GridFTP server running on a HPC node and the public GridFTP server running on the DMZ.
CEDPS also developed a *concurrency feature for GridFTP* to speed up the transfer of data sets consisting of lots of small files. This new feature has enabled APS users to use their network to move data rather than mailing the data in hard disks and DVDs. APS users move terabytes of data from APS to Australia at a rate 30x faster than standard FTP.
Internal and external data movement at ALCF
The features developed by CEDPS are deployed at the Argonne Leadership Computing Facility (ALCF) for both internal and external data movement
The functionalities developed by CEDPS are helping the internal and external data movement at Argonne Leadership Computing Facility (ALCF). The *SSH security for GridFTP* – developed by CEDPS as an alternate security mechanism for GSI in GridFTP - is used in the deployment for external access. The new features developed by CEDPS for *HPSS-enabled GridFTP*, such as support for multiple class of service, is used for the internal data movement to archive the data on HPSS.
Rapid data movement accelerates world-class science
ASCR research connects DOE labs at unprecedented speeds. Using GridFTP technology developed by the ASCR Center for Enabling Distributed Petascale Science, staff at the National Energy Research Scientific Computing Center (NERSC) and Oak Ridge Leadership Computing Facility (OLCF) have established a service that allows data to be moved between the two sites at 200 megabytes per second--more than 20 times faster than previously possible. The new capability is enhancing collaboration and reducing time to discovery for researchers in fields as diverse as astrophysics and computational chemistry
In order to answer the puzzle - What about 14C makes its half-life so long? Scientists at ORNL and their collaborators need to move 40 terabytes (40 trillion bytes) of data from NERSC to OLCF for each of the nuclei they study. GridFTP enables them to move this data in under 3 days rather than several months.
“Rapid data transfer allows me to spend more time on the science, not the logistics of getting to the science. “ says Hai Ah Nam, ORNL
CMS data movement: Checksum verification at FermiLab
Application scientists from the CMS (Compact Muon Solenoid) high energy physics project move large quantities of data between CERN and Fermi Lab (a Tier 1 site). In turn, Fermi serves CMS data sets to physicists throughout the United States.
The CEDPS team implemented a checksum verification capability in the dCache GridFTP server. CMS uses this capability to verify checksums on approximately 10 Terabytes per day of data downloaded to Fermi Lab from CERN. The use of this checksum verification capability has improved the reliability of CMS data transfers substantially.
In addition, dCache checksum failures have been used to identify hardware for preemptive replacement or maintenance, which has improved the availability of the hardware infrastructure.
When this capability was initially enabled, its use revealed several deficiencies in a partner storage system in Europe. This new checksum functionality triggered further development in that storage system. As a result, this work produced higher reliability for end-to-end, trans-Atlantic transfers of terabytes of data per day.
NERSC and GridFTP
National Energy Research Scientific Computing Center (NERSC) uses GridFTP as its primary high performance interface for wide area access to large datasets. NERSC provides GridFTP access to all major systems, including its flagship 40,000‐core Cray XT4 system, its HPSS mass storage, and smaller compute clusters. High‐performance, striped GridFTP transfers with tunable buffer sizes are critical for many time‐sensitive data transfers.
“GridFTP is currently the only tool that meets these needs while offering excellent data transfer performance, overall reliability and support. NERSC views GridFTP as a crucial technology for facilitating scientific collaborations across multiple sites.” says Shreyas Cholia from NERSC
Many projects use GridFTP to access NERSC, including Open Science Grid communities. The STAR project uses GridFTP via the BeStMan Storage Resource Manager (SRM). Other projects using GridFTP at NERSC include the Earth System Grid, the Planck Cosmic Microwave Background application and the ALICE high energy physics application.
DES Data Management effort relies on GridFTP for its data processing, data access, and archive systems
In the field of astronomy, new technology in the form of innovative detectors is leading to a generation of never-before-seen volumes of high-quality data.
The Dark Energy Survey Data Management (DES Data Management) team relies on GridFTP as the underlying file transfer protocol for rapidly transferring large quantities of raw and reduced data between a collection of heterogeneous sites.
The project needs to transfer data between:
- TeraGrid sites (NCSA ⇔ SDSC),
- between NCSA and FermiLab (a DOE site),
- from DESDM team dedicated servers to shared supercomputer platforms, and
- from parallel file systems to tape archives.
“For all our use cases, GridFTP has provided the flexible and high-performance solution required by the DESDM project.” Says Dr. Greg Daues from NCSA
CEDPS tools enable the Ultravis project to move data at 30-45x faster
The SciDAC institute for ultra-scale visualization (Ultravis) at UC Davis addresses the upcoming peta-scale visualization challenges facing computational science and engineering. The scientists involved in this SciDAC project need to move large volumes of data across the country and overseas.
Ultravis team has transferred Gigabytes of data at an initial rate of 9 MB/s between ORNL and UC Davis using SSH GridFTP developed by the CEDPS team. The new SSH GridFTP server is deployed at OLCF. Previously, the Ultravis was using scp and rsync for transferring the data and obtaining data rates of 200-300 KB/s. They are seeing 30-45x improvements in performance with SSH GridFTP.







