How to Deal with 10 Petabytes of Data a Year? CERN's New Grid

CERN's new Large Hadron Collider, which will come online this summer, is expected to generate 10 petabytes of data a year: roughly 1% of the world's entire data output. To deal with this data, CERN is using grid technology with a fiber optic network that links 55,000 servers in 11 global data centers at speeds that are 10,000 times faster than a normal broadband connection.

CERN's GridCafè Web site provides a concise, clear, and easily understood introduction to CERN's grid and grid technology in general.

Read more about it at "Coming Soon: Superfast Internet."

iRODS Version 1.0: Data Grids, Digital Libraries, Persistent Archives, and Real-Time Data Systems

The Data-Intensive Computing Environments group at the San Diego Supercomputer Center has released version 1.0 of the open-source iRODS (Integrated Rule-Oriented Data System) system, which can be used to support data grids, digital libraries, persistent archives, and real-time data systems.

Here's an excerpt from the press release:

"iRODS is an innovative data grid system that incorporates and moves beyond ten years of experience in developing the widely used Storage Resource Broker (SRB) technology," said Reagan Moore, director of the DICE group at SDSC. "iRODS equips users to handle the full range of distributed data management needs, from extracting descriptive metadata and managing their data to moving it efficiently, sharing data securely with collaborators, publishing it in digital libraries, and finally archiving data for long-term preservation. . . ."

"You can start using it as a single user who only needs to manage a small stand-alone data collection," said Arcot Rajasekar, who leads the iRODS development team. "The same system lets you grow into a very large federated collaborative system that can span dozens of sites around the world, with hundreds or thousands of users and numerous data collections containing millions of files and petabytes of data—it’s a true full-scale distributed data system." A petabyte is one million gigabytes, about the storage capacity of 10,000 of today’s PCs. . . .

Version 1.0 of iRODS is supported on Linux, Solaris, Macintosh, and AIX platforms, with Windows coming soon. The iRODS Metadata Catalog (iCAT) will run on either the open source PostgreSQL database (which can be installed via the iRODS install package) or Oracle. And iRODS is easy to install—just answer a few questions and the install package automatically sets up the system.

Under the hood, the iRODS architecture stores data on one or more servers, which may be widely separated geographically; keeps track of system and user-defined information describing the data with the iRODS Metadata Catalog (iCAT); and offers users access through clients (currently a command line interface and Web client, with more to come). As directed by iRODS rules, the system can process data where it is stored using applications called "micro-services" executed on the remote server, making possible smaller and more targeted data transfers.

Humanities Cyberinfrastructure: The TextGrid Project

The Humanities-oriented TextGrid Project is part of the larger German D-Grid initiative.

Here's an excerpt from the About TextGrid page:

TextGrid aims to create a community grid for the collaborative editing, annotation, analysis and publication of specialist texts. It thus forms a cornerstone in the emerging e-Humanities. . . .

Despite modern information technology and a clear thrust towards collaboration, text scientists still mostly work in local systems and project-oriented applications. Current initiatives lack integration with already existing text corpora, and they remain unconnected to resources such as dictionaries, lexica, secondary literature and tools. . . .

Integrated tools that satisfy the specific requirements of text sciences could transform the way scholars process, analyse, annotate, edit and publish text data. Working towards this vision, TextGrid aims at building a virtual workbench based on e-Science methods.

The installation of a grid-enabled architecture is obvious for two reasons. On the one hand, past and current initiatives for digitising and accessioning texts already accrued a considerable data volume, which exceeds multiple terabytes. Grids are capable of handling these data volumes. Also the dispersal of the community as well as the scattering of resources and tools call for establishing a Community Grid. This establishes a platform for connecting the experts and integrating the initiatives worldwide. The TextGrid community is equipped with a set of powerful software tools based on existing solutions and embracing the grid paradigm.