How to Deal with 10 Petabytes of Data a Year? CERN’s New Grid

CERN's new Large Hadron Collider, which will come online this summer, is expected to generate 10 petabytes of data a year: roughly 1% of the world's entire data output. To deal with this data, CERN is using grid technology with a fiber optic network that links 55,000 servers in 11 global data centers at speeds that are 10,000 times faster than a normal broadband connection.

CERN's GridCafè Web site provides a concise, clear, and easily understood introduction to CERN's grid and grid technology in general.

Read more about it at "Coming Soon: Superfast Internet."

Essays from the Core Functions of the Research Library in the 21st Century Meeting

The Council on Library and Information Resources has released essays from its recent Core Functions of the Research Library in the 21st Century meeting.

Here's an excerpt from the meeting home page that lists the essays:

"The Future of the Library in the Research University," by Paul Courant

"Accelerating Learning and Discovery: Refining the Role of Academic Librarians," by Andrew Dillon

"A New Value Equation Challenge: The Emergence of eResearch and Roles for Research Libraries," by Richard E. Luce

"Co-teaching: The Library and Me," by Stephen G. Nichols

"Groundskeepers to Gatekeepers: How to Change Faculty Perceptions of Librarians and Ensure the Future of the Research Library," by Daphnee Rentfrow

"The Research Library in the 21st Century: Collecting, Preserving, and Making Accessible Resources for Scholarship," by Abby Smith

"The Role of the Library in 21st Century Scholarly Publishing," by Kate Wittenberg

"Leveraging Digital Technologies in Service to Culture and Society: The Role of Libraries as Collaborators," by Lee Zia

SEASR (Software Environment for the Advancement of Scholarly Research)

The Andrew W. Mellon Foundation-funded SEASR (Software Environment for the Advancement of Scholarly Research) project is building digital humanities cyberinfrastructure.

Here's an excerpt about the project from its home page:

What can SEASR do for scholars?

  • help scholars to access existing large data stores more readily
  • provide scholars with enhanced data synthesis and query analysis: from focused data retrieval and data integration, to intelligent human-computer interactions for knowledge access, to semantic data enrichment, to entity and relationship discovery, to knowledge discovery and hypothesis generation
  • empower collaboration among scholars by enhancing and innovating virtual research environments

What kind of innovations does SEASR provide for the humanities?

  • a complete, fully integrated, state-of-the-art software environment for managing structured and unstructured data and analyzing digital libraries, repositories and archives, as well as educational platforms
  • an open source, end-to-end software system that enables researchers to develop, evolve, and maintain data interoperability, evaluation, analysis, and visualization

Read more about it at "Placing SEASR within the Digital Library Movement."

RAD Lab: Cloud Computing Made Easy

The RAD Lab (Reliable Adaptive Distributed Systems Laboratory) is working to "enable one person to invent and run the next revolutionary IT service, operationally expressing a new business idea as a multi-million-user service over the course of a long weekend."

Read more about it at "RAD Lab Technical Vision" and "Trying to Figure Out How to Put a Google in Every Data Center."

iRODS Version 1.0: Data Grids, Digital Libraries, Persistent Archives, and Real-Time Data Systems

The Data-Intensive Computing Environments group at the San Diego Supercomputer Center has released version 1.0 of the open-source iRODS (Integrated Rule-Oriented Data System) system, which can be used to support data grids, digital libraries, persistent archives, and real-time data systems.

Here's an excerpt from the press release:

"iRODS is an innovative data grid system that incorporates and moves beyond ten years of experience in developing the widely used Storage Resource Broker (SRB) technology," said Reagan Moore, director of the DICE group at SDSC. "iRODS equips users to handle the full range of distributed data management needs, from extracting descriptive metadata and managing their data to moving it efficiently, sharing data securely with collaborators, publishing it in digital libraries, and finally archiving data for long-term preservation. . . ."

"You can start using it as a single user who only needs to manage a small stand-alone data collection," said Arcot Rajasekar, who leads the iRODS development team. "The same system lets you grow into a very large federated collaborative system that can span dozens of sites around the world, with hundreds or thousands of users and numerous data collections containing millions of files and petabytes of data—it’s a true full-scale distributed data system." A petabyte is one million gigabytes, about the storage capacity of 10,000 of today’s PCs. . . .

Version 1.0 of iRODS is supported on Linux, Solaris, Macintosh, and AIX platforms, with Windows coming soon. The iRODS Metadata Catalog (iCAT) will run on either the open source PostgreSQL database (which can be installed via the iRODS install package) or Oracle. And iRODS is easy to install—just answer a few questions and the install package automatically sets up the system.

Under the hood, the iRODS architecture stores data on one or more servers, which may be widely separated geographically; keeps track of system and user-defined information describing the data with the iRODS Metadata Catalog (iCAT); and offers users access through clients (currently a command line interface and Web client, with more to come). As directed by iRODS rules, the system can process data where it is stored using applications called "micro-services" executed on the remote server, making possible smaller and more targeted data transfers.

Broadband in the U.S.: Mission Accomplished?

The U.S. National Telecommunications and Information Administration will shortly release a report, Networked Nation: Broadband in America, that critics say presents too optimistic a picture of broadband access in the U.S. Read more about it at "Study: U.S. Broadband Goal Nearly Reached."

Meanwhile, EDUCAUSE has released A Blueprint for Big Broadband: An EDUCAUSE White Paper, which says that: "The United States is facing a crisis in broadband connectivity."

Here's an excerpt from the EDUCAUSE report's "Executive Summary":

While other nations are preparing for the future, the United States is not. Most developed nations are deploying "big broadband" networks (100 Mbps) that provide faster connections at cheaper prices than those available in the United States. Japan has already announced a national commitment to build fiber networks to every home and business, and countries that have smaller economies and more rural territory than the United States (e.g., Finland, Sweden, and Canada) have better broadband services available.

Why is the United States so far behind? The failure of the United States to keep pace is the direct result of our failure to adopt a national broadband policy. The United States has taken a deregulatory approach under the assumption that the market will build enough capacity to meet the demand. While these steps may have had some positive influence, they are not sufficient. . . .

For these reasons, this paper proposes the creation of a new federal Universal Broadband Fund (UBF) that, together with matching funds from the states and the private and/or public sector, should be used to build open, big broadband networks of at least 100 Mbps (scalable upwards to 1 Gbps) to every home and business by 2012. U.S. state governors and foreign heads of state have found the resources to subsidize broadband deployment; the U.S. federal government should as well.

Humanities Cyberinfrastructure: The TextGrid Project

The Humanities-oriented TextGrid Project is part of the larger German D-Grid initiative.

Here's an excerpt from the About TextGrid page:

TextGrid aims to create a community grid for the collaborative editing, annotation, analysis and publication of specialist texts. It thus forms a cornerstone in the emerging e-Humanities. . . .

Despite modern information technology and a clear thrust towards collaboration, text scientists still mostly work in local systems and project-oriented applications. Current initiatives lack integration with already existing text corpora, and they remain unconnected to resources such as dictionaries, lexica, secondary literature and tools. . . .

Integrated tools that satisfy the specific requirements of text sciences could transform the way scholars process, analyse, annotate, edit and publish text data. Working towards this vision, TextGrid aims at building a virtual workbench based on e-Science methods.

The installation of a grid-enabled architecture is obvious for two reasons. On the one hand, past and current initiatives for digitising and accessioning texts already accrued a considerable data volume, which exceeds multiple terabytes. Grids are capable of handling these data volumes. Also the dispersal of the community as well as the scattering of resources and tools call for establishing a Community Grid. This establishes a platform for connecting the experts and integrating the initiatives worldwide. The TextGrid community is equipped with a set of powerful software tools based on existing solutions and embracing the grid paradigm.

Spiro Reviews Last Year's Key Digital Humanities Developments

In a series of three interesting posts, Lisa Spiro, Director of the Digital Media Center and the Educational Technology Research and Assessment Cooperative at Rice University's Fondren Library, has reviewed 2007's major digital humanities developments: "Digital Humanities in 2007 [Part 1 of 3]," "Digital Humanities in 2007 [Part 2 of 3]," and "Digital Humanities in 2007 [Part 3 of 3]."

Towards the Australian Data Commons: A Proposal for an Australian National Data Service

The Australian eResearch Infrastructure Council has released Towards the Australian Data Commons: A Proposal for an Australian National Data Service.

Here's an excerpt from the "Overview":

This paper is designed to encourage, inform and ultimately summarise the discussions around the appropriate strategic and technical descriptions of the Australian National Data Service; to fill in the outline in the Platforms for Collaboration investment plan.

To do so, the paper:

  • introduces the Australian National Data Service (ANDS) and the driving forces behind its creation;
  • provides a rationale for the services that ANDS will provide, and the programs through which the services will be offered; and
  • describes in detail the ANDS programs.

Part One (Background) provides a brief summary of the reasons to focus on data management, as well as an overview of ANDS, and identifies some issues associated with implementation.

Part Two (Rationale) sets out the systemic issues associated with achieving a research data commons, and provides the resultant rationale for the services that ANDS will offer the programs that they will be delivered through.

Part Three (Detailed Descriptions of ANDS Programs) sets out in detail the Aim, Focus, Service Beneficiaries, Products and Community Engagement activities for each of the ANDS Programs.

Digital Library Federation Forum for NSF DataNet Grant Proposals

The Digital Library Federation has established a forum for those who want to collaborate or get further information about the NSF's Sustainable Digital Data Preservation and Access Network Partners (DataNet) grant program. Participation in the forum is open, but registration is required.

Cyberscholarship Report

The School of Information Sciences at the University of Pittsburgh has released The Future of Scholarly Communication: Building the Infrastructure for Cyberscholarship. Report of a Workshop Held in Phoenix, Arizona, April 17 to 19, 2007, Sponsored by the National Science Foundation and the Joint Information Systems Committee.

Here's an excerpt from the "Summary of Conclusions and Recommendations" section:

  • The widespread availability of digital content is creating opportunities for new forms of research and scholarship that are qualitatively different from the traditional way of using academic publications and research data. We call this "cyberscholarship". . . .
  • The widespread availability of content in digital formats provides an infrastructure for novel forms of research. To support cyberscholarship it must be captured, managed, and preserved in ways that are significantly different from conventional methods. . . .
  • Development of the infrastructure requires coordination at a national and international level. . . . In the United States, since there is no single agency with this mission, we recommend a coordinating committee of the appropriate federal agencies. . . .
  • Development of the content infrastructure requires a blend of research – both discipline-specific and in the enabling computer science – and implementation. . . .
  • We propose a seven year timetable for implementation of the infrastructure. The first three years will emphasize a set of prototypes, followed by implementation of a coordinated group of systems and services.

NSF Solicits Grant Proposals for up to $20 Million for Dataset Access and Preservation

National Science Foundation's Office of Cyberinfrastructure has announced the availability of grants to U.S. academic institutions under its Sustainable Digital Data Preservation and Access Network Partners (DataNet) program.

Here's an excerpt from the solicitation:

Science and engineering research and education are increasingly digital and increasingly data-intensive. Digital data are not only the output of research but provide input to new hypotheses, enabling new scientific insights and driving innovation. Therein lies one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams. This solicitation addresses that challenge by creating a set of exemplar national and global data research infrastructure organizations (dubbed DataNet Partners) that provide unique opportunities to communities of researchers to advance science and/or engineering research and learning.

The new types of organizations envisioned in this solicitation will integrate library and archival sciences, cyberinfrastructure, computer and information sciences, and domain science expertise to:

  • provide reliable digital preservation, access, integration, and analysis capabilities for science and/or engineering data over a decades-long timeline;
  • continuously anticipate and adapt to changes in technologies and in user needs and expectations;
  • engage at the frontiers of computer and information science and cyberinfrastructure with research and development to drive the leading edge forward; and
  • serve as component elements of an interoperable data preservation and access network.

By demonstrating feasibility, identifying best practices, establishing viable models for long term technical and economic sustainability, and incorporating frontier research, these exemplar organizations can serve as the basis for rational investment in digital preservation and access by diverse sectors of society at the local, regional, national, and international levels, paving the way for a robust and resilient national and global digital data framework.

These organizations will provide:

  • a vision and rationale that meet critical data needs, create important new opportunities and capabilities for discovery, innovation, and learning, improve the way science and engineering research and education are conducted, and guide the organization in achieving long-term sustainability;
  • an organizational structure that provides for a comprehensive range of expertise and cyberinfrastructure capabilities, ensures active participation and effective use by a wide diversity of individuals, organizations, and sectors, serves as a capable partner in an interoperable network of digital preservation and access organizations, and ensures effective management and leadership; and
  • activities to provide for the full data management life cycle, facilitate research as resource and object, engage in computer science and information science research critical to DataNet functions, develop new tools and capabilities for learning that integrate research and education at all levels, provide for active community input and participation in all phases and all aspects of Partner activities, and include a vigorous and comprehensive assessment and evaluation program.

Potential applicants should note that this program is not intended to support narrowly-defined, discipline-specific repositories. . . .

Award Information

Anticipated Type of Award: Cooperative Agreement

Estimated Number of Awards: 5 — Two to three awards are anticipated in each of two review cycles (one review cycle for fiscal year FY2008 awards and one for FY2009) for a total of five awards, contingent on the quality of proposals received and pending the availability of funds. Each award is limited to a total of up to $20,000,000 (direct plus indirect costs) for up to 5 years. The initial term of each award is expected to be 5 years with the potential at NSF's sole discretion for one terminal renewal for another 5 years, subject to performance and the availability of funds. Such performance is to include serving the needs of the relevant science and engineering research and education communities and catalyzing new opportunities for progress. If a second five-year award is made, NSF funding is expected to decrease in each successive year of the award as the Partner transitions to a sustainable economic model with other sources of support. The actual amount of the annual decrease in NSF support will be established through the cooperative agreement. Note that the maximum period NSF will support a DataNet Partner is 10 years.

Anticipated Funding Amount: $100,000,000 — Up to $100,000,000 over a five year period is expected to be available contingent on the quality of proposals received and pending the availability of funds.

Two EDUCAUSE Live! Podcasts: Cyberinfrastructure and Digital Libraries

Two EDUCAUSE Live! Podcasts have been released: