A Guide to Distributed Digital Preservation

The MetaArchive Cooperative has released A Guide to Distributed Digital Preservation.

Here's an excerpt from the announcement:

This volume is devoted to the broad topic of distributed digital preservation, a still-emerging field of practice for the cultural memory arena. Replication and distribution hold out the promise of indefinite preservation of materials without degradation, but establishing effective organizational and technical processes to enable this form of digital preservation is daunting. Institutions need practical examples of how this task can be accomplished in manageable, low-cost ways.

This guide is written with a broad audience in mind that includes librarians, archivists, scholars, curators, technologists, lawyers, and administrators. Readers may use this guide to gain both a philosophical and practical understanding of the emerging field of distributed digital preservation, including how to establish or join a network.

International Internet Preservation Consortium Launches Web Archives Registry

The International Internet Preservation Consortium has launched a web archives registry.

Here's an excerpt from the announcement:

The registry offers a single point of access to a comprehensive overview of member web archiving efforts and outputs. Twenty-one archives from around the world are currently included; updates will be added as additional archives are made accessible by IIPC members.

In addition to a detailed description of each web archive, the following information is included:

  • Collecting institution
  • Start date
  • Archive interface language(s)
  • Access methods (URL search, keyword search, full text search, thematic, etc.)
  • Harvesting methods (National domain, event, thematic, etc.)
  • Access restrictions

The registry was put in place by the IIPC Access Working Group, which focuses on initiatives, procedures and tools required to provide immediate and future to access archived web material. The registry will also provide a basis for IIPC to explore integrated access and search in the future.

Library of Congress Launches Digital Preservation Podcast Series

The Library of Congress has launched a digital preservation podcast series.

Here's an excerpt from the press release:

The Library of Congress presents a new podcast series, featuring interviews with prominent digital preservation practitioners and thought leaders. These podcasts offer a chance to hear experts talk about their lessons learned and goals for future projects.

The debut podcasts are interviews with Patricia Cruse and Martin Halbert. Cruse is the director of the California Curation Center, formerly known as the California Digital Library's Digital Preservation Program. She talks about her professional achievements and personal interest in making government information widely available to the public. Halbert is the newly appointed dean of libraries at the University of North Texas and one of the co-founders of the MetaArchive Cooperative. In his podcast he talks about institutional collaboration and how pooling resources helped build large-scale online resources such as the Trans-Atlantic Slave Trade Database.

The podcasts are available on the Library of Congress website and by subscription through iTunesU.

Center for Research Libraries Certifies Portico as Trustworthy Digital Repository

The Center for Research Libraries has certified Portico as a trustworthy digital repository.

Here's an excerpt from the announcement:

This month the Center for Research Libraries (CRL) announced the completion of an audit of the Portico digital repository and its certification as a trustworthy digital repository. Portico is the first digital preservation service to undergo this independent audit and the only service to be certified at this time.. . .

The nine-month audit process was an extremely positive and valuable one for Portico. It confirmed that the majority of our practices conform to the Trustworthy Repositories Audit and Certification Checklist (TRAC) and other metrics developed by CRL through its analyses of digital repositories. It also identified for us several areas for continued improvement as well as ways in which we can enhance the service for CRL member libraries as well as others. We look forward to continuing to report to CRL on these issues in the years ahead to ensure we continue to meet certification requirements and the expectations of CRL libraries, our other partner libraries, and our participating publishers.

We invite you to review the background information about CRL's Certification and Assessment of Digital Repositories Program (http://www.crl.edu/archiving-preservation/digital-archives/certification-and-assessment-digital-repositories) as well as the public audit report on Portico published by the CRL Certification Advisory Panel (http://www.crl.edu/archiving-preservation/digital-archives/certification-and-assessment-digital-repositories/portico).

Data Dimensions: Disciplinary Differences in Research Data Sharing, Reuse and Long Term Viability

The Digital Curation Centre has released Data Dimensions: Disciplinary Differences in Research Data Sharing, Reuse and Long Term Viability: A Comparative Review Based on Sixteen Case Studies.

Here's an excerpt:

This synthesis study, commissioned by the Digital Curation Centre from Key Perspectives Ltd, forms a major output from the DCC SCARP Project, which investigated attitudes and approaches to data deposit, sharing and reuse, curation and preservation, over a range of research fields in differing disciplines. The aim was to investigate research practitioners’ perspectives and practices in caring for their research data, and the methods and tools they use to that end. Objectives included identification and promotion of ‘good practice’ in the selected research domains, as expressed in DCC tools and resources. The approach combined case study methods with a survey of the literature relevant to digital curation in the selected fields. . . .

This synthesis report (which drew on the SCARP case studies plus a number of others, identified in the Appendix), identifies factors that help understand how curation practices in research groups differ in disciplinary terms. This provides a backdrop to different digital curation approaches. However the case studies illustrate that "the discipline" is too broad a level to understand data curation practices or requirements. The diversity of data types, working methods, curation practices and content skills found even within specialised domains means that requirements should be defined at this or even a finer-grained level, such as the research group.

Report and Recommendations from the Scholarly Publishing Roundtable

The Scholarly Publishing Roundtable has released the Report and Recommendations from the Scholarly Publishing Roundtable.

Here's an excerpt from the press release:

An expert panel of librarians, library scientists, publishers, and university academic leaders today called on federal agencies that fund research to develop and implement policies that ensure free public access to the results of the research they fund "as soon as possible after those results have been published in a peer-reviewed journal."

The Scholarly Publishing Roundtable was convened last summer by the U.S. House Committee on Science and Technology, in collaboration with the White House Office of Science and Technology Policy (OSTP). Policymakers asked the group to examine the current state of scholarly publishing and seek consensus recommendations for expanding public access to scholarly journal articles.

The various communities represented in the Roundtable have been working to develop recommendations that would improve public access without curtailing the ability of the scientific publishing industry to publish peer- reviewed scientific articles.

The Roundtable’s recommendations, endorsed in full by the overwhelming majority of the panel (12 out of 14 members), "seek to balance the need for and potential of increased access to scholarly articles with the need to preserve the essential functions of the scholarly publishing enterprise," according to the report. . . .

The Roundtable identified a set of principles viewed as essential to a robust scholarly publishing system, including the need to preserve peer review, the necessity of adaptable publishing business models, the benefits of broader public access, the importance of archiving, and the interoperability of online content.

In addition, the group affirmed the high value of the "version of record" for published articles and of all stakeholders' contributions to sustaining the best possible system of scholarly publishing during a time of tremendous change and innovation.

To implement its core recommendation for public access, the Roundtable recommended the following:

  1. Agencies should work in full and open consultation with all stakeholders, as well as with OSTP, to develop their public access policies. Agencies should establish specific embargo periods between publication and public access.
  2. Policies should be guided by the need to foster interoperability.
  3. Every effort should be made to have the Version of Record as the version to which free access is provided.
  4. Government agencies should extend the reach of their public access policies through voluntary collaborations with non-governmental stakeholders.
  5. Policies should foster innovation in the research and educational use of scholarly publications.
  6. Government public access policies should address the need to resolve the challenges of long-term digital preservation.
  7. OSTP should establish a public access advisory committee to facilitate communication among government and nongovernment stakeholders.

Read more about it at "Scholarly Publishing Roundtable Releases Report and Recommendations" and "Scholarly Publishing Roundtable Releases Report to Congress."

Research Data: Unseen Opportunities

The Canadian Association of Research Libraries has released Research Data: Unseen Opportunities.

Here's an excerpt from the press release:

The purpose of the toolkit is to enable research library directors to raise awareness of the issues of data management with administrators and researchers on campus.

Data are valuable assets that in some cases have an unlimited potential for reuse. The awareness toolkit underscores the need to ensure that research data are managed throughout the data lifecycle so that they are understandable and usable.

"This is a very timely document" says Marnie Swanson (University of Victoria), Chair of the CARL Data Management Sub-Committee. "More than ever, data are a critical component of the research endeavor and this toolkit will help libraries raise awareness in the scholarly community of the importance of data stewardship."

Research Data: Unseen Opportunities provides readers with a general understanding of the current state of research data in Canada and internationally. It is organized into seven sections: The Big Picture; Major Benefits of Data Management; Current Context; Case Studies; Gaps in Data Stewardship in Canada; Data Management Policies in Canada; Responses to Faculty/Administrative Concerns; What Can Be Done on Campus?

Insight into Digital Preservation of Research Output in Europe

PARSE.Insight (INSIGHT into issues of Permanent Access to the Records of Science in Europe) has released Insight into Digital Preservation of Research Output in Europe.

Here's an excerpt:

This report . . . describes the results of the surveys conducted by PARSE.Insight to gain insight into research in Europe. Major surveys were held within three stake-holder domains: research, publishing and data management. In total, almost 2,000 people responded; they provided us with interesting insights in the current state of affairs in digital preservation of digital research data (including publications), the outlook of data preservation, data sharing, roles & responsibilities of stakeholders in research and funding of research.

UC Berkeley Media Vault Program Progress Report

The University of California, Berkeley's Media Vault Program has posted a progress report.

Here's an excerpt:

Media Vault Program partners offer a number of specialized tools to help campus researchers manage their materials. These include:

  • WebGenDL (UCB Library Systems) — the library's internal system for managing, creating, preserving and discovering digital library content. These tools are aimed primarily at mature, publishable sets of materials, rather than the broader context of research data
  • UC3 Curation Micro-services — a set of low barrier tools for full lifecycle enrichment of objects (e.g., identity, fixity, replication, annotation). The first few will be rolled out publicly in January 2010. These are presented not as a user interface, but rather as behind-the-scenes services
  • Sakai 3 — the next-generation version of the platform that powers the Berkeley campus's bSpace application. Due in 2011, Sakai 3 will include a range of social tools to help users extend and disseminate their materials

To augment these services, and to handle use cases beyond their scope, the MVP team examined a number of potential platforms. . . .

Of these candidates, Alfresco stands out as the most functional, out-of-the-box solution. With a little customization, it can be readied for user testing. Therefore, the MVP team has selected it as the basis of its next round of discussions with stakeholders, partners and prospective users.

Read more about Alfresco at the AlfrescoWiki.

Presentations from the 5th International Digital Curation Conference

Presentations from the 5th International Digital Curation Conference are now available. (Thanks to the Digital Curation Blog, which has provided extensive coverage of the conference.)

First Day

Day Two

File Formats for Preservation

The Digital Preservation Coalition has released File Formats for Preservation.

Here's an excerpt:

File formats are the principal means of encoding information content in any computing environment. Preserving intellectual content requires a firm grasp of the file formats used to create, store and disseminate it, and ensuring that they remain fit for purpose. There are several significant pronouncements on preservation file formats in the literature. These have generally emanated from either preservation institutions or research projects and usually take one of three approaches:

  • recommendations for submitting material to digital repositories
  • recommendations or policies for long term preservation or
  • proposals, plans for and technical documentation of existing registries to store attributes of formats.

More recently, attention has broadened to pay specific attention to the significant properties of the intellectual objects that are the subject of preservation. This Technology Watch Report has been written to provide an overview of these developments in context by comparative review and analysis to assist repository managers and the preservation community more widely. It aims to provide a guide and critique to the current literature, and place it in the context of a wider professional knowledge and research base.

Data Preservation in High Energy Physics

The ICHFA DPHEP International Study Group has self-archived Data Preservation in High Energy Physics in arXiv.org.

Here's an excerpt:

Data from high-energy physics (HEP) experiments are collected with significant financial and human effort and are mostly unique. At the same time, HEP has no coherent strategy for data preservation and re-use. An inter-experimental Study Group on HEP data preservation and long-term analysis was convened at the end of 2008 and held two workshops, at DESY (January 2009) and SLAC (May 2009). This document is an intermediate report to the International Committee for Future Accelerators (ICFA) of the reflections of this Study Group.

Closing the Digital Curation Gap Project

The Institute of Museum and Library Services has awarded $249,623 to the University of North Carolina Chapel Hill School of Information and Library Science for the Closing the Digital Curation Gap project.

Here's an excerpt from the press release:

Scientists, researchers, and scholars across the world generate vast amounts of digital data, but the scientific record and the documentary heritage created in digital form are at risk—from technology obsolescence, from the fragility of digital media, and from the lack of baseline practices for managing and preserving digital data. The University of North Carolina Chapel Hill (UNC-CH) School of Information and Library Science, working with the Institute of Museum and Library Services (IMLS) and partners in the United Kingdom (U.K.), are collaborating on the Closing the Digital Curation Gap (CDCG) project to establish baseline practices for the storage, maintenance, and preservation of digital data to help ensure their enhancement and continuing long-term use. Because digital curation, or the management and preservation of digital data over the full life cycle, is of strategic importance to the library and archives fields, IMLS is funding the project through a cooperative agreement with UNC-CH. U.K. partners include the Joint Information Systems Committee (JISC), which supports innovation in digital technologies in U.K. colleges and universities, and its funded entities, the Strategic Content Alliance (SCA) and the Digital Curation Centre (DCC).

Well-curated data can be made accessible for a variety of audiences. For example, the data gathered by the Sloan Digital Sky Survey (www.sdss.org) at the Apache Point Observatory in New Mexico is available to professional astronomers worldwide as well as to schoolchildren, teachers, and citizen scientists through its Galaxy Zoo project. Galaxy Zoo, now in its second version, invites citizen scientists to assist in classifying over a million galaxies (www.galaxyzoo.org). With good preservation techniques, this data will be available into the future to provide documentation of the sky as it currently appears.

Data and information science researchers have already developed many viable applications, models, strategies, and standards for the long term care of digital objects. This project will help bridge a significant gap between the progress of digital curation research and development and the professional practices of archivists, librarians, and museum curators. Project partners will develop guidelines for digital curation practices, especially for staff in small to medium-sized cultural heritage institutions where digital assets are most at risk. Larger institutions will also benefit. To develop baseline practices, a working group will establish and support a network of digital curation practitioners, researchers, and educators through face-to-face meetings, web-based communication, and other communication tools. Project staff will also use surveys, interviews, and case studies to develop a plan for ongoing development of digital curation frameworks, guidance, and best practices. The team will also promote roles that various organizations can play and identify future opportunities for collaboration.

As part of this project, the Digital Curation Manual, which is maintained by the DCC, will be updated and expanded www.dcc.ac.uk/resource/curation-manual/chapters and the Digital Curation Exchange web portal will receive support (http://digitalcurationexchange.org). Through these efforts, the CDCG project will lay the foundation that will inform future training, education, and practice. The project's research, publications, practical tool integration, and outreach and training efforts will be of value to organizations charged with maintaining digital assets over the long term.

"Memento: Time Travel for the Web"

Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, and Harihar Shankar have self-archived "Memento: Time Travel for the Web" in arXiv.org.

Here's an excerpt:

The Web is ephemeral. Many resources have representations that change over time, and many of those representations are lost forever. A lucky few manage to reappear as archived resources that carry their own URIs. For example, some content management systems maintain version pages that reflect a frozen prior state of their changing resources. Archives recurrently crawl the web to obtain the actual representation of resources, and subsequently make those available via special-purpose archived resources. In both cases, the archival copies have URIs that are protocol-wise disconnected from the URI of the resource of which they represent a prior state. Indeed, the lack of temporal capabilities in the most common Web protocol, HTTP, prevents getting to an archived resource on the basis of the URI of its original. This turns accessing archived resources into a significant discovery challenge for both human and software agents, which typically involves following a multitude of links from the original to the archival resource, or of searching archives for the original URI. This paper proposes the protocol-based Memento solution to address this problem, and describes a proof-of-concept experiment that includes major servers of archival content, including Wikipedia and the Internet Archive. The Memento solution is based on existing HTTP capabilities applied in a novel way to add the temporal dimension. The result is a framework in which archived resources can seamlessly be reached via the URI of their original: protocol-based time travel for the Web.

Read more about it at "Time-Travelling Browsers Navigate the Web's Past" and the Memento project website.

"The Practice and Perception of Web Archiving in Academic Libraries and Archives"

Lisa A. Gregory's Master's theses, "The Practice and Perception of Web Archiving in Academic Libraries and Archives," is available from the School of Information and Library Science at the University of North Carolina at Chapel Hill.

Here's an excerpt:

In order to dig deeper into possible reasons behind archivists’ and librarians’ reluctance to archive Web sites, the study described here asks professionals to reveal their Web archiving experiences as well as the information sources they consult regarding archiving Web sites. Specifically, the following two research questions are addressed: Are librarians and archivists at institutions of higher education currently engaged in or considering archiving Web sites? What sources do these professionals consult for information about Web archiving?

Towards Repository Preservation Services. Final Report from the JISC Preserv 2 Project

Steve Hitchcock, David Tarrant, and Les Carr have self-archived Towards Repository Preservation Services. Final Report from the JISC Preserv 2 Project in the ECS EPrints Repository.

Here's the abstract:

Preserv 2 investigated the preservation of data in digital institutional repositories, focussing in particular on managing storage, data and file formats. Preserv 2 developed the first repository storage controller, which will be a feature of EPrints version 3.2 software (due 2009). Plugin applications that use the controller have been written for Amazon S3 and Sun cloud services among others, as well as for local disk storage. In a breakthrough application Preserv 2 used OAI-ORE to show how data can be moved between two repository softwares with quite distinct data models, from an EPrints repository to a Fedora repository. The largest area of work in Preserv 2 was on file format management and an 'active' preservation approach. This involves identifying file formats, assessing the risks posed by those formats and taking action to obviate the risks where that could be justified. These processes were implemented with reference to a technical registry, PRONOM from The National Archives (TNA), and DROID (digital record object identification service), also produced by TNA. Preserv 2 showed we can invoke a current registry to classify the digital objects and present a hierarchy of risk scores for a repository. Classification was performed using the Preserv2 EPrints preservation toolkit. This 'wraps' DROID in an EPrints repository environment. This toolkit will be another feature available for EPrints v3.2 software. The result of file format identification can indicate a file is at risk of becoming inaccessible or corrupted. Preserv 2 developed a repository interface to present formats by risk category. Providing risk scores through the live PRONOM service was shown to be feasible. Spin-off work is ongoing to develop format risk scores by compiling data from multiple sources in a new linked data registry.

Papers from the European Research Area 2009 Conference

Papers from the European Research Area 2009 Conference are now available.

Here's a selection from the "Open Access and Preservation" session:

Digital Videos: Presentations from Access 2009 Conference

Presentations from the Access 2009 Conference are now available. Digital videos and presentation slides (if available) are synched.

Here's a quick selection:

  1. Dan Chudnov, "Repository Development at the Library of Congress"
  2. Cory Doctorow, "Copyright vs Universal Access to All Human Knowledge and Groups Without Cost: The State of Play in the Global Copyfight"
  3. Mark Jordan & Brian Owen, "COPPUL's LOCKSS Private Network / Software Lifecycles & Sustainability: a PKP and reSearcher Update"
  4. Dorthea Salo, "Representing and Managing the Data Deluge"
  5. Roy Tennant, "Inspecting the Elephant: Characterizing the Hathi Trust Collection"

Johns Hopkins University Sheridan Libraries' Data Conservancy Project Funded by $20 Million NSF Grant

The Johns Hopkins University Sheridan Libraries' Data Conservancy project has been funded by a $20 million NSF grant.

Here's an excerpt from the press release:

The Johns Hopkins University Sheridan Libraries have been awarded $20 million from the National Science Foundation (NSF) to build a data research infrastructure for the management of the ever-increasing amounts of digital information created for teaching and research. The five-year award, announced this week, was one of two for what is being called "data curation."

The project, known as the Data Conservancy, involves individuals from several institutions, with Johns Hopkins University serving as the lead and Sayeed Choudhury, Hodson Director of the Digital Research and Curation Center and associate dean of university libraries, as the principal investigator. In addition, seven Johns Hopkins faculty members are associated with the Data Conservancy, including School of Arts and Sciences professors Alexander Szalay, Bruce Marsh, and Katalin Szlavecz; School of Engineering professors Randal Burns, Charles Meneveau, and Andreas Terzis; and School of Medicine professor Jef Boeke. The Hopkins-led project is part of a larger $100 million NSF effort to ensure preservation and curation of engineering and science data.

Beginning with the life, earth, and social sciences, project members will develop a framework to more fully understand data practices currently in use and arrive at a model for curation that allows ease of access both within and across disciplines.

"Data curation is not an end but a means," said Choudhury. "Science and engineering research and education are increasingly digital and data-intensive, which means that new management structures and technologies will be critical to accommodate the diversity, size, and complexity of current and future data sets and streams. Our ultimate goal is to support new ways of inquiry and learning. The potential for the sharing and application of data across disciplines is incredible. But it’s not enough to simply discover data; you need to be able to access it and be assured it will remain available."

The Data Conservancy grant represents one of the first awards related to the Institute of Data Intensive Engineering and Science (IDIES), a collaboration between the Krieger School of Arts and Sciences, the Whiting School of Engineering, and the Sheridan Libraries. . . .

In addition to the $20 million grant announced today, the Libraries received a $300,000 grant from NSF to study the feasibility of developing, operating and sustaining an open access repository of articles from NSF-sponsored research. Libraries staff will work with colleagues from the Council on Library and Information Resources (CLIR), and the University of Michigan Libraries to explore the potential for the development of a repository (or set of repositories) similar to PubMedCentral, the open-access repository that features articles from NIH-sponsored research. This grant for the feasibility study will allow Choudhury's group to evaluate how to integrate activities under the framework of the Data Conservancy and will result in a set of recommendations for NSF regarding an open access repository.

Indiana University Bloomington Media Preservation Survey

Indiana University Bloomington has released its Media Preservation Survey.

Here's an excerpt:

The survey task force recommends a number of actions to facilitate the time-critical process of rescuing IUB’s audio, video, and film media.

  • Appoint a campus-wide taskforce to advise
    • the development of priorities for preservation action
    • the development of a campus-wide preservation plan
    • how units can leverage resources for the future
  • Create a centralized media preservation and digitization center that will serve the entire campus, using international standards for preservation transfer. As part of the planning for this center, hire a
    • media preservation specialist
    • film archivist
  • Develop special funding for the massive and rapid digitization of the treasures of IU over the next 10 years.
  • Create a centralized physical storage space appropriate for film, video, and audio.
  • Provide archival appraisal and control across campus to
    • assure quality of digitization for preservation
    • oversee plans for maintaining original media
  • Develop cataloging services for special collections to improve intellectual control to
    • accelerate research opportunities
    • improve access.

"Digital Preservation: Logical and Bit-Stream Preservation Using Plato, EPrints and the Cloud"

Adam Field, David Tarrant, Andreas Rauber, and Hannes Kulovits have self-archived their "Digital Preservation: Logical and Bit-Stream Preservation Using Plato, EPrints and the Cloud" presentation on the ECS EPrints Repository.

Here's an excerpt from the abstract:

This tutorial shows attendees the latest facilities in the EPrints open source repository platform for dealing with preservation tasks in a practical and achievable way, and new mechanisms for integrating the repository with the cloud and the user desktop, in order to be able to offer a trusted and managed storage solution to end users. . . .

The benefit of this tutorial is the grounding of digital curation advice and theory into achievable good practice that delivers helpful services to end users for their familiar personal desktop environments and new cloud services.