Insight into Digital Preservation of Research Output in Europe

PARSE.Insight (INSIGHT into issues of Permanent Access to the Records of Science in Europe) has released Insight into Digital Preservation of Research Output in Europe.

Here's an excerpt:

This report . . . describes the results of the surveys conducted by PARSE.Insight to gain insight into research in Europe. Major surveys were held within three stake-holder domains: research, publishing and data management. In total, almost 2,000 people responded; they provided us with interesting insights in the current state of affairs in digital preservation of digital research data (including publications), the outlook of data preservation, data sharing, roles & responsibilities of stakeholders in research and funding of research.

UC Berkeley Media Vault Program Progress Report

The University of California, Berkeley's Media Vault Program has posted a progress report.

Here's an excerpt:

Media Vault Program partners offer a number of specialized tools to help campus researchers manage their materials. These include:

  • WebGenDL (UCB Library Systems) — the library's internal system for managing, creating, preserving and discovering digital library content. These tools are aimed primarily at mature, publishable sets of materials, rather than the broader context of research data
  • UC3 Curation Micro-services — a set of low barrier tools for full lifecycle enrichment of objects (e.g., identity, fixity, replication, annotation). The first few will be rolled out publicly in January 2010. These are presented not as a user interface, but rather as behind-the-scenes services
  • Sakai 3 — the next-generation version of the platform that powers the Berkeley campus's bSpace application. Due in 2011, Sakai 3 will include a range of social tools to help users extend and disseminate their materials

To augment these services, and to handle use cases beyond their scope, the MVP team examined a number of potential platforms. . . .

Of these candidates, Alfresco stands out as the most functional, out-of-the-box solution. With a little customization, it can be readied for user testing. Therefore, the MVP team has selected it as the basis of its next round of discussions with stakeholders, partners and prospective users.

Read more about Alfresco at the AlfrescoWiki.

Presentations from the 5th International Digital Curation Conference

Presentations from the 5th International Digital Curation Conference are now available. (Thanks to the Digital Curation Blog, which has provided extensive coverage of the conference.)

First Day

Day Two

File Formats for Preservation

The Digital Preservation Coalition has released File Formats for Preservation.

Here's an excerpt:

File formats are the principal means of encoding information content in any computing environment. Preserving intellectual content requires a firm grasp of the file formats used to create, store and disseminate it, and ensuring that they remain fit for purpose. There are several significant pronouncements on preservation file formats in the literature. These have generally emanated from either preservation institutions or research projects and usually take one of three approaches:

  • recommendations for submitting material to digital repositories
  • recommendations or policies for long term preservation or
  • proposals, plans for and technical documentation of existing registries to store attributes of formats.

More recently, attention has broadened to pay specific attention to the significant properties of the intellectual objects that are the subject of preservation. This Technology Watch Report has been written to provide an overview of these developments in context by comparative review and analysis to assist repository managers and the preservation community more widely. It aims to provide a guide and critique to the current literature, and place it in the context of a wider professional knowledge and research base.

Data Preservation in High Energy Physics

The ICHFA DPHEP International Study Group has self-archived Data Preservation in High Energy Physics in

Here's an excerpt:

Data from high-energy physics (HEP) experiments are collected with significant financial and human effort and are mostly unique. At the same time, HEP has no coherent strategy for data preservation and re-use. An inter-experimental Study Group on HEP data preservation and long-term analysis was convened at the end of 2008 and held two workshops, at DESY (January 2009) and SLAC (May 2009). This document is an intermediate report to the International Committee for Future Accelerators (ICFA) of the reflections of this Study Group.

Closing the Digital Curation Gap Project

The Institute of Museum and Library Services has awarded $249,623 to the University of North Carolina Chapel Hill School of Information and Library Science for the Closing the Digital Curation Gap project.

Here's an excerpt from the press release:

Scientists, researchers, and scholars across the world generate vast amounts of digital data, but the scientific record and the documentary heritage created in digital form are at risk—from technology obsolescence, from the fragility of digital media, and from the lack of baseline practices for managing and preserving digital data. The University of North Carolina Chapel Hill (UNC-CH) School of Information and Library Science, working with the Institute of Museum and Library Services (IMLS) and partners in the United Kingdom (U.K.), are collaborating on the Closing the Digital Curation Gap (CDCG) project to establish baseline practices for the storage, maintenance, and preservation of digital data to help ensure their enhancement and continuing long-term use. Because digital curation, or the management and preservation of digital data over the full life cycle, is of strategic importance to the library and archives fields, IMLS is funding the project through a cooperative agreement with UNC-CH. U.K. partners include the Joint Information Systems Committee (JISC), which supports innovation in digital technologies in U.K. colleges and universities, and its funded entities, the Strategic Content Alliance (SCA) and the Digital Curation Centre (DCC).

Well-curated data can be made accessible for a variety of audiences. For example, the data gathered by the Sloan Digital Sky Survey ( at the Apache Point Observatory in New Mexico is available to professional astronomers worldwide as well as to schoolchildren, teachers, and citizen scientists through its Galaxy Zoo project. Galaxy Zoo, now in its second version, invites citizen scientists to assist in classifying over a million galaxies ( With good preservation techniques, this data will be available into the future to provide documentation of the sky as it currently appears.

Data and information science researchers have already developed many viable applications, models, strategies, and standards for the long term care of digital objects. This project will help bridge a significant gap between the progress of digital curation research and development and the professional practices of archivists, librarians, and museum curators. Project partners will develop guidelines for digital curation practices, especially for staff in small to medium-sized cultural heritage institutions where digital assets are most at risk. Larger institutions will also benefit. To develop baseline practices, a working group will establish and support a network of digital curation practitioners, researchers, and educators through face-to-face meetings, web-based communication, and other communication tools. Project staff will also use surveys, interviews, and case studies to develop a plan for ongoing development of digital curation frameworks, guidance, and best practices. The team will also promote roles that various organizations can play and identify future opportunities for collaboration.

As part of this project, the Digital Curation Manual, which is maintained by the DCC, will be updated and expanded and the Digital Curation Exchange web portal will receive support ( Through these efforts, the CDCG project will lay the foundation that will inform future training, education, and practice. The project's research, publications, practical tool integration, and outreach and training efforts will be of value to organizations charged with maintaining digital assets over the long term.

"Memento: Time Travel for the Web"

Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, and Harihar Shankar have self-archived "Memento: Time Travel for the Web" in

Here's an excerpt:

The Web is ephemeral. Many resources have representations that change over time, and many of those representations are lost forever. A lucky few manage to reappear as archived resources that carry their own URIs. For example, some content management systems maintain version pages that reflect a frozen prior state of their changing resources. Archives recurrently crawl the web to obtain the actual representation of resources, and subsequently make those available via special-purpose archived resources. In both cases, the archival copies have URIs that are protocol-wise disconnected from the URI of the resource of which they represent a prior state. Indeed, the lack of temporal capabilities in the most common Web protocol, HTTP, prevents getting to an archived resource on the basis of the URI of its original. This turns accessing archived resources into a significant discovery challenge for both human and software agents, which typically involves following a multitude of links from the original to the archival resource, or of searching archives for the original URI. This paper proposes the protocol-based Memento solution to address this problem, and describes a proof-of-concept experiment that includes major servers of archival content, including Wikipedia and the Internet Archive. The Memento solution is based on existing HTTP capabilities applied in a novel way to add the temporal dimension. The result is a framework in which archived resources can seamlessly be reached via the URI of their original: protocol-based time travel for the Web.

Read more about it at "Time-Travelling Browsers Navigate the Web's Past" and the Memento project website.

"The Practice and Perception of Web Archiving in Academic Libraries and Archives"

Lisa A. Gregory's Master's theses, "The Practice and Perception of Web Archiving in Academic Libraries and Archives," is available from the School of Information and Library Science at the University of North Carolina at Chapel Hill.

Here's an excerpt:

In order to dig deeper into possible reasons behind archivists’ and librarians’ reluctance to archive Web sites, the study described here asks professionals to reveal their Web archiving experiences as well as the information sources they consult regarding archiving Web sites. Specifically, the following two research questions are addressed: Are librarians and archivists at institutions of higher education currently engaged in or considering archiving Web sites? What sources do these professionals consult for information about Web archiving?

Towards Repository Preservation Services. Final Report from the JISC Preserv 2 Project

Steve Hitchcock, David Tarrant, and Les Carr have self-archived Towards Repository Preservation Services. Final Report from the JISC Preserv 2 Project in the ECS EPrints Repository.

Here's the abstract:

Preserv 2 investigated the preservation of data in digital institutional repositories, focussing in particular on managing storage, data and file formats. Preserv 2 developed the first repository storage controller, which will be a feature of EPrints version 3.2 software (due 2009). Plugin applications that use the controller have been written for Amazon S3 and Sun cloud services among others, as well as for local disk storage. In a breakthrough application Preserv 2 used OAI-ORE to show how data can be moved between two repository softwares with quite distinct data models, from an EPrints repository to a Fedora repository. The largest area of work in Preserv 2 was on file format management and an 'active' preservation approach. This involves identifying file formats, assessing the risks posed by those formats and taking action to obviate the risks where that could be justified. These processes were implemented with reference to a technical registry, PRONOM from The National Archives (TNA), and DROID (digital record object identification service), also produced by TNA. Preserv 2 showed we can invoke a current registry to classify the digital objects and present a hierarchy of risk scores for a repository. Classification was performed using the Preserv2 EPrints preservation toolkit. This 'wraps' DROID in an EPrints repository environment. This toolkit will be another feature available for EPrints v3.2 software. The result of file format identification can indicate a file is at risk of becoming inaccessible or corrupted. Preserv 2 developed a repository interface to present formats by risk category. Providing risk scores through the live PRONOM service was shown to be feasible. Spin-off work is ongoing to develop format risk scores by compiling data from multiple sources in a new linked data registry.

Papers from the European Research Area 2009 Conference

Papers from the European Research Area 2009 Conference are now available.

Here's a selection from the "Open Access and Preservation" session:

Digital Videos: Presentations from Access 2009 Conference

Presentations from the Access 2009 Conference are now available. Digital videos and presentation slides (if available) are synched.

Here's a quick selection:

  1. Dan Chudnov, "Repository Development at the Library of Congress"
  2. Cory Doctorow, "Copyright vs Universal Access to All Human Knowledge and Groups Without Cost: The State of Play in the Global Copyfight"
  3. Mark Jordan & Brian Owen, "COPPUL's LOCKSS Private Network / Software Lifecycles & Sustainability: a PKP and reSearcher Update"
  4. Dorthea Salo, "Representing and Managing the Data Deluge"
  5. Roy Tennant, "Inspecting the Elephant: Characterizing the Hathi Trust Collection"

Johns Hopkins University Sheridan Libraries' Data Conservancy Project Funded by $20 Million NSF Grant

The Johns Hopkins University Sheridan Libraries' Data Conservancy project has been funded by a $20 million NSF grant.

Here's an excerpt from the press release:

The Johns Hopkins University Sheridan Libraries have been awarded $20 million from the National Science Foundation (NSF) to build a data research infrastructure for the management of the ever-increasing amounts of digital information created for teaching and research. The five-year award, announced this week, was one of two for what is being called "data curation."

The project, known as the Data Conservancy, involves individuals from several institutions, with Johns Hopkins University serving as the lead and Sayeed Choudhury, Hodson Director of the Digital Research and Curation Center and associate dean of university libraries, as the principal investigator. In addition, seven Johns Hopkins faculty members are associated with the Data Conservancy, including School of Arts and Sciences professors Alexander Szalay, Bruce Marsh, and Katalin Szlavecz; School of Engineering professors Randal Burns, Charles Meneveau, and Andreas Terzis; and School of Medicine professor Jef Boeke. The Hopkins-led project is part of a larger $100 million NSF effort to ensure preservation and curation of engineering and science data.

Beginning with the life, earth, and social sciences, project members will develop a framework to more fully understand data practices currently in use and arrive at a model for curation that allows ease of access both within and across disciplines.

"Data curation is not an end but a means," said Choudhury. "Science and engineering research and education are increasingly digital and data-intensive, which means that new management structures and technologies will be critical to accommodate the diversity, size, and complexity of current and future data sets and streams. Our ultimate goal is to support new ways of inquiry and learning. The potential for the sharing and application of data across disciplines is incredible. But it’s not enough to simply discover data; you need to be able to access it and be assured it will remain available."

The Data Conservancy grant represents one of the first awards related to the Institute of Data Intensive Engineering and Science (IDIES), a collaboration between the Krieger School of Arts and Sciences, the Whiting School of Engineering, and the Sheridan Libraries. . . .

In addition to the $20 million grant announced today, the Libraries received a $300,000 grant from NSF to study the feasibility of developing, operating and sustaining an open access repository of articles from NSF-sponsored research. Libraries staff will work with colleagues from the Council on Library and Information Resources (CLIR), and the University of Michigan Libraries to explore the potential for the development of a repository (or set of repositories) similar to PubMedCentral, the open-access repository that features articles from NIH-sponsored research. This grant for the feasibility study will allow Choudhury's group to evaluate how to integrate activities under the framework of the Data Conservancy and will result in a set of recommendations for NSF regarding an open access repository.

Indiana University Bloomington Media Preservation Survey

Indiana University Bloomington has released its Media Preservation Survey.

Here's an excerpt:

The survey task force recommends a number of actions to facilitate the time-critical process of rescuing IUB’s audio, video, and film media.

  • Appoint a campus-wide taskforce to advise
    • the development of priorities for preservation action
    • the development of a campus-wide preservation plan
    • how units can leverage resources for the future
  • Create a centralized media preservation and digitization center that will serve the entire campus, using international standards for preservation transfer. As part of the planning for this center, hire a
    • media preservation specialist
    • film archivist
  • Develop special funding for the massive and rapid digitization of the treasures of IU over the next 10 years.
  • Create a centralized physical storage space appropriate for film, video, and audio.
  • Provide archival appraisal and control across campus to
    • assure quality of digitization for preservation
    • oversee plans for maintaining original media
  • Develop cataloging services for special collections to improve intellectual control to
    • accelerate research opportunities
    • improve access.

"Digital Preservation: Logical and Bit-Stream Preservation Using Plato, EPrints and the Cloud"

Adam Field, David Tarrant, Andreas Rauber, and Hannes Kulovits have self-archived their "Digital Preservation: Logical and Bit-Stream Preservation Using Plato, EPrints and the Cloud" presentation on the ECS EPrints Repository.

Here's an excerpt from the abstract:

This tutorial shows attendees the latest facilities in the EPrints open source repository platform for dealing with preservation tasks in a practical and achievable way, and new mechanisms for integrating the repository with the cloud and the user desktop, in order to be able to offer a trusted and managed storage solution to end users. . . .

The benefit of this tutorial is the grounding of digital curation advice and theory into achievable good practice that delivers helpful services to end users for their familiar personal desktop environments and new cloud services.

Digital Preservation: Life2 Final Project Report

JISC has released Life2 Final Project Report.

Here's an excerpt:

LIFE Model v2 outlines a fully-revised lifecycle model taking into account feedback from user groups, the Case Studies and the wider digital preservation community.

Generic Preservation Model (GPM) summarises the update to the preservation model with an accompanying spreadsheet. This model allows institutions to estimate potential digital preservation costs for their collections. The GPM fits into the updated LIFE Model.

An Economic Evaluation of LIFE was written by economist Bo-Christer Björk on the approach used for both the first and second phases of LIFE. This independent review validates the LIFE approach for lifecycle costing.

The SHERPA DP Case Study outlines the mapping of the repository services that CeRch provides to the LIFE Model. The SHERPA-LEAP Case Study maps three very different HE repositories to the LIFE Model. Goldsmiths University of London, Royal Holloway University of London and UCL (University College London) each provide exemplars of varying collections. Each institution’s repository is at a different stage of development.

The Newspapers Case Study successfully maps both analogue and digital newspaper collections to the LIFE Model. This success means that LIFE could be developed into a fully-compatible predictive tool across both analogue and digital collections, allowing for comparison both throughout the lifecycles of a collection and across different types of collections.

SHERPA DP2: Developing Services for Archiving and Preservation in a Distributed Environment—Final Report

JISC has released SHERPA DP2: Developing Services for Archiving and Preservation in a Distributed Environment—Final Report.

Here's an excerpt:

The SHERPA DP2 project (2007-2009) was a two year project funded by the JISC under the Digital Preservation and Records Management Programme. The project was led by the Centre for e- Research at King's College London (formerly the Executive of the Arts and Humanities Data Service), which is working with several institutions to develop a preservation service that will cater for the requirements of a diverse range of digital resources and web-based resources. In summary, the project has the following objectives:

  1. Extend and refine the OAIS-based Shared Services model created for the initial SHERPA DP project to accommodate the requirements of different Content Providers and varied collaborative methods.
  2. Produce a set of services that will assist with the capture and return of research data stored in distributed locations, building upon existing software tools.
  3. Expand upon the work processes and software tools developed for SHERPA DP(1) and SOAPI to cater for the curation and preservation of increasingly diverse resource types.

Digital Preservation: Media Vault Program Interim Report

The Media Vault Program has released Media Vault Program Interim Report.

Here's an excerpt:

All major studies and reports on the sustainability of digital resources point to a multitude of barriers that can be clustered into four factors:

Economic: Who owns the problem, and who benefits from the solutions? Who pays for the services, long-term preservation, development, and curation? . . . .

Technical: Simple services are needed, but they are not simple to build, implement and support in our complex environment. Successful structures that can support digital scholarship must account for user needs, emerging technologies/file formats, adverse working contexts (fieldwork, offline, multi-platform), and should be supported at the enterprise scale. . . .

Political/Organizational: . . . . there are good reasons for the various service provider organizations to innovate on their own, but there is much to gain from working together on common goals and milestones. In fact, where communities have succeeded in softening the boundaries between content producers and consumers, supporters and beneficiaries, significant successes have been achieved. . . .

Social: We live in interesting times . . . and the prevalence of cheap/stolen media has produced an expectation that things should be always available, conveniently packaged, and free. Where some organizations, such as the Long Now Foundation, are hoping to "provide counterpoint to today's "Faster/cheaper" mind set and promote 'slower/better' thinking," it may be up to those of us who care deeply about the persistence of research data to step up as the seas continue to change.

Harvard University Library Launched Web Archive Collection Service (WAX)

The Harvard University Library has launched its Web Archive Collection Service (WAX).

Here's an excerpt from the press release (posted on

WAX began as a pilot project in July 2006, funded by the University's Library Digital Initiative (LDI) to address the management of web sites by collection managers for long-term archiving. It was the first LDI project specifically oriented toward preserving "born-digital" material. . . .

During the pilot, we explored the legal terrain and implemented several methods of mitigating risks. We investigated various technologies and developed work flow efficiencies for the collection managers and the technologists. We analyzed and implemented the metadata and deposit requirements for long term preservation in our repository. We continue to look at ways to ease the labor intensive nature of the QA process, to improve display as the software matures and to assess additional requirements for long term preservation. . . .

WAX was built using several open source tools developed by the Internet Archive and other International Internet Preservation Consortium (IIPC) members. These IIPC tools include the Heritrix web crawler; the Wayback index and rendering tool; and the NutchWAX index and search tool. WAX also uses Quartz open source job scheduling software from OpenSymphony.

In February 2009, the pilot public interface was launched and announced to the University community. WAX has now transitioned to a production system supported by the University Library's central infrastructure.

English-Language Summary of A Future for Our Digital Memory: Permanent Access to Information in the Netherlands

The Netherlands Coalition for Digital Preservation has released an English-language summary of A Future for Our Digital Memory: Permanent Access to Information in the Netherlands.

Here's an excerpt:

In order to underpin its strategy, the NCDD decided to first build a detailed picture of the current situation in the public sector in the Netherlands. Can institutions or domains be identified which have successfully risen to the challenge of digital preservation and permanent access? Which categories of data are in danger of being lost? How can the risks be managed? This so-called National Digital Preservation Survey was funded by the Ministry of Ministry of Education, Culture and Science.

After some preliminary consultancy work it was decided that the survey would best be carried out by researchers with both knowledge of the issues involved in digital preservation and of the three sectors, which were identified as: scholarly communications, government & archives, and culture & heritage. A team of three researchers was recruited from among NCDD member staff, with the NCDD coordinator leading the project. The initial objective, to conduct a statistically relevant quantitative survey, had to be abandoned early in the project. The field to be surveyed was vast and varied, and some of the target groups were quite unfamiliar with the specifics of digital preservation, making online surveys unproductive. Therefore, the research team decided on a methodology of (some seventy) semi-structured interviews with knowledgeable stakeholders, adding relevant information from both Dutch and foreign published sources. Five interviews were held with major private sector parties to establish whether the private sector has best practices to offer for the public sector to emulate.

Digital Preservation: Alpha Prototype of JHOVE2 Released

An alpha prototype of JHOVE2 is now available. JHOVE2 is a tool for the characterization (i.e., identification, validation, feature extraction, and assessment) of digital objects that is used for digital library and digital preservation purposes.

Here's an excerpt from the announcement:

An alpha prototype version of JHOVE2 is now available for download and evaluation (v. 0.5.2, 2009-08-05). Distribution packages (in zip and tar.gz form) are available on the JHOVE2 public wiki at ( The new JHOVE2 architecture reflected in this prototype is described in the attached architectural overview (also available at . . .

The prototype supports the following features:

  • Appropriate recursive processing of directories and Zip files.
  • High performance buffered I/O using the Java nio package.
  • Message digesting for the following algorithms: Adler-32, CRC-32,
  • MD2, MD5, SHA-1, SHA-256, SHA-384, SHA-512
  • Results formatted as JSON, text (name/value pairs), and XML.
  • Use of the Spring Inversion-of-Control container for flexible module
  • configuration.
  • A complete UTF-8 module.
  • An minimally functional Shapefile module.

OCLC Presentations on Digital Curation and Web-scale Management Services

Below are streaming video OCLC presentations from ALA Annual 2009 on digital curation and Web-scale Management Services.

  • Integrating Technical Services and Preservation Workflows: "Mainstreaming Digital Resources. After an introduction from Geri Bunker Ingram of OCLC, Amy Rudersdorf (Director, Digital Information Management Program, The State Library of North Carolina) discusses integrating a whole host of systems into a digital curation workflow, including OCLC's Connexion tools, Digital Archive, WorldCat, Digital Collection Gateway and CONTENTdm."
  • OCLC Web-scale Management Services: "Presentation by Andrew Pace, OCLC Executive Director for Networked Library Services, ALA Annual 2009. Web-scale cooperative library management services, network-level tools for managing library collections through circulation and delivery, print and licensed acquisitions, and license management. These services complement existing OCLC Web-scale services, such as cataloging, resource sharing, and integrated discovery."