OAI-PMH – Page 2 – DigitalKoans

Using the Open Archives Initiative Protocol for Metadata Harvesting

Libraries Unlimited has released Using the Open Archives Initiative Protocol for Metadata Harvesting by Timothy W. Cole and Muriel Foulonneau.

Here’s an excerpt from the publisher’s description:

Through a series of case studies, Cole and Foulonneau guide the reader through the process of conceiving, implementing and maintaining an OAI-compliant repository. Its applicability to both institutional archives and discipline based aggregators are covered, with equal attention paid to the technical and organizational aspects of creating and maintaining such repositories.

Compound Information Objects: An OAI-ORE Perspective

The Open Archives Initiative—Object Reuse and Exchange has released Compound Information Objects: An OAI-ORE Perspective by Carl Lagoze and Herbert Van de Sompel.

Here’s an excerpt from the document’s "Introduction and Motivation" section:

In summary, the web architecture expresses the notion of linked URI-identified resources. Information systems can leverage this architecture to publish the components of a compound object and thereby make them available to web clients and services. But due to the absence of commonly accepted standards, the notion of an identified compound object with a distinct boundary and typed relationships among its component resources is lost.

The absence of these standards affects the functionality of a number of existing and possible web services and applications. Crawler-based search engines might be more useful if the granularity of their result sets corresponded to compound objects (a book or chapter, in this example) rather than individual resources (single pages). The ranking algorithms of these search engines might improve if the links among the components of a compound object were treated differently than links to the object as a whole, or if the number of in-links to the various component resources was accumulated to the level of the compound object instead of counted separately. Citation analysis systems would also benefit from a mechanism for citing the compound object itself, rather than arbitrary parts of the object. Finally, a standard for representing compound objects might enable a new class of "whole object" services such as "preserve a compound object".

Wednesday’s OAI5 Presentations

Presentations from Wednesday’s sessions of the 5th Workshop on Innovations in Scholarly Communication in Geneva are now available.

Here are a few highlights from this major conference:

MESUR: Metrics from Scholarly Usage of Resources (PowerPoint): "The two-year MESUR project, funded by the Andrew W. Mellon Foundation, aims to define and validate a range of usage-based impact metrics, and issue guidelines with regards to their characteristics and proper application. The MESUR project is constructing a large-scale semantic model of the scholarly community that seamlessly integrates a wide range of bibliographic, citation and usage data."
OAI Object Re-Use and Exchange (PowerPoint): "In this presentation, we will give an overview of the current activities, including: defining the problem of compound documents within the web architecture, enumerating and exploring several use cases, and identifying likely adopters of OAI-ORE."
OpenDOAR Policy Tools and Applications (RealVideo): "OpenDOAR has developed a set of policy generator tools for repository administrators and is contacting administrators to advocate policy development."
State of OAI-PMH (PowerPoint): "The OAI-PMH was released in 2001 and stabilized at v2.0 in 2002. Since then there has been steady growth in adoption of the protocol. Support for the OAI-PMH is assumed for base-level interoperability between institutional repositories, and is also provided for many other collections of scholarly material. I will review the current landscape and reflect on some milestones and issues."

(You may want to download PowerPoint Viewer 2007 if you don’t have PowerPoint 2007).

Summary of PerX Project Findings About OAI-PMH and Repository Metadata Challenges

Roderick A. MacLeod has posted a useful summary of some of the key documents and findings of the PerX (Pilot Engineering Repositories Xsearch) project on JISC-REPOSITORIES. He notes: "These documents may help to dispel possible myths concerning the ease of service provision, ease of reharvesting metadata, surfacing digital repository content in third part services, etc."

Here’s a excerpt from the project’s About page that describes it:

PerX is a two-year (June 2005-May 2007) JISC Digital Repositories Programme project, to develop a pilot service which provides subject resource discovery across a series of repositories of interest to the engineering learning and research community. This pilot will then be used as a test-bed to explore the practical issues that would be encountered when considering the possibility of a full scale subject resource discovery service.

(Prior posting about PerX.)

Wildfire Institutional Repository Software

One of the interesting findings of my brief investigation of open access repository software by country was the heavy use of Wildfire in the Netherlands.

Wildfire was created by Henk Druiven, University of Groningen, and it is used by over 70 repositories. It runs on a PHP, MySQL, and Apache platform.

Here is a brief description from In Between.

Wildfire is the software our library uses for our OAI compatible repositories. It is a flexible system for setting up a large number of repositories that at the same time allows them to be aggregated in groups. A group acts like yet another repository with its own harvest address and user interface.

There are several descriptive documents about Wildfire, but most are not in English.

Recent Object Reuse and Exchange (ORE) Documents

In a previous posting, I discussed the Open Archives Initiative’s Object Reuse and Exchange (ORE) project. ORE is worth watching closely.

Two new documents were released this January:

"Report of the January 2007 ORE-TC Meeting," which is: "A detailed report of the results of the meeting of OAI-ORE Technical Committee describing features and requirements of the ORE model and its context in the Web Architecture."
"Open Repositories 2007," which is: "A presentation describing OAI-ORE and progress based on the January 2007 ORE Technical Committee Meeting."

OAIster Hits 10,000,000 Records

Excerpt from the press release:

We live in an information-driven world—one in which access to good information defines success. OAIster’s growth to 10 million records takes us one step closer to that goal.

Developed at the University of Michigan’s Library, OAIster is a collection of digital scholarly resources. OAIster is also a service that continually gathers these digital resources to remain complete and fresh. As global digital repositories grow, so do OAIster’s holdings.

Popular search engines don’t have the holdings OAIster does. They crawl web pages and index the words on those pages. It’s an outstanding technique for fast, broad information from public websites. But scholarly information, the kind researchers use to enrich their work, is generally hidden from these search engines.

OAIster retrieves these otherwise elusive resources by tapping directly into the collections of a variety of institutions using harvesting technology based on the Open Archives Initiative (OAI) Protocol for Metadata Harvesting. These can be images, academic papers, movies and audio files, technical reports, books, as well as preprints (unpublished works that have not yet been peer reviewed). By aggregating these resources, OAIster makes it possible to search across all of them and return the results of a thorough investigation of complete, up-to-date resources. . . .

OAIster is good news for the digital archives that contribute material to open-access repositories. "[OAIster has demonstrated that]. . . OAI interoperability can scale. This is good news for the technology, since the proliferation is bound to continue and even accelerate," says Peter Suber, author of the SPARC Open Access Newsletter. As open-access repositories proliferate, they will be supported by a single, well-managed, comprehensive, and useful tool.

Scholars will find that searching in OAIster can provide better results than searching in web search engines. Roy Tennant, User Services Architect at the California Digital Library, offers an example: "In OAIster I searched ‘roma’ and ‘world war,’ then sorted by weighted relevance. The first hit nailed my topic—the persecution of the Roma in World War II. Trying ‘roma world war’ in Google fails miserably because Google apparently searches ‘Rome’ as well as ‘Roma.’ The ranking then makes anything about the Roma people drop significantly, and there is nothing in the first few screens of results that includes the word in the title, unlike the OAIster hit."

OAIster currently harvests 730 repositories from 49 countries on 6 continents. In three years, it has more than quadrupled in size and increased from 6.2 million to 10 million in the past year. OAIster is a project of the University of Michigan Digital Library Production Service.

ScientificCommons.org: Access to Over 13 Million Digital Documents

ScientificCommons.org is an initiative of the Institute for Media and Communications Management at the University of St. Gallen. It indexes both metadata and full-text from global digital repositories. It uses OAI-PMH to identify relevant documents. The full-text documents are in PDF, PowerPoint, RTF, Microsoft Word, and Postscript formats. After being retrieved from their original repository, the documents are cached locally at ScientificCommons.org. It has indexed about 13 million documents from over 800 repositories.

Here are some additional features from the About ScientificCommons.org page:

Identification of authors across institutions and archives: ScientificCommons.org identifies authors and assigns them their scientific publications across various archives. Additionally the social relations between the authors will be extracted and displayed. . . .

Semantic combination of scientific information: ScientificCommons.org structures and combines the scientific data to knowledge areas with Ontology’s. Lexical and statistical methods are used to identify, extract and analyze keywords. Based on this processes ScientificCommons.org classifies the scientific data and uses it e.g. for navigational and weighting purposes.

Personalization services: ScientificCommons.org offers the researchers the possibilities to inform themselves about new publications via our RSS Feed service. They can customize the RSS Feed to a special discipline or even to personalized list of keywords. Furthermore ScientificCommons.org will provide an upload service. Every researcher can upload his publication directly to ScientificCommons.org and assign already existing publications at ScientificCommons.org to his own researcher profile.

DLF/NSDL OAI Best Practices Wiki

The Digital Library Federation and NSDL OAI and Shareable Metadata Best Practices Working Group’s OAI Best Practices Wiki has a number of resources relevant to the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and related metadata issues.

The Tools and Strategies for Using and Enhancing/Extending the OAI Protocol section is of particular interest. It includes information about OAI-PMH data provider and service provider registries, software solutions and packages, and static repositories and gateways; metadata management and added value tools as well as OAI and character validation tools; and using SRU/W, collection description schema, and NSDL safe transforms.

Is OAI-PMH Too Labor-Intensive?

OAI-PMH permits metadata harvesting from disciplinary archives, institutional repositories, and other digital archives. This allows the creation of specialized search services using this harvested metadata. OAI-PMH is a key technology for the open access movement, but does it require too much human intervention?

An interesting message on JISC-REPOSITORIES by Santy Chumbe, Technical Officer of the PerX project, suggests that it may. He says:

We have learned that in despite of its relative simplicity, an OAI-PMH service can be harder to implement and maintain than expected. We have spent a lot of effort harvesting, normalising and maintaining metadata obtained from OAI data providers. In particular the issue of metadata quality is an important factor here. A summary of our experiences dealing with OAI-PMH can be found at http://eprints.rclis.org/archive/00006394. . . . A final report outlining the maintenance issues involved in the project is in progress but the experience gained suggests that successful ongoing maintenance of OAI targets would require a mixture of automated and manual approaches and that the level of ongoing maintenance is high.

STARGATE Final Report and Tools

The STARGATE project has issued its final report. Here’s a brief summary of the project from the Executive Summary:

STARGATE (Static Repository Gateway and Toolkit) was funded by the Joint Information Systems Committee (JISC) and is intended to demonstrate the ease of use of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Static Repository technology, and the potential benefits offered to publishers in making their metadata available in this way This technology offers a simpler method of participating in many information discovery services than creating fully-fledged OAI-compliant repositories. It does this by allowing the infrastructure and technical support required to participate in OAI-based services to be shifted from the data provider (the journal) to a third party and allows a single third party gateway provider to provide intermediation for many data providers (journals).

To support the its work, the project developed tools and supporting documentation, which can be found below:

OAI’s Object Reuse and Exchange Initiative

The Open Archives Initiative has announced its Object Reuse and Exchange (ORE) initiative:

Object Reuse and Exchange (ORE) will develop specifications that allow distributed repositories to exchange information about their constituent digital objects. These specifications will include approaches for representing digital objects and repository services that facilitate access and ingest of these representations. The specifications will enable a new generation of cross-repository services that leverage the intrinsic value of digital objects beyond the borders of hosting repositories. . . . its real importance lies in the potential for these distributed repositories and their contained objects to act as the foundation of a new digitally-based scholarly communication framework. Such a framework would permit fluid reuse, refactoring, and aggregation of scholarly digital objects and their constituent parts—including text, images, data, and software. This framework would include new forms of citation, allow the creation of virtual collections of objects regardless of their location, and facilitate new workflows that add value to scholarly objects by distributed registration, certification, peer review, and preservation services. Although scholarly communication is the motivating application, we imagine that the specifications developed by ORE may extend to other domains.

OAI-ORE is being funded my the Andrew W. Mellon Foundation for a two-year period.

Presentations from the Augmenting Interoperability across Scholarly Repositories meeting are a good source of further information about the thinking behind the initiative as is the "Pathways: Augmenting Interoperability across Scholarly Repositories" preprint.

Will You Only Harvest Some?

The Digital Library for Information Science and Technology has announced DL-Harvest, an OAI-PMH service provider that harvests and makes searchable metadata about information science materials from the following archives and repositories:

ALIA e-prints
arXiv
Caltech Library System Papers and Publications
DLIST
Documentation Research and Training Centre
DSpace at UNC SILS
E-LIS
Metadata of LIS Journals
OCLC Research Publications
OpenMED@NIC
WWW Conferences Archive

DL-Harvest is a much needed, innovative discipline-based search service. Big kudos to all involved.

DLIST also just announced the formation of an advisory board.

The following musings, inspired by the DL-Harvest announcement, are not intended to detract from the fine work that DLIST is doing or from the very welcome addition of DL-Harvest to their service offerings.

Discipline-focused metadata can be relatively easily harvested from OAI-PHM-compliant systems that are organized along disciplinary lines (e.g., the entire archive/repository is discipline-based or an organized subset is discipline-based). No doubt these are very rich, primary veins of discipline-specific information, but how about the smaller veins and nuggets that are hard to identify and harvest because they are in systems or subsets that focus on another discipline?

Here’s an example. An economist, who is not part of a research center or other group that might have its own archive, writes extensively about the economics of the scholarly publishing business. This individual’s papers end up in the economics department section of his or her institutional repository and in EconWPA. They are highly relevant to librarians and information scientists, but will their metadata records be harvested for use in services like DL-Harvest using OAI-PMH since they are in the wrong conceptual bins (e.g., set in the case of the IR)?

Coleman et al. point to one solution in their intriguing "Integration of Non-OAI Resources for Federated Searching in DLIST, an Eprints Repository" paper. But (lots of hand waving here), if using automatic metadata extraction was an easy and simple way to supplement conventional OAI-PMH harvesting, the bottom line question is: how good is good enough? In other words, what’s an acceptable level of accuracy for the automatic metadata extraction? (I won’t even bring up the dreaded "controlled vocabulary" notion.)

No doubt this problem falls under the 80/20 Rule, and the 20 is most likely in the low hanging fruit OAI-PMH-wise, but wouldn’t it be nice to have more fruit?

More on OhioLINK’s Digital Resource Commons

David F. Kohl has self-archived a PowerPoint presentation about the DRC at E-LIS. It’s called "Cooperating Beyond the ‘Buying Club’: Digital Resource Commons (DRC): Making the Impossible Possible in Ohio."

To quote from the abstract:

Each institution can ‘brand’ itself in the system and may host a discrete and customized interface to all of its content. To the end user it will appear as an institutional resource as if it were hosted on your own servers. There will also be a collective OhioLINK level branding and ability for searches to retrieve across the institutional collections. . . . You will have complete control of your own content and how it is accessed. Multi-tiered security levels will allow your content to be shared only to the extent desired. . . .

Alternatively content can be restricted to an individual department, to an institution, or to the OhioLINK membership. Each institution can set its own policies governing the content in its repositories. Likewise custom workflows can be established to make the most of the personnel involved in each project and expedite the content creation and capture process. The service will include robust and flexible cataloging tools to aid in the creation of records that can be searched and browsed effectively by all types of users. Catalog records can be exported in international standard XML formats such as the Open Archives Initiative Protocol for Metadata Harvesting. Through OhioLINK’s unique collaboration with the Ohio Supercomputer Center your content is stored on enterprise class servers and storage networks.. . . A huge storage area network allows virtually unlimited storage space on our disks. . . . Programming or system administration skills and experience are not required. The system is flexible and adaptable and provides services superior to ‘DSpace’ and ‘ContentDM’ without the associated costs.

OhioLINK’s Digital Resource Commons

Peter Murray, Assistant Director of Multimedia Systems at OhioLINK recently posted a job announcement on LITA-L (I’d link, but given the way ALA safeguards access to its lists, it’s simply impossible) that brought to my attention a bold OhioLink project called the Digital Resource Commons, which is part of an even bolder project called the Ohio Digital Commons for Education. The quote from the job ad below describes the Digital Resource Commons. An earlier part of the ad indicates that Fedora will be used as the DRC’s platform.

OhioLINK’s Digital Resource Commons (DRC) is an Ohio Board of Regents-funded project to create a federated repository service that ingests, preserves, presents, and mediates administration of the educational and research materials of participating institutions. With the capability to store and deliver a virtually unlimited variety of digital file types and formats (including text, data sets, image, audio, video, streaming video, multimedia presentations, animations, etc.) the DRC is positioned to capture digital content from student and faculty researchers as it is produced and return it to users of the DRC upon request. The DRC offers wide and flexible control to member institutions and the communities within institution to define how content is added, preserved, and displayed to repository users. With federated community administration features, lead contacts at member institutions can create communities and delegate up to a complete subset of their privileges within the system to the editors/moderators of those new communities. The ability to scope and brand content to a particular community and institution is offered while retaining the ability to search for content across the entire repository. As both an Open Archives Initiative Data Provider and Service Provider, the DRC is positioned to become the premier point for the discovery of knowledge by and about Ohio’s scholars. In conjunction with the other parts of the Ohio Board of Regents grant funding, the DRC is one piece of a larger effort to build the Ohio Digital Commons for Education—a powerful vision for the future of learning and research in the state of Ohio.

The quote below from the DRC Web site describes the Ohio Digital Commons for Education.

The Digital Resource Commons is one of three projects funded by an Ohio Board of Regents Technology Initiatives grant collectively called the Ohio Digital Commons for Education (ODCE). The three components—this resource repository, the state-wide licensing and development of course management systems (WebCT and Blackboard), and a common access control mechanism (Shibboleth)—combine to offer a powerful vision for learning and research for the state of Ohio.

Impressive. As Daniel Hudson Burnham said: "Make no little plans; they have no magic to stir men’s blood and probably themselves will not be realized."

New OAI-PMH Guidelines

The Open Archives Initiative has issued Conveying Rights Expressions about Metadata in the OAI-PMH Framework, a new Implementation Guidelines document aimed at clarifying the important issue of how to express rights information about harvested metadata in OAI-PMH.

From the document:

Data providers might want to associate rights expressions with the metadata to indicate how it may be used, shared, and modified after it has been harvested. This specification defines how rights information pertaining to the metadata should be included in responses to OAI-PMH requests. The described technique:

Is based on delivering rights expressions that apply to metadata included in OAI-PMH responses. It uses the optional containers that have been defined as part of the OAI-PMH specification. As a result, no changes to the protocol are made, and compatibility with all existing OAI-PMH implementations is maintained.

Is not tied to any particular rights expression language. This document makes use of Creative Commons and GNU licenses, but the use of these specific languages is for illustrative purposes only.

Essential reading for OAI-PMH geeks.