In "Google's Book Search: A Disaster for Scholars," Geoffrey Nunberg examines the limitations of Google Book Search's metadata, which he calls "a train wreck: a mishmash wrapped in a muddle wrapped in a mess."
Category: Metadata
Authority Control for Repositories: Names Project: Final Report
JISC has released the Names Project: Final Report.
Here's an excerpt:
The Names Project began in July 2007. It was funded to investigate requirements for a name authority service for UK repositories. Prototype name authority software has been developed as part of this work and a number of connections have been made with UK stakeholders and with international projects working in a similar space.
Plugins to Import E-Print Metadata from arXiv into an EPrints Repository
The IncReASe (Increasing Repository Content through Automation and Services) project has released four plugins to facilitate importing e-print metadata from arXiv into an EPrints repository.
Here's an excerpt from the plugins' Web page:
Potentially, content in arXiv could provide a "quick win" for repository population. No arXiv depositor we have talked to date has objected to our importing their work into WRRO [White Rose Research Online]. From discussions with arXiv users, we are assuming that local deposit in WRRO with a "push" of data to arXiv may be difficult to achieve—we'd need to demonstrate some clear benefit to the depositor. arXiv serves its community well. A more likely model may be that arXiv users continue to deposit as now but IRs "harvest" data from arXiv (or perhaps arXiv will develop a facility to push material into local IRs).
"Saying What We Do—Doing What We Say: Preservation Issues (Metadata and Otherwise) in Institutional Repositories"
Sarah L. Shreeves has self-archived her presentation "Saying What We Do—Doing What We Say: Preservation Issues (Metadata and Otherwise) in Institutional Repositories" in IDEALS.
Streamline Integrating Repository Function with Work Practice: Tools to Facilitate Personal E-Administration, Final Report v1.3
JISC has released Streamline Integrating Repository Function with Work Practice: Tools to Facilitate Personal E-Administration, Final Report v1.3.
Here's an excerpt:
The tools developed include an automatic metadata generation tool that completes as much of the metadata as possible, from documentation associated with a learning object, including suggesting key words to the user; and resource discovery tools, which recommend additional resources based on closeness of objects to the original search results. In addition, we contributed to a variety of widgets, developed with the PERSoNA project, to demonstrate the use of social networking tools to promote sharing of resources through the repository.
“RKBExplorer: Repositories, Linked Data and Research Support”
Hugh Glaser, Ian Millard, and Les Carr have self-archived "RKBExplorer: Repositories, Linked Data and Research Support" in the ECS EPrints Repository.
Here's an excerpt:
RKBExplorer (http://rkbexplorer.com/) is a system for publishing Linked Data to Semantic Web standards, also providing a browser that allows users to explore this interlinked Web of Data, primarily in the domain of scientific endeavour. As part of the activity, we have harvested the metadata from a number of the larger ePrints repositories into http://eprints.rkbexplorer.com, and republished it as Linked Data. This allows the RKBExplorer browser to present a unified view of these repositories and related data from other sources such as dblp and dbpedia (a Semantic Web version of Wikipedia). Users can thus investigate concepts related to the ePrints people and articles, such as related people, projects and institutions.
“Repurposing ProQuest Metadata for Batch Ingesting ETDs into an Institutional Repository”
Shawn Averkamp and Joanna Lee have published "Repurposing ProQuest Metadata for Batch Ingesting ETDs into an Institutional Repository" in the latest issue of the Code4Lib Journal.
Here's an excerpt:
This article describes the workflow used by the University of Iowa Libraries to populate their institutional repository and their catalog with the data collected by ProQuest UMI Dissertation Publishing during the submission of students' theses and dissertations. Re-purposing the metadata from ProQuest allowed the University of Iowa Libraries to streamline the process for ingesting theses and dissertations into their institutional repository The article includes a discussion of the benefits and limitations of the workflow described.
Vocabulary Mapping Framework Announced
A cooperative project, the Vocabulary Mapping Framework, is mapping major metadata standards (CIDOC CRM, DCMI, DDEX, DOI, FRBR, MARC21, LOM, ONIX, and RDA).
Here's an excerpt from the press release:
The new vocabulary is not intended as a replacement for any existing standards, but as an aid to interoperability, whether automatic or human-mediated. The expanded Framework will include mappings of terms from code lists or allowed value sets in the existing standards to the RDA/ONIX vocabulary, enabling the computation of "best fit" mappings between any pairing of standards. . . .
The work will result in:
- a mapping of vocabularies from the source standards to support the building of crosswalks and transformations between any of them;
- a definitive reference set which editors can draw on when creating and developing standards;
- a downloadable RDF/OWL ontology to support the interchange of metadata content between these major standards, which will be useful to enable automated reuse of metadata from different sources and schemas, to improve the quality and access and reduce the cost of metadata;
- a governance scheme to oversee further development.
Creating Catalogues: Bibliographic Records in a Networked World
The Research Information Network has released Creating Catalogues: Bibliographic Records in a Networked World .
Here's an excerpt from the announcement:
Against this background the RIN report: Creating Catalogues: bibliographic records in a networked world, is a very timely overview of the whole process of bibliographic record production for printed and electronic books, and for scholarly journals and journal articles. This report follows the production of these data from publisher through a range of intermediaries to the end user. Whilst there are pressures to make these data more freely available, each player in the process has its own motivations and business models in creating, adding to, using or re-using bibliographic data, all of which need to be considered.
We find that there would be considerable benefits if libraries, along with other organisations in the supply chain, were to operate more at the network level but that there are significant barriers in the way of making significant moves in that direction.
Creating Catalogues cannot attempt to solve all the problems in the way of making bibliographic data more freely available for re-use and innovation, or of eliminating wasteful duplication of effort. Our objective is to clarify the key issues and to stimulate debate on possible ways forward. Creating Catalogues provides a number of key recommendations and the RIN will work with the academic library community and other key stakeholders in the supply chain to raise awareness and understanding of the issues raised in this report, of the benefits to be achieved by moving to new models, and of how we might overcome the barriers to achieving them.
Webcast: FRBR: Things You Should Know. . .
The Library of Congress has released the FRBR: Things You Should Know. . . Webcast presented by Barbara Tillett.
Here's an excerpt from the description:
This presentation for non-catalogers is intended to present basic concepts and benefits of using the FRBR conceptual model (Functional Requirements for Bibliographic Records) in resource discovery systems.
IR Deposit Using Embedded Document Metadata: Deposit Plait: Final Report
JISC has released Deposit Plait: Final Report.
Here's an excerpt:
The aim of the Deposit Plait project was to examine potential for easing the deposit of journal articles into institutional repositories by making use of any metadata embedded within the document properties of the document being deposited. . . .
The first stage of the project was to see how easy it is to extract this metadata. The target file formats that the project worked with were the Open Document Format (as created by OpenOffice), OpenXML (as created by Microsoft Office 2007), and .doc files (as created by version of Microsoft Office from 97 to 2003). There are standard open source software libraries that can extract both standard and custom metadata fields from each of these file forms.
The second stage of the project was to see how easy it is to use extracted metadata as search terms in order to search for a more complete metadata record. In the case where the item being deposited into the repository has been in existence for some time (it is a 'retrospective deposit') then metadata found can be used to perform a search. Different search methods were implemented as examples, including using search APIs, and screen scraping from search services. Whilst the method works fine, there are the normal licensing issues to consider, and whether licences cover the user for this type of metadata re-use.
The project concluded by creating an online demonstration system. In contrast to a normal repository deposit where the user enters metadata, and then uploads a file, this system requires the user to first upload a file. The metadata is extracted, and the user is allowed to choose which (one or more) of the fields to use as the basis of a search. The search is then initiated and matching records returned. The user can then pick and choose fields from the results the 'plait' together their final metadata record.
Library of Congress Makes ID.LOC.GOV Authorities and Vocabularies Service Publicly Available
The Library of Congress has made its ID.LOC.GOV authorities and vocabularies service publicly available.
Here's an excerpt from the announcement:
The Library of Congress has opened its ID.LOC.GOV web service, Authorities and Vocabularies, with the Library of Congress Subject Headings (LCSH) as the initial offering. The primary goal of this service is to enable machines to programmatically access data at the Library of Congress but the web interface also provides simple user access. We view this service as a step toward exposing and interconnecting vocabulary and thesaurus data via URLs. For LCSH, we are fortunate to have been able to link terms to a similar service provided in Europe for RAMEAU, a French subject heading vocabulary closely coordinated with LCSH.
OCLC Releases Networking Names Report
OCLC has released the Networking Names report.
Here's an excerpt from the press release:
This report identifies the necessary components of a "Cooperative Identities Hub" that would address the problem space in the research community and have the most impact across different target audiences.
The fifteen members of the RLG Partnership Networking Names Advisory Group developed fourteen use case scenarios around academic libraries and scholars, archivists and archival users, and institutional repositories that provide the context in which different communities would benefit from aggregating information about persons and organizations, corporate and government bodies, and families, and making it available on a network level.
The report summarizes the group's recommendations on the functions and attributes needed to support the use case scenarios.
Advancing the State of the Art in Distributed Digital Libraries: Accomplishments of and Lessons Learned from the Digital Library Federation Aquifer Metadata Working Group
DRAFT: TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices
DRAFT: TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices is now available for comment until May 6, 2009.
Here's an excerpt from the comment survey:
The revised "TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices," currently in draft form, contain updated versions of the widely adopted encoding 'levels'—from fully automated conversion to content analysis and scholarly encoding. They also contain a substantially revised section on the TEI Header, designed to support interoperability between text collections and the use of complementary metadata schemas such as MARC and METS. The new Guidelines also reflect an organizational shift. Originally authored by the DLF-sponsored TEI Task Force, the current revision work is a partnership between members of the Task Force and the TEI Libraries SIG. As a result of this partnership, responsibility for the Guidelines will migrate to the SIG, allowing closer work with the TEI Consortium as a whole and a stronger basis for advocating for the needs of libraries in future TEI releases.
OECD: We Need Publishing Standards for Datasets and Data Tables
OECD has released We Need Publishing Standards for Datasets and Data Tables.
Here's an excerpt:
Datasets are a significant part of the scholarly record and are being published more and more frequently, either formally or informally. Many publishers are beginning to link to them from their journals and authors are trying to cite them in their articles. Librarians would like a way to manage them alongside other publications. In short, they need to be integrated into the scholarly information system so that authors, readers and librarians can use, find and manage them as easily as they do working papers, journal articles and books.
In this paper, OECD is proposing some standards for citing and bibliographic management of datasets and data tables. OECD is currently building a new online publishing platform which will host working papers, journals, books, tables and datasets. Due to be launched in mid-2009, this platform will use the standards proposed above. Librarians will be offered MARC 21 records for datasets, alongside records for OECD books and periodicals. Users of the platform will be invited to download citations for datasets and tables in a form compatible with popular bibliographic management systems. All the DOIs for the datasets and tables will be deposited with CrossRef, ready for other publishers to use.
OCLC: A Symposium for Publishers and Librarians
OCLC has released presentations and other documents related to its recent event, A Symposium for Publishers and Librarians.
Here's an excerpt from the symposium report:
On March 18th and 19th representatives from libraries, the publisher supply chain and organizations supporting these communities met at OCLC's Conference Center in Dublin, Ohio to discuss metadata needs, practices, lifecycle and economics across the communities and to explore opportunities for change.
“Name Authority Control in Institutional Repositories”
Dorothea Salo has self-archived "Name Authority Control in Institutional Repositories" in MINDS@UW.
Here's an excerpt:
Neither the standards nor the software underlying institutional repositories anticipated performing name authority control on widely disparate metadata from highly unreliable sources. Without it, though, both machines and humans are stymied in their efforts to access and aggregate information by author. Many organizations are awakening to the problems and possibilities of name authority control, but without better coordination, their efforts will only confuse matters further. Local heuristics-based name-disambiguation software may help those repository managers who can implement it. For the time being, however, most repository managers can only control their own name lists as best they can after deposit while they advocate for better systems and services.
CrossRef’s Geoffrey Bilder on Author Identifiers
Gobbledygook has interviewed CrossRef's Geoffrey Bilder about author identifiers.
Here's an excerpt:
Of course, lots of the same issues can be raised with CrossRef, right? What guarantees that CrossRef won’t become evil and co-opt all of our identities? This, of course is the big fear underlining the knee-jerk reaction against "centralized systems" in favor of "distributed systems". The problem with this, as I mentioned in the FriendFeed thread is that my personal and unfashionable observation is that "distributed" begets "centralized." For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.). This gets us back to square one and makes me think the real issue is- how do you make the centralized system that eventually emerges accountable? This is, of course, a social issue more than a technical issue and involves making sure that whatever entity emerges has clearly defined data portability policies and a "living will" that attempts to guarantee that the service can be run in perpetuity- even if by another organization. For the record, I don’t think adopting the slogan "don’t be evil" is enough ;).
“On the Communication of Scientific Results: The Full-Metadata Format”
Moritz Riede, Rico Schueppel, Kristian O. Sylvester-Hvid, et al. have self-archived "On the Communication of Scientific Results: The Full-Metadata Format" in arXiv.
Here's an excerpt:
In this paper, we introduce a scientific format for text-based data files, which facilitates storing and communicating tabular data sets. The so-called Full-Metadata Format builds on the widely used INI-standard and is based on four principles: readable self-documentation, flexible structure, fail-safe compatibility, and searchability. As a consequence, all metadata required to interpret the tabular data are stored in the same file, allowing for the automated generation of publication-ready tables and graphs and the semantic searchability of data file collections. The Full-Metadata Format is introduced on the basis of three comprehensive examples. The complete format and syntax is given in the appendix.
Indiana University Digital Library Program Releases IN Harmony Sheet Music Cataloging Tool
The Indiana University Digital Library Program has released the IN Harmony Sheet Music Cataloging Tool.
Here's an excerpt from the tool's page:
The IN Harmony Sheet Music Cataloging Tool is an open source tool developed by the Indiana University Digital Library Program with funding from the Institute of Museum and Library Services as part of the IN Harmony: Sheet Music From Indiana project. This tool has been designed to assist libraries, archives, museums, and individual collectors describe their sheet music collections in a robust and standards-based way. This is a production system of the Indiana University Digital Library Program and was used to catalog more than 10,000 pieces of sheet music for the IN Harmony project.
The tool collects descriptive metadata about sheet music and exports it in the MODS, simple Dublin Core, and OAI-PMH Static Repository formats.
“Aligning METS with the OAI-ORE Data Model”
Jerome P. McDonough has made "Aligning METS with the OAI-ORE Data Model" available in IDEALS.
Here's an excerpt:
(OAI-ORE) specifications provide a flexible set of mechanisms for transferring complex data objects between different systems. In order to serve as an exchange syntax, OAI-ORE must be able to support the import of information from localized data structures serving various communities of practice. In this paper, we examine the Metadata Encoding & Transmission Standard (METS) and the issues that arise when trying to map from a localized structural metadata schema into the OAI-ORE data model and serialization syntaxes.
Special Issue of Library Trends on Institutional Repositories
The latest issue of Library Trends (57, no. 2, Fall 2008) is about institutional repositories.
Here are the articles (links are to article preprints):
- "Introduction: Institutional Repositories: Current State and Future"
- "Innkeeper at the Roach Motel"
- "Institutional Repositories in the UK: The JISC Approach"
- "Strategies for Institutional Repository Development: A Case Study of Three Evolving Initiatives"
- "Perceptions and Experiences of Staff in the Planning and Implementation of Institutional Repositories"
- "Institutional Repositories and Research Data Curation in a Distributed Environment"
- "At the Watershed: Preparing for Research Data Management and Stewardship at the University of Minnesota Libraries"
- "Case Study in Data Curation at Johns Hopkins University"
- "Describing Scholarly Works with Dublin Core: A Functional Approach"
- "The 'Wealth of Networks' and Institutional Repositories: MIT, DSpace, and the Future of the Scholarly Commons"
- "Leveraging Short-term Opportunities to Address Long-term Obligations: A Perspective on Institutional Repositories and Digital Preservation Programs"
- "Shedding Light on the Dark Data in the Long Tail of Science"
Andy Powell on Persistent URIs and Digital Repositories
In “How Uncool? Repository URIs. . .,” Andy Powell analyzes URI structure in 107 repositories to determine whether their items’ URIs are likely to be persistent..
Here's an excerpt:
So what is an uncool URI? An uncool URI is one that is unlikely to be persistent, typically because the person who first assigned it didn’t think hard enough about likely changes in organisational structure, policy or technology and the impact that changes in those areas might have on the persistence of the URI into the future.
Automatic Metadata Generation for Repositories: MetaTools: Final Report
JISC has released MetaTools: Final Report .
Here's an excerpt from the announcement:
Automatic metadata generation has sometimes been posited as a solution to the 'metadata bottleneck' that repositories and portals are facing as they struggle to provide resource discovery metadata for a rapidly growing number of new digital resources. Unfortunately there is no registry or trusted body of documentation that rates the quality of metadata generation tools or identifies the most effective tool(s) for any given task.
The aim of the first stage of the project was to remedy this situation by developing a framework for evaluating tools used for the purpose of generating Dublin Core metadata. . . .
A test program was then implemented using metrics from the framework. It evaluated the quality of metadata generated from 1) Web pages (html) and 2) scholarly works (pdf) by four of the more widely-known metadata generation tools—Data Fountains, DC-dot, SamgI, and the Yahoo! Term Extractor. . . .
It was found that the output from Data Fountains was generally superior to that of the other tools that the project tested. But the output from all of the tools was considered to be disappointing and markedly inferior to the quality of metadata that Tonkin and Muller report that PaperBase has extracted from scholarly works. Over all, the prospects for generating high-quality metadata for scholarly works appear to be brighter because of their more predictable layout. . . .
In the third stage of the project SOAP and RESTful Web Service interfaces were developed for three metadata generation tools—Data Fountains, SamgI and Kea. This had a dual purpose. Firstly, the creation of an optimal metadata record usually requires the merging of output from several tools each of which, until now, had to be invoked separately because of the ad hoc nature of their interfaces. As Web services, they will be available for use in a network such as the Web with well-defined interfaces that are implementation-independent. These services will be exposed for use by clients without them having to be concerned with how the service will execute their requests. Repositories should be able to plug them into their own cataloguing environments and experiment with automatic metadata generation under more 'real-life' circumstances than hitherto. Secondly, and more importantly (in view of the relatively poor quality of current tools) they enabled the project to experiment with the use of a high-level ontology for describing metadata generation tools.