The Digital Library Federation has published Future Directions in Metadata Remediation for Metadata Aggregators.
Here's an excerpt:
With support from The Gladys Krieble Delmas Foundation, the Digital Library Federation embarked on a project to inventory existing tools and services for metadata mapping, remediation, and enhancement. Once identified, tools were evaluated for general applicability across digital library and other cultural heritage environments.
The results of the research show that a handful of tools are usable as-is, but many tools need more work to be generally applicable in a variety of environments and significant development would be required to create a robust and well-defined set of metadata remediation services. Key points of note:
- Relatively few tools are available that can work directly on metadata records rather than full text, and those that are available need to be customized for each aggregator.
- Workable tools are available for date normalization, and also for normalizing and matching coordinates to U.S. geographic names.
- A statistical topic model program for subject clustering has been developed.
- Both named entity and topical keyword extraction remain problematic, with a fairly high percentage of errors.
- Authority files may be used to break up pre-coordinated Library of Congress subject strings into topical, name, geographic, temporal, and genre facets to improve searching.
- Mappings between different thesauri, which should allow for better search processing in aggregations containing multiple subject vocabularies, are still under development.
- Infrastructure for work collocation, appropriate to aggregators with significant published materials, is still underdeveloped and will probably need to wait for the widespread adoption of the new standard for resource description, Resource Description and Access (RDA).
- Unambiguous identifiers for entities such as names and works would be useful when the community infrastructure is developed, but are not yet supported by most metadata formats.
- Unambiguous, machine-actionable rights statements are also an area where the community infrastructure is still under development.