Will You Only Harvest Some?

The Digital Library for Information Science and Technology has announced DL-Harvest, an OAI-PMH service provider that harvests and makes searchable metadata about information science materials from the following archives and repositories:

ALIA e-prints
arXiv
Caltech Library System Papers and Publications
DLIST
Documentation Research and Training Centre
DSpace at UNC SILS
E-LIS
Metadata of LIS Journals
OCLC Research Publications
OpenMED@NIC
WWW Conferences Archive

DL-Harvest is a much needed, innovative discipline-based search service. Big kudos to all involved.

DLIST also just announced the formation of an advisory board.

The following musings, inspired by the DL-Harvest announcement, are not intended to detract from the fine work that DLIST is doing or from the very welcome addition of DL-Harvest to their service offerings.

Discipline-focused metadata can be relatively easily harvested from OAI-PHM-compliant systems that are organized along disciplinary lines (e.g., the entire archive/repository is discipline-based or an organized subset is discipline-based). No doubt these are very rich, primary veins of discipline-specific information, but how about the smaller veins and nuggets that are hard to identify and harvest because they are in systems or subsets that focus on another discipline?

Here’s an example. An economist, who is not part of a research center or other group that might have its own archive, writes extensively about the economics of the scholarly publishing business. This individual’s papers end up in the economics department section of his or her institutional repository and in EconWPA. They are highly relevant to librarians and information scientists, but will their metadata records be harvested for use in services like DL-Harvest using OAI-PMH since they are in the wrong conceptual bins (e.g., set in the case of the IR)?

Coleman et al. point to one solution in their intriguing "Integration of Non-OAI Resources for Federated Searching in DLIST, an Eprints Repository" paper. But (lots of hand waving here), if using automatic metadata extraction was an easy and simple way to supplement conventional OAI-PMH harvesting, the bottom line question is: how good is good enough? In other words, what’s an acceptable level of accuracy for the automatic metadata extraction? (I won’t even bring up the dreaded "controlled vocabulary" notion.)

No doubt this problem falls under the 80/20 Rule, and the 20 is most likely in the low hanging fruit OAI-PMH-wise, but wouldn’t it be nice to have more fruit?

One thought on “Will You Only Harvest Some?”

I want to comment on so many of the points you raise (very stimulating thoughts) but I’ll limit myself to responding to one of your musings for now :).

Yes, the economists’ papers could be harvested (at a finer grained level and when relevant to LIS) irrespective of the broad IR – but this is not a trivial problem to solve as we found out when we wanted to harvest only certain subjects from ArXIV for example. If I’m not clear do try some of the open access aggregator services based on OAI-PMH and I think you’ll see what I mean. Also take a look at the subjects listed in DLIST – we’ve taken a lot of heat for this mix of broad and narrow subjects but there’s a bit of method to our madness as I hope time will show.

Delighted by your interest and appreciation and will look forward to more inspired musings. Thanks. A

Comments are closed.