JISC has released Deposit Plait: Final Report.
Here's an excerpt:
The aim of the Deposit Plait project was to examine potential for easing the deposit of journal articles into institutional repositories by making use of any metadata embedded within the document properties of the document being deposited. . . .
The first stage of the project was to see how easy it is to extract this metadata. The target file formats that the project worked with were the Open Document Format (as created by OpenOffice), OpenXML (as created by Microsoft Office 2007), and .doc files (as created by version of Microsoft Office from 97 to 2003). There are standard open source software libraries that can extract both standard and custom metadata fields from each of these file forms.
The second stage of the project was to see how easy it is to use extracted metadata as search terms in order to search for a more complete metadata record. In the case where the item being deposited into the repository has been in existence for some time (it is a 'retrospective deposit') then metadata found can be used to perform a search. Different search methods were implemented as examples, including using search APIs, and screen scraping from search services. Whilst the method works fine, there are the normal licensing issues to consider, and whether licences cover the user for this type of metadata re-use.
The project concluded by creating an online demonstration system. In contrast to a normal repository deposit where the user enters metadata, and then uploads a file, this system requires the user to first upload a file. The metadata is extracted, and the user is allowed to choose which (one or more) of the fields to use as the basis of a search. The search is then initiated and matching records returned. The user can then pick and choose fields from the results the 'plait' together their final metadata record.