Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR Grant Awarded

The NSF Division of Information & Intelligent Systems has awarded a grant to the Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive for their "Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR" proposal.

Here's an excerpt from the award abstract:

The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections. To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features. When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion. The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.