Researchers at Penn State's College of Information Sciences and Technology's Cyber-Infrastructure Lab have developed open source software called TableSeer that can find, extract, search, and rank table data from PDF files. Source code will be available at the project's close.
Here's an extract from the press release:
Tables are an important data resource for researchers. In a search of 10,000 documents from journals and conferences, the researchers found that more than 70 percent of papers in chemistry, biology and computer science included tables. Furthermore, most of those documents had multiple tables.
But while some software can identify and extract tables from text, existing software cannot search for tables across documents. That means scientists and scholars must manually browse documents in order to find tables-a time-consuming and cumbersome process.
TableSeer automates that process and captures data not only within the table but also in tables' titles and footnotes. In addition, it enables column-name-based search so that a user can search for a particular column in a table.
In tests with documents from the Royal Society of Chemistry, TableSeer correctly identified and retrieved 93.5 percent of tables created in text-based formats. . . .
Information on TableSeer can be found in a paper, "TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries," by Ying Liu, Kun Bai, Mitra and Giles of the Penn State College of Information Sciences and Technology.