Search Engines and Discovery Systems – Page 12

"Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web"

Duncan Hull, Steve R. Pettifer, and Douglas B. Kel have published "Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web" in PLoS Computational Biology.

Here's the abstract:

Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as 'thought in cold storage,' and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.

Google Now Fully Indexes Scanned Documents Using OCR

Goggle has announced that it is now using OCR (Optical Character Recognition) to fully index scanned PDF files (these files contain text in digital images).

New Tutorial: Internet for Image Searching

TASI has launched a new online tutorial, Internet for Image Searching. It includes sites that provide images that can be freely used.

Google Newspaper Digitization Project Announced

Google has announced a newspaper digitization project that will "make more old newspapers accessible and searchable online by partnering with newspaper publishers to digitize millions of pages of news archives."

Read more about it at "Bringing History Online, One Newspaper at a Time."

SRU Open Search: Open Source Customizable Interface for Displaying SRU-Formatted XML

The Institute for Research and Innovation in Social Services at the University of Strathclyde has released SRU Open Search, an open source customizable interface for displaying SRU-formatted XML.

Here are some features selected from a more comprehensive list:

Bookmarkable pages, so you can share a page of results via email

Share items via social bookmarking sites (Delicious, Digg, Google)

Featured audio highlighting—inline mp3 player via flash

Featured content highlighting . . .

Visualisation of search terms via pie chart, tag cloud & tree map . . .

Portable version of search so users can add to their own site

Browser search plugin for Firefox & Internet Explorer (inc Auto Suggest)

Solr Search Engine Plug-In for Fedora Released

The DRAMA team has released a Solr plug-in for Fedora.

Here's a description of Solr from its home page:

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.

Coverage of the Demise of Microsoft's Mass Digitization Project

Microsoft's decision to end its Live Search Books program, which provided important funding for the Open Content Alliance, has been widely covered by newspapers, blogs, and other information sources.

Here's a selection of articles and posts: "Books Scanning to be Publicly Funded," "'It Ain’t Over Till It's Over': Impact of the Microsoft Shutdown," "Microsoft Abandons Live Search Books/Academic Scan Plan," "Microsoft Burns Book Search—Lacks 'High Consumer Intent,'" "Microsoft Shuts Down Two of Its Google 'Wannabe’s': Live Search Books and Live Search Academic," "Microsoft Will Shut Down Book Search Program," "Microsoft's Book-Search Project Has a Surprise Ending," "Post-Microsoft, Libraries Mull Digitization," "Publishers Surprised by Microsoft Move," "Why Killing Live Book Search Is Good for the Future of Books," and "Without Microsoft, British Library Keeps on Digitizing."

National Science Digital Library NCore Team Releases NSDL Search, MediaWiki Extensions, and WordPress MU Plug-Ins

The National Science Digital Library NCore team has released three applications:

Generic, open source version of NSDL Search
MediaWiki extensions that are used to support the NSDL Wiki
WordPress MU plug-ins that are used to support Expert Voices

Terrier 2.1 Released: Open Source Search Software for Large Document Collections

The University of Glasgow Department of Computing Science has released Terrier 2.1, an open source search engine written in Java that is designed to handle large document collections.

Google Book Search Book Viewability API Released

Google has released the Google Book Search Book Viewability API.

Here's an excerpt from the API home page:

The Google Book Search Book Viewability API enables developers to:

Link to Books in Google Book Search using ISBNs, LCCNs, and OCLC numbers

Know whether Google Book Search has a specific title and what the viewability of that title is

Generate links to a thumbnail of the cover of a book

Generate links to an informational page about a book

Generate links to a preview of a book

Read more about it at "Book Info Where You Need It, When You Need It."

Digital Library Federation ILS and Discovery Systems Draft Report

The Digital Library Federation's ILS and Discovery Systems working group has issued a Draft Recommendation investigating issues related to integrated library system and discovery system integration.

Here's an excerpt from the "Introduction":

This document is the (DRAFT) report of that group. It gives technical recommendations for integrating the ILS with external discovery applications. This report includes

A summary of a survey of the needs and discovery applications implemented and desired by libraries in DLF (and other similar libraries).

A high-level summary of specific abstract functions that discovery applications need to be able to invoke on ILS's and/or their data to support desired discovery applications, as well as outgoing services from ILS software to other applications.

Recommendations for concrete bindings for these functions (i.e. specific protocols, APIs, data standards, etc.) that can be used with future and/or existing ILS's. Producing a complete concrete binding and reference implementation is beyond the scope of this small, short-term group; but we hope to provide sufficient requirements and details that others can produce appropriate bindings and implementations.

Practical recommendations to encourage libraries, ILS developers, and discovery application developers to expeditiously integrate discovery systems with the ILS and other sources of bibliographic metadata.

Summa: A Federated Search System

Statsbiblioteket is developing Summa, a federated search system.

Birte Christensen-Dalsgaard, Director of Development, discusses Summa and other topics in a new podcast (CNI Podcast: An Interview with Birte Christensen-Dalsgaard, Director of Development at the State and University Library, Denmark).

Here's an excerpt from the podcast abstract:

Summa is an open source system implementing modular, service-based architecture. It is based on the fundamental idea "free the content from the proprietary library systems," where the discovery layer is separated from the business layer. In doing so, any Internet technology can be used without the limitations traditionally set by proprietary library systems, and there is the flexibility to integrate or to be integrated into other systems. A first version of a Fedora—Summa integration has been developed.

A white paper is available that examines the system in more detail.

Columbia University and Microsoft Book Digitization Project

The Columbia University Libraries have announced that they will work with Microsoft to digitize a "large number of books" that are in the public domain.

Here's an excerpt from the press release:

Columbia University and Microsoft Corp. are collaborating on an initiative to digitize a large number of books from Columbia University Libraries and make them available to Internet users. With the support of the Open Content Alliance (OCA), publicly available print materials in Columbia Libraries will be scanned, digitized, and indexed to make them readily accessible through Live Search Books. . . .

Columbia University Libraries is playing a key role in book selection and in setting quality standards for the digitized materials. Microsoft will digitize selected portions of the Libraries’ great collections of American history, literature, and humanities works, with the specific areas to be decided mutually by Microsoft and Columbia during the early phase of the project.

Microsoft will give the Library high-quality digital images of all the materials, allowing the Library to provide worldwide access through its own digital library and to share the content with non-commercial academic initiatives and non-profit organizations.

Read more about it at "Columbia University Joins Microsoft Scan Plan."

Wikia Search Debuts to Pundits’ Criticism

An alpha version of Wikia's open source Wikia Search has gone public, but the consensus seems to be that this user-tuned search engine has a long way to go to compete with the likes of Google.

Google Gives Wikipedia a Lump of Knol for Xmas

According to "Encouraging People to Contribute Knowledge," Google has launched Knol, a Wikipedia competitor, in test mode.

Here'as an excerpt from the posting:

Earlier this week, we [Google] started inviting a selected group of people to try a new, free tool that we are calling "knol", which stands for a unit of knowledge. Our goal is to encourage people who know a particular subject to write an authoritative article about it. . . . .

A knol on a particular topic is meant to be the first thing someone who searches for this topic for the first time will want to read. The goal is for knols to cover all topics, from scientific concepts, to medical information, from geographical and historical, to entertainment, from product information, to how-to-fix-it instructions. Google will not serve as an editor in any way, and will not bless any content. . . . .For many topics, there will likely be competing knols on the same subject. . . .

Knols will include strong community tools. People will be able to submit comments, questions, edits, additional content, and so on. Anyone will be able to rate a knol or write a review of it. Knols will also include references and links to additional information. At the discretion of the author, a knol may include ads.

Columbia University Libraries and Bavarian State Library Become Google Book Search Library Partners

Both the Columbia University Libraries and Bavarian State Library have joined the Google Book Search Library Project.

Here are the announcements:

Bavarian State Library (in German)
Columbia University Libraries

University of Michigan Libraries Make over 100,000 Records for Digitized Books Available for Harvesting

The University of Michigan Libraries have made over 100,000 metadata records from its MBooks collection available for OAI-PMH harvesting. The records are for digitized books in the public domain.

Here's an excerpt from the announcement:

The University of Michigan Library is pleased to announce that records from our MBooks collection are available for OAI harvesting. The MBooks collection consists of materials digitized by Google in partnership with the University of Michigan.

http://quod.lib.umich.edu/cgi/o/oai/oai?verb=Identify

Only records for MBooks available in the public domain are exposed. We have split these into sets containing public domain items according to U.S. copyright law, and public domain items worldwide. There are currently over 100,000 records available for harvesting. We anticipate having 1 million records available when the entire U-M collection has been digitized by Google.

Paul Courant on Michigan’s Mass Digitization Project with Google

In "On Being in Bed with Google," Paul N. Courant, University Librarian and Dean of Libraries at the University of Michigan, vigorously rebuts arguments against research libraries participating in the Google Books Library Project.

Here's an excerpt:

Since 2005, Siva Vaidhyanathan has been making and refining the argument that libraries should be digitizing their collections independently, without corporate financing or participation, and that those who don’t are failing to uphold their responsibility to the public. "Libraries should not be relinquishing their core duties to private corporations for the sake of expediency."

"Expediency" is a bit of a dirty word. Vaidhyanathan’s phrase suggests that good people don’t do things simply because they are "expedient." But I view large-scale digitization as expeditious. We have a generation of students who will not find valuable scholarly works unless they can find them electronically. At the rate that OCA is digitizing things (and I say the more the merrier and the faster the better) that generation will be dandling great-grandchildren on its knees before these great collections can be found electronically. At Michigan, the entire collection of bound print will be searchable, by anyone in the world, about when children born today start kindergarten.

Update on the British Public Library/Microsoft Digitization Project

Jim Ashling provides an update on the progress that the British Public Library and Microsoft have made in their project to digitize about 100,000 books for access in Live Book Search in his Information Today article "Progress Report: The British Library and Microsoft Digitization Partnership."

Here's an excerpt from the article:

Unlike previous BL digitization projects where material had been selected on an item-by-item basis, the sheer size of this project made such selectivity impossible. Instead, the focus is on English-language material, collected by the BL during the 19th century. . . .

Scanning produces high-resolution images (300 dpi) that are then transferred to a suite of 12 computers for OCR (optical character recognition) conversion. The scanners, which run 24/7, are specially tuned to deal with the spelling variations and old-fashioned typefaces used in the 1800s. The process creates multiple versions including PDFs and OCR text for display in the online services, as well as an open XML file for long-term storage and potential conversion to any new formats that may become future standards. In all, the data will amount to 30 to 40 terabytes. . . .

Obviously, then, an issue exists here for a collection of 19th-century literature when some authors may have lived beyond the late 1930s [British/EU law gives authors a copyright term of life plus 70 years]. An estimated 40 percent of the titles are also orphan works. Those two issues mean that item-by-item copyright checking would be an unmanageable task. Estimates for the total time required to check on the copyright issues involved vary from a couple of decades to a couple of hundred years. The BL’s approach is to use two databases of authors to identify those who were still living in 1936 and to remove their work from the collection before scanning. That, coupled with a wide publicity to encourage any rights holders to step forward, may solve the problem.

Yale Will Work with Microsoft to Digitize 100,000 Books

The Yale University Library and Microsoft will work together to digitize 100,000 English-language out-of-copyright books, which will be made available via Microsoftâ€™s Live Search Books.

Here’s an excerpt from the press release:

The Library and Microsoft have selected Kirtas Technologies to carry out the process based on their proven excellence and state-of-the art equipment. The Library has successfully worked with Kirtas previously, and the company will establish a digitization center in the New Haven area. . . .

The project will maintain rigorous standards established by the Yale Library and Microsoft for the quality and usability of the digital content, and for the safe and careful handling of the physical books. Yale and Microsoft will work together to identify which of the approximately 13 million volumes held by Yaleâ€™s 22 libraries will be digitized. Books selected for digitization will remain available for use by students and researchers in their physical form. Digital copies of the books will also be preserved by the Yale Library for use in future academic initiatives and in collaborative scholarly ventures.

German Publishers Just Say No to Google Book Search: Libreka Launched at Frankfurt Book Fair

German publishers who want to retain control of their content have a new alternative to Google Book Search: Libreka, a full-text search engine that initially has about 8,000 books from publishers who opted-in for inclusion. Searchers retrieve book titles and cover images, but no content.

Source: "German Publishers Offer Alternative to Google Books." Deutsche Welle, 11 October 2007.

Comment on Preservation in the Age of Large-Scale Digitization

CLIR seeks comments on Preservation in the Age of Large-Scale Digitization by Oya Rieger. The deadline is 10/5/07.

LibraryFind 0.8.2 Released

The Oregon State University Libraries have released LibraryFind 0.8.2.

Here’s an excerpt from the CODE4LIB announcement:

LibraryFind is metasearch software written in Ruby-on-Rails. It allows libraries to provide a unified search solution to their users, letting library users search across both licensed collections and local collections. LibraryFind is open source software (licensed under the GPL), and is free to download and use. More information on LibraryFind can be found at http://libraryfind.org.

Amazon and Google E-Book Developments

Amazon is expected to release a wireless e-book reader this October called Kindle. It's anticipated to be priced in $400-$500 range.

Also in the fall, Google is expected to offer charged access to the complete contents of digital books, with pricing to be determined by publishers.

Source: Stone, Brad. "Are Books Passé? Envisioning the Next Chapter for Electronic Books." The New York Times, 6 September 2006, C1, C9.

Z39.50 Target Directory

The Z39.50 Target Directory from Index Data includes both Z39.50– and SRU/SRW-enabled systems.

It can be searched or browsed by name.

DigitalKoans provides news and commentary on digital copyright, digital curation, digital repository, open access, research data management, scholarly communication, and other digital information issues. It is also available via an RSS feed.

A Digital Scholarship publication. Digital Scholarship is a noncommercial publisher and it accepts no advertising. Charles W. Bailey, Jr. is the publisher of Digital Scholarship.