Summa: A Federated Search System

Statsbiblioteket is developing Summa, a federated search system.

Birte Christensen-Dalsgaard, Director of Development, discusses Summa and other topics in a new podcast (CNI Podcast: An Interview with Birte Christensen-Dalsgaard, Director of Development at the State and University Library, Denmark).

Here's an excerpt from the podcast abstract:

Summa is an open source system implementing modular, service-based architecture. It is based on the fundamental idea "free the content from the proprietary library systems," where the discovery layer is separated from the business layer. In doing so, any Internet technology can be used without the limitations traditionally set by proprietary library systems, and there is the flexibility to integrate or to be integrated into other systems. A first version of a Fedora—Summa integration has been developed.

A white paper is available that examines the system in more detail.

Columbia University and Microsoft Book Digitization Project

The Columbia University Libraries have announced that they will work with Microsoft to digitize a "large number of books" that are in the public domain.

Here's an excerpt from the press release:

Columbia University and Microsoft Corp. are collaborating on an initiative to digitize a large number of books from Columbia University Libraries and make them available to Internet users. With the support of the Open Content Alliance (OCA), publicly available print materials in Columbia Libraries will be scanned, digitized, and indexed to make them readily accessible through Live Search Books. . . .

Columbia University Libraries is playing a key role in book selection and in setting quality standards for the digitized materials. Microsoft will digitize selected portions of the Libraries’ great collections of American history, literature, and humanities works, with the specific areas to be decided mutually by Microsoft and Columbia during the early phase of the project.

Microsoft will give the Library high-quality digital images of all the materials, allowing the Library to provide worldwide access through its own digital library and to share the content with non-commercial academic initiatives and non-profit organizations.

Read more about it at "Columbia University Joins Microsoft Scan Plan."

Wikia Search Debuts to Pundits’ Criticism

An alpha version of Wikia's open source Wikia Search has gone public, but the consensus seems to be that this user-tuned search engine has a long way to go to compete with the likes of Google.

Read more about it at "Jimmy Wales Argues That His Wikia Needs More Time," "Wiki Citizens Taking on a New Area: Searching," "Wikia Launching Human-Powered Search," "Wikia Search Alpha Preview Leaves Much to Be Desired," "Wikia Search Is A Complete Letdown," and"Wikia Search—Miles Behind the Competition."

Google Gives Wikipedia a Lump of Knol for Xmas

According to "Encouraging People to Contribute Knowledge," Google has launched Knol, a Wikipedia competitor, in test mode.

Here'as an excerpt from the posting:

Earlier this week, we [Google] started inviting a selected group of people to try a new, free tool that we are calling "knol", which stands for a unit of knowledge. Our goal is to encourage people who know a particular subject to write an authoritative article about it. . . . .

A knol on a particular topic is meant to be the first thing someone who searches for this topic for the first time will want to read. The goal is for knols to cover all topics, from scientific concepts, to medical information, from geographical and historical, to entertainment, from product information, to how-to-fix-it instructions. Google will not serve as an editor in any way, and will not bless any content. . . . .For many topics, there will likely be competing knols on the same subject. . . .

Knols will include strong community tools. People will be able to submit comments, questions, edits, additional content, and so on. Anyone will be able to rate a knol or write a review of it. Knols will also include references and links to additional information. At the discretion of the author, a knol may include ads.

Read more about it at "Google to Wikipedia: "Knol" Thine Enemy," "Google's Knol: No Wikipedia Killer," "Google's 'Knols' Aren't a Threat to Wikipedia," "Google's Know-It-All Project," and "Google's Units of Knowledge May Raise Conflict of Interest."

Columbia University Libraries and Bavarian State Library Become Google Book Search Library Partners

Both the Columbia University Libraries and Bavarian State Library have joined the Google Book Search Library Project.

Here are the announcements:

Update on the British Public Library/Microsoft Digitization Project

Jim Ashling provides an update on the progress that the British Public Library and Microsoft have made in their project to digitize about 100,000 books for access in Live Book Search in his Information Today article "Progress Report: The British Library and Microsoft Digitization Partnership."

Here's an excerpt from the article:

Unlike previous BL digitization projects where material had been selected on an item-by-item basis, the sheer size of this project made such selectivity impossible. Instead, the focus is on English-language material, collected by the BL during the 19th century. . . .

Scanning produces high-resolution images (300 dpi) that are then transferred to a suite of 12 computers for OCR (optical character recognition) conversion. The scanners, which run 24/7, are specially tuned to deal with the spelling variations and old-fashioned typefaces used in the 1800s. The process creates multiple versions including PDFs and OCR text for display in the online services, as well as an open XML file for long-term storage and potential conversion to any new formats that may become future standards. In all, the data will amount to 30 to 40 terabytes. . . .

Obviously, then, an issue exists here for a collection of 19th-century literature when some authors may have lived beyond the late 1930s [British/EU law gives authors a copyright term of life plus 70 years]. An estimated 40 percent of the titles are also orphan works. Those two issues mean that item-by-item copyright checking would be an unmanageable task. Estimates for the total time required to check on the copyright issues involved vary from a couple of decades to a couple of hundred years. The BL’s approach is to use two databases of authors to identify those who were still living in 1936 and to remove their work from the collection before scanning. That, coupled with a wide publicity to encourage any rights holders to step forward, may solve the problem.

Yale Will Work with Microsoft to Digitize 100,000 Books

The Yale University Library and Microsoft will work together to digitize 100,000 English-language out-of-copyright books, which will be made available via Microsoft’s Live Search Books.

Here’s an excerpt from the press release:

The Library and Microsoft have selected Kirtas Technologies to carry out the process based on their proven excellence and state-of-the art equipment. The Library has successfully worked with Kirtas previously, and the company will establish a digitization center in the New Haven area. . . .

The project will maintain rigorous standards established by the Yale Library and Microsoft for the quality and usability of the digital content, and for the safe and careful handling of the physical books. Yale and Microsoft will work together to identify which of the approximately 13 million volumes held by Yale’s 22 libraries will be digitized. Books selected for digitization will remain available for use by students and researchers in their physical form. Digital copies of the books will also be preserved by the Yale Library for use in future academic initiatives and in collaborative scholarly ventures.

German Publishers Just Say No to Google Book Search: Libreka Launched at Frankfurt Book Fair

German publishers who want to retain control of their content have a new alternative to Google Book Search: Libreka, a full-text search engine that initially has about 8,000 books from publishers who opted-in for inclusion. Searchers retrieve book titles and cover images, but no content.

Source: "German Publishers Offer Alternative to Google Books." Deutsche Welle, 11 October 2007.

LibraryFind 0.8.2 Released

The Oregon State University Libraries have released LibraryFind 0.8.2.

Here’s an excerpt from the CODE4LIB announcement:

LibraryFind is metasearch software written in Ruby-on-Rails. It allows libraries to provide a unified search solution to their users, letting library users search across both licensed collections and local collections. LibraryFind is open source software (licensed under the GPL), and is free to download and use. More information on LibraryFind can be found at http://libraryfind.org.

Amazon and Google E-Book Developments

Amazon is expected to release a wireless e-book reader this October called Kindle. It's anticipated to be priced in $400-$500 range.

Also in the fall, Google is expected to offer charged access to the complete contents of digital books, with pricing to be determined by publishers.

Source: Stone, Brad. "Are Books Passé? Envisioning the Next Chapter for Electronic Books." The New York Times, 6 September 2006, C1, C9.

AltLaw.org Launch

The Columbia Law School and the University of Colorado Law School have launched AltLaw.org.

Here's a quote from the press release:

AltLaw.org contains nearly 170,000 decisions dating back to the early 1990s from the U.S. Supreme Court and Federal Appellate courts. The site’s creators, Columbia Law School’s Timothy Wu and Stuart Sierra, and University of Colorado Law School’s Paul Ohm, said the site’s database will grow over time. . . .

Wu said he envisions AltLaw.org being used by many groups—journalists, the public, lawyers who want to avoid the hundreds of dollars per hour in fees for proprietary law databases, and legal scholars who need quick and searchable access to cases at home or on the road. One of the assets to AltLaw.org’s design is that it is fast and simple to use, Wu said.

Ohm wrote the thousands of lines of code that download cases to AltLaw.org from more than a dozen court websites each night. He said the data comes from the courts themselves, and AltLaw.org is designed as an extremely open platform so that others can take the raw material and use it in various ways.

"This is what we call the 'law commons' part of the design," Ohm said. "The touchstone of AltLaw.org is openness, and this means that not only will users be able to search cases, but they'll also be able to make copies of all of the cases in our database to reuse or remix in any way that they'd like."

Google Scholar Digitization Program

According to the article "Changes at Google Scholar: A Conversation with Anurag Acharya," Google Scholar has begun a small-scale, targeted journal digitization effort.

Here's a quote from the article:

Representing another effort to reach currently inaccessible content, Google Scholar now has its own digitization program. “It’s a small program,” said Acharya. “We mainly look for journals that would otherwise never get digitized. Under our proposal, we will digitize and host journal articles with the provision that they must be openly reachable in collaboration with publishers, fully downloadable, and fully readable. Once you get out of the U.S. and Western European space into the rest of the world, the opportunities to get and digitize research are very limited. They are often grateful for the help. It gives us the opportunity to get that country’s material or make that scholarly society more visible.”

Source: Quint, Barbara. "Changes at Google Scholar: A Conversation with Anurag Acharya." NewsBreaks 27 August 2007.

Welcome to the DRM Zone: Case in Point, the Google Video Store

If you have ever purchased or rented a video from the Google Video Store, it will cease to function on August 15, 2007. That's because the Google Video Store is being shut down and along with it Google 's associated DRM system.

Customers will get credits in Google Checkout for what they spent on Google Video Store products, but not cash refunds, meaning that they must buy merchandise available via that service to recoup their losses. Of course, this does not compensate purchasers for the inconvenience of having to replace their videos (assuming that they can).

This fiasco underlines a key problem with DRM: it doesn't just restrict access, it restricts access using proprietary technologies, and, with few exceptions, those technologies cannot be legally circumvented under U.S. law.

Source: Fisher, Ken. "Google Selleth Then Taketh Away, Proving the Need for DRM Circumvention." Ars Technica, 12 August 2007.

TableSeer: Searching and Ranking PDF Table Data

Researchers at Penn State's College of Information Sciences and Technology's Cyber-Infrastructure Lab have developed open source software called TableSeer that can find, extract, search, and rank table data from PDF files. Source code will be available at the project's close.

Here's an extract from the press release:

Tables are an important data resource for researchers. In a search of 10,000 documents from journals and conferences, the researchers found that more than 70 percent of papers in chemistry, biology and computer science included tables. Furthermore, most of those documents had multiple tables.

But while some software can identify and extract tables from text, existing software cannot search for tables across documents. That means scientists and scholars must manually browse documents in order to find tables-a time-consuming and cumbersome process.

TableSeer automates that process and captures data not only within the table but also in tables' titles and footnotes. In addition, it enables column-name-based search so that a user can search for a particular column in a table.

In tests with documents from the Royal Society of Chemistry, TableSeer correctly identified and retrieved 93.5 percent of tables created in text-based formats. . . .

Information on TableSeer can be found in a paper, "TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries," by Ying Liu, Kun Bai, Mitra and Giles of the Penn State College of Information Sciences and Technology.

Cornell Joins Google Books Library Project

The Cornell University Library has joined the Google Books Library Project.

Here's an excerpt from the press release:

Google will digitize up to 500,000 works from Cornell University Library and make them available online using Google Book Search. As a result, materials from the library’s exceptional collections will be easily accessible to students, scholars and people worldwide, supporting the library’s long-standing commitment to make its collections broadly available.

“Research libraries today are integral partners in the academic enterprise through their support of research, teaching and learning. They also serve a public good by enhancing access to the works of the world's best minds,” said Interim University Librarian Anne R. Kenney. “As a major research library, Cornell University Library is pleased to join its peer institutions in this partnership with Google. The outcome of this relationship is a significant reduction in the time and effort associated with providing scholarly full-text resources online.”

Materials from Mann Library, one of 20 member libraries that comprise Cornell University Library, will be digitized as part of the agreement. Mann’s collections include some of the following subject areas: biological sciences, natural resources, plant, animal and environmental sciences, applied economics, management and public policy, human development, textiles and apparel, nutrition and food science.. . .

Cornell is the 27th institution to join the Google Book Search Library Project, which digitizes books from major libraries and makes it possible for Internet users to search their collections online. Over the next six years, Cornell will provide Google with public domain and copyrighted holdings from its collections. If a work has no copyright restrictions, the full text will be available for online viewing. For books protected by copyright, users will just get the basic background (such as the book’s title and the author’s name), at most a few lines of text related to their search and information about where they can buy or borrow a book. Cornell University Library will work with Google to choose materials that complement the contributions of the project’s other partners. In addition to making the materials available through its online search service, Google will also provide Cornell with a digital copy of all the materials scanned, which will eventually be incorporated into the university’s own digital library.

Towards Telesophy: Federating All the World’s Knowledge

A video is now available of Bruce Schatz, Director of the CANIS (Community Architectures for Network Information Systems) Laboratory at the University of Illinois at Urbana-Champaign, delivering a speech at Google on July 11th titled Towards Telesophy: Federating All the World’s Knowledge.

Here’s an excerpt from the presentation’s abstract:

Central archives partially survived the transition from a million repositories to a billion, but distributed indexing is necessary to scale to a trillion repositories in the next generation. Supporting scalable semantics requires divide-and-conquer to capture local context as an approximation to global meaning. Concept switches in the Interspace are the analogue of packet switches in the Internet, since user interaction is at the level of logical spaces rather than physical networks. This talk will describe the research technologies and trends creating the global infrastructure, with suggestions for hero experiments and hints at the new world of the near future.

VuFind 0.5 Beta Released

Villanova University's Falvey Memorial Library has released VuFind 0.5 Beta. This open-source software operates in conjunction with Voyager OPACs (more drivers being developed), and it is powered by Solr.

Here's an excerpt from the project's home page:

VuFind is a library resource portal designed and developed for libraries by libraries. The goal of VuFind is to enable your users to search and browse through all of your library's resources by replacing the traditional OPAC to include:

  • Catalog Records
  • Digital Library Items
  • Institutional Repository
  • Institutional Bibliography
  • Other Library Collections and Resources

VuFind is completely modular so you can implement just the basic system, or all of components. And since it's open source, you can modify the modules to best fit your need or you can add new modules to extend your resource offerings.

CIC’s Digitization Contract with Google

Library Journal Academic Newswire has published a must-read article ("Questions Emerge as Terms of the CIC/Google Deal Become Public") about the Committee on Institutional Cooperation’s Google Book Search Library Project contract.

The article includes quotes from Peter Brantley, Digital Library Federation Executive Director, from his "Monetizing Libraries" posting about the contract (another must-read piece).

Here’s an excerpt from Brantley’s posting:

In other words—pretty much, unless Google ceases business operations, or there is a legal ruling or agreement with publishers that expressly permits these institutions (excepting Michigan and Wisconsin which have contracts of precedence) to receive digitized copies of In-Copyright material, it will be held in escrow until such time as it becomes public domain.

That could be a long wait. . . .

In an article early this year in The New Yorker, "Google’s Moon Shot," Jeffrey Toobin discusses possible outcomes of the antagonism this project has generated between Google and publishers. Paramount among them, in his mind, is a settlement. . . .

A settlement between Google and publishers would create a barrier to entry in part because the current litigation would not be resolved through court decision; any new entrant would be faced with the unresolved legal issues and required to re-enter the settlement process on their own terms. That, beyond the costs of mass digitization itself, is likely to deter almost any other actor in the market.

Google Library Project Adds Committee on Institutional Cooperation (CIC)

The Google Book Search Library Project has an important new participant—the Committee on Institutional Cooperation (CIC). The CIC members are the University of Chicago, the University of Illinois, Indiana University, the University of Iowa, the University of Michigan, Michigan State University, the University of Minnesota, Northwestern University, Ohio State University, Pennsylvania State University, Purdue University, and the University of Wisconsin-Madison. As many as 10 million volumes will be digitized from the collections of these major research libraries.

Here’s an excerpt from the CIC press release:

This partnership between our 12 member universities and Google is unprecedented. What makes this work so exciting is that we will literally open the pages of millions of books that have been assembled on our library shelves over more than a century. In literally seconds, we’ll be able browse across the content of thousands of volumes, searching for words or phrases, and making links across those texts that would have taken weeks or months or years of dedicated and scrupulous analysis. It is an extraordinary effort, blending the efforts and aspirations of librarians, university administrators, and scholars from across 12 world-class research universities. And our corporate partner possesses unparalleled expertise in creating and opening the digital world to coherent and comprehensive searching.

The effort is not entirely without controversy—no great undertaking ever is. But our universities believe strongly in the power of information to change the world, and in preserving, protecting and extending access to information. We have carefully weighed and considered the intellectual property issues and believe that our effort is firmly within the guidelines of current copyright law, while providing some flexibility as those laws are tested in the new digital environment in the coming years.

Lawsuit Aside, McGraw-Hill Uses Google Book Search

According to an article in Network World, McGraw-Hill uses Google Book Search on its Web site in spite of the fact that it is suing Google over the product.

How can this be? McGraw-Hill participates in the Google Book Search Partner Program, which gives publishers control over access to their digitized books, but, at the same time, it objects to Google’s efforts to scan and make available copies of its books in libraries without its permission.

Source: Perez, Juan Carlos. "Google’s Book Search Available in Publisher Sites." Network World, 1 June 2007.

Stanford’s President and the Google Book Search Library Project

The Wall Street Journal ran a lengthy article about the personal finances of John L. Hennessy, president of Stanford University, today ("The Golden Touch of Stanford’s President"). It kicks off by noting that Hennessy made $1 million last November that didn’t come from Stanford.

The last seven paragraphs are of interest, since they discuss Stanford’s relationship to the Google Book Search Library Project. The Executive Director of the Author’s Guild says that Hennessy’s Google holdings are a "great concern" and there "seems to be both a personal and institutional profit motive here." The Stanford general counsel indicates that Hennessy was not part of discussions about Google Book Search Library Project. Another issue is Google’s $2 million gift to the Stanford Law School’s Center for Internet and Society to promote fair use. Lawrence Lessig denies that the Google gift had any "quid pro quo" implications, and the former Law School Dean indicates that Hennessy had no part in the Google gift.

Concerns about potential conflict of interest may be fueled by Hennessy’s $11 million gains from sale of Google stock and use of stock options, his current Google stock holdings valued at $2.3 million, and his Google stock options that may be worth $15.8 million if exercised.

Source: Hechinger, John, and Rebecca Buckman. "The Golden Touch of Stanford’s President." The Wall Street Journal, 24 Febuary 2007, A1, A10.