OAIster Hits 10,000,000 Records

Excerpt from the press release:

We live in an information-driven world—one in which access to good information defines success. OAIster’s growth to 10 million records takes us one step closer to that goal.

Developed at the University of Michigan’s Library, OAIster is a collection of digital scholarly resources. OAIster is also a service that continually gathers these digital resources to remain complete and fresh. As global digital repositories grow, so do OAIster’s holdings.

Popular search engines don’t have the holdings OAIster does. They crawl web pages and index the words on those pages. It’s an outstanding technique for fast, broad information from public websites. But scholarly information, the kind researchers use to enrich their work, is generally hidden from these search engines.

OAIster retrieves these otherwise elusive resources by tapping directly into the collections of a variety of institutions using harvesting technology based on the Open Archives Initiative (OAI) Protocol for Metadata Harvesting. These can be images, academic papers, movies and audio files, technical reports, books, as well as preprints (unpublished works that have not yet been peer reviewed). By aggregating these resources, OAIster makes it possible to search across all of them and return the results of a thorough investigation of complete, up-to-date resources. . . .

OAIster is good news for the digital archives that contribute material to open-access repositories. "[OAIster has demonstrated that]. . . OAI interoperability can scale. This is good news for the technology, since the proliferation is bound to continue and even accelerate," says Peter Suber, author of the SPARC Open Access Newsletter. As open-access repositories proliferate, they will be supported by a single, well-managed, comprehensive, and useful tool.

Scholars will find that searching in OAIster can provide better results than searching in web search engines. Roy Tennant, User Services Architect at the California Digital Library, offers an example: "In OAIster I searched ‘roma’ and ‘world war,’ then sorted by weighted relevance. The first hit nailed my topic—the persecution of the Roma in World War II. Trying ‘roma world war’ in Google fails miserably because Google apparently searches ‘Rome’ as well as ‘Roma.’ The ranking then makes anything about the Roma people drop significantly, and there is nothing in the first few screens of results that includes the word in the title, unlike the OAIster hit."

OAIster currently harvests 730 repositories from 49 countries on 6 continents. In three years, it has more than quadrupled in size and increased from 6.2 million to 10 million in the past year. OAIster is a project of the University of Michigan Digital Library Production Service.

SEPW and SEPB Now Searchable Using a Google Custom Search Engine

The Scholarly Electronic Publishing Weblog is now searchable using a Google Custom Search Engine. The new search box is near the bottom of the Weblog’s home page.

The Scholarly Electronic Publishing Bibliography is also now searchable using a Google Custom Search Engine. This will be incorporated into a future version of SEPB. Only the bibliography sections of the document are searchable using this method (e.g., SEPW and SEPR are excluded).

Keep in mind when you search that you will retrieve bibliography section file or Weblog archive file titles with a single representative search result shown from that file. To see all hits, click on the cached page, which shows the retrieved search term(s) in the file highlighted in yellow.

For those who might be interested in including these Google Custom Search Engines in their Web pages, see "Code for Bailey’s Google Search Engines"

New OA Google Custom Search Engines

I’ve enhanced Open Access Update with four new Google Custom Search Engines:

  1. Open Access Mailing Lists (these are lists that have general discussion of OA topics)
  2. Open Access Serials
  3. Open Access Weblogs
  4. Open Access Wikis

The indexed works contain significant information about open access topics and are freely available.

See Open Access Update for details about the included works.

A Simple Search Hit Comparison for Google Scholar, OAIster, and Windows Live Academic Search

Given that Windows Live Academic Search’s content is limited to computer science, electrical engineering, and physics journals and conferences, a direct comparison of it with other search engines is somewhat difficult.

Although its limitations should be clearly recognized, the following simple experiment in comparing the number of hits for Google Scholar, OAIster (a search engine that indexes open access literature, such as e-prints), and Windows Live Academic Search may help to shed some light on their differences. (Note that OAIster does not typically include content directly provided by commercial publishers, although it does include e-prints for a large number articles published in academic journals.)

The search is for: "OAI-PMH" (entered without quotes).

"OAI-PMH" being, of course, the Open Archives Initiative Protocol for Metadata Harvesting. This is a highly specific search, where many, but not all, hits should fall within the subjects covered by Windows Live Academic Search. A major area that might not be covered is library and information science literature.

To get a better feel for the baseline published literature about OAI-PMH, let’s first do some searching for that term in specialized commercial databases.

  • ACM Digital Library (description): 51 hits.
  • Engineering Village 2 (description): 66 hits.
  • Information Science & Technology Abstracts (description): 36 hits.
  • Library Literature & Information Science Index/Full Text (description): 13 hits.

Now, the search engines in question (the links for the below search engine names are for the search, not the search engine):

So, what have we learned? Windows Live Academic Search has a somewhat higher number of hits than the selected commercial databases and, if adjusted downward for publisher versions only (see below), is on the high end. This suggests that it covers the toll-based published literature very well. However, it has a significantly lower number of hits than OAIster and Google Scholar, suggesting that its coverage of open access literature may be weaker than Google Scholar and it is quite likely weaker than OAIster.

Of the 74 hits for the "OAI-PMH" search in Windows Live Academic Search, 54 (73%) were "published versions" (i.e., publisher-supplied works); 20 (27%) were not (i.e., e-prints). Scanning the "Results by Institution" sidebar, it appears that 100% of OAIster’s 180 hits were from open access sources; I didn’t check them all. I didn’t try to break down the 542-hit Google Scholar search result, which has a mix of toll-based and open access materials, although it would be quite interesting to do so. It should be clear that a sample of one search term is a very crude measure (and that this posting won’t grace the pages of JASIST anytime soon).

Of course, this simple experiment tells us nothing about the presence of duplicate entries for the same work in search result sets, which could be important for a meaningful open access comparison. Consider, for example, this group of 11 hits for "A Scalable Architecture for Harvest-Based Digital Libraries—The ODU/Southampton Experiments" from the Google "OAI-PMH" search.

Nor does it tell us the number of items that are not journal articles (or e-prints for them) or conference papers.

An apples-to-apples comparison would adjust for useless duplicates and non-journal/conference literature. (But, of course, it would be quite useful if Windows Live Academic Search had non-journal/conference literature such as technical reports in it.)

However, given the small hit sets, it would not be impossible for someone else to do a deeper analysis on the duplicate entry question and some other tractable questions.

Windows Live Academic Search Is Up

The beta version of Windows Live Academic Search is now available: http://academic.live.com/. It appears to me that the system is under a very heavy load, so you may want to wait a bit before giving it a test drive.

The Windows Live Academic Search development team now has a Weblog. The official press release is now available. A list of participants in Microsoft’s MSN Search Champs V4, some of whom gave Microsoft detailed feedback about Windows Live Academic Search is also available.

Windows Live Academic Search Overview

The home page provides a search box, brief overview, an explanation of search results, and an FAQ.

The system’s indexed content is limited to Computer Science, Electrical Engineering, and Physics journals and conferences, including: "6 million records from approximately 4300 journals and 2000 conferences." Here is a list of works indexed. Relevance is determined by: (1) "quality of match of the search term with the content of the paper", and (2) "authoritativeness of the paper." Citation count is not being used in the ranking algorithm at this time.

Interface

The interface has the following key features:

  • Search box: My understanding is that all MSN search commands work, but I have not tested this.
  • Slider bar: Expands or restricts the amount of information shown for each hit in the search results (left side of screen).
  • Search results:
    • Author and title information in hits are linked (blue links).
    • Other hit links include search the Web for the item, CiteSeer citations (if available), show abstract, and hide abstract (grey links).
  • Sort by: You can sort search results by relevance, date (oldest), date (newest), author, journal, and conference. The last three sorts provide a header that precedes the listed search results: for example, John Doe (2).
  • "+add to Live.com": Adds the search to your Windows Live page. Three clickable buttons appear above the stored search on that page: Web, News, Feeds. When one is clicked, the search is repeated in the appropriate information source (e.g. RSS feeds).
  • Preview pane: The right side of the screen is used to display the fielded abstract, BibTex formatted abstract, or EndNote formatted abstract.

Highlights from the Home Page FAQ

  • More content?: "We are not ready to provide a detailed timeline on when we will have a comprehensive index by subject."
  • OpenUrl: "You will be able to click on the link to your library Open URL resolver to determine the availability of full text access."
  • Preferences: "While the version of Academic search does not have a preference page, future versions will have that functionality."

Highlights from Windows Live Academic Information: Librarians

  • OpenURL: "If Academic search can identify that a user is affiliated with your institution, appropriate search results will be accompanied by a link to your OpenURL resolver vendor. We request that you work with your link resolver company and give them permission to provide the necessary information about your institution to us."
  • RSS feeds for searches: "When a new article related to that search is posted, they [researchers] are alerted instantly via an RSS feed."

Highlights from Windows Live Academic Information: Publishers

  • Participation: "Talk to Crossref about the Crossref/Academic search partnership to receive information on the program and instructions on how to initiate participation."
  • Search Results: "Therefore, search results from journals that are indexed from publishers and are published articles will always be marked ‘Published Version’. This ensures that users know which result is the official version. If there are many instances of the same article (from other sources such as a web-crawl), we will always link the search result heading to the version on the publisher’s site."
  • Abstract information: "We require that non-subscribers at least see an abstract of the paper when they land on your site from our search results page." However, it also says:

    "Academic search provides for three levels of display in the preview pane:

    • Full abstract
    • First 140 characters from abstract
    • Nothing from the abstract

    Publishers can choose any of these options for their content."

Microsoft’s Windows Live Academic Search

Microsoft will be releasing Windows Live Academic Search shortly (I was recently told Wednesday; the blog buzz is saying tomorrow).

As is typical with such software projects, the team is doing some last minute tweaking before release. So, I won’t try to describe the system in any detail at this point, except to say that it integrates access to published articles with e-prints and other open access materials, it provides a reference export capability, there’s a cool optional two-pane view (short bibliographic information on the left; full bibliographic information and abstract on the right), and it supports search "macros" (user-written search programs).

What I will say is this: Microsoft made a real effort to get significant, honest input from the librarian and publisher communities during the development process. I know, because, now that the nondisclosure agreement has been lifted, I can say that I was one of the librarians who provided such input on an unpaid basis. I was very impressed by how carefully the development team listened to what we had to say, how sharp and energetic they were, how they really got the Web 2.0 concept, and how deeply committed they were to creating the best product possible. Having read Microserfs, I had a very different mental picture of Microsoft than the reality I encountered.

Needless to say, there were lively exchanges of views between librarians and publishers when open access issues were touched upon. My impression is that the team listened to both sides and tried to find the happy middle ground.

When it’s released, Windows Live Academic Search won’t be the perfect answer to your open access search engine dreams (what system is?), and Microsoft knows that there are missing pieces. But I think it will give Google Scholar a run for its money. I, for one, heartily welcome it, and I think it’s a good base to build upon, especially if Microsoft continues to solicit and seriously consider candid feedback from the library and publisher communities (and it appears that it will).

An Important Partial Win for Google and Privacy

U.S. District Court Judge James Ware ruled on Friday that Google does not have to turn over 5,000 search queries to the Justice Department; however, it does have to turn over 50,000 random Web URLs.

The Google Blog posting ("Google Wins!") was ecstatic, stating that:

This is a victory for both online rights activists and users of Google. Google may not always be perfect, but this time they stood up for what is right.

According to an article in Red Herring ("Judge Limits US Data Hunt"):

The government’s subpoena originally told Google it must turn over massive amounts of data in two broad categories: all the URLs available on the company’s search engine as of last July 31, and all search queries entered into Google’s search engine during June and July of 2005. That likely would have included tens of millions of data points.

A San Francisco Chroncile article ("Google Must Divulge Data Judge Cuts Amount of Info Company Has to Give Feds") noted that:

Google, along with privacy advocates, argued that sometimes users can reveal personal information in search queries, including their Social Security Numbers. Or they can suggest the sexual preferences of public officials or use inflammatory phrases such as "bomb-making equipment," which would pique the interest of law enforcement. The privacy advocates said that the Justice Department couldn’t be trusted with access to such sensitive data, despite the administration’s promises to use the queries only for its online pornography case.

Judge Ware expressed concern about the impact of search-term disclose on Google due to user privacy issues:

The expectation of privacy by some Google users may not be reasonable, but may nonetheless have an appreciable impact on the way in which Google is perceived, and consequently the frequency with which users use Google. Such an expectation does not rise to the level of privilege, but does indicate that there is a potential burden as to Google’s loss of goodwill if Google is forced to disclose search queries to the government.

The Google Print Controversy: A Bibliography

Update: See the Google Book Search Bibliography, Version 2 for the latest bibliography.

This bibliography presents selected English-language electronic works about Google Print that are freely available on the Internet. It has a special focus on the legal issues associated with this project. Page numbers for print/electronic publications are not included unless they are mentioned in the electronic version.

Association of American Publishers. "Google Library Project Raises Serious Questions for Publishers and Authors."

Association of Learned and Professional Society Publishers. "Google Print for Libraries—ALPSP Position Statement."

Authors Guild. "Authors Guild Sues Google, Citing 'Massive Copyright Infringement'."

Band. Jonathan. "The Google Print Library Project: A Copyright Analysis." ARL: A Bimonthly Report on Research Library Issues and Actions from ARL, CNI, and SPARC, no. 242 (2005): 6-9.

Banks, Marcus A. "The Excitement of Google Scholar, the Worry of Google Print." Biomedical Digital Libraries 2 (Article 2 2005).

Battelle, John. "The AAP/Google Lawsuit: Much More At Stake ." John Battelle's Searchblog, 20 October 2005.

Blankenhorn, Dana. "Economic Lesson of Google Print." Moore's Lore, 21 October 2005.

Chafkin, Max. "Google Scrambles to Defend 'Google Print for Libraries' Initiative." The Book Standard, 21 October 2005.

Coleman, Mary Sue. "Riches We Must Share . . ." The Washington Post, 22 October 2005, A21.

Crawford, Susan. "Why Google Is Right." Susan Crawford Blog, 21 September 2005.

Drummond, David. "Why We Believe in Google Print." Google Blog, 19 October 2005.

DW staff. "German Publishers Warm to Google Library." Deutsche Welle, 20 October 2005.

Felten, Edward W. "Google Print, Damages and Incentives." Freedom to Tinker, 23 September 2005.

Finkelstein, Seth. "Google Print Is Not Copyright's Enemy-Of-My-Enemy-Is-My-Friend." Infothought, 23 September 2005.

Google. "Google Checks Out Library Books."

———. "Google Print."

———. "Information for Publishers about the Library Project."

Google, and University Library, University of Michigan. "Cooperative Agreement."

Graham, Jefferson. "Google Print Project Inspires Fans, Fears." USA Today, 17 October 2005.

Helm, Burt. "For Google, Another Stormy Chapter." BusinessWeek, 22 September 2005.

———. "A Google Project Pains Publishers." BusinessWeek, 23 May 2005.

———. "Google's Escalating Book Battle." BusinessWeek, 20 October 2005.

———. "Google's Plan Doesn't Scan." BusinessWeek, 12 August 2005.

———. "A New Page in Google's Books Fight." BusinessWeek, 22 June 2005.

Hof, Rob. "Lawsuit Against Google Print: The End of the Internet?" The Tech Beat, 21 October 2005.

Keegan, Victor. "A Bookworm's Delight." The Guardian, 21 October 2005.

Lavoie, Brian, Lynn Silipigni Connaway, and Lorcan Dempsey. "Anatomy of Aggregate Collections: The Example of Google Print for Libraries." D-Lib Magazine 11, no. 9 (2005).

Lessig, Lawrence. "Google Sued." Lessig Blog, 22 September 2005.

Marco, Meghann. "So, My Publisher Is Sueing Google. . ." MeghannMarco.com, 19 October 2005.

Markoff, John, and Edward Wyatt. "Google Is Adding Major Libraries to Its Database." The New York Times, 14 December 2004.

Mathes, Adam. "The Point of Google Print." Google Blog, 19 October 2005.

O'Reilly, Tim. "Google Library vs. Publishers." O'Reilly Radar, 13 August 2005.

Patry, William. "Google Revisited." The Patry Copyright Blog, 23 September 2005.

______. "Google, the Second Suit and Second Copy." The Patry Copyright Blog, 21 October 2005.

Petit, C. E. "Author's Guild v. Google: A Skeptical Analysis." Scrivener's Error: Warped Weft, 2005.

Pickering, Bobby. "Google Clarifies Print Differences in Europe." Information World Review, 18 October 2005.

Quilter, Laura. "Google & Not-for-Profit Libraries." Derivative Work, 13 August 2005.

Quint, Barbara. "CORRECTIONS: Google Print Not All I Said It Was." Information Today NewsBreaks & the Weekly News Digest, 29 August 2005.

———. "Google and Research Libraries Launch Massive Digitization Project." Information Today NewsBreaks & the Weekly News Digest, 20 December 2004.

———. "Google Library Project Hit by Copyright Challenge from University Presses." Information Today NewsBreaks & the Weekly News Digest, 31 May 2005.

———. "Google Slows Library Project to Accommodate Publishers." Information Today NewsBreaks & the Weekly News Digest, 15 August 2005.

———. "Google's Library Project: Questions, Questions, Questions." Information Today NewsBreaks & the Weekly News Digest, 27 December 2004.

———. "The Other Shoe Drops: Google Print Sued for Copyright Violation." Information Today NewsBreaks & the Weekly News Digest, 3 October 2005.

Raff, Andrew. "Google, Publishers, Copies and 'Being Evil'." IPTAblog, 21 September 2005.

Slater, Derek. "Google Print Commentary Round-Up." A Copyfighter's Musings, 20 October 2005.

Smith, Adam M. "Making Books Easier to Find." Google Blog, 11 August 2005.

Suber, Peter. "Does Google Library Violate Copyright?" SPARC Open Access Newsletter, no. 90 (2005).

Sullivan, Danny. "Forget Google Print Copyright Infringement; Search Engines Already Infringe." Search Engine Watch, 25 May 2005.

_______. "Indexing Versus Caching & How Google Print Doesn't Reprint." Search Engine Watch, 21 October 2005.

Taylor, Nick. ". . . But Not at Writers' Expense." The Washington Post, 22 October 2005, A21.

Thompson, Bill. "Defending Google's Licence to Print." BBC News, 10 October 2005.

University Library, University of Michigan. "UM Library/Google Digitization Partnership FAQ, August 2005."

Vaidhyanathan, Siva. "Google Avoids Copyright Meltdown." SIVACRACY.NET: Opinions, Rants, and Obsessions of Siva Vaidhyanathan and his Friends and Family, 12 August 2005.

_______. "On the Essense of Libraries and Fair Use." SIVACRACY.NET: Opinions, Rants, and Obsessions of Siva Vaidhyanathan and his Friends and Family, 18 August 2005.

_______. "'Steal This Book'." On the Media, 30 September 2005.

"Why I Think Google's Library Plan was Out of Bounds." SIVACRACY.NET: Opinions, Rants, and Obsessions of Siva Vaidhyanathan and his Friends and Family, 13 August 2005.

von Lohmann, Fred. "Authors Guild Sues Google." Deep Links, 20 September 2005.

Wentworth, Donna. "Google Print Is as Google Print Does." Copyfight, 15 August 2005.

Wilkin, John P., and Reginald Carr. "Google's Library Digitization Project: Reports from Michigan and Oxford."

Wojcicki, Susan. "Google Print and the Authors Guild." Google Blog, 20 September 2005.

Wu,Tim. "Leggo My Ego." Slate, 17 October 2005.

Wyatt, Edward. "Google Opens 8 Sites in Europe, Widening Its Book Search Effort." The New York Times, 18 October 2005.

Selected by Librarians' Index to the Internet

Google Print Controversy Heats Up

Lots of ink (real and virtual) on Google Print and the AAUP’s recent resistance (all from Open Access News):

"Forget Google Print Copyright Infringement; Search Engines Already Infringe," SearchEngineWatch

"From Gutenberg to Google: Five Views on the Search-Engine Company’s Project to Digitize Library Books," The Chronicle
of Higher Education
(requires subscription)

"Google Books under Fire," The Register

"Google Library Project Hit by Copyright Challenge from University Presses," Information Today Newsbreaks

"Google Print Goes Live," InternetNews

"A Google Project Pains Publishers," Business Week

"Google This: ‘Copyright Law,’" Business Week

"Google’s Scan Plan Hits More Bumps," Forbes

"Publishers Lay into Google Print," ZDNet UK

"The University Press Assn.’s Objections," Business Week

"University-Press Group Raises Questions About Google’s Library-Scanning Project," The Chronicle of Higher Education