Open Source Software – Page 12

TableSeer: Searching and Ranking PDF Table Data

Researchers at Penn State's College of Information Sciences and Technology's Cyber-Infrastructure Lab have developed open source software called TableSeer that can find, extract, search, and rank table data from PDF files. Source code will be available at the project's close.

Here's an extract from the press release:

Tables are an important data resource for researchers. In a search of 10,000 documents from journals and conferences, the researchers found that more than 70 percent of papers in chemistry, biology and computer science included tables. Furthermore, most of those documents had multiple tables.

But while some software can identify and extract tables from text, existing software cannot search for tables across documents. That means scientists and scholars must manually browse documents in order to find tables-a time-consuming and cumbersome process.

TableSeer automates that process and captures data not only within the table but also in tables' titles and footnotes. In addition, it enables column-name-based search so that a user can search for a particular column in a table.

In tests with documents from the Royal Society of Chemistry, TableSeer correctly identified and retrieved 93.5 percent of tables created in text-based formats. . . .

Information on TableSeer can be found in a paper, "TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries," by Ying Liu, Kun Bai, Mitra and Giles of the Penn State College of Information Sciences and Technology.

UNIX Ruling: An Open Source Victory

In a blow to the SCO Group, Dale A. Kimball, a judge in the U.S. District Court for the District of Utah Central District, has ruled that Novell owns the disputed copyright to the UNIX operating system. The judge also ruled that SCO must drop its suits against IBM Corp and Sequant as well as pay Novell part of its licensing fees from Sun and Microsoft.

Here's an excerpt from "Novell Wins Right to Unix, Dismissing SCO":

The ruling is good news for organizations that use open-source software products, said Jim Zemlin, executive director of the Linux Foundation. "From the perspective of someone who is adopting open-source solutions to run in the enterprise, it proves to them that the industry is going to defend the platform, and that when organizations attack it from a legal perspective, that the industry collectively will defend it," he said.

Here's an excerpt from "Judge Says Unix Copyrights Belong to Novell":

The court's ruling has cut out the core of SCO's case and, as a result, eliminates SCO's threat to the Linux community based upon allegations of copyright infringement of Unix," said Joe LaSala, Novell's senior vice president and general counsel.

Sources: Gohring, Nancy. "Novell Wins Right to Unix, Dismissing SCO." InfoWorld, 10 August 2007; Markoff, John. "Judge Says Unix Copyrights Belong to Novell." The New York Times, 11 August 2007.

CommentPress 1.0 Theme Released: Paragraph-Level Commenting in WordPress

After a year-and-a-half of development effort, the Institute for the Future of the Book has released the open-source CommentPress 1.0 theme for WordPress, which allows paragraph-level comments that are displayed side-by-side with the associated paragraph.

Here’s an excerpt from the announcement:

This little tool is the happy byproduct of a year and a half spent hacking WordPress to see whether a popular net-native publishing form, the blog, which, most would agree, is very good at covering the present moment in pithy, conversational bursts but lousy at handling larger, slow-developing works requiring more than chronological organization—whether this form might be refashioned to enable social interaction around long-form texts. Out of this emerged a series of publishing experiments loosely grouped under the heading "networked books." . . .

In the course of our tinkering, we achieved one small but important innovation. Placing the comments next to rather than below the text turned out to be a powerful subversion of the discussion hierarchy of blogs, transforming the page into a visual representation of dialog, and re-imagining the book itself as a conversation. Several readers remarked that it was no longer solely the author speaking, but the book as a whole (author and reader, in concert). . . .

We can imagine a number of possibilities:

— scholarly contexts: working papers, conferences, annotation projects, journals, collaborative glosses
— educational: virtual classroom discussion around readings, study groups
— journalism/public advocacy/networked democracy: social assessment and public dissection of government or corporate documents, cutting through opaque language and spin (like our version of the Iraq Study Group Report, or a copy of the federal budget, or a Walmart press release)
— creative writing: workshopping story drafts, collaborative storytelling
— recreational: social reading, book clubs

VuFind 0.5 Beta Released

Villanova University's Falvey Memorial Library has released VuFind 0.5 Beta. This open-source software operates in conjunction with Voyager OPACs (more drivers being developed), and it is powered by Solr.

Here's an excerpt from the project's home page:

VuFind is a library resource portal designed and developed for libraries by libraries. The goal of VuFind is to enable your users to search and browse through all of your library's resources by replacing the traditional OPAC to include:

Catalog Records

Digital Library Items

Institutional Repository

Institutional Bibliography

Other Library Collections and Resources

VuFind is completely modular so you can implement just the basic system, or all of components. And since it's open source, you can modify the modules to best fit your need or you can add new modules to extend your resource offerings.

Metadata Extraction Tool Version 3.2

The National Library of New Zealand has released version 3.2 of its open-source Metadata Extraction Tool.

Written in Java and XML, the Metadata Extraction Tool has a Windows interface, and it runs under UNIX in command line mode. Batch processing is supported.

Here’s an excerpt from the project home page:

The Tool builds on the Library’s work on digital preservation, and its logical preservation metadata schema. It is designed to:

automatically extracts preservation-related metadata from digital files

output that metadata in a standard format (XML) for use in preservation activities. . . .

The Metadata Extract Tool includes a number of ‘adapters’ that extract metadata from specific file types. Extractors are currently provided for:

Images: BMP, GIF, JPEG and TIFF.

Office documents: MS Word (version 2, 6), Word Perfect, Open Office (version 1), MS Works, MS Excel, MS PowerPoint, and PDF.

Audio and Video: WAV and MP3.

Markup languages: HTML and XML.

If a file type is unknown the tool applies a generic adapter, which extracts data that the host system ‘knows’ about any given file (such as size, filename, and date created).

steve: The Art Museum Tagging Project

The steve project has developed open source tagging software for museums called steve tagger that runs on Linux, Macintosh, and Windows platforms (see the Steve Tagger 1.0 Install Guide). You can see how the tagging works at their live system site.

Here’s an excerpt from the About Steve pages that describes the project:

"Steve" is a collaborative research project exploring the potential for user-generated descriptions of the subjects of works of art to improve access to museum collections and encourage engagement with cultural content. We are a group of volunteers, primarily from art museums, who share a common interest in improving access to our collections. We are concerned about barriers to public access to online museum information. Participation in steve is open to anyone with a contribution to make to developing our collective knowledge, whether they formally represent a museum or not.

You can find out more about steve from the November 2006 "Social Tagging and Folksonomy: steve.museum and Access to Art" presentation and from other project documents on the Reference page.

Archivists’ Toolkit 1.1 Beta (v. 1.0.19) Released

The Archivists’ Toolkit 1.1 Beta (v. 1.0.19) has been released. This version can connect Archivists’ Toolkit clients to MySQL, MS SQLServer, and Oracle database backends.

For more information on the Archivists’ Toolkit, see the "Archivists’ Toolkit Beta 1.1 Released" DigitalKoans posting.

Index Data Releases Open Source Pazpar2 Z39.50 Client

Index Data has released Version 1.0.1 of Pazpar2, an open source Z39.50 client.

Here’s an excerpt from the press release:

Pazpar2 . . . can be viewed either as a high-performance metasearching middleware or a Z39.50 client with a webservice interface, depending on your perspective and needs. It is a fairly compact C program—a resident daemon—that incorporates the best we know how to do in terms of providing high performance, user-oriented federated searching. . . .

One cool thing it does is search many databases in parallel, and do it fast, without unduly loading up the user interface. . . It retrieves a set of records from each target, and performs merging, deduplication, ranking/sorting, and pulls browse facets from them. . . .

It doesn’t know anything about data models, so you can handle exotic data sources if you need to. . . you use XSLT to normalize data into an internal model—we provide examples for MARC21 and a DC-esque internal model, and configure ranking, facets, sorting, etc., from that. . . .

Towards an Open Source Repository and Preservation System

The UNESCO Memory of the World Programme, with the support of the Australian Partnership for Sustainable Repositories, has published Towards an Open Source Repository and Preservation System: Recommendations on the Implementation of an Open Source Digital Archival and Preservation System and on Related Software Development.

Here’s an excerpt from the Executive Summary and Recommendations:

This report defines the requirements for a digital archival and preservation system using standard hardware and describes a set of open source software which could used to implement it. There are two aspects of this report that distinguish it from other approaches. One is the complete or holistic approach to digital preservation. The report recognises that a functioning preservation system must consider all aspects of a digital repositories; Ingest, Access, Administration, Data Management, Preservation Planning and Archival Storage, including storage media and management software. Secondly, the report argues that, for simple digital objects, the solution to digital preservation is relatively well understood, and that what is needed are affordable tools, technology and training in using those systems.

An assumption of the report is that there is no ultimate, permanent storage media, nor will there be in the foreseeable future. It is instead necessary to design systems to manage the inevitable change from system to system. The aim and emphasis in digital preservation is to build sustainable systems rather than permanent carriers. . . .

The way open source communities, providers and distributors achieve their aims provides a model on how a sustainable archival system might work, be sustained, be upgraded and be developed as required. Similarly, many cultural institutions, archives and higher education institutions are participating in the open source software communities to influence the direction of the development of those softwares to their advantage, and ultimately to the advantage of the whole sector.

A fundamental finding of this report is that a simple, sustainable system that provides strategies to manage all the identified functions for digital preservation is necessary. It also finds that for simple discrete digital objects this is nearly possible. This report recommends that UNESCO supports the aggregation and development of an open source archival system, building on, and drawing together existing open source programs.

This report also recommends that UNESCO participates through its various committees, in open source software development on behalf of the countries, communities, and cultural institutions, who would benefit from a simple, yet sustainable, digital archival and preservation system. . . .

Report on Chemistry Teaching/Research Data and Institutional Repositories

The JISC-funded SPECTRa project has released Project SPECTRa (Submission, Preservation and Exposure of Chemistry Teaching and Research Data): JISC Final Report, March 2007.

Here’s an excerpt from the Executive Summary:

Project SPECTRa’s principal aim was to facilitate the high-volume ingest and subsequent reuse of experimental data via institutional repositories, using the DSpace platform, by developing Open Source software tools which could easily be incorporated within chemists’ workflows. It focussed on three distinct areas of chemistry research—synthetic organic chemistry, crystallography and computational chemistry.

SPECTRa was funded by JISC’s Digital Repositories Programme as a joint project between the libraries and chemistry departments of the University of Cambridge and Imperial College London, in collaboration with the eBank UK project. . . .

Surveys of chemists at Imperial and Cambridge investigated their current use of computers and the Internet and identified specific data needs. The survey’s main conclusions were:

Much data is not stored electronically (e.g. lab books, paper copies of spectra)

A complex list of data file formats (particularly proprietary binary formats) being used

A significant ignorance of digital repositories

A requirement for restricted access to deposited experimental data

Distributable software tool development using Open Source code was undertaken to facilitate deposition into a repository, guided by interviews with key researchers. The project has provided tools which allow for the preservation aspects of data reuse. All legacy chemical file formats are converted to the appropriate Chemical Markup Language scheme to enable automatic data validation, metadata creation and long-term preservation needs. . . .

The deposition process adopted the concept of an "embargo repository" allowing unpublished or commercially sensitive material, identified through metadata, to be retained in a closed access environment until the data owner approved its release. . . .

Among the project’s findings were the following:

it has integrated the need for long-term management of experimental chemistry data with the maturing technology and organisational capability of digital repositories;

scientific data repositories are more complex to build and maintain than are those designed primarily for text-based materials;

the specific needs of individual scientific disciplines are best met by discipline-specific tools, though this is a resource-intensive process;

institutional repository managers need to understand the working practices of researchers in order to develop repository services that meet their requirements;

IPR issues relating to the ownership and reuse of scientific data are complex, and would benefit from authoritative guidance based on UK and EU law.

Archivists’ Toolkit Beta 1.1 Released

The Archivists’ Toolkit Beta 1.1 has been released for testing by interested parties.

Here’s a description of the Archivists’ Toolkit from the project’s home page:

Key Features:

Integrated support for managing archival materials from acquisition through processing:

Recording repository information

Tracking sources / donors

Recording accessions

Basic authority control for names and topical subjects

Describing archival resources and digital objects

Managing location information

Customizable interface:

Modify field labels

Establish default values for fields and notes where boilerplate text is used

Customize searchable fields and record browse lists

Ingest of legacy data in multiple formats: EAD 2002, MARC XML, and tab delimited accession data

Rapid data entry interface for creating container lists quickly

Management of user accounts, with a range of permission levels to control access to data

Tracking of database records, including username and date of record creation and most recent edit

Generation of over 30 different administrative and descriptive reports, such as acquisition statistics, accession records, shelf lists, subject guides, etc.

Export EAD 2002, MARC XML, METS, MODS, and Dublin Core

Support for desktop or networked, single- or multi-repository installations

DSpace Executive Director Appointed

Michele Kimpton, formerly of the Internet Archive, has been appointed the Executive Director of the newly formed DSpace nonprofit organization.

Here’e an excerpt from the announcement:

I am happy to report that we are making good progress on establishing the new non-profit organization, and I would like to take this opportunity to announce that Michele Kimpton has accepted the position as Executive Director for the organization. The DSpace non-profit corporation will initially provide organizational, legal and financial support for the DSpace open source software project. Prior to joining DSpace, Michele Kimpton was one of the founding Directors at Internet Archive, in charge of Web archiving technology and services. . . .

Michele developed an organization within Internet Archive to help support and fund open source software and web archiving programs, so she comes to us with a lot of experience in both open source software and long-term digital curation. Her organization worked primarily with National Libraries and Archives around the world, so she is familiar with large, widely diverse and distributed communities. Michele was one of the co-founders of the IIPC ( International Internet Preservation Consortium, netpreserve.org), whose mission is to work collaboratively to develop tools, standards and processes for archiving and preservation of web material.

The DSpace non-profit corporation is in the final stages of completing filing status as a not-for-profit corporation of Massachusetts. By summer 2007 we expect to have this legal entity in place, and a complete Board of Directors. Both MIT and Hewlett Packard have provided the start up funding to establish the organization over the next several years. . . .

Fez 1.3 Released

Christiaan Kortekaas has announced on the fedora-commons-users list that Fez 1.3 is now available from SourceForge.

Here’s a summary of key changes from his message:

Primary XSDs for objects based on MODS instead of DC (can still handle your existing DC objects though)

Download statistics using apache logs and GeoIP

Object history logging (premis events)

Shibboleth support

Fulltext indexing (pdf only)

Import and Export of workflows and XSDs

Sanity checking to help make sure required external dependencies are working

OAI provider that respects FezACML authorisation rules

For further information on Fez, see the prior post "Fez+Fedora Repository Software Gains Traction in US."

E-Journal: A Drupal-Based E-Journal Publishing System

Roman Chyla has developed E-Journal, an e-journal management and publishing system based upon the popular open-source Drupal content management system.

Here is a description from the E-Journal site:

This module allows you to create and control own electronic journals in Drupal—you can set up as many journals as you want, add authors and editors. Module gives you issue management and provides list of vocabularies (to browse) and archive of published articles. This module is more sophisticated than epublish.module and was inspired by Open Journal System. Our workflow is not so rigid though and because of the Drupal platform, you can do much more with e-journal than with OJS – potentially ;-).

An example journal that uses E-Journal is Ikaros .

(Prior postings about e-journal management and publishing systems.)

Fez+Fedora Repository Software Gains Traction in US

The February 2007 issue of Sustaining Repositories reports that more US institutions are using or investigating a combination of Fez and Fedora (see the below quote):

Fez programmers at the University of Queensland (UQ) have been gratified by a surge in international interest in the Fez software. Emory University Libraries are building a Fez repository for electronic theses. Indiana University Libraries are also testing Fez+Fedora to see whether to replace their existing DSpace installation. The Colorado Alliance of Research Libraries (http://www.coalliance.org/) is using Fez+Fedora for their Alliance Digital Repository. Also in the US, the National Science Digital Library is using Fez+Fedora for their Materials Science Digital Library (http://matdl.org/repository/index.php).

Oregon State University Libraries Release LibraryFind Metasearch Software

The Oregon State University Libraries have released version 0.7 of LibraryFind, which is open source metasearch software.

LibraryFind features noted in the press release include:

2-click user workflow (one click to find, one click to get)

Integrated OpenURL resolver

2-tiered caching system to improve search response time

Customizable user interface

According to the installation instructions, the software requires Ruby 1.8.4 and Rails 1.1.6.

MIT’s SIMILE Project

MIT’s Semantic Interoperability of Metadata and Information in unLike Environments (SIMILE) project is producing a variety of interesting open source software packages that will be of interest to librarians and others such as Piggy Bank, "a Firefox extension that turns your browser into a mashup platform, by allowing you to extract data from different web sites and mix them together."

Here is an overview of the SIMILE project from the About SIMILE page:

SIMILE is a joint project conducted by the MIT Libraries and MIT Computer Science and Artificial Intelligence Laboratory. SIMILE seeks to enhance inter-operability among digital assets, schemata/vocabularies/ontologies, metadata, and services. A key challenge is that the collections which must inter-operate are often distributed across individual, community, and institutional stores. We seek to be able to provide end-user services by drawing upon the assets, schemata/vocabularies/ontologies, and metadata held in such stores.

SIMILE will leverage and extend DSpace, enhancing its support for arbitrary schemata and metadata, primarily though the application of RDF and semantic web techniques. The project also aims to implement a digital asset dissemination architecture based upon web standards. The dissemination architecture will provide a mechanism to add useful "views" to a particular digital artifact (i.e. asset, schema, or metadata instance), and bind those views to consuming services.

You can get a more detailed overview of the project from the SIMILE grant proposal and from other project documents.

There is a SIMILE blog and a Wiki. There are also three mailing lists.

Fedora 2.2 Released

The Fedora Project has released version 2.2 of Fedora.

From the announcement:

This is a significant release of Fedora that includes a complete repackaging of the Fedora source and binary distribution so that Fedora can now be installed as a standalone web application (.war) in any web container. This is a first step in positioning Fedora to fit within a standard "enterprise system" environment. A new installer application makes it easy to setup and run Fedora. Fedora now uses Servlet Filters for authentication. To support digital object integrity, the Fedora repository can now be configured to calculate and store checksums for datastream content. This can be done globally, or on selected datastreams. The Fedora API also provides the ability to check content integrity based on checksums. The RDF-based Resource Index has been tuned for better performance. Also, a new high-performing triplestore, backed by Postgres, has been developed that can be plugged into the Resource Index. Fedora contains many other enhancements and bug fixes.

Learning Commons Publishes "Copyright, Copyleft and Everything in Between"

The South African Learning Commons has published a multimedia introduction to copyright, open content, and open source issues for kids.

It is available for Linux, Mac, and Windows computers, and it is under the Creative Commons Attribution Share-Alike South Africa license.

Under the Hood of PLoS ONE: The Open Source TOPAZ E-Publishing System

PLoS is building its innovative PLoS ONE e-journal, which will incorporate both traditional and open peer review, using the open source TOPAZ software. (For a detailed description of the PLoS ONE peer review process, check out "ONE for All: The Next Step for PLoS.")

What is TOPAZ? It’s Web site doesn’t provide specifics, but "PLoS ONE—Technical Background" by Richard Cave does:

The core of TOPAZ is a digital information repository called Fedora (Flexible Extensible Digital Object Repository Architecture). Fedora is an Open Source content management application that supports the creation and management of digital objects. The digital objects contain metadata to express internal and external relationships in the repository, like articles in a journal or the text, images and video of an article. This relationship metadata can also be search using a semantic web query languages. Fedora is jointly developed by Cornell Universityâ€™s computer science department and the University of Virginia Libraries.

The metastore Kowari will be used with Fedora to support Resource Description Framework (RDF) http://en.wikipedia.org/wiki/Resource_Description_Framework metadata within the repository.

The PLoS ONE web interface will be built with AJAX. Client-side APIs will create the community features (e.g. annotations, discussion threads, ratings, etc.) for the website. As more new features are available on the TOPAZ architecture, we will launch them on PLoS ONE.

There was a TOPAZ Wiki at PLoS. It’s gone, but it’s pages are still cached by Google. The Wiki suggests that TOPAZ is likely to support Atom/RSS feeds, full-text search, and OAI-PMH among other possible features.

For information about other open source e-journal publishing systems, see "Open Source Software for Publishing E-Journals."

Open Source Software for Publishing E-Journals

Want to publish an open access journal, but you don’t want to license a commercial journal management system, develop your own system, or to do it all by tedious HTML hand-coding? Here’s summary information about two existing open source e-journal management systems (and one emerging system) that may do the trick.

HyperJournal

"HyperJournal is a software application that facilitates the administration of academic journals on the Web. Conceived for researchers in the Humanities and designed according to an intuitive and elegant layout, it permits the installation, personalization, and administration of a dedicated Web site at extremely low cost and without the need for special IT-competence. HyperJournal can be used not only to establish an online version of an existing paper periodical, but also to create an entirely new, solely electronic journal."
Overview
Documentation
Download

Open Journal Systems, Public Knowledge Project

"Open Journal Systems (OJS) is a journal management and publishing system that has been developed by the Public Knowledge Project through its federally funded efforts to expand and improve access to research. OJS assists with every stage of the refereed publishing process, from submissions through to online publication and indexing. Through its management systems, its finely grained indexing of research, and the context it provides for research, OJS seeks to improve both the scholarly and public quality of referred research."
Open Journal Systems (Overview)
FAQ
OJS Technical Reference
Download

DPubS (Digital Publishing System), Cornell University Library (In development)

"DPubS’ ground-breaking software system will enable publishers to cost-effectively organize, deliver, present and publish scholarly journals, monographs, conference proceedings, and other common and evolving means of academic discourse."
About DPubS
FAQ

Postscript: Peter Suber suggests adding several other software packages, including: