DSpace How-To Guide

Tim Donohue, Scott Phillips, and Dorothea Salo have published DSpace How-To Guide: Tips and Tricks for Managing Common DSpace Chores (Now Serving DSpace 1.4.2 and Manakin 1.1).

This 55-page booklet, which is under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License, will be a welcome addition to the virtual bookshelves of institutional repository managers struggling with the mysteries of DSpace.

Report on Chemistry Teaching/Research Data and Institutional Repositories

The JISC-funded SPECTRa project has released Project SPECTRa (Submission, Preservation and Exposure of Chemistry Teaching and Research Data): JISC Final Report, March 2007.

Here’s an excerpt from the Executive Summary:

Project SPECTRa’s principal aim was to facilitate the high-volume ingest and subsequent reuse of experimental data via institutional repositories, using the DSpace platform, by developing Open Source software tools which could easily be incorporated within chemists’ workflows. It focussed on three distinct areas of chemistry research—synthetic organic chemistry, crystallography and computational chemistry.

SPECTRa was funded by JISC’s Digital Repositories Programme as a joint project between the libraries and chemistry departments of the University of Cambridge and Imperial College London, in collaboration with the eBank UK project. . . .

Surveys of chemists at Imperial and Cambridge investigated their current use of computers and the Internet and identified specific data needs. The survey’s main conclusions were:

  • Much data is not stored electronically (e.g. lab books, paper copies of spectra)
  • A complex list of data file formats (particularly proprietary binary formats) being used
  • A significant ignorance of digital repositories
  • A requirement for restricted access to deposited experimental data

Distributable software tool development using Open Source code was undertaken to facilitate deposition into a repository, guided by interviews with key researchers. The project has provided tools which allow for the preservation aspects of data reuse. All legacy chemical file formats are converted to the appropriate Chemical Markup Language scheme to enable automatic data validation, metadata creation and long-term preservation needs. . . .

The deposition process adopted the concept of an "embargo repository" allowing unpublished or commercially sensitive material, identified through metadata, to be retained in a closed access environment until the data owner approved its release. . . .

Among the project’s findings were the following:

  • it has integrated the need for long-term management of experimental chemistry data with the maturing technology and organisational capability of digital repositories;
  • scientific data repositories are more complex to build and maintain than are those designed primarily for text-based materials;
  • the specific needs of individual scientific disciplines are best met by discipline-specific tools, though this is a resource-intensive process;
  • institutional repository managers need to understand the working practices of researchers in order to develop repository services that meet their requirements;
  • IPR issues relating to the ownership and reuse of scientific data are complex, and would benefit from authoritative guidance based on UK and EU law.

SWORD (Simple Web-service Offering Repository Deposit) Project

Led by UKOLN, The JISC SWORD (Simple Web-service Offering Repository Deposit) Project is developing "a prototype ‘smart deposit’ tool" to "facilitate easier and more effective population of repositories."

Here’s an excerpt from the project plan:

The effective and efficient population of repositories is a key concern for the repositories community. Deposit is a crucial step in the repository workflow; without it a repository has no content and can fulfill no further function. Currently most repositories exist in a fairly linear context, accepting deposits from a single interface and putting them into a single repository. Further deployment of repositories, encouraged by JISC and other funders, means that this situation is changing and we are beginning to see an increasingly complex and dynamic ecology of interactions between repositories and other services and systems. By and large developers are not creating repository systems and software from scratch, rather they are considering how repositories interface with other applications within institutions and the wider information landscape. A single repository, or multiple repositories, might interact with other components, such as VLEs, authoring tools, packaging tools, name authority services, classification services and research systems. In terms of content, resources may be deposited in a repository by both human and software agents, e.g. packaging tools that push content into repositories or a drag-and-drop desktop tool. The type of resource being deposited will also influence the choice of deposit mechanism. If the resources are complex packaged objects then a web service will need to support the ingest of multiple packaging standards.

There is currently no standard mechanism for accepting content into repositories, yet there already exists a stable and widely implemented service for harvesting metadata from repositories (OAI-PMH—Open Archives Initiative Protocol for Metadata Harvesting). This project will implement a similarly open protocol or specification for deposit. By taking a similar approach, the project and the resulting protocol and implementations will gain easier acceptance by a community already familiar with the OAI-PMH.

This project aims to develop a Simple Web-service Offering Repository Deposit (SWORD)—a lightweight deposit protocol that will be implemented as a simple web service within EPrints, DSpace, Fedora and IntraLibrary and tested against a prototype ‘smart deposit’ tool. The project plans to take forward the lightweight protocol originally formulated by a small group working within the Digital Repositories Programme (the ‘Deposit API’ work) . The project is aligned with the Object Reuse and Exchange (ORE) Mellon-funded two-year project by the Open Archives Initiative, which commenced in October 2006. Members of the SWORD project team are represented on its Technical and Liaison Committees. . . . . The SWORD project is not attempting to duplicate work being done being done by ORE, but seeks to build on existing work to support UK-specific requirements whilst feeding into the ongoing ORE project.

DSpace Executive Director Appointed

Michele Kimpton, formerly of the Internet Archive, has been appointed the Executive Director of the newly formed DSpace nonprofit organization.

Here’e an excerpt from the announcement:

I am happy to report that we are making good progress on establishing the new non-profit organization, and I would like to take this opportunity to announce that Michele Kimpton has accepted the position as Executive Director for the organization. The DSpace non-profit corporation will initially provide organizational, legal and financial support for the DSpace open source software project. Prior to joining DSpace, Michele Kimpton was one of the founding Directors at Internet Archive, in charge of Web archiving technology and services. . . .

Michele developed an organization within Internet Archive to help support and fund open source software and web archiving programs, so she comes to us with a lot of experience in both open source software and long-term digital curation. Her organization worked primarily with National Libraries and Archives around the world, so she is familiar with large, widely diverse and distributed communities. Michele was one of the co-founders of the IIPC ( International Internet Preservation Consortium, netpreserve.org), whose mission is to work collaboratively to develop tools, standards and processes for archiving and preservation of web material.

The DSpace non-profit corporation is in the final stages of completing filing status as a not-for-profit corporation of Massachusetts. By summer 2007 we expect to have this legal entity in place, and a complete Board of Directors. Both MIT and Hewlett Packard have provided the start up funding to establish the organization over the next several years. . . .

Open Access Repository Software Use By Country

Based on data from the OpenDOAR Charts service, here is snapshot of the open access repository software that is in use in the top five countries that offer such repositories.

The countries are abbreviated in the table header column as follows: US = United States, DK = Germany, UK = United Kingdom, AU = Australia, and NL = Netherlands. The number in parentheses is the reported number of repositories in that country.

Read the country percentages downward in each column (they do not total to 100% across the rows).

Excluding "unknown" or "other" systems, the highest in-country percentage is shown in boldface.

Software/Country US (248) DE (109) UK (93) AU (50) NL (44)
Bepress 17% 0% 2% 6% 0%
Cocoon 0% 0% 1% 0% 0%
CONTENTdm 3% 0% 2% 0% 0%
CWIS 1% 0% 0% 0% 0%
DARE 0% 0% 0% 0% 2%
Digitool 0% 0% 1% 0% 0%
DSpace 18% 4% 22% 14% 14%
eDoc 0% 2% 0% 0% 0%
ETD-db 4% 0% 0% 0% 0%
Fedora 0% 0% 0% 2% 0%
Fez 0% 0% 0% 2% 0%
GNU EPrints 19% 8% 46% 22% 0%
HTML 2% 4% 4% 4% 0%
iTor 0% 0% 0% 0% 5%
Milees 0% 2% 0% 0% 0%
MyCoRe 0% 2% 0% 0% 0%
OAICat 0% 0% 0% 2% 0%
Open Repository 0% 0% 3% 0% 2%
OPUS 0% 43% 2% 0% 0%
Other 6% 7% 2% 2% 0%
PORT 0% 0% 0% 0% 2%
Unknown 31% 28% 18% 46% 23%
Wildfire 0% 0% 0% 0% 52%

Snapshot Data from OpenDOAR Charts

OpenDOAR has introduced OpenDOAR Charts, a nifty new service that allows users to create and view charts that summarize data from its database of open access repositories.

Here’s what a selection of the default charts show today. Only double-digit percentage results are discussed.

  • Repositories by continent: Europe is the leader with 49% of repositories. North America places second with 33%.
  • Repositories by country: In light of the above, it is interesting that the US leads the pack with 29% of repositories. Germany (13%) and the UK follow (11%).
  • Repository software: After the 28% of unknown software, EPrints takes the number two slot (21%), followed by DSpace (19%).
  • Repository types: By far, institutional repositories are the leader at 79%. Disciplinary repositories follow (13%).
  • Content types: ETDs lead (53%), followed by unpublished reports/working papers (48%), preprints/postprints (37%), conference/workshop papers (35%), books/chapters/sections (31%), multimedia/av (20%), postprints only (17%), bibliographic references (16%), special items (15%), and learning objects (13%).

This is a great service; however, I’d suggest that University of Nottingham consider licensing it under a Creative Commons license so that snapshot charts could be freely used (at least for noncommercial purposes).

MIT’s SIMILE Project

MIT’s Semantic Interoperability of Metadata and Information in unLike Environments (SIMILE) project is producing a variety of interesting open source software packages that will be of interest to librarians and others such as Piggy Bank, "a Firefox extension that turns your browser into a mashup platform, by allowing you to extract data from different web sites and mix them together."

Here is an overview of the SIMILE project from the About SIMILE page:

SIMILE is a joint project conducted by the MIT Libraries and MIT Computer Science and Artificial Intelligence Laboratory. SIMILE seeks to enhance inter-operability among digital assets, schemata/vocabularies/ontologies, metadata, and services. A key challenge is that the collections which must inter-operate are often distributed across individual, community, and institutional stores. We seek to be able to provide end-user services by drawing upon the assets, schemata/vocabularies/ontologies, and metadata held in such stores.

SIMILE will leverage and extend DSpace, enhancing its support for arbitrary schemata and metadata, primarily though the application of RDF and semantic web techniques. The project also aims to implement a digital asset dissemination architecture based upon web standards. The dissemination architecture will provide a mechanism to add useful "views" to a particular digital artifact (i.e. asset, schema, or metadata instance), and bind those views to consuming services.

You can get a more detailed overview of the project from the SIMILE grant proposal and from other project documents.

There is a SIMILE blog and a Wiki. There are also three mailing lists.

Notre Dame Institutional Digital Repository Phase I Final Report

The University of Notre Dame Libraries have issued a report about their year-long institutional repository pilot project. There is an abbreviated HTML version and a complete PDF version.

From the Executive Summary:

Here is the briefest of summaries regarding what we did, what we learned, and where we think future directions should go:

  1. What we did—In a nutshell we established relationships with a number of content groups across campus: the Kellogg Institute, the Institute for Latino Studies, Art History, Electrical Engineering, Computer Science, Life Science, the Nanovic Institute, the Kaneb Center, the School of Architecture, FTT (Film, Television, and Theater), the Gigot Center for Entrepreneurial Studies, the Institute for Scholarship in the Liberal Arts, the Graduate School, the University Intellectual Property Committee, the Provost’s Office, and General Counsel. Next, we collected content from many of these groups, "cataloged" it, and saved it into three different computer systems: DigiTool, ETD-db, and DSpace. Finally, we aggregated this content into a centralized cache to provide enhanced browsing, searching, and syndication services against the content.
  2. What we learned—We essentially learned four things: 1) metadata matters, 2) preservation now, not later, 3) the IDR requires dedicated people with specific skills, 4) copyright raises the largest number of questions regarding the fulfillment of the goals of the IDR.
  3. Where we are leaning in regards to recommendations—The recommendations take the form of a "Chinese menu" of options, and the options are be grouped into "meals." We recommend the IDR continue and include: 1) continuing to do the Electronic Theses & Dissertations, 2) writing and implementing metadata and preservation policies and procedures, 3) taking the Excellent Undergraduate Research to the next level, and 4) continuing to implement DigiTool. There are quite a number of other options, but they may be deemed too expensive to implement.

Results from the DSpace Community Survey

DSpace conducted an informal survey of its open source community in October 2006. Here are some highlights:

  • The vast majority of respondents (77.6%) used or planned to use DSpace for a university IR.
  • The majority of systems were in production (53.4%); pilot testing was second (35.3%).
  • Preservation and interoperability were the highest priority system features (61.2% each), followed by search engine indexing (57.8%) and open access to refereed articles (56.9%). (Percentage of respondents who rated these features "very important.") Only 5.2% thought that OA to refereed articles was unimportant.
  • The most common type of current IR content was refereed scholarly articles and theses/dissertations (55.2% each), followed by other (48.6%) and grey literature (47.4%).
  • The most popular types of content that respondents were planning to add to their IRs were datasets (53.4%), followed by audio and video (46.6% each).
  • The most frequently used type of metadata was customized Dublin Core (80.2%), followed by XML metadata (13.8%).
  • The most common update pattern was to regularly migrate to new versions; however it took a "long time to merge in my customizations/configuration" (44.8%).
  • The most common types of modification were minor cosmetics (34.5%), new features (26.7%), and significant user interface customization (21.6%).
  • Only 30.2% were totally comfortable with editing/customizing DSpace; 56.9% were somewhat comfortable and 12.9% were not comfortable.
  • Plug-in use is light: for example, 11.2% use SRW/U, 8.6% use Manakin, and 5.2% use TAPIR (ETDs).
  • The most desired feature for the next version is a more easily customized user interface (17.5%), closely followed by improved modularity (16.7%).

For information about other recent institutional repository surveys, see "ARL Institutional Repositories SPEC Kit" and "MIRACLE Project’s Institutional Repository Survey."

MIRACLE Project’s Institutional Repository Survey

The MIRACLE (Making Institutional Repositories A Collaborative Learning Environment) project at the University of Michigan’s School of Information presented a paper at JCDL 2006 titled "Nationwide Census of Institutional Repositories: Preliminary Findings."

MIRACLE’s sample population was 2,147 library directors at four-year US colleges and universities. The paper presents preliminary findings from 273 respondents.

Respondents characterized their IR activities as: "(1) implementation of an IR (IMP), (2) planning & pilot testing an IR software package (PPT), (3) planning only (PO), or (4) no planning to date (NP)."

Of the 273 respondents, "28 (10%) have characterized their IR involvement as IMP, 42 (15%) as PPT, 65 (24%) as PO, and 138 (51%) as NP."

The top-ranked benefits of having an IR were: "capturing the intellectual capital of your institution," "better service to contributors," and "longtime preservation of your institution’s digital output." The bottom-ranked benefits were "reducing user dependence on your library’s print collection," "providing maximal access to the results of publicly funded research," and "an increase in citation counts to your institution’s intellectual output."

On the question of IR staffing, the survey found:

Generally, PPT and PO decision-makers envision the library sharing operational responsibility for an IR. Decision-makers from institutions with full-fledged operational IRs choose responses that show library staff bearing the burden of responsibility for the IR.

Of those with operational IRs who identified their IR software, the survey found that they were using: "(1) 9 for Dspace, (2) 5 for bePress, (3) 4 for ProQuest’s Digital Commons, (4) 2 for local solutions, and (5) 1 each for Ex Libris’ DigiTools and Virginia Tech’s ETD." Of those who were pilot testing software: "(1) 17 for DSpace, (2) 9 for OCLC’s ContentDM, (3) 5 for Fedora, (4) 3 each for bePress, DigiTool, ePrints, and Greenstone, (5) 2 each for Innovative Interfaces, Luna, and ETD, and (6) 1 each for Digital Commons, Encompass, a local solution, and Opus."

In terms of number of documents in the IRs, by far the largest percentages were for less than 501 documents (IMP, 41%; and PPT, 67%).

The preliminary results also cover other topics, such as content recruitment, investigative decision-making activities, IR costs, and IR system features.

It is interesting to see how these preliminary results compare to those of the ARL Institutional Repositories SPEC Kit. For example, when asked "What are the top three benefits you feel your IR provides?," the ARL survey respondents said:

  1. Enhance visibility and increase dissemination of institution’s scholarship: 68%
  2. Free, open, timely access to scholarship: 46%
  3. Preservation of and long-term access to institution’s scholarship: 36%
  4. Preservation and stewardship of digital content: 36%
  5. Collecting, organizing assets in a central location: 24%
  6. Educate faculty about copyright, open access, scholarly communication: 8%

Digital University/Library Presses, Part 5: Internet-First University Press

Established in January 2004, Cornell University’s Internet-First University Press is described as follows:

These materials are being published as part of a new approach to scholarly publishing. The manuscripts and videos are freely available from this Internet-First University Press repository within DSpace at Cornell University.

These online materials are available on an open access basis, without fees or restrictions on personal use. All mass reproduction, even for educational or not-for-profit use, requires permission and license.

There are Internet-First University Press DSpace collections for books and articles, multimedia and videos, and undergraduate scholarly publications. There is a print-on-demand option for books and articles.

There are DSpace sub-communities for journals and symposia, workshops, and conferences. One e-journal is published by Internet-First University Press, the CIGR E-Journal (most current volume dated 2005). A print journal, Engineering Quarterly, has been digitized and made available.

There appears to be no further information about the Internet-First University Press at its DSpace site; however, the "Internet-First Publishing Project at Cornell Offers New and Old Books Free Online or to Be Printed on Demand" press release provides further background information.

Prior postings on this topic:

ARL Institutional Repositories SPEC Kit

The Institutional Repositories SPEC Kit is now available from the Association of Research Libraries (ARL). This document presents the results of a thirty-eight-question survey of 123 ARL members in early 2006 about their institutional repositories practices and plans. The survey response rate was 71% (87 out of 123 ARL members responded). The front matter and nine-page Executive Summary are freely available. The document also presents detailed question-by-question results, a list of respondent institutions, representative documents from institutions, and a bibliography. It is 176 pages long.

Here is the bibliographic information: University of Houston Libraries Institutional Repository Task Force. Institutional Repositories. SPEC Kit 292. Washington, DC: Association of Research Libraries, 2006. ISBN: 1-59407-708-8.

The members of the University of Houston Libraries Institutional Repository Task Force who authored the document were Charles W. Bailey, Jr. (Chair); Karen Coombs; Jill Emery (now at UT Austin); Anne Mitchell; Chris Morris; Spencer Simons; and Robert Wright.

The creation of a SPEC Kit is a highly collaborative process. SPEC Kit Editor Lee Anne George and other ARL staff worked with the authors to refine the survey questions, mounted the Web survey, analyzed the data in SPSS, created a preliminary summary of survey question responses, and edited and formatted the final document. Given the amount of data that the survey generated, this was no small task. The authors would like to thank the ARL team for their hard work on the SPEC Kit.

Although the Executive Summary is much longer than the typical one (over 5,100 words vs. about 1,500 words), it should not be mistaken for a highly analytic research article. Its goal was to try to describe the survey’s main findings, which was quite challenging given the amount of survey data available. The full data is available in the "Survey Questions and Responses" section of the SPEC Kit.

Here are some quick survey results:

  • Thirty-seven ARL institutions (43% of respondents) had an operational IR (we called these respondents implementers), 31 (35%) were planning one by 2007, and 19 (22%) had no IR plans.
  • Looked at from the perspective of all 123 ARL members, 30% had an operational IR and, by 2007, that figure may reach 55%.
  • The mean cost of IR implementation was $182,550.
  • The mean annual IR operation cost was $113,543.
  • Most implementers did not have a dedicated budget for either start-up costs (56%) or ongoing operations (52%).
  • The vast majority of implementers identified first-level IR support units that had a library reporting line vs. one that had a campus IT or other campus unit reporting line.
  • DSpace was by far the most commonly used system: 20 implementers used it exclusively and 3 used it in combination with other systems.
  • Proquest DigitalCommons (or the Bepress software it is based on) was the second choice of implementers: 7 implementers used this system.
  • While 28% of implementers have made no IR software modifications to enhance its functionality, 22% have made frequent changes to do so and 17% have made major modifications to the software.
  • Only 41% of implementers had no review of deposited documents. While review by designated departmental or unit officials was the most common method (35%), IR staff reviewed documents 21% of the time.
  • In a check all that apply question, 60% of implementers said that IR staff entered simple metadata for authorized users and 57% said that they enhanced such data. Thirty-one percent said that they cataloged IR materials completely using local standards.
  • In another check all that apply question, implementers clearly indicated that IR and library staff use a variety of strategies to recruit content: 83% made presentations to faculty and others, 78% identified and encouraged likely depositors, 78% had library subject specialists act as advocates, 64% offered to deposit materials for authors, and 50% offered to digitize materials and deposit them.
  • The most common digital preservation arrangement for implementers (47%) was to accept any file type, but only preserve specified file types using data migration and other techniques. The next most common arrangement (26%) was to accept and preserve any file type.
  • The mean number of digital objects in implementers’ IRs was 3,844.