ARL Releases E-Science Survey Preliminary Results and Resources

The Association of Research Libraries has released preliminary results and resources from an e-science survey of its members.

Here's an excerpt from the press release:

The Association of Research Libraries (ARL) E-Science Working Group surveyed ARL member libraries in the fall of 2009 to gather data on the state of engagement with e-science issues. An overview of initial survey findings was presented by E-Science Working Group Chair Wendy Lougee, University Librarian, McKnight Presidential Professor, University of Minnesota Libraries, at the October ARL Membership Meeting. Lougee's briefing explored contrasting approaches among research institutions, particularly in regard to data management. The briefing also summarized survey findings on topics such as library services, organizational structures, staffing patterns and staff development, and involvement in research grants, along with perspectives on pressure points for service development. To better explicate the findings, Lougee reviewed specific cases of activities at six research institutions. . . .

A full report of the survey findings is being prepared and will be published in 2010 by ARL through its Occasional Papers series.

Open Science at Web-Scale: Optimising Participation and Predictive Potential

JISC has released Open Science at Web-Scale: Optimising Participation and Predictive Potential.

Here's an excerpt:

This Report has attempted to draw together and synthesise evidence and opinion from a wide range of sources. Examples of data intensive science at extremes of scale and complexity which enable forecasting and predictive assertions, have been described together with compelling exemplars where an open and participative culture is transforming science practice. It is perhaps worth noting that the pace of change in this area is such, that it has been a challenging piece to compose and at best, it can only serve as a subjective snapshot of a very dynamic data space. . . .

The perspective of openness as a continuum is helpful in positioning the range of behaviours and practices observed in different disciplines and contexts. By separating the twin aspects of openness (access and participation), we can begin to understand the full scope and potential of the open science vision. Whilst a listing of the perceived values and benefits of open science is given, further work is required to provide substantive and tangible evidence to justify and support these assertions. Available evidence suggests that transparent data sharing and data re-use are far from commonplace. The peer production approaches to data curation which have been described, are really in their infancy but offer considerable promise as scaleable models which could be migrated to other disciplines. The more radical open notebook science methodologies are currently on the "fringe" and it is not clear whether uptake and adoption will grow in other disciplines and contexts.

Duke, NC State, and UNC Data Sharing Cloud Computing Project Launched

Duke University, North Carolina State University, and the University of North Carolina at Chapel Hill have launched a two-year project to share digital data.

Here's an excerpt from the press release:

An initiative that will determine how Triangle area universities access, manage, and share ever-growing stores of digital data launched this fall with funding from the Triangle Universities Center for Advanced Studies, Inc. (TUCASI).

The two-year TUCASI data-Infrastructure Project (TIP) will deploy a federated data cyberinfrastructure—or data cloud—that will manage and store digital data for Duke University, NC State University, UNC Chapel Hill, and the Renaissance Computing Institute (RENCI) and allow the campuses to more seamlessly share data with each other, with national research projects, and private sector partners in Research Triangle Park and beyond.

RENCI and the Data Intensive Cyber Environments (DICE) Center at UNC Chapel Hill manage the $2.7 million TIP. The provosts, heads of libraries and chief information officers at the three campuses signed off on the project just before the start of the fall semester.

"The TIP focuses on federation, sharing and reuse of information across departments and campuses without having to worry about where the data is physically stored or what kind of computer hardware or software is used to access it," said Richard Marciano, TIP project director, and also professor at UNC's School of Information and Library Science (SILS), executive director of the DICE Center, and a chief scientist at RENCI. "Creating infrastructure to support future Triangle collaboratives will be very powerful."

The TIP includes three components—classroom capture, storage, and future data and policy, which will be implemented in three phases. In phase one, each campus and RENCI will upgrade their storage capabilities and a platform-independent system for capturing and sharing classroom lectures and activities will be developed. . . .

In phase two, the TIP team will develop policies and practices for short- and long-term data storage and access. Once developed, the policies and practices will guide the research team as it creates a flexible, sustainable digital archive, which will connect to national repositories and national data research efforts. Phase three will establish policies for adding new collections to the TIP data cloud and for securely sharing research data, a process that often requires various restrictions. "Implementation of a robust technical and policy infrastructure for data archiving and sharing will be key to maintaining the Triangle universities' position as leaders in data-intensive, collaborative research," said Kristin Antelman, lead researcher for the future data and policy working group and associate director for the Digital Library at NC State.

The tasks of the TIP research team will include designing a model for capturing, storing and accessing course content, determining best practices for search and retrieval, and developing mechanisms for sharing archived content among the TIP partners, across the Triangle area and with national research initiatives. Campus approved social media tools, such as YouTube and iTunesU, will be integrated into the system.

The Fourth Paradigm: Data-Intensive Scientific Discovery

Microsoft Research has released The Fourth Paradigm: Data-Intensive Scientific Discovery.

Of particular interest is the "Scholarly Communication" chapter.

Here are some selections from that chapter:

  • "Jim Gray’s Fourth Paradigm and the Construction of the Scientific Record," Clifford Lynch
  • "Text in a Data-Centric World," Paul Ginsparg
  • "All Aboard: Toward a Machine-Friendly Scholarly Communication System," Herbert Van de Sompel and Carl Lagoze
  • "I Have Seen the Paradigm Shift, and It Is Us," John Wilbanks

Digital Videos: Presentations from Access 2009 Conference

Presentations from the Access 2009 Conference are now available. Digital videos and presentation slides (if available) are synched.

Here's a quick selection:

  1. Dan Chudnov, "Repository Development at the Library of Congress"
  2. Cory Doctorow, "Copyright vs Universal Access to All Human Knowledge and Groups Without Cost: The State of Play in the Global Copyfight"
  3. Mark Jordan & Brian Owen, "COPPUL's LOCKSS Private Network / Software Lifecycles & Sustainability: a PKP and reSearcher Update"
  4. Dorthea Salo, "Representing and Managing the Data Deluge"
  5. Roy Tennant, "Inspecting the Elephant: Characterizing the Hathi Trust Collection"

Johns Hopkins University Sheridan Libraries' Data Conservancy Project Funded by $20 Million NSF Grant

The Johns Hopkins University Sheridan Libraries' Data Conservancy project has been funded by a $20 million NSF grant.

Here's an excerpt from the press release:

The Johns Hopkins University Sheridan Libraries have been awarded $20 million from the National Science Foundation (NSF) to build a data research infrastructure for the management of the ever-increasing amounts of digital information created for teaching and research. The five-year award, announced this week, was one of two for what is being called "data curation."

The project, known as the Data Conservancy, involves individuals from several institutions, with Johns Hopkins University serving as the lead and Sayeed Choudhury, Hodson Director of the Digital Research and Curation Center and associate dean of university libraries, as the principal investigator. In addition, seven Johns Hopkins faculty members are associated with the Data Conservancy, including School of Arts and Sciences professors Alexander Szalay, Bruce Marsh, and Katalin Szlavecz; School of Engineering professors Randal Burns, Charles Meneveau, and Andreas Terzis; and School of Medicine professor Jef Boeke. The Hopkins-led project is part of a larger $100 million NSF effort to ensure preservation and curation of engineering and science data.

Beginning with the life, earth, and social sciences, project members will develop a framework to more fully understand data practices currently in use and arrive at a model for curation that allows ease of access both within and across disciplines.

"Data curation is not an end but a means," said Choudhury. "Science and engineering research and education are increasingly digital and data-intensive, which means that new management structures and technologies will be critical to accommodate the diversity, size, and complexity of current and future data sets and streams. Our ultimate goal is to support new ways of inquiry and learning. The potential for the sharing and application of data across disciplines is incredible. But it’s not enough to simply discover data; you need to be able to access it and be assured it will remain available."

The Data Conservancy grant represents one of the first awards related to the Institute of Data Intensive Engineering and Science (IDIES), a collaboration between the Krieger School of Arts and Sciences, the Whiting School of Engineering, and the Sheridan Libraries. . . .

In addition to the $20 million grant announced today, the Libraries received a $300,000 grant from NSF to study the feasibility of developing, operating and sustaining an open access repository of articles from NSF-sponsored research. Libraries staff will work with colleagues from the Council on Library and Information Resources (CLIR), and the University of Michigan Libraries to explore the potential for the development of a repository (or set of repositories) similar to PubMedCentral, the open-access repository that features articles from NIH-sponsored research. This grant for the feasibility study will allow Choudhury's group to evaluate how to integrate activities under the framework of the Data Conservancy and will result in a set of recommendations for NSF regarding an open access repository.

"Empirical Study of Data Sharing by Authors Publishing in PLoS Journals"

Caroline J. Savage and Andrew J. Vickershave have published "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals" in PLoS One.

Here's an excerpt:

We requested data from ten investigators who had published in either PLoS Medicine or PLoS Clinical Trials. All responses were carefully documented. In the event that we were refused data, we reminded authors of the journal's data sharing guidelines. If we did not receive a response to our initial request, a second request was made. Following the ten requests for raw data, three investigators did not respond, four authors responded and refused to share their data, two email addresses were no longer valid, and one author requested further details. A reminder of PLoS's explicit requirement that authors share data did not change the reply from the four authors who initially refused. Only one author sent an original data set. . . .

We received only one of ten raw data sets requested. This suggests that journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

eSciDoc Infrastructure Version 1.1 Released

Version 1.1 of the eSciDoc Infrastructure has been released.

Here's an excerpt from the announcement:

  • Improved Ingest with support for pre-set states (e.g., ingest objects in status 'released'). Ingest performance has been improved significantly.
  • Support for user preferences added
  • Group policies extend the existing authorization options and allow for better support of collaborative working environments
  • Support for Japanese character sets in full-text and metadata searches, including the extraction of Japanese text from PDF documents
  • Support for OAI-PMH with dynamic sets based on filters
  • Improved and extended functionality for the Admin Tool, which now comes with a web-based GUI

Here's a brief description of the eSciDoc Core Services, which are part of a larger software suite (see the General Concepts page for further information):

The eSciDoc Core Services form a middleware for e-Research applications. The Core Services encapsulate a repository (Fedora Commons) and implement a broad range of commonly used functionalities. The service-oriented architecture fosters the creation of autonomous services, which can be re-used independently from the rest of the infrastructure. The multi-disciplinary nature of the existing Solutions built on top of the Core Services ensure the coverage of a broad range of generic and discipline-specific requirements.

“Adding eScience Assets to the Data Web”

Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson, Simeon Warner, Robert Sanderson, and Pete Johnston have self-archived "Adding eScience Assets to the Data Web" on arXiv.org.

Here's an excerpt:

Aggregations of Web resources are increasingly important in scholarship as it adopts new methods that are data-centric, collaborative, and networked-based. The same notion of aggregations of resources is common to the mashed-up, socially networked information environment of Web 2.0. We present a mechanism to identify and describe aggregations of Web resources that has resulted from the Open Archives Initiative – Object Reuse and Exchange (OAI-ORE) project. The OAI-ORE specifications are based on the principles of the Architecture of the World Wide Web, the Semantic Web, and the Linked Data effort. Therefore, their incorporation into the cyberinfrastructure that supports eScholarship will ensure the integration of the products of scholarly research into the Data Web.

Australian National Data Service Launches Two Research Data Services

The Australian National Data Service has launched two research data services: Identify My Data and Register My Data.

Here's an excerpt from the announcement:

The Register My Data services allow you to register descriptions of your research data. These descriptions are then published in a number of discovery environments. The first of these is the Research Data Australia gateway (to be launched by ANDS in July) which aspires to include any Australian publicly funded data relevant to research and enable innovative cross-disciplinary re-use. Data descriptions registered with ANDS are also fed into other data discovery portals in Australia and internationally, including the big search engines such as Google. The Identify My Data services allocate persistent identifiers to data. These identifiers enable continuity of access even when the location of the data on the internet changes.

Curating Atmospheric Data for Long Term Use: Infrastructure and Preservation Issues for the Atmospheric Sciences Community

The Digital Curation Centre has released Curating Atmospheric Data for Long Term Use: Infrastructure and Preservation Issues for the Atmospheric Sciences Community, SCARP Case Study No. 2.

Here's an excerpt:

DCC SCARP aims to understand disciplinary approaches to data curation by substantial case studies based on an immersive approach. As part of the SCARP project we engaged with a number of archives, including the British Atmospheric Data Centre, the World Data Centre Archive at the Rutherford Appleton Laboratory and the European Incoherent Scatter Scientific Association (EISCAT). We developed a preservation analysis methodology which is discipline independent in application but none the less capable of identifying and drawing out discipline specific preservation requirements and issues. In this case study report we present the methodology along with its application to the Mesospheric Stratospheric Tropospheric (MST) radar dataset, which is currently supported by and accessed through the British Atmospheric Data Centre. We suggest strategies for the long term preservation of the MST data and make recommendations for the wider community.

Keeping Research Data Safe 2: The Identification of Long-lived Digital Datasets for the Purposes of Cost Analysis: Project Plan

Charles Beagrie has released Keeping Research Data Safe 2: The Identification of Long-lived Digital Datasets for the Purposes of Cost Analysis: Project Plan.

Here's an excerpt from the project home page:

The Keeping Research Data Safe 2 project commenced on 31 March 2009 and will complete in December 2009. The project will identify and analyse sources of long-lived data and develop longitudinal data on associated preservation costs and benefits. We believe these outcomes will be critical to developing preservation costing tools and cost benefit analyses for justifying and sustaining major investments in repositories and data curation.

DISC-UK DataShare Project: Final Report

JISC has released DISC-UK DataShare Project: Final Report.

Here's an excerpt:

The DISC-UK DataShare Project was funded from March 2007-March 2009 as part of JISC's Repositories and Preservation programme, Repositories Enhancement strand. It was led by EDINA and Edinburgh University Data Library in partnership with the University of Oxford and the University of Southampton. The project built on the existing informal collaboration of UK data librarians and data managers who formed DISC-UK (Data Information Specialists Committee–UK).

This project has brought together the distinct communities of data support staff in universities and institutional repository managers in order to bridge gaps and exploit the expertise of both to advance the current provision of repository services for accommodating datasets, and thus to explore new pathways to assist academics at our institutions who wish to share their data over the Internet. The project's overall aim was to contribute to new models, workflows and tools for academic data sharing within a complex and dynamic information environment which includes increased emphasis on stewardship of institutional knowledge assets of all types; new technologies to enhance e- Research; new research council policies and mandates; and the growth of the Open Access / Open Data movement.

With three institutions taking part plus the London School of Economics as an associate partner, a range of exemplars have emerged from the establishment of institutional data repositories and related services. Part of the variety in the exemplars is a result of the different repository platforms used by the three project partners: DSpace (Edinburgh DataShare), ePrints (e-Prints Soton) and Fedora (Oxford University Research Archive, ORA)–all open source software. LSE took another route and is using the distributed Dataverse repository network for data, linking to publications in LSE Research Online. Also, different approaches were taken in setting up the repositories. All three institutions had an existing, well-used institutional repository, but two chose to incorporate datasets within the same system as the publications, and one (Edinburgh DataShare) was a paired repository exclusively for datasets, designed to interoperate with the publications repository (Edinburgh Research Archive). The approach took a major turn midway through the project when an apparent solution to the problem of lack of voluntary deposits arose, in the form of the advent of the Data Audit Framework. Edinburgh participated as a partner in the DAF Development project which created the methodology for the framework, and also won a bid to carry out its own DAF Implementation project. Later, the other two partners conducted their own versions of the data audit framework under the auspices of the DataShare project.

A number of scoping activities were carried about by the partners with the goal of informing repository enhancement as well as broader dissemination. These included a State-of-the-Art-Review to determine what had been learned by previous repository projects in the UK that had forayed into the data arena. This resulted in a list of benefits and barriers to deposit of datasets by researchers to inform our outreach activities. A Data Sharing Continuum diagram was developed to illustrate where the projects were aiming to fit into the curation landscape, and the range of curation steps that could be taken, from simple backup to online visualization. Later on, a specialized metadata schema was explored (Data Documentation Initiative or DDI) in terms of how it might be incorporated into repository systems, though repository development in this area was not taken up. Instead, a dataset application profile was developed based on qualified Dublin Core (dcterms). This was implemented in the Edinburgh DataShare repository and adapted by Southampton for their next release. The project wished to explore wider issues with open data and web publishing, and therefore produced two briefing papers to do with data mashups–on numeric data and geospatial data. Finally, the project staff and consultant distilled what it had learned in terms of policy development for data repositories in a training guide. A number of peer reviewed posters, papers, and articles were written by DISC-UK members about various aspects of the project during the period.

Key conclusions were that 1) Data management motivation is a better bottom-up driver for researchers than data sharing but is not sufficient to create culture change, 2) Data librarians, data managers and data scientists can help bridge communication between repository managers & researchers, and 3) IRs can improve impact of sharing data over the internet.

Digital Preservation: PARSE.Insight Project Reports on First Year Achievements

In "Annual Review Year 1: Goals and Achievements," The PARSE.Insight (Permanent Access to the Records of Science in Europe) Project reports on its first year achievements. This post includes links to a number of longer documents, including the PARSE.Insight Deliverable D2.1 Draft Roadmap.

Here's an excerpt from the PARSE.Insight Deliverable D2.1 Draft Roadmap.

The purpose of this document is to provide an overview and initial details of a number of specific components, both technical and non-technical, which would be needed to supplement existing and already planned infrastructures for science data. The infrastructure components presented here are aimed at bridging the gaps between islands of functionality, developed for particular purposes, often by other European projects, whether separated by discipline or time. Thus the infrastructure components are intended to play a general, unifying role in science data. While developed in the context of a European wide infrastructure, there would be great advantages for these types of infrastructure components to be available much more widely.

U.S. Federal Government Launches Data.gov

The U.S. Federal Government has launched Data.gov.

Here's an excerpt from the home page:

The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government. Although the initial launch of Data.gov provides a limited portion of the rich variety of Federal datasets presently available, we invite you to actively participate in shaping the future of Data.gov by suggesting additional datasets and site enhancements to provide seamless access and use of your Federal data.

Read more about it at "Data.gov Launched by Federal Government"; "Data.gov Launches to Mixed Reviews"; and "Data.gov Now Live; Looks Nice But Short on Data."

Dryad Repository Gets $2.18 Million Grant from the National Science Foundation

The Dryad Repository has received a $2.18 million grant from the National Science Foundation.

Here's an excerpt from the press release:

The repository, called Dryad, is designed to archive data that underlie published findings in evolutionary biology, ecology and related fields and allow scientists to access and build on each other’s findings.

The grant recipients are:

The National Evolutionary Synthesis Center and the Metadata Research Center have been developing Dryad in coordination with a large group of Journals and Societies in evolutionary biology and ecology. With the new grant, the additional team members are contributing to the development of the repository. . . .

Currently, a tremendous amount of information underlying published research findings is lost, researchers say. The lack of data sharing and preservation makes it impossible for the data to be examined or re-used by future investigators.

Dryad addresses these shortcomings and allows scientists to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, integrate data across studies and look for trends through statistical meta-analysis.

"The Dryad project seeks to enable scientists to generate new knowledge using existing data," said Kathleen Smith, Ph.D., principal investigator for the grant, a biology professor at Duke and director of the National Evolutionary Synthesis Center. "The key to Dryad in our view is making data deposition a routine and easy part of the publication process."

Digital Repositories Roadmap Review: Towards a Vision for Research and Learning in 2013

JISC has released Digital Repositories Roadmap Review: Towards a Vision for Research and Learning in 2013.

Here's an excerpt from the announcement:

The review is structured into two parts. Firstly it makes a number of recommendations targeted at the JISC Executive. The review then goes on to identify a number of milestones of relevance to the wider community that might act as a measure of progress towards the wider vision of enhanced scholarly communication. Achievement of these milestones would be assisted by JISC through its community work and funding programmes. The review addresses repositories for research outputs, research data and learning materials in separate sections.

DigitalKoans

CLARION (Chemical Laboratory Repository In/Organic Notebooks) Project Funded

JISC has funded the CLARION (Chemical Laboratory Repository In/Organic Notebooks) project.

Here's an excerpt from the announcement:

So an important part of CLARION will be developing the means for working with scientists to expose their data at the appropriate time. CLARION will expand to include a variety of spectral data, both from central analytical services and from individual labs. Another key aspect of CLARION is that we shall be integrating it with a commercial electronic laboratory notebook (eLNb). We're in the process of evaluating offerings and expect to make an announcement soon. This will be a key opportunity to see how feasible it is to integrate a standard system with the needs of a departmental repository. The protocols may be harder but we'll have the experience from the crystallography band spectroscopy. An important aspect is that we are keen to develop the Open Data idea globally and we's be very interested from other groups who are doing –or thinking of doing –similar things.

Infrastructure Planning and Data Curation: A Comparative Study of International Approaches to Enabling the Sharing of Research Data

JISC has released Infrastructure Planning and Data Curation: A Comparative Study of International Approaches to Enabling the Sharing of Research Data.

Here's an excerpt from the announcement:

The current methods of storing research data are as diverse as the disciplines that generate them and are necessarily driven by the myriad ways in which researchers need to subsequently access and exploit the information they contain. Institutional repositories, data centres and all other methods of storing data have to exist within an infrastructure that enables researchers to access ad exploit the data, and variant models for this infrastructure can be conceptualised. Discussion of effective infrastructures for curating data is taking place a all levels, wherever research is reliant on the longterm stewardship of digital material. JISC has commissioned this study to survey the different national agendas that are addressing variant infrastructure models, to inform developments within the UK and for facilitating an internationally integrated approach to data curation.

The study of data sharing initiatives in the OECD countries confirmed the traditional perception that the policy instruments are clustered more in the upper end of the stakeholder taxonomy – i.e. at the level of national and research funding organisations whereas the services and practical tools are being developed by organisations at the lower end of the taxonomy. Despite the differences that exist between countries in terms of the models used for research funding, as well as the levels at which decisions are taken, there is agreement on the expected strata of responsibility for applying the instruments of data sharing. This supports the structure of stakeholder taxonomy used in the study.

Draft Roadmap for Science Data Infrastructure

PARSE.Insight has released Draft Roadmap for Science Data Infrastructure.

Here's an excerpt from the announcement:

The draft roadmap provides an overview and initial details of a number of specific components, both technical and non-technical, which would be needed to supplement existing and already planned infrastructures for scientific data. The infra-structure components are aimed at bridging the gaps between islands of functionality, developed for particular purposes, often by other European projects. Thus the infrastructure components are intended to play a general, unifying role in scientific data. While developed in the context of a Europe-wide infrastructure, there would be great advantages for these types of infrastructure components to be available much more widely.

DCC Releases "Database Archiving"

The Digital Curation Centre has released a new briefing paper on "Database Archiving."

Here's an excerpt:

Database archiving is usually seen as a subset of data archiving. In a computational context, data archiving means to store electronic documents, data sets, multimedia files, and so on, for a period of time. The primary goal is to maintain the data in case it is later requested for some particular purpose. Complying with government regulations on data preservation are for example a main driver behind data archiving efforts. Database archiving focuses on archiving data that are maintained under the control of a database management system and structured under a database schema, e.g., a relational database.