The Fourth Paradigm: Data-Intensive Scientific Discovery

Microsoft Research has released The Fourth Paradigm: Data-Intensive Scientific Discovery.

Of particular interest is the "Scholarly Communication" chapter.

Here are some selections from that chapter:

  • "Jim Gray’s Fourth Paradigm and the Construction of the Scientific Record," Clifford Lynch
  • "Text in a Data-Centric World," Paul Ginsparg
  • "All Aboard: Toward a Machine-Friendly Scholarly Communication System," Herbert Van de Sompel and Carl Lagoze
  • "I Have Seen the Paradigm Shift, and It Is Us," John Wilbanks

Digital Videos: Presentations from Access 2009 Conference

Presentations from the Access 2009 Conference are now available. Digital videos and presentation slides (if available) are synched.

Here's a quick selection:

  1. Dan Chudnov, "Repository Development at the Library of Congress"
  2. Cory Doctorow, "Copyright vs Universal Access to All Human Knowledge and Groups Without Cost: The State of Play in the Global Copyfight"
  3. Mark Jordan & Brian Owen, "COPPUL's LOCKSS Private Network / Software Lifecycles & Sustainability: a PKP and reSearcher Update"
  4. Dorthea Salo, "Representing and Managing the Data Deluge"
  5. Roy Tennant, "Inspecting the Elephant: Characterizing the Hathi Trust Collection"

Johns Hopkins University Sheridan Libraries' Data Conservancy Project Funded by $20 Million NSF Grant

The Johns Hopkins University Sheridan Libraries' Data Conservancy project has been funded by a $20 million NSF grant.

Here's an excerpt from the press release:

The Johns Hopkins University Sheridan Libraries have been awarded $20 million from the National Science Foundation (NSF) to build a data research infrastructure for the management of the ever-increasing amounts of digital information created for teaching and research. The five-year award, announced this week, was one of two for what is being called "data curation."

The project, known as the Data Conservancy, involves individuals from several institutions, with Johns Hopkins University serving as the lead and Sayeed Choudhury, Hodson Director of the Digital Research and Curation Center and associate dean of university libraries, as the principal investigator. In addition, seven Johns Hopkins faculty members are associated with the Data Conservancy, including School of Arts and Sciences professors Alexander Szalay, Bruce Marsh, and Katalin Szlavecz; School of Engineering professors Randal Burns, Charles Meneveau, and Andreas Terzis; and School of Medicine professor Jef Boeke. The Hopkins-led project is part of a larger $100 million NSF effort to ensure preservation and curation of engineering and science data.

Beginning with the life, earth, and social sciences, project members will develop a framework to more fully understand data practices currently in use and arrive at a model for curation that allows ease of access both within and across disciplines.

"Data curation is not an end but a means," said Choudhury. "Science and engineering research and education are increasingly digital and data-intensive, which means that new management structures and technologies will be critical to accommodate the diversity, size, and complexity of current and future data sets and streams. Our ultimate goal is to support new ways of inquiry and learning. The potential for the sharing and application of data across disciplines is incredible. But it’s not enough to simply discover data; you need to be able to access it and be assured it will remain available."

The Data Conservancy grant represents one of the first awards related to the Institute of Data Intensive Engineering and Science (IDIES), a collaboration between the Krieger School of Arts and Sciences, the Whiting School of Engineering, and the Sheridan Libraries. . . .

In addition to the $20 million grant announced today, the Libraries received a $300,000 grant from NSF to study the feasibility of developing, operating and sustaining an open access repository of articles from NSF-sponsored research. Libraries staff will work with colleagues from the Council on Library and Information Resources (CLIR), and the University of Michigan Libraries to explore the potential for the development of a repository (or set of repositories) similar to PubMedCentral, the open-access repository that features articles from NIH-sponsored research. This grant for the feasibility study will allow Choudhury's group to evaluate how to integrate activities under the framework of the Data Conservancy and will result in a set of recommendations for NSF regarding an open access repository.

"Empirical Study of Data Sharing by Authors Publishing in PLoS Journals"

Caroline J. Savage and Andrew J. Vickershave have published "Empirical Study of Data Sharing by Authors Publishing in PLoS Journals" in PLoS One.

Here's an excerpt:

We requested data from ten investigators who had published in either PLoS Medicine or PLoS Clinical Trials. All responses were carefully documented. In the event that we were refused data, we reminded authors of the journal's data sharing guidelines. If we did not receive a response to our initial request, a second request was made. Following the ten requests for raw data, three investigators did not respond, four authors responded and refused to share their data, two email addresses were no longer valid, and one author requested further details. A reminder of PLoS's explicit requirement that authors share data did not change the reply from the four authors who initially refused. Only one author sent an original data set. . . .

We received only one of ten raw data sets requested. This suggests that journal policies requiring data sharing do not lead to authors making their data sets available to independent investigators.

eSciDoc Infrastructure Version 1.1 Released

Version 1.1 of the eSciDoc Infrastructure has been released.

Here's an excerpt from the announcement:

  • Improved Ingest with support for pre-set states (e.g., ingest objects in status 'released'). Ingest performance has been improved significantly.
  • Support for user preferences added
  • Group policies extend the existing authorization options and allow for better support of collaborative working environments
  • Support for Japanese character sets in full-text and metadata searches, including the extraction of Japanese text from PDF documents
  • Support for OAI-PMH with dynamic sets based on filters
  • Improved and extended functionality for the Admin Tool, which now comes with a web-based GUI

Here's a brief description of the eSciDoc Core Services, which are part of a larger software suite (see the General Concepts page for further information):

The eSciDoc Core Services form a middleware for e-Research applications. The Core Services encapsulate a repository (Fedora Commons) and implement a broad range of commonly used functionalities. The service-oriented architecture fosters the creation of autonomous services, which can be re-used independently from the rest of the infrastructure. The multi-disciplinary nature of the existing Solutions built on top of the Core Services ensure the coverage of a broad range of generic and discipline-specific requirements.

“Adding eScience Assets to the Data Web”

Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson, Simeon Warner, Robert Sanderson, and Pete Johnston have self-archived "Adding eScience Assets to the Data Web" on arXiv.org.

Here's an excerpt:

Aggregations of Web resources are increasingly important in scholarship as it adopts new methods that are data-centric, collaborative, and networked-based. The same notion of aggregations of resources is common to the mashed-up, socially networked information environment of Web 2.0. We present a mechanism to identify and describe aggregations of Web resources that has resulted from the Open Archives Initiative – Object Reuse and Exchange (OAI-ORE) project. The OAI-ORE specifications are based on the principles of the Architecture of the World Wide Web, the Semantic Web, and the Linked Data effort. Therefore, their incorporation into the cyberinfrastructure that supports eScholarship will ensure the integration of the products of scholarly research into the Data Web.

Australian National Data Service Launches Two Research Data Services

The Australian National Data Service has launched two research data services: Identify My Data and Register My Data.

Here's an excerpt from the announcement:

The Register My Data services allow you to register descriptions of your research data. These descriptions are then published in a number of discovery environments. The first of these is the Research Data Australia gateway (to be launched by ANDS in July) which aspires to include any Australian publicly funded data relevant to research and enable innovative cross-disciplinary re-use. Data descriptions registered with ANDS are also fed into other data discovery portals in Australia and internationally, including the big search engines such as Google. The Identify My Data services allocate persistent identifiers to data. These identifiers enable continuity of access even when the location of the data on the internet changes.

Curating Atmospheric Data for Long Term Use: Infrastructure and Preservation Issues for the Atmospheric Sciences Community

The Digital Curation Centre has released Curating Atmospheric Data for Long Term Use: Infrastructure and Preservation Issues for the Atmospheric Sciences Community, SCARP Case Study No. 2.

Here's an excerpt:

DCC SCARP aims to understand disciplinary approaches to data curation by substantial case studies based on an immersive approach. As part of the SCARP project we engaged with a number of archives, including the British Atmospheric Data Centre, the World Data Centre Archive at the Rutherford Appleton Laboratory and the European Incoherent Scatter Scientific Association (EISCAT). We developed a preservation analysis methodology which is discipline independent in application but none the less capable of identifying and drawing out discipline specific preservation requirements and issues. In this case study report we present the methodology along with its application to the Mesospheric Stratospheric Tropospheric (MST) radar dataset, which is currently supported by and accessed through the British Atmospheric Data Centre. We suggest strategies for the long term preservation of the MST data and make recommendations for the wider community.

Keeping Research Data Safe 2: The Identification of Long-lived Digital Datasets for the Purposes of Cost Analysis: Project Plan

Charles Beagrie has released Keeping Research Data Safe 2: The Identification of Long-lived Digital Datasets for the Purposes of Cost Analysis: Project Plan.

Here's an excerpt from the project home page:

The Keeping Research Data Safe 2 project commenced on 31 March 2009 and will complete in December 2009. The project will identify and analyse sources of long-lived data and develop longitudinal data on associated preservation costs and benefits. We believe these outcomes will be critical to developing preservation costing tools and cost benefit analyses for justifying and sustaining major investments in repositories and data curation.

DISC-UK DataShare Project: Final Report

JISC has released DISC-UK DataShare Project: Final Report.

Here's an excerpt:

The DISC-UK DataShare Project was funded from March 2007-March 2009 as part of JISC's Repositories and Preservation programme, Repositories Enhancement strand. It was led by EDINA and Edinburgh University Data Library in partnership with the University of Oxford and the University of Southampton. The project built on the existing informal collaboration of UK data librarians and data managers who formed DISC-UK (Data Information Specialists Committee–UK).

This project has brought together the distinct communities of data support staff in universities and institutional repository managers in order to bridge gaps and exploit the expertise of both to advance the current provision of repository services for accommodating datasets, and thus to explore new pathways to assist academics at our institutions who wish to share their data over the Internet. The project's overall aim was to contribute to new models, workflows and tools for academic data sharing within a complex and dynamic information environment which includes increased emphasis on stewardship of institutional knowledge assets of all types; new technologies to enhance e- Research; new research council policies and mandates; and the growth of the Open Access / Open Data movement.

With three institutions taking part plus the London School of Economics as an associate partner, a range of exemplars have emerged from the establishment of institutional data repositories and related services. Part of the variety in the exemplars is a result of the different repository platforms used by the three project partners: DSpace (Edinburgh DataShare), ePrints (e-Prints Soton) and Fedora (Oxford University Research Archive, ORA)–all open source software. LSE took another route and is using the distributed Dataverse repository network for data, linking to publications in LSE Research Online. Also, different approaches were taken in setting up the repositories. All three institutions had an existing, well-used institutional repository, but two chose to incorporate datasets within the same system as the publications, and one (Edinburgh DataShare) was a paired repository exclusively for datasets, designed to interoperate with the publications repository (Edinburgh Research Archive). The approach took a major turn midway through the project when an apparent solution to the problem of lack of voluntary deposits arose, in the form of the advent of the Data Audit Framework. Edinburgh participated as a partner in the DAF Development project which created the methodology for the framework, and also won a bid to carry out its own DAF Implementation project. Later, the other two partners conducted their own versions of the data audit framework under the auspices of the DataShare project.

A number of scoping activities were carried about by the partners with the goal of informing repository enhancement as well as broader dissemination. These included a State-of-the-Art-Review to determine what had been learned by previous repository projects in the UK that had forayed into the data arena. This resulted in a list of benefits and barriers to deposit of datasets by researchers to inform our outreach activities. A Data Sharing Continuum diagram was developed to illustrate where the projects were aiming to fit into the curation landscape, and the range of curation steps that could be taken, from simple backup to online visualization. Later on, a specialized metadata schema was explored (Data Documentation Initiative or DDI) in terms of how it might be incorporated into repository systems, though repository development in this area was not taken up. Instead, a dataset application profile was developed based on qualified Dublin Core (dcterms). This was implemented in the Edinburgh DataShare repository and adapted by Southampton for their next release. The project wished to explore wider issues with open data and web publishing, and therefore produced two briefing papers to do with data mashups–on numeric data and geospatial data. Finally, the project staff and consultant distilled what it had learned in terms of policy development for data repositories in a training guide. A number of peer reviewed posters, papers, and articles were written by DISC-UK members about various aspects of the project during the period.

Key conclusions were that 1) Data management motivation is a better bottom-up driver for researchers than data sharing but is not sufficient to create culture change, 2) Data librarians, data managers and data scientists can help bridge communication between repository managers & researchers, and 3) IRs can improve impact of sharing data over the internet.

Digital Preservation: PARSE.Insight Project Reports on First Year Achievements

In "Annual Review Year 1: Goals and Achievements," The PARSE.Insight (Permanent Access to the Records of Science in Europe) Project reports on its first year achievements. This post includes links to a number of longer documents, including the PARSE.Insight Deliverable D2.1 Draft Roadmap.

Here's an excerpt from the PARSE.Insight Deliverable D2.1 Draft Roadmap.

The purpose of this document is to provide an overview and initial details of a number of specific components, both technical and non-technical, which would be needed to supplement existing and already planned infrastructures for science data. The infrastructure components presented here are aimed at bridging the gaps between islands of functionality, developed for particular purposes, often by other European projects, whether separated by discipline or time. Thus the infrastructure components are intended to play a general, unifying role in science data. While developed in the context of a European wide infrastructure, there would be great advantages for these types of infrastructure components to be available much more widely.

U.S. Federal Government Launches Data.gov

The U.S. Federal Government has launched Data.gov.

Here's an excerpt from the home page:

The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government. Although the initial launch of Data.gov provides a limited portion of the rich variety of Federal datasets presently available, we invite you to actively participate in shaping the future of Data.gov by suggesting additional datasets and site enhancements to provide seamless access and use of your Federal data.

Read more about it at "Data.gov Launched by Federal Government"; "Data.gov Launches to Mixed Reviews"; and "Data.gov Now Live; Looks Nice But Short on Data."

Dryad Repository Gets $2.18 Million Grant from the National Science Foundation

The Dryad Repository has received a $2.18 million grant from the National Science Foundation.

Here's an excerpt from the press release:

The repository, called Dryad, is designed to archive data that underlie published findings in evolutionary biology, ecology and related fields and allow scientists to access and build on each other’s findings.

The grant recipients are:

The National Evolutionary Synthesis Center and the Metadata Research Center have been developing Dryad in coordination with a large group of Journals and Societies in evolutionary biology and ecology. With the new grant, the additional team members are contributing to the development of the repository. . . .

Currently, a tremendous amount of information underlying published research findings is lost, researchers say. The lack of data sharing and preservation makes it impossible for the data to be examined or re-used by future investigators.

Dryad addresses these shortcomings and allows scientists to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, integrate data across studies and look for trends through statistical meta-analysis.

"The Dryad project seeks to enable scientists to generate new knowledge using existing data," said Kathleen Smith, Ph.D., principal investigator for the grant, a biology professor at Duke and director of the National Evolutionary Synthesis Center. "The key to Dryad in our view is making data deposition a routine and easy part of the publication process."

Digital Repositories Roadmap Review: Towards a Vision for Research and Learning in 2013

JISC has released Digital Repositories Roadmap Review: Towards a Vision for Research and Learning in 2013.

Here's an excerpt from the announcement:

The review is structured into two parts. Firstly it makes a number of recommendations targeted at the JISC Executive. The review then goes on to identify a number of milestones of relevance to the wider community that might act as a measure of progress towards the wider vision of enhanced scholarly communication. Achievement of these milestones would be assisted by JISC through its community work and funding programmes. The review addresses repositories for research outputs, research data and learning materials in separate sections.

DigitalKoans

CLARION (Chemical Laboratory Repository In/Organic Notebooks) Project Funded

JISC has funded the CLARION (Chemical Laboratory Repository In/Organic Notebooks) project.

Here's an excerpt from the announcement:

So an important part of CLARION will be developing the means for working with scientists to expose their data at the appropriate time. CLARION will expand to include a variety of spectral data, both from central analytical services and from individual labs. Another key aspect of CLARION is that we shall be integrating it with a commercial electronic laboratory notebook (eLNb). We're in the process of evaluating offerings and expect to make an announcement soon. This will be a key opportunity to see how feasible it is to integrate a standard system with the needs of a departmental repository. The protocols may be harder but we'll have the experience from the crystallography band spectroscopy. An important aspect is that we are keen to develop the Open Data idea globally and we's be very interested from other groups who are doing –or thinking of doing –similar things.

Infrastructure Planning and Data Curation: A Comparative Study of International Approaches to Enabling the Sharing of Research Data

JISC has released Infrastructure Planning and Data Curation: A Comparative Study of International Approaches to Enabling the Sharing of Research Data.

Here's an excerpt from the announcement:

The current methods of storing research data are as diverse as the disciplines that generate them and are necessarily driven by the myriad ways in which researchers need to subsequently access and exploit the information they contain. Institutional repositories, data centres and all other methods of storing data have to exist within an infrastructure that enables researchers to access ad exploit the data, and variant models for this infrastructure can be conceptualised. Discussion of effective infrastructures for curating data is taking place a all levels, wherever research is reliant on the longterm stewardship of digital material. JISC has commissioned this study to survey the different national agendas that are addressing variant infrastructure models, to inform developments within the UK and for facilitating an internationally integrated approach to data curation.

The study of data sharing initiatives in the OECD countries confirmed the traditional perception that the policy instruments are clustered more in the upper end of the stakeholder taxonomy – i.e. at the level of national and research funding organisations whereas the services and practical tools are being developed by organisations at the lower end of the taxonomy. Despite the differences that exist between countries in terms of the models used for research funding, as well as the levels at which decisions are taken, there is agreement on the expected strata of responsibility for applying the instruments of data sharing. This supports the structure of stakeholder taxonomy used in the study.

Draft Roadmap for Science Data Infrastructure

PARSE.Insight has released Draft Roadmap for Science Data Infrastructure.

Here's an excerpt from the announcement:

The draft roadmap provides an overview and initial details of a number of specific components, both technical and non-technical, which would be needed to supplement existing and already planned infrastructures for scientific data. The infra-structure components are aimed at bridging the gaps between islands of functionality, developed for particular purposes, often by other European projects. Thus the infrastructure components are intended to play a general, unifying role in scientific data. While developed in the context of a Europe-wide infrastructure, there would be great advantages for these types of infrastructure components to be available much more widely.

DCC Releases "Database Archiving"

The Digital Curation Centre has released a new briefing paper on "Database Archiving."

Here's an excerpt:

Database archiving is usually seen as a subset of data archiving. In a computational context, data archiving means to store electronic documents, data sets, multimedia files, and so on, for a period of time. The primary goal is to maintain the data in case it is later requested for some particular purpose. Complying with government regulations on data preservation are for example a main driver behind data archiving efforts. Database archiving focuses on archiving data that are maintained under the control of a database management system and structured under a database schema, e.g., a relational database.

Rufus Pollock on Open Data and Licensing

In "Open Data Openness and Licensing," Rufus Pollock, a Cambridge University economist, tackles the question of whether open research data should be licensed.

Here's an excerpt:

Over the last couple of years there has been substantial discussion about the licensing (or not) of (open) data and what "open" should mean. In this debate there two distinct, but related, strands:

  1. Some people have argued that licensing is inappropriate (or unnecessary) for data.
  2. Disagreement about what "open" should mean. Specifically: does openness allow for attribution and share-alike "requirements" or should "open" data mean "public domain" data?

These points are related because arguments for the inappropriateness of licensing data usually go along the lines: data equates to facts over which no monopoly IP rights can or should be granted; as such all data is automatically in the public domain and hence there is nothing to license (and worse "licensing" amounts to an attempt to "enclose" the public domain).

However, even those who think that open data can/should only be public domain data still agree that it is reasonable and/or necessary to have some set of community "rules" or "norms" governing usage of data. Therefore, the question of what requirements should be allowed for "open" data is a common one, whatever one"s stance on the PD question.

ARL Report: Current Models of Digital Scholarly Communication

The Association of Research Libraries has released Current Models of Digital Scholarly Communication by Nancy L. Maron and K. Kirby Smith, plus a database of associated examples.

Here's an excerpt from the press release:

In the spring of 2008, ARL engaged Ithaka’s Strategic Services Group to conduct an investigation into the range of online resources valued by scholars, paying special attention to those projects that are pushing beyond the boundaries of traditional formats and are considered innovative by the faculty who use them. The networked digital environment has enabled the creation of many new kinds of works, and many of these resources have become essential tools for scholars conducting research, building scholarly networks, and disseminating their ideas and work, but the decentralized distribution of these new-model works has made it difficult to fully appreciate their scope and number.

Ithaka’s findings are based on a collection of resources identified by a volunteer field team of over 300 librarians at 46 academic institutions in the US and Canada. Field librarians talked with faculty members on their campuses about the digital scholarly resources they find most useful and reported the works they identified. The authors evaluated each resource gathered by the field team and conducted interviews of project leaders of 11 representative resources. Ultimately, 206 unique digital resources spanning eight formats were identified that met the study’s criteria.

The study’s innovative qualitative approach yielded a rich cross-section of today’s state of the art in digital scholarly resources. The report profiles each of the eight genres of resources, including discussion of how and why the faculty members reported using the resources for their work, how content is selected for the site, and what financial sustainability strategies the resources are employing. Each section draws from the in-depth interviews to provide illustrative anecdotes and representative examples.

Highlights from the study’s findings include:

  • While some disciplines seem to lend themselves to certain formats of digital resource more than others, examples of innovative resources can be found across the humanities, social sciences, and scientific/technical/medical subject areas.

  • Of all the resources suggested by faculty, almost every one that contained an original scholarly work operates under some form of peer review or editorial oversight.

  • Some of the resources with greatest impact are those that have been around a long while.

  • While some resources serve very large audiences, many digital publications—capable of running on relatively small budgets—are tailored to small, niche audiences.

  • Innovations relating to multimedia content and Web 2.0 functionality appear in some cases to blur the lines between resource types.

  • Projects of all sizes—especially open-access sites and publications—employ a range of support strategies in the search for financial sustainability.

Presentations from the Oxford Institutional and National Services for Research Data Management Workshop

Presentations from the Institutional and National Services for Research Data Management Workshop at the Oxford Said Business School are now available.

Here's a selection: