Software Curation and Preservation

"Software Preservation after the Internet"

Software preservation must consider knowledge management as a key challenge. We suggest a conceptualization of software preservation approaches that are available at different stages of the software lifecycle and can support memory institutions to assess the current state of software items in their collection, the capabilities of their infrastructure, and completeness and applicability of knowledge that is required to successfully steward the collection.

https://tinyurl.com/8y9svs7x

"Where Is All the Research Software? An Analysis of Software in UK Academic Repositories"

This research examines the prevalence of research software as independent records of output within UK academic institutional repositories (IRs). There has been a steep decline in numbers of research software submissions to the UK’s Research Excellence Framework from 2008 to 2021, but there has been no investigation into whether and how the official academic IRs have affected the low return rates. In what we believe to be the first such census of its kind, we queried the 182 online repositories of 157 UK universities. Our findings show that the prevalence of software within UK Academic IRs is incredibly low. Fewer than 28% contain software as recognised academic output. Of greater concern, we found that over 63% of repositories do not currently record software as a type of research output and that several Universities appeared to have removed software as a defined type from default settings of their repository. We also explored potential correlations, such as being a member of the Russell group, but found no correlation between these metadata and prevalence of records of software. Finally, we discuss the implications of these findings with regards to the lack of recognition of software as a discrete research output in institutions, despite the opposite being mandated by funders, and we make recommendations for changes in policies and operating procedures.

https://doi.org/10.7717/peerj-cs.1546

Software Metadata Recommended Format Guide, Version 1.1.0

The Software Metadata Recommended Format Guide (SMRF) describes and represents metadata elements identified by the Software Preservation Network that are appropriate to describe software materials in the context of a wide range of collections. SMRF aims to be adaptable, so that it can be used in different contexts and systems across libraries, museums, archives, and repositories. It is not meant to be exhaustive; instead SMRF is meant to provide a framework for cultural institutions and collections to determine which metadata to capture, and how to capture it, for their own collections

https://tinyurl.com/mr3hecys

"How Does Mandated Code-Sharing Change Peer Review?"

At the end of the year-long trial period, code sharing had risen from 53% in 2019 to 87% for 2021 articles submitted after the policy went into effect. Evidence in hand, the journal Editors-in-Chief decided to make code sharing a permanent feature of the journal. Today, the sharing rate is 96%.

https://tinyurl.com/5n9yh9yj

"Towards Research Software-ready Libraries"

Software is increasingly acknowledged as valid research output. Academic libraries adapt to this change to become research software-ready. Software publication and citation are key areas in this endeavor. We present and discuss the current state of the practice of software publication and software citation, and discuss four areas of activity that libraries engage in: (1) technical infrastructure, (2) training and support, (3) software management and curation, (4) policies.

https://doi.org/10.1515/abitech-2023-0031

"Computational Reproducibility of Jupyter Notebooks from Biomedical Publications"

Jupyter notebooks facilitate the bundling of executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows. The reproducibility of computational aspects of research is a key component of scientific reproducibility but has not yet been assessed at scale for Jupyter notebooks associated with biomedical publications. We address computational reproducibility at two levels: First, using fully automated workflows, we analyzed the computational reproducibility of Jupyter notebooks related to publications indexed in PubMed Central. We identified such notebooks by mining the articles full text, locating them on GitHub and re-running them in an environment as close to the original as possible. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications. Second, this study represents a reproducibility attempt in and of itself, using essentially the same methodology twice on PubMed Central over two years. Out of 27271 notebooks from 2660 GitHub repositories associated with 3467 articles, 22578 notebooks were written in Python, including 15817 that had their dependencies declared in standard requirement files and that we attempted to re-run automatically. For 10388 of these, all declared dependencies could be installed successfully, and we re-ran them to assess reproducibility. Of these, 1203 notebooks ran through without any errors, including 879 that produced results identical to those reported in the original notebook and 324 for which our results differed from the originally reported ones. Running the other notebooks resulted in exceptions. We zoom in on common problems, highlight trends and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.

https://arxiv.org/abs/2308.07333

More about Jupyter notebooks.

The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents that include code, interactive widgets, plots, narrative text, equations, images and even video! The Jupyter name comes from 3 programming languages: Julia, Python, and R. It is a popular tool for literate programming. Donald Knuth first defined literate programming as a script, notebook, or computational document that contains an explanation of the program logic in a natural language (e.g. English or Mandarin), interspersed with snippets of macros and source code, which can be compiled and rerun. You can think of it as an executable paper!

"Code Sharing Increases Citations, but Remains Uncommon"

Overall, R code was only available in 49 of the 1001 papers examined (4.9%) (Figure 1). When included, code was most often in the Supplemental Information (41%), followed by Github (20%), Figshare (6%), or other repositories (33%). Open-access publications were 70% more likely to include code than closed access publications (7.21% vs. 4.22%, X² = 4.442, p < 0.05). Code-sharing was estimated to increase at 0.5% / year, but this trend was not significant (p=0.11). The year of 2021 and 2022 showed a shift towards more frequent sharing, but the percentage of code-sharing has been consistently below 15% over the past decade (Figure 1).

We found papers including code disproportionately impact the literature (Figure 2), and accumulate citations faster (i.e., a marginally significant year-by-code-inclusion interaction; p = 0.0863). Further, we found a significant interaction between Open Access and code inclusion (p = 0.0265), with publications meeting both Open Science criteria (i.e., open code and open access) having highest overall predicted citation rates (Figure 2). For example, Open Science papers are expected to receive more than doubled citations (96.25 vs. 36.89) in year 13 post-publication compared with fully closed papers (Figure 2).

https://doi.org/10.21203/rs.3.rs-3222221/v1

"Care to Share? Experimental Evidence on Code Sharing Behavior in the Social Sciences"

Transparency and peer control are cornerstones of good scientific practice and entail the replication and reproduction of findings. The feasibility of replications, however, hinges on the premise that original researchers make their data and research code publicly available. This applies in particular to large-N observational studies, where analysis code is complex and may involve several ambiguous analytical decisions. To investigate which specific factors influence researchers’ code sharing behavior upon request, we emailed code requests to 1,206 authors who published research articles based on data from the European Social Survey between 2015 and 2020. In this preregistered multifactorial field experiment, we randomly varied three aspects of our code request’s wording in a 2x4x2 factorial design: the overall framing of our request (enhancement of social science research, response to replication crisis), the appeal why researchers should share their code (FAIR principles, academic altruism, prospect of citation, no information), and the perceived effort associated with code sharing (no code cleaning required, no information). Overall, 37.5% of successfully contacted authors supplied their analysis code. Of our experimental treatments, only framing affected researchers’ code sharing behavior, though in the opposite direction we expected: Scientists who received the negative wording alluding to the replication crisis were more likely to share their research code. Taken together, our results highlight that the availability of research code will hardly be enhanced by small-scale individual interventions but instead requires large-scale institutional norms.

https://doi.org/10.1371/journal.pone.0289380

"Policy Recommendations to Ensure That Research Software Is Openly Accessible and Reusable"

There is now an opportunity to expand US federal policies in similar ways and align their research software sharing aspects across agencies.

To do this, we recommend:

As part of their updated policy plans submitted in response to the 2022 OSTP memo, US federal agencies should, at a minimum, articulate a pathway for developing guidance on research software sharing, and, at a maximum, incorporate research software sharing requirements as a necessary extension of any data sharing policy and a critical strategy to make data truly FAIR (as these principles have been adapted to apply to research software [12]).

As part of sharing requirements, federal agencies should specify that research software should be deposited in trusted, public repositories that maximize discovery, collaborative development, version control, long-term preservation, and other key elements of the National Science and Technology Council’s "Desirable Characteristics of Data Repositories for Federally Funded Research" [13], as adapted to fit the unique considerations of research software.

US federal agencies should encourage grantees to use non-proprietary software and file formats, whenever possible, to collect and store data. We realize that for some research areas and specialized techniques, viable non-proprietary software may not exist for data collection. However, in many cases, files can be exported and shared using non-proprietary formats or scripts can be provided to allow others to open files.

Consistent with the US Administration’s approach to cybersecurity [<14], federal agencies should provide clear guidance on measures grantees are expected to undertake to ensure the security and integrity of research software. This guidance should encompass the design, development, dissemination, and documentation of research software. Examples include the National Institute of Standards and Technology’s secure software development framework and Linux Foundation’s open source security foundation.

As part of the allowable costs that grantees can request to help them meet research sharing requirements, US federal agencies should include reasonable costs associated with developing and maintaining research software needed to maximize data accessibility and reusability for as long as it is practical. Federal agencies should ensure that such costs are additive to proposal budgets, rather than consuming funds that would otherwise go to the research itself.

US federal agencies should encourage grantees to apply licenses to their research software that facilitate replication, reuse, and extensibility, while balancing individual and institutional intellectual property considerations. Agencies can point grantees to guidance on desirable criteria for distribution terms and approved licenses from the Open Source Initiative.

In parallel with the actions listed above that can be immediately incorporated into new public access plans, US federal agencies should also explore long-term strategies to elevate research software to co-equal research outputs and further incentivize its maintenance and sharing to improve research reproducibility, replicability, and integrity.

https://doi.org/10.1371/journal.pbio.3002204

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

Replayed: Essential Writings on Software Preservation and Game Histories

Since the early 2000s, Henry Lowood has led or had a key role in numerous initiatives devoted to the preservation and documentation of virtual worlds, digital games, and interactive simulations, establishing himself as a major scholar in the field of game studies. . . . Replayed consolidates Lowood’s far-flung and significant publications on these subjects into a single volume.

https://www.press.jhu.edu/books/title/12805/replayed

"Advancing Software Citation Implementation (Software Citation Workshop 2022)"

Software is foundationally important to scientific and social progress, however, traditional acknowledgment of the use of others’ work has not adapted in step with the rapid development and use of software in research. This report outlines a series of collaborative discussions that brought together an international group of stakeholders and experts representing many communities, forms of labor, and expertise. Participants addressed specific challenges about software citation that have so far gone unresolved. The discussions took place in summer 2022 both online and in-person and involved a total of 51 participants. The activities described in this paper were intended to identify and prioritize specific software citation problems, develop (potential) interventions, and lay out a series of mutually supporting approaches to address them. The outcomes of this report will be useful for the GLAM (Galleries, Libraries, Archives, Museums) community, repository managers and curators, research software developers, and publishers.

https://arxiv.org/abs/2302.07500v1

"A Sustainable Infrastructure Concept for Improved Accessibility, Reusability, and Archival of Research Software "

Research software is an integral part of most research today and it is widely accepted that research software artifacts should be accessible and reproducible. However, the sustainable archival of research software artifacts is an ongoing effort. We identify research software artifacts as snapshots of the current state of research and an integral part of a sustainable cycle of software development, research, and publication. We develop requirements and recommendations to improve the archival, access, and reuse of research software artifacts based on installable, configurable, extensible research software, and sustainable public open-access infrastructure. The described goal is to enable the reuse and exploration of research software beyond published research results, in parallel with reproducibility efforts, and in line with the FAIR principles for data and software. Research software artifacts can be reused in varying scenarios. To this end, we design a multi-modal representation concept supporting multiple reuse scenarios. We identify types of research software artifacts that can be viewed as different modes of the same software-based research result, for example, installation-free configurable browser-based apps to containerized environments, descriptions in journal publications and software documentation, or source code with installation instructions. We discuss how the sustainability and reuse of research software are enhanced or enabled by a suitable archive infrastructure. Finally, at the example of a pilot project at the University of Stuttgart, Germany—a collaborative effort between research software developers and infrastructure providers—we outline practical challenges and experiences

https://arxiv.org/abs/2301.12830

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

A Preservationist’s Guide to the DMCA Exemption for Software Preservation, 2nd Edition

In late 2021, the Library of Congress adopted several exemptions to the Digital Millennium Copyright Act (DMCA) provision prohibiting circumvention of technological measures that control access to copyrighted works. In other words, they created a set of exceptions to the general legal rule against cracking digital locks on things like DVDs, software, and video games. The exemptions are set out in regulations published by the Copyright Office. They went into effect on October 28, 2021 and last until October 28th, 2024. This guide is intended to help preservationists determine whether their activities are protected by the new exemptions. It includes important updates to the first edition to reflect changes in the rule to allow offsite access to non-game software, along with a few other technical changes.

https://doi.org/10.5281/zenodo.7328908

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"How Often Do Cancer Researchers Make Their Data and Code Available and What Factors Are Associated with Sharing"

One in five studies declared data were publicly available (59/306, 19%, 95% CI: 15–24%). However, when data availability was investigated this percentage dropped to 16% (49/306, 95% CI: 12–20%), and then to less than 1% (1/306, 95% CI: 0–2%) when data were checked for compliance with key FAIR principles. While only 4% of articles that used inferential statistics reported code to be available (10/274, 95% CI: 2–6%), the odds of reporting code to be available were 5.6 times higher for researchers who shared data.

https://doi.org/10.1186/s12916-022-02644-2

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

Supporting Software Preservation Services in Research and Memory Organizations

Supporting Software Preservation Services in Research and Memory Organizationsidentifies concepts, skill sets, barriers, and future directions related to software preservation work. Although definitions of "software" can vary across preservation contexts, the study found that there appears to be wide support for inter-organizational collaboration in software preservation. The report includes 13 recommendations for broadening representation in the field, defining the field, networking and community building, informal and formal learning, and implementing shared infrastructures and model practices.

https://cutt.ly/4NJHcoF

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"Who Writes Scholarly Code?"

This paper presents original research about the behaviours, histories, demographics, and motivations of scholars who code, specifically how they interact with version control systems locally and on the Web. By understanding patrons through multiple lenses—daily productivity habits, motivations, and scholarly needs—librarians and archivists can tailor services for software management, curation, and long-term reuse, raising the possibility for long-term reproducibility of a multitude of scholarship.

http://www.ijdc.net/article/view/839

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

Paywall: "A Perspective on Computational Research Support Programs in the Library: More than 20 Years of Data from Stanford University Libraries"

Presentation of data is a major component to academic research. However, programming languages, computational tools, and methods for exploring and analyzing data can be time consuming and frustrating to learn and finding help with these stages of the broader research process can be daunting. In this work, we highlight the impacts that computational research support programs housed in library contexts can have for fulfilling gaps in student, staff, and faculty research needs. The archival history of one such organization, Software and Services for Data Science (SSDS) in the Stanford University Cecil H. Green Library, is used to outline challenges faced by social sciences and humanities researchers from the 1980s to the present day.

https://doi.org/10.1177/09610006221124619

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"Introducing the FAIR Principles for Research Software"

The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work.

https://doi.org/10.1038/s41597-022-01710-x

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"Nine Best Practices for Research Software Registries and Repositories"

Scientific software registries and repositories improve software findability and research transparency, provide information for software citations, and foster preservation of computational methods in a wide range of disciplines. Registries and repositories play a critical role by supporting research reproducibility and replicability, but developing them takes effort and few guidelines are available to help prospective creators of these resources. To address this need, the FORCE11 Software Citation Implementation Working Group convened a Task Force to distill the experiences of the managers of existing resources in setting expectations for all stakeholders. In this article, we describe the resultant best practices which include defining the scope, policies, and rules that govern individual registries and repositories, along with the background, examples, and collaborative work that went into their development.

https://doi.org/10.7717/peerj-cs.1023

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |