Data and Text Mining – DigitalKoans

"Text Analysis of Archival Finding Aids; Collection Scoping and Beyond"

In this study, we examine the suitability of text analysis as a method for analyzing collection scope strengths across a repository’s physical archival holdings. We apply a tool for text analysis called Leximancer to analyze a corpus of archival finding aids to explore topical coverage. Leximancer results were highly aligned with the baseline subject heading analysis that we performed, but the concepts, themes, and co-occurring topic pairs surfaced by Leximancer suggest areas of collection strength and potential focus for new acquisitions. We discuss the potential applications of text analysis for internal library use including collection development, as well as potential implications for wider description, discovery, and access. Text analysis can accurately surface topical strengths and directly lead to insights that can inform future acquisition decisions and archival collection development policies.

https://tinyurl.com/mr45f8e7

Paywall: "Copyright and Text and Data Mining: Is the Current Legislation Sufficient and Adequate?"

This paper presents the basic aspects of legislation applicable to text and data mining activities. It offers a detailed comparative analysis of the norms of the main jurisdictions that have regulated them to date [Japan, UK, US, and EU] highlighting in each case the positive and negative aspects.

https://doi.org/10.1353/pla.2024.a931775

"TDM & AI Rights Reserved? Fair Use & Evolving Publisher Copyright Statements"

Earlier this year, we noticed that some academic publishers have revised the copyright notices on their websites to state they reserve rights to text and data mining (TDM) and AI training (for example, see the website footers for Elsevier and Wiley). . . .SPARC asked Kyle K. Courtney, Director of Copyright and Information Policy for Harvard Library, to address key questions regarding these revised copyright statements and the continuing viability of fair use justifications for TDM.

https://tinyurl.com/4prkfbb3

"Developing Text and Data Mining (TDM) Support within a University Research Library"

The introduction of the text and data mining (TDM) exception in 2014 led to researchers asking for support from staff within Library Services at the University of Birmingham. An initial involvement with a funded corpus linguistics project fostered an effective partnership between the Copyright and Licensing Team and the University’s Research Infrastructure Team. This case study traces the TDM journey that Library Services has subsequently undertaken. The article will look at how staff in Copyright and Licensing and the Research Skills Team identified the original service gap. It will also look at issues impacting on supporting TDM and the results of a TDM survey that was sent to researchers. It concludes with a reflection on how the service might evolve in the future — from the creation and availability of TDM datasets, to the skills development of both librarians and the university communities they support, and the impact artificial intelligence (AI) developments might have on TDM practices.

https://doi.org/10.1629/uksg.646

"Fair Use Rights to Conduct Text and Data Mining and Use Artificial Intelligence Tools Are Essential for UC Research and Teaching"

The UC Libraries invest more than $60 million each year licensing systemwide electronic content needed by scholars for these and other studies. (Indeed, the $60 million figure represents license agreements made at the UC systemwide and multi-campus levels. But each individual campus also licenses electronic resources, adding millions more in total expenditures.) Our libraries secure campus access to a broad range of digital resources including books, scientific journals, databases, multimedia resources, and other materials. In doing so, the UC Libraries must negotiate licensing terms that ensure scholars can make both lawful and comprehensive use of the materials the libraries have procured. Increasingly, however, publishers and vendors are presenting libraries with content license agreements that attempt to preclude, or charge additional and unsupportable fees for, fair uses like training AI tools in the course of conducting TDM. . . .

If the UC Libraries are unable to protect these fair uses, UC scholars will be at the mercy of publishers aggregating and controlling what may be done with the scholarly record. Further, UC scholars’ pursuit of knowledge will be disproportionately stymied relative to academic colleagues in other global regions, given that a large proportion of other countries preclude contractual override of research exceptions.

Indeed, in more than forty countries—including all those within the European Union (EU)—publishers are prohibited from using contracts to abrogate exceptions to copyright in non-profit scholarly and educational contexts. Article 3 of the EU’s Directive on Copyright in the Digital Single Market preserves the right for scholars within research organizations and cultural heritage institutions (like those researchers at UC) to conduct TDM for scientific research, and further proscribes publishers from invalidating this exception by license agreements (see Article 7). Moreover, under AI regulations recently adopted by the European Parliament, copyright owners may not opt out of having their works used in conjunction with artificial intelligence tools in TDM research—meaning copyrighted works must remain available for scientific research that is reliant on AI training, and publishers cannot override these AI training rights through contract. Publishers are thus obligated to—and do—preserve fair use-equivalent research exceptions for TDM and AI within the EU, and can do so in the United States, too. . . .

In all events, adaptable licensing language can address publishers’ concerns by reiterating that the licensed products may be used with AI tools only to the extent that doing so would not: i. create a competing or commercial product or service for use by third parties; ii. unreasonably disrupt the functionality of the subscribed products; or iii. reproduce or redistribute the subscribed products for third parties. In addition, license agreements can require commercially reasonable security measures (as also required in the EU) to extinguish the risk of content dissemination beyond permitted uses. In sum, these licensing terms can replicate the research rights that are unequivocally reserved for scholars elsewhere.

https://tinyurl.com/4fvpdz35

"Licensing Challenges Associated With Text and Data Mining: How Do We Get Our Patrons What They Need?"

Today’s researchers expect to be able to complete text and data mining (TDM) work on many types of textual data. But they are often blocked more by contractual limitations on what data they can use, and how they can use it, than they are by what data may be available to them. This article lays out the different types of TDM processes currently in use, the issues that may block researchers from being able to do the work they would like, and some possible solutions.

https://doi.org/10.31274/jlsc.15530

| Research Data Publication and Citation Bibliography | Research Data Sharing and Reuse Bibliography | Research Data Curation and Management Bibliography | Digital Scholarship |

"UC Berkeley Library and Internet Archive Co-directing Project to Help Text Data Mining Researchers Navigate Cross-Border Legal and Ethical Issues"

https://cutt.ly/ZXzHWmu

Research Data Sharing and Reuse Bibliography | Digital Scholarship | Digital Scholarship Sitemap

Paywall: "Text Mining for Type of Research Classification"

https://doi.org/10.1080/01639374.2021.1998281

Academic Library as Scholarly Publisher Bibliography, Version 2 | Digital Scholarship | Digital Scholarship Sitemap

"CADRE: A Cloud-Based Data Service for Big Bibliographic Data"

https://dl.acm.org/doi/abs/10.1145/3459637.3481898

CADRE: Collaborative Archive & Data Research Environment

Academic Library as Scholarly Publisher Bibliography, Version 2 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

Pamela Samuelson: "Text and Data Mining of In-Copyright Works: Is It Legal?"

https://cutt.ly/URMrNf7

Academic Library as Scholarly Publisher Bibliography, Version 2 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Exploring Data Mining: Facets and Emerging Trends"

https://doi.org/10.1108/DLP-08-2020-0078

Electronic Theses and Dissertations Bibliography, Version 7 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

Transforming Library Services for Computational Research with Text Data: Environmental Scan, Stakeholder Perspectives, and Recommendations for Libraries

https://cutt.ly/XQDKsT5

Research Data Curation and Management Bibliography | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Building Legal Literacies for Text Data Mining"

https://cutt.ly/Lm9ImUs

Research Data Curation and Management Bibliography | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

Paywall: "Application of Text Mining Techniques on Scholarly Research Articles: Methods and Tools"

https://doi.org/10.1080/13614533.2021.1918190

Research Data Curation Bibliography, Version 10 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"ODDPub—a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications"

http://doi.org/10.5334/dsj-2020-042

Research Data Curation Bibliography, Version 10 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

Paywall: "Collaborative Digital Research: Case Study of Text Mining a Corpus of Academic Journals"

https://doi.org/10.1080/13614533.2020.1819352

Research Data Curation Bibliography, Version 10 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

Christine L. Borgman: "Whose Text, Whose Mining, and to Whose Benefit?"

https://escholarship.org/uc/item/3682b9j6

"Library Receives $1M Mellon Grant to Experiment with Digital Collections as Big Data "

https://www.loc.gov/item/prn-19-098/?loclr=ealn

Dahlgren Memorial Library: "Text Mining for Clinical Support"

https://doi.org/10.5195/jmla.2019.758

Carl Malamud: "The Plan to Mine the World’s Research Papers"

https://www.nature.com/articles/d41586-019-02142-1

"The Fate of Text and Data Mining in the European Copyright Overhaul"

https://www.eff.org/deeplinks/2018/04/text-and-data-mining-european-copyright-overhaul

"Releasing 1.8 Million Open Access Publications from Publisher Systems for Text and Data Mining"

Petr Knoth, Nancy Pontika and Lucas Anastasiou have published "Releasing 1.8 Million Open Access Publications from Publisher Systems for Text and Data Mining" in LSE Impact of Social Sciences.

Here's an excerpt:

Text and data mining offers an opportunity to improve the way we access and analyse the outputs of academic research. But the technical infrastructure of the current scholarly communication system is not yet ready to support TDM to its full potential, even for open access outputs. To address this problem, Petr Knoth, Nancy Pontika and Lucas Anastasiou have developed the CORE Publisher Connector, a toolkit service designed to assist text miners in accessing content though a single machine interface. The Connector aims to solve the heterogeneity among publisher APIs and assist text miners with data collection, provide a centralised point of access to all openly available scientific publications, and provide a high-performance, constantly updated access interface.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

HathiTrust Research Center User Requirements Study White Paper

Eleanor Dickson et al. have self-archived "HathiTrust Research Center User Requirements Study White Paper ."

Here's an excerpt:

This paper presents findings from an investigation into trends and practices in humanities and social sciences research that incorporates text data mining. As affiliates of the HathiTrust Research Center (HTRC), the purpose of our study was to illuminate researcher needs and expectations for text data, tools, and training for text mining in order to better understand our current and potential user community. Results of our study have and will continue to inform development of HTRC tools and services for computational text analysis.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

"Text Data Mining from the Author’s Perspective: Whose Text, Whose Mining, and to Whose Benefit?"

Christine L. Borgman has self-archived "Text Data Mining from the Author's Perspective: Whose Text, Whose Mining, and to Whose Benefit?."

Here's an excerpt:

Given the many technical, social, and policy shifts in access to scholarly content since the early days of text data mining, it is time to expand the conversation about text data mining from concerns of the researcher wishing to mine data to include concerns of researcher-authors about how their data are mined, by whom, for what purposes, and to whose benefits.

Research Data Curation Bibliography, Version 8 | Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap

An Analytical Review of Text and Data Mining Practices and Approaches in Europe

OpenForum Europe has released An Analytical Review of Text and Data Mining Practices and Approaches in Europe: Policy Recommendations in View of the Upcoming Copyright Legislative Proposal.

Here's an excerpt:

Europe needs a regime which enables any researcher, citizen, company or other entity to engage in TDM activities, using material to which they have lawful access, wherever they feel there is a good idea. The exact commercial rewards can be managed at subsequent stages, depending on the implementation of the mining outcome. The protection could be considered at the point at which some clearly commercially beneficial project, product, service, business or company has emerged.

Digital Curation and Digital Preservation Works | Open Access Works | Digital Scholarship | Digital Scholarship Sitemap