“Researchers Created an Open Rival to OpenAI’s o1 ‘Reasoning’ Model for Under $50”


S1 is based on a small, off-the-shelf AI model from Alibaba-owned Chinese AI lab Qwen, which is available to download for free. . . .

After training s1, which took less than 30 minutes using 16 Nvidia H100 GPUs, s1 achieved strong performance on certain AI benchmarks. . . . Niklas Muennighoff, a Stanford researcher who worked on the project, told TechCrunch he could rent the necessary compute today for about $20.

https://tinyurl.com/3mxwcv22

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

“‘Meta Torrented over 81 TB of Data through Anna’s Archive, Despite Few Seeders’”


Freshly unsealed court documents reveal that Meta downloaded significant amounts of data from shadow libraries through Anna’s Archive. The company’s use of BitTorrent was already known, but internal email communication reveals sources and terabytes of downloaded data, as well as a struggle with limited availability and slow download speeds due to a lack of seeders.

https://tinyurl.com/yxzjtnvs

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

OpenAI Video: “Introduction to Deep Research”

From “Introducing Deep Research“:

Deep research is built for people who do intensive knowledge work in areas like finance, science, policy, and engineering and need thorough, precise, and reliable research. It can be equally useful for discerning shoppers looking for hyper-personalized recommendations on purchases that typically require careful research, like cars, appliances, and furniture. Every output is fully documented, with clear citations and a summary of its thinking, making it easy to reference and verify the information. It is particularly effective at finding niche, non-intuitive information that would require browsing numerous websites. Deep research frees up valuable time by allowing you to offload and expedite complex, time-intensive web research with just one query.

https://tinyurl.com/4h2sy9rt

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"DeepSeek Panic Triggers Tech Stock Sell-off as Chinese AI Tops App Store"


There are three elements of DeepSeek R1 that really shocked experts. First, the Chinese startup appears to have trained the model for only $6 million (reportedly about 3% of the cost of training o1) as a so-called “side project” while using less powerful Nvidia H800 AI-acceleration chips due to US export restrictions on cutting-edge GPUs. Secondly, it appeared just four months after OpenAI announced o1 in September 2024. Finally, and perhaps most importantly, DeepSeek released the model weights for free with an open MIT license, meaning anyone can download it, run it, and fine-tune (modify) it.

https://tinyurl.com/3e5bk3cw

For an in-depth analysis see: “China’s DeepSeek AI Model Shocks the World: Should You Sell Your Nvidia Stock?

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Every AI Copyright Lawsuit in the US, Visualized"


Over the past two years, dozens of other copyright lawsuits against AI companies have been filed at a rapid clip. . . . This wide variety of rights holders are alleging that AI companies have used their work to train what are often highly lucrative and powerful AI models in a manner that is tantamount to theft. . . . Nearly every major generative AI company has been pulled into this legal fight, including OpenAI, Meta, Microsoft, Google, Anthropic, and Nvidia.

We’ve created visualizations to help you track and contextualize which companies and rights holders are involved, where the cases have been filed, what they’re alleging, and everything else you need to know.

https://tinyurl.com/sv4ja66n

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Open Science at the Generative AI Turn: An Exploratory Analysis of Challenges and Opportunities"


Technology influences Open Science (OS) practices, because conducting science in transparent, accessible, and participatory ways requires tools and platforms for collaboration and sharing results. Due to this relationship, the characteristics of the employed technologies directly impact OS objectives. Generative Artificial Intelligence (GenAI) is increasingly used by researchers for tasks such as text refining, code generation/editing, reviewing literature, and data curation/analysis. Nevertheless, concerns about openness, transparency, and bias suggest that GenAI may benefit from greater engagement with OS. GenAI promises substantial efficiency gains but is currently fraught with limitations that could negatively impact core OS values, such as fairness, transparency, and integrity, and may harm various social actors. In this paper, we explore the possible positive and negative impacts of GenAI on OS. We use the taxonomy within the UNESCO Recommendation on Open Science to systematically explore the intersection of GenAI and OS. We conclude that using GenAI could advance key OS objectives by broadening meaningful access to knowledge, enabling efficient use of infrastructure, improving engagement of societal actors, and enhancing dialogue among knowledge systems. However, due to GenAI’s limitations, it could also compromise the integrity, equity, reproducibility, and reliability of research. Hence, sufficient checks, validation, and critical assessments are essential when incorporating GenAI into research workflows.

https://doi.org/10.1162/qss_a_00337

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Text Analysis of Archival Finding Aids; Collection Scoping and Beyond"


In this study, we examine the suitability of text analysis as a method for analyzing collection scope strengths across a repository’s physical archival holdings. We apply a tool for text analysis called Leximancer to analyze a corpus of archival finding aids to explore topical coverage. Leximancer results were highly aligned with the baseline subject heading analysis that we performed, but the concepts, themes, and co-occurring topic pairs surfaced by Leximancer suggest areas of collection strength and potential focus for new acquisitions. We discuss the potential applications of text analysis for internal library use including collection development, as well as potential implications for wider description, discovery, and access. Text analysis can accurately surface topical strengths and directly lead to insights that can inform future acquisition decisions and archival collection development policies.

https://tinyurl.com/mr45f8e7

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"News and Views: How Much Content Can AI Legally Exploit?"


Most OA licenses, even permissive ones like CC BY, require attribution. However, generative AI models inherently strip attribution from the data they process, making compliance nearly impossible. Specialist AIs might be trained to circumvent this, but the bulk of big-name gen AI tools don’t. Compliance with the most basic OA requirement of attribution is unworkable.

Additionally, while traditional licenses clearly delineate permissible use, OA licenses often depend on interpretations of “non-commercial” or “derivative” use that may vary by jurisdiction.

https://tinyurl.com/562k8kee

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"OpenAI and Others Seek New Path to Smarter AI as Current Methods Hit Limitations"


To overcome these challenges [in training AIs with enormous amounts of increasingly scarce data] researchers are exploring “test-time compute,” a technique that enhances existing AI models during the so-called “inference” phase, or when the model is being used. For example, instead of immediately choosing a single answer, a model could generate and evaluate multiple possibilities in real-time, ultimately choosing the best path forward. . . .

“It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer,” said Noam Brown, a researcher at OpenAI who worked on o1, at TED AI conference in San Francisco last month.

https://tinyurl.com/5n9bwkv6

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft"


Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

https://tinyurl.com/ymen65js

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

New Horizons in Artificial Intelligence in Libraries


This publication provides an opportunity to explore developing new library AI paradigms, including present use case practical implementation and opportunities on the horizon as well as current large ethics questions and needs for transparency, scenario planning, considerations and implications of bias as library AI systems are developed and implemented presently and for our collective future.

https://tinyurl.com/4b5juutm

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Intelligent Summaries: Will Artificial Intelligence Mark the Finale for Biomedical Literature Reviews?"


Manuscripts that only flatly summarize knowledge in a field could become superfluous, as AI-powered systems will become better and better at generating more comprehensive and updated summaries automatically. Furthermore, the use of A.I. technologies in data analysis and synthesis will greatly reduce human tasks, enabling more efficient and timely production of preliminary findings. What kind of reviews will still find room in an academic journal? It is reasonable to believe that reviews that provide critical analysis, unique interpretations of existing literature, which connect different areas, shed novel light on available data, that are aware of their human partiality, will continue to be valuable in academic journals.

https://doi.org/10.1002/leap.1648

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"How ChatGPT Search (Mis)represents Publisher Content"


In total, we pulled two hundred quotes from twenty publications and asked ChatGPT to identify the sources of each quote. We observed a spectrum of accuracy in the responses: some answers were entirely correct (i.e., accurately returned the publisher, date, and URL of the block quote we shared), many were entirely wrong, and some fell somewhere in between. . . .

In total, ChatGPT returned partially or entirely incorrect responses on a hundred and fifty-three occasions, though it only acknowledged an inability to accurately respond to a query seven times. . . .

Our tests found that no publisher—regardless of degree of affiliation with OpenAI—was spared inaccurate representations of its content in ChatGPT.

https://tinyurl.com/3z9dxttv

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Publishers are Selling Papers to Train AIs — and Making Millions of Dollars"


[Roger] Schonfeld [VP of Ithaka S+R] and his colleagues launched the Generative AI Licensing Agreement Tracker in October. It includes information about licensing deals — confirmed and forthcoming — between technology companies and six major academic publishers, including Wiley, Sage and Taylor & Francis. Schonfeld says that the list documents only public agreements, and that there are probably several others that remain undisclosed. . . .

Some scholars have been apprehensive about deals being made without their knowledge on content they produced. To address this issue, a few publishers have taken steps to involve authors in the process.

https://tinyurl.com/56zwe54p

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Google Shows off AI Tool for Reading Handwritten Text by Rewriting It Digitally"


Imagine writing by hand in a paper notebook, then showing the notes to your camera to instantly make them searchable and organize them in context with previous notes on physical pages. If you’re like me and have particularly messy handwriting, InkSight could help turn your chicken scratch into typewritten text that is still accurate to what you scribble.

https://tinyurl.com/2dt685ba

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"The Chatbot Optimisation Game: Can We Trust AI Web Searches?"


Those wanting a firmer grip on chatbots, then, may have to explore more underhand techniques, such as the one discovered by two computer-science researchers at Harvard University. They’ve demonstrated how chatbots can be tactically controlled by deploying something as simple as a carefully written string of text. This “strategic text sequence” looks like a nonsensical series of characters – all random letters and punctuation – but is actually a delicate command that can strong-arm chatbots into generating a specific response. Not part of a programming language, it’s derived using an algorithm that iteratively develops text sequences that encourage LLMs to ignore their safety guardrails – and steer them towards particular outputs.

https://tinyurl.com/2wuvuur9

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"AI Can Carry out Qualitative Research at Unprecedented Scale"


We have developed and launched an easy-to-use platform for conducting large-scale qualitative interviews, based on artificial intelligence in just this way. A chat interface allows the respondent to interact with a LLM that collects their responses and generates new questions. . . .

First, we asked a team of sociology PhD students from Harvard and the London School of Economics, who specialise in qualitative methods, to assess the quality of interviews based on the interview scripts. The AI-led interviews were rated approximately comparable to an average human expert (under the same conditions). . . . A vast majority of participants reported enjoying their interaction with the conversational agent and preferred this mode of interview over open text fields.

https://tinyurl.com/mry3vrat

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Ithaka S+R: A Third Transformation? Generative AI and Scholarly Publishing


What is not yet clear is how disruptive this [AI] growth will be. To this end, we interviewed 12 leaders in stakeholder communities ranging from large publishers and technology disruptors to academic librarians and scholars. The consensus among the individuals with whom we spoke is that generative AI will enable efficiency gains across the publication process. Writing, reviewing, editing, and discovery will all become easier and faster. Both scholarly publishing and scientific discovery in turn will likely accelerate as a result of AI-enhanced research methods. From that shared premise, two distinct categories of change emerged from our interviews. In the first and most commonly described future, the efficiency gains made publishing function better but did not fundamentally alter its dynamics or purpose. In the second, much hazier scenario, generative AI created a transformative wave that could dwarf the impacts of either the first or second digital transformations [URL added].

https://doi.org/10.18665/sr.321519

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Paywall: "A ‘Delve’ into the Evidence of AI in Production of Academic Business Literature"


The author performed a t-test using the average growth rates of articles published in the database ProQuest ABI/INFORM Global containing keywords or phrases purported to be commonly used in content generated by AI during the years before and after common generative AI availability. Results show evidence that publication rates after generative AI availability experienced an improbably high deviation from the norm.

https://doi.org/10.1080/08963568.2024.2420300

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"‘Massive Copyright Violation’ Threatens One of the World’s Hottest AI Apps"


News Corp has officially filed a lawsuit against Perplexity AI over accusations that the startup has committed copyright infringement on a “massive scale.” . . .

Perplexity’s value proposition is instead to insert itself between search and content producers as a middleman, training its AI on copyrighted content that its chatbot will then regurgitate. . . to its own paying customers, without compensating or attributing the original content producers. . . .

https://tinyurl.com/y2h5fpeu

Perplexity

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Publishers Join with Worldwide Coalition to Condemn the Theft of Creative and Intellectual Authorship by Tech Companies for Generative AI Training"


Today, the Association of American Publishers (AAP) joined forces with more than 10,000 creators and coalition partners, including authors, musicians, actors, artists, and photographers, to condemn the theft of creative and intellectual authorship by big tech companies for use in their Generative AI models. In fact, these consumer-facing models and tools would not exist without the books, newspapers, songs, performances, and other invaluable human expressions that were—and continue to be—copied, ingested, and regenerated in blatant disregard of the law.

https://tinyurl.com/4e37e3ff

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Microsoft and OpenAI’s Close Partnership Shows Signs of Fraying"


When OpenAI got its giant investment from Microsoft, it agreed to an exclusive deal to buy computing power from Microsoft and work closely with the tech giant on new A.I.. . . .

OpenAI employees complain that Microsoft is not providing enough computing power. . . some have complained that if another company beat it to the creation of A.I. that matches the human brain, Microsoft will be to blame because it hasn’t given OpenAI the computing power it needs. . . .

The contract contains a clause that says that if OpenAI builds artificial general intelligence, or A.G.I. — roughly speaking, a machine that matches the power of the human brain — Microsoft loses access to OpenAI’s technologies.

The clause was meant to ensure that a company like Microsoft did not misuse this machine of the future, but today, OpenAI executives see it as a path to a better contract. . . Under the terms of the contract, the OpenAI board could decide when A.G.I. has arrived.

https://tinyurl.com/y5mjr66d

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"Large Language Publishing: The Scholarly Publishing Oligopoly’s Bet on AI"


This article focuses on an offshoot of the big firms’ surveillance-publishing businesses: the post-ChatGPT imperative top profit from troves of proprietary “training data,” to make new AI products and—the essay predicts—to license academic papers and scholars’ tracked behavior to big technology companies.

https://tinyurl.com/ft2467my

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Virginia Tech and UC Riverside: "University Libraries Receives Grant to Create Generative Artificial Intelligence Incubator Program"


University Libraries at Virginia Tech and the University of California, Riverside, received a $115,398 Institute of Museum and Library Services grant to create a generative artificial intelligence incubator program (GenAI) to increase the adoption of artificial intelligence (AI) in the library profession and academic libraries. . . .

[Yinlin] Chen [assistant director for the Center for Digital Research and Scholarship at Virginia Tech] will use his expertise in advanced GenAI techniques and multidisciplinary AI research in his collaboration with Edward Fox, co-principal investigator and director of the digital library research laboratory at Virginia Tech and computer science professor, and Zhiwu Xie, co-principal investigator and assistant university librarian for research and technology at the University of California, Riverside, to create the generative artificial intelligence incubator program. They will build training materials, workshops, and projects to assist librarians in becoming AI practitioners.

https://tinyurl.com/3sysn284

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

"In Which Fields Can ChatGPT Detect Journal Article Quality? An Evaluation of REF2021 Results"


Time spent by academics on research quality assessment might be reduced if automated approaches can help. Whilst citation-based indicators have been extensively developed and evaluated for this, they have substantial limitations and Large Language Models (LLMs) like ChatGPT provide an alternative approach. This article assesses whether ChatGPT 4o-mini can be used to estimate the quality of journal articles across academia. It samples up to 200 articles from all 34 Units of Assessment (UoAs) in the UK’s Research Excellence Framework (REF) 2021, comparing ChatGPT scores with departmental average scores. There was an almost universally positive Spearman correlation between ChatGPT scores and departmental averages, varying between 0.08 (Philosophy) and 0.78 (Psychology, Psychiatry and Neuroscience), except for Clinical Medicine (rho=-0.12). Although other explanations are possible, especially because REF score profiles are public, the results suggest that LLMs can provide reasonable research quality estimates in most areas of science, and particularly the physical and health sciences and engineering, even before citation data is available. Nevertheless, ChatGPT assessments seem to be more positive for most health and physical sciences than for other fields, a concern for multidisciplinary assessments, and the ChatGPT scores are only based on titles and abstracts, so cannot be research evaluations.

https://arxiv.org/abs/2409.16695

| Artificial Intelligence |
| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |