"Towards a Books Data Commons for AI Training"


This white paper describes ways of building a books data commons: a responsibly designed, broadly accessible data set of digitized books to be used in training AI models. This report, written in partnership with Creative Commons and Proteus Strategies, is based on a series of workshops that brought together practitioners building AI models, legal and policy scholars, and experts working with collections of digitized books.

In the paper, we first explain why books matter for AI training and how broader access could be beneficial. We then summarize two tracks that might be considered for developing such a resource, highlighting existing projects that help foreground the potential challenges. One track relies on public domain and permissively licensed books, while the other depends on exceptions to copyright to enable training on in-copyright books. The report also presents several key design choices and next steps that could advance further development of this approach.

https://tinyurl.com/2fu47552

| Research Data Curation and Management Works |
| Digital Curation and Digital Preservation Works |
| Open Access Works |
| Digital Scholarship |

Avatar photo

Author: Charles W. Bailey, Jr.

Charles W. Bailey, Jr.