Forbes Daily Briefing

The Library Of Congress Is A Training Data Playground For AI Companies

With archives hosting about 180 million works, the world’s largest library is drawing interest from AI startups looking to train their large language models on content that won’t get them sued.

Broadcast on:: 28 Sep 2024
Audio Format:: other

With archives hosting about 180 million works, the world’s largest library is drawing interest from AI startups looking to train their large language models on content that won’t get them sued.

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Here's your Forbes Daily Briefing bonus story of the week. Today on Forbes, the Library of Congress is a training data playground for AI companies. Black and White Portraits of Rosa Parks, Letters penned by Thomas Jefferson, and the Giant Bible of Maine's, a 15th century manuscript known to be one of the last handwritten Bibles in Europe. These are among the 180 million items, including books, manuscripts, maps, and audio recordings, housed within the Library of Congress. Every year, hundreds of thousands of visitors walk through the Library's high-ceilinged pillared halls, passing beneath Renaissance-style domes, embellished with murals and mosaics. But of late, the more than 200-year-old library has attracted a new type of patron, AI companies that are eager to access the Library's digital archives, and the 185 petabytes of data stored within it, to develop and train their most advanced AI models. For reference, one petabyte is equal to 1,000 terabytes, or 1 million gigabytes. Judith Conklin, Chief Information Officer at the Library of Congress, told Forbes, quote, "We know that we have a large amount of digital material that large language model companies are very interested in, it's extraordinarily popular." The upsurge and interest in the Library's data is also reflected in the numbers. The Congress.gov website, which is managed by the Library of Congress and hosts data about bills, statutes, and laws, gets anywhere between 20 million to 40 million monthly hits on its API, an interface that allows programmers to download the Library's data in a machine-readable format. Conklin said the traffic to the Congress.gov API has consistently grown since it became available in September 2022. The Library's API now gets about a million visits every month. The Library's digital archives host an abundance of rare, original, and authoritative information. It's also diverse. The collections feature content in more than 400 languages, spanning art, music, and most disciplines. But what makes this data especially appealing to AI developers is that these works are in the public domain, and not copyrighted or otherwise restricted. While a growing group of artists and organizations are locking up their data to prevent AI companies from scraping it, the Library of Congress has made its data reserves freely available to anyone who wants it. For AI companies that have already mined the entirety of the Internet, scraping everything from YouTube videos to copyrighted books to train their models, the Library is one of the few remaining free resources. Otherwise, they must strike licensing deals with publishers or use AI-generated so-called "synthetic" data, which can be problematic, leading to degraded responses from the model. The only caveat, people who want access to the Library's data must collect it via the API, a portal through which anyone, from a genealogist to an AI researcher, can download data. But they are prohibited from scraping content directly from the site, a common practice among AI companies, and one that Conklin said has become a real quote "hurdle" for the Library because it slows public access to its archives. She said, quote, "There are others who want our data to train their own models, but they want it fast so they just scrape our websites. If they're hurting the performance of our websites, we have to manually slow them down." The hunt for data is just one part of the story. Companies like OpenAI, Amazon, and Microsoft are also courting the world's largest library as a customer. They claim AI models can help librarians and subject matter specialists with tasks like navigating catalogs, searching records, and summarizing long documents. This is certainly possible, but there are some rough edges that need to be ironed out first. Natalie Smith, the Library of Congress' Director of Digital Strategy, told Forbes that AI models trained on contemporary data sometimes struggle with historical accuracy, identifying a person holding a book as someone holding a cell phone, for example. She said, quote, "There is an overwhelming bias towards today's times, and so they often apply modern concepts to historical documents." For full coverage, check out Rashi Srivastava's piece on Forbes.com. This is Kieran Meadows from Forbes. Thanks for tuning in. [BLANK_AUDIO]