OpenAI Caught in Legal Web: Judge Orders Disclosure of Deleted Book Databases

Photo by Sasun Bughdaryan on Unsplash
The tech world is buzzing with a groundbreaking legal showdown that could reshape how artificial intelligence companies handle copyrighted material. A Manhattan Federal Court Magistrate Judge has ordered OpenAI to turn over internal communications about why it deleted two massive book databases potentially used to train ChatGPT.
The case centers around a class-action lawsuit involving major players like the Authors Guild, bestselling authors George R.R. Martin and John Grisham. They allege that OpenAI downloaded and used books from LibGen, an online library previously shut down by courts, to train its AI systems without proper authorization.
In a significant twist, OpenAI deleted two datasets called “Books1” and “Books2” in 2022 - reportedly containing over 100,000 books - before any litigation began. Initially, the company claimed the datasets were deleted due to “non-use,” but their explanations have been inconsistent.
Judge Ona Wang’s ruling critically undermines OpenAI’s attempts to shield its internal communications under attorney-client privilege. “OpenAI has put its state of mind at issue,” Wang wrote, emphasizing that the company cannot selectively use legal protections to restrict inquiries into their actions.
The decision mandates that OpenAI must now disclose all communications with lawyers regarding the database deletions, including previously redacted internal references to LibGen. This ruling could have far-reaching implications for how AI companies handle training data and respect intellectual property rights.
OpenAI has already signaled its intent to appeal the ruling, stating they disagree with the judge’s decision. The case highlights the ongoing tension between technological innovation and copyright protection in the rapidly evolving AI landscape.
As the legal battle continues, it raises critical questions about transparency, consent, and the ethical boundaries of artificial intelligence development. The outcome could set a precedent for how tech companies approach data collection and usage in machine learning models.
AUTHOR: mls
SOURCE: The Mercury News



















































