Sarah Silverman sues OpenAI, Meta over copyright infringement in AI training

Comedian and writer Sarah Silverman has filed copyright infringement lawsuits against the makers of ChatGPT as well as Facebook parent company Meta, saying that their AI system's use of her copyright works for training violates her intellectual property rights.

The suits, filed last week in federal district court in San Francisco, argued that Microsoft-backed OpenAI and Meta didn’t have permission to use copyright works by Silverman and two other authors, Christopher Golden and Richard Kadrey, when it used them to train ChatGPT and Meta's LLaMA (Large Language Model Meta AI). It asks for injunctions against the companies to prevent them from continuing similar practices, as well as unspecified monetary damages. (Both suits also ask the court to certify the lawsuit as a class action.)

The heart of the lawsuit, according to the complaint, is OpenAI’s use of a data set called BookCorpus, which it said was created in 2015 for the purpose of large language model training. Much of BookCorpus, the plaintiffs say, was copied from a site called Smashwords, a host for self-published novels, which were under copyright. Additionally, the complaint alleges that there is no way that the book-based data sets used to train OpenAI came entirely from legal sources, as no legal databases offer enough content to account for the size of the “Books1” and “Books2” sets.

Instead, the plaintiffs say, it’s likely that OpenAI used so-called “shadow libraries” like LibGen, Z-Library and Bibliotik to train the AI, and it’s from that source that the company found Silverman and the other plaintiffs’ copyright content.

Both suits were filed by the Joseph Saveri Law Firm, which has already filed a nearly identical class action on behalf of other authors against OpenAI. All of these cases are likely to turn on the court’s interpretation of "fair use” under US copyright law. Fair use is a provision that excuses what would otherwise be violations of copyright law, under certain conditions like criticism, news reporting and education. Fair use is a somewhat nebulous concept, requiring that courts interpret four factors in deciding whether something is fair use: the effect of the use on the potential market for the copyright work, the amount of the work used in proportion to its size, the nature of the work itself, and the “purpose and character” of the use.

OpenAI and Meta could argue that the first and last points, about purpose and character as well as effect on market value, militate in favor of a court finding that the use of copyrighted material in this case qualifies as fair use, while Silverman and her fellow plaintiffs could lean on the commercial nature of the companies’ use of their data, as well as the fact that the works were used in their entirety.

Requests for comment from Meta and OpenAI were not immediately returned.

IT World

Go back