aecifo

OpenAI accidentally deleted case data

Umito November 25, 2024

Lawyers representing the New York Times (NYT) and Daily News in their lawsuit against OpenAI, which alleges unauthorized use of their content to train AI models, say OpenAI engineers accidentally deleted data that could be important to the case, reported TechCrunch.

Earlier this year, OpenAI provided computing resources to two virtual machines. These machines were provided so that lawyers for The New York Times and Daily News could search their copyrighted content in its AI training sets.

Lawyer representing New York Times and Daily News files suit letter in the U.S. District Court for the Southern District of New York. The letter is a status update regarding training data issues and a renewed demand that OpenAI be ordered to identify and admit which of the News Plaintiffs (NYT and Daily News) work it used to train each of its GPT models.

The letter declared that on November 14, 2024, OpenAI engineers erased the programs and research results data stored on one of the dedicated virtual machines. However, the publishers’ lawyer added that they had no reason to believe the deletion was intentional.

OpenAI training datasets are a ‘sandbox’: NYT

The publishers’ attorney said they “incur a significant burden and expense of searching for their copyrighted works in OpenAI’s training datasets within a tightly controlled environment that this Court and the parties have previously referred to it as “the sandbox”.

The publishers’ lawyer said they and the experts they hired had spent more than 150 hours since November 1, 2024 researching OpenAI’s training data. He adds that OpenAI was able to recover much of the “deleted” data. However, OpenAI “irretrievably lost” the folder structure and file names of the editors’ work product.

OpenAI is best placed to research its own datasets

He added that without the original folder structure and file names, the retrieved data becomes “unreliable” and cannot confirm whether OpenAI used the publishers’ copied articles to build its models. Stating that the retrieved data was “unusable,” the publishers’ lawyer argued that OpenAI was best placed to search its own datasets for publishers’ works using its own tools and equipment.

“The News plaintiffs have also provided the information OpenAI needs to conduct this research. All it takes is OpenAI’s commitment to do so in a timely manner,” he said.

The News plaintiffs provided OpenAI with detailed instructions to search their content using specific URLs and “n-gram” analysis, which detects overlapping phrases in their works. However, OpenAI has not yet yielded results or confirmed any significant progress. According to the filing, OpenAI’s lawyers reported only “promising meetings” with its engineers, but no tangible results. Additionally, OpenAI said in response to the plaintiffs’ formal requests for admission that it would “neither admit nor deny” the use of the editors’ work in its training datasets or models.

Open AI response

On November 22, 2024, OpenAI filed its answer in the case. In his response, OpenAI’s lawyer denied that the company deleted any evidence, instead attributing the problem to a misconfiguration of the system by the publishers that resulted in a technical glitch.

“Plaintiffs requested a configuration change to one of the many machines provided by OpenAI to search training datasets. Implementing the change requested by the plaintiffs, however, resulted in the deletion of the folder structure and some file names on a hard drive – a drive that was supposed to be used as a temporary cache to store OpenAI data, but which was obviously also used by the plaintiffs to save some. of their search results (apparently without any backup). Regardless, there is no reason to believe that any files were actually lost, and plaintiffs could restart searches to recreate the files with only a few days of computing time,” the company said.

“Plaintiffs’ inspection efforts began with repeated execution of faulty code that overwhelmed and crashed the file system,” it adds.

Willingness to collaborate

The release states that OpenAI is ready to collaborate with publishers. “The main obstacle here is not technical; it is plaintiffs’ refusal to cooperate,” the company said in its response.

OpenAI said it had offered to support publishers’ research, provided they provided “clear and reasonable proposals.”

“OpenAI also offered to perform at least some of the plaintiffs’ research for them and asked the plaintiffs to make a full proposal. Despite OpenAI’s support, plaintiffs returned to their ineffective pursuit of ‘boiling the ocean,’ demanding ever-increasing hardware performance,” it says.

Background

In this case, OpenAI has argued that using publicly available data, such as articles from the NYT and Daily News, to train its models constitutes fair use. According to OpenAI, “learning” on billions of examples requires no licensing or compensation for the data. This remains true even when models use the data for commercial purposes.

Additionally, OpenAI has entered into licensing agreements with a growing number of publishers, some of which are Conde Nast, TIME, Associated PressAnd News Corp.

Apre-salomemanzo

Apre-salomemanzo

OpenAI accidentally deleted case data