Potential evidence in the NY Times copyright complaint was inadvertently erased by OpenAI

OpenAI developers unintentionally erased data that would have been pertinent to the lawsuit, according to attorneys representing The New York Times and Daily News, who are suing the company for allegedly stealing their work to train its AI models without their consent.

Earlier in the fall, OpenAI consented to supply two virtual computers so that lawyers for The Times and Daily News could scan its AI training sets for their copyrighted content. (Software-based computers that run within the operating system of another computer are called virtual machines; they are frequently used for testing, data backup, and application execution.) In a letter, the publishers’ lawyers claim that since November 1, they and the specialists they hired have searched OpenAI’s training data for more than 150 hours.

However, the aforementioned letter, which was sent late Wednesday to the U.S. District Court for the Southern District of New York, claims that on November 14, OpenAI programmers deleted all of the publishers’ search data that was kept on one of the virtual computers.

OpenAI made an effort to retrieve the data and was largely successful. The recovered data, however, “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” according to the letter, as the folder structure and file names were “irretrievably” deleted.

Counsel for The Times and Daily News noted, “News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time.” “This supplemental letter is being filed today because the plaintiffs only found out yesterday that the recovered data is unusable and that a week’s worth of work by its experts and lawyers must be redone.”

The plaintiffs’ attorney makes it apparent that they don’t think the deletion was done on purpose. The episode, they believe, does highlight the fact that OpenAI “is in the best position to search its own datasets” for potentially illegal content using its own mechanisms.

A representative for OpenAI declined to comment.

OpenAI has argued that it is fair usage to train models with publicly accessible data, such as stories from The Times and Daily News, in this and other instances. In other words, OpenAI feels that it is not obligated to license or otherwise pay for the samples when developing models like GPT-4o, which “learn” from billions of examples of e-books, essays, and other types of text to produce content that sounds human — even if it generates revenue from those models.

Nevertheless, OpenAI has signed licensing agreements with an increasing number of new publications, including as News Corp., the Associated Press, Financial Times, People parent firm Dotdash Meredith, and Axel Springer, the owner of Business Insider. Although OpenAI has refused to disclose the details of these agreements, it has been reported that Dotdash, one of the content partners, receives at least $16 million annually.

OpenAI has not acknowledged or denied using any particular copyrighted works to train its AI systems without authorization.

Leave a Comment Cancel Reply