A new study seems to give confidence in accusations that Openai trained at least some of his models of him on copyright protected content.
Openai has been embraced in costumes brought by authors, programmers and other rights holders who accuse the company of using their works-books, codes, etc.-to develop its illegal models. Openai has long claimed a protection of the right use, but the plaintiffs in these cases argue that there is no carving in the law of the US copyright on training data.
The study, which was co -author by researchers at the University of Washington, University of Copenhagen and Stanford, proposes a new method for identifying “memorized” training data from models after an API, such as Openai.
Models are forecast engines. Trained for a lot of data, they learn models – this is how they are able to generate essays, photos and more. Most results are not copies with training data words, but because of the way models “learn”, some inevitably are. Image models have been found to adjust the screenings of screens from the films in which they were trained, while linguistic models are effectively observed by plagiarizing news articles.
The method of study relies on the words that co-authors call “superimposed high”-that is, words that are distinguished as rare in the context of a larger work body. For example, the word “radar” in the sentence “Jack and I sat perfectly with the noise of the radar” would be considered charming high because it is statistically less likely than words such as “engine” or “radio” to appear “humming”.
Co-authors investigated some Openai models, including GPT-4 and GPT-3.5, for signs of memorization by removing high surviving words from the fiction books and New York Times and having patterns try to “guess” which words were masked. If the models managed to imagine properly, it is likely that they would memorize the piece during training, co -authors ended.
According to the results of the tests, the GPT-4 showed signs of memorized parts of popular fiction books, including books on a database containing copyright protected ebooks called Bookmia. The results also suggested that the model memorized parts of the New York Times items, albeit at a relatively lower rate.
Abhilasha Ravicher, a doctoral student at the University of Washington and a study co -author, told Techcrunch that the findings shed light on the “polemic data” models could have been trained.
“In order to have great linguistic models that are reliable, we need to have models that we can investigate and audit and consider scientifically,” Ravicher said. “Our work aims to provide a tool to investigate large language models, but there is a real need for greater data transparency throughout the ecosystem.”
Openai has long defended for more loose restrictions on the development of models using copyright protected data. While the company has certain content licensing agreements and offers OPT-OUT mechanisms that allow copyright owners to flag content, they will prefer the company not to use them for training purposes, it has lobbied some governments to codify the “right use” rules about it.