Welcome to The Parliament

Your source for the latest African news and insights

Software company OpenAI uses publicly available data to train its AI bot ChatGPT. This includes, for example, books and articles that you can find on the internet. But the producers of these now want to see money for that.

This type of data, commonly referred to as training data, is an essential component for the development of generative artificial intelligence, or AI software that generates answers based on questions.

However, finding useful data is becoming increasingly difficult, which is why AI makers such as Google, Meta, OpenAI, Anthropic and Microsoft are looking for new sources. It even goes so far that Meta was on the verge of buying one of the largest publishers in the world, Simon & Schuster.

The problem is that a growing number of publishers are accusing AI creators of unlawfully using copyrighted data. Publishers believe they should be compensated for this.

Meta and OpenAI counter this claim with the "fair use" clause in U.S. copyright law, which states that copyrighted data may be used for training purposes. But it remains to be seen whether this point will hold up in court.

OpenAI and Microsoft in the spotlight
The U.S.-based Center for Investigative Reporting (CIR), a nonprofit journalism publisher, sued OpenAI and Microsoft last week. The CIR accuses the two tech companies of unlawfully using "copyrighted works owned by creators around the world, including CIR."

"OpenAI and Microsoft are using our stories to make their products more powerful, but they have never asked for permission or offered compensation to do so. Something that other organizations that use our materials do in the form of a license," said CIR CEO Monika Bauerlein at the presentation of the complaint. "This behavior is not only unfair, it's a copyright infringement."

In another complaint by the Author's Guild, two authors claim that OpenAI used information from their books to train ChatGPT. In December 2023, The New York Times sued OpenAI for a similar reason.

Last May, it became clear from documents from the Author's Guild lawsuit that OpenAI had removed two giant datasets that had been used to train GPT-3. According to the Guild's lawyers, it contained more than 100,000 books. The two employees who collected this data no longer work for the tech company, according to the same documents.

OpenAI has been making deals with news organizations, among others, for a while to be able to use that content legally. For example, the creator of ChatGPT has signed agreements with The Associated Press, the publisher of The Wall Street Journal, the New York Post, The Atlantic, Prisa Media, Le Monde, Financial Times, and Business Insider's parent company Axel Springer.

But that doesn't seem to solve the problem of the need for training data. In order for generative AI to continue to work and work better, it needs to keep learning. A handful of licensing agreements are a drop in the ocean.

Leave a Comment

Comments

No comments yet. Be the first to comment!