Generative AI Models Facing the Risk of Exhausting Data to Train Their Tools: Stuart Russell

He claims that the technology, which relies on massive amounts of text to train these bots, is reaching its limits.

By Anshul Panda
July 22, 2023 14:28 +08

According to Stuart Russell, an artificial intelligence expert and University of California, Berkeley professor, ChatGPT and other AI-powered bots may be facing a big hurdle. He claims that the technology, which relies on massive amounts of text to train these bots, is reaching its limits. According to Russell's recent interview with the International Telecommunication Union, a UN communications agency, there is a finite quantity of digital text available for these AI systems to learn from.

This projected paucity of training material may have ramifications for future generative AI developers that rely on large datasets to train their technology. Nonetheless, Russell believes that AI will continue to displace humans in a variety of vocations, particularly those involving language processing, a concept he calls "language in, language out." These forecasts contribute to the increased scrutiny of OpenAI and other generative AI companies' data collection practises as they train large language models, or LLMs.

ChatGPT and other chatbots' data harvesting practices have come under growing scrutiny. Concerns have been expressed by creatives concerned about their work being copied without their permission, as well as by social media executives concerned about the unregulated use of their platforms' data. Russell's discoveries, on the other hand, highlight another potential vulnerability: a lack of high-quality linguistic data to train these databases.

OpenAI has been facing multiple lawsuits in recent weeks, with allegations suggesting that the company utilized datasets containing personal information and copyrighted materials to train ChatGPT. One significant lawsuit, spanning 157 pages and filed by 16 unnamed plaintiffs, contends that OpenAI accessed sensitive data, including private conversations and medical records, for training the AI model.

In addition to the legal challenges, the lawyers for comedian Sarah Silverman and two other authors have accused OpenAI of copyright infringement. The ability of ChatGPT to provide accurate summaries of their works without proper authorisation gives rise to this assertion. Similarly, in late June, authors Mona Awad and Paul Tremblay filed a lawsuit against OpenAI, claiming similar concerns about the unauthorised use of copyrighted materials in ChatGPT training.

In November of last year, Epoch, a collection of AI researchers, calculated that machine learning datasets could run out of "high-quality language data" as early as 2026. According to the study, such "high-quality" linguistic data often comes from sources such as books, news stories, scientific papers, Wikipedia, and curated web material.