AI can ‘run out of text in the universe’ to train chatbots

Stuart Russell, a professor of computer science at the University of California, Berkeley, says OpenAI’s ChatGPT is among many chatbots trained on large language models that can «run out of text» to train.
Beata Zawrzel/NurPhoto via Getty Images

  • A Berkeley professor said AI developers are «running out of text» to train chatbots at a United Nations summit.
  • He added that the AI ​​strategy behind training large language models is «starting to hit a brick wall».
  • That’s the latest concern regarding OpenAI and the data collection practices of other AI developers.

ChatGPT and other AI-powered bots may soon «run out of text in the universe» to train them to know what to say, an artificial intelligence expert and professor at the University of California, Berkeley said.

Stuart Russell says that technology that collects mountains of text to train artificial intelligence bots like ChatGPT is «starting to hit a brick wall». In other words, there’s only so much digital text for these bots to type in, he told an interviewer last week from the International Telecommunication Union, a United Nations media agency.

This could affect how innovative AI developers collect data and train their technology in the years to come, but Russell still thinks AI will replace humans in many of the jobs he has described. described in the interview as «language in, language out».

Russell’s prediction expands on the growing attention that has been illuminated in recent weeks on data collection undertaken by OpenAI and other generalist AI developers to train large language models or LLM.

The data collection practices integral to ChatGPT and other chatbots are facing increasing scrutiny, including from creators concerned about their work being copied without with their consent and from social media operators who are not satisfied that the data on their platforms is being used freely. But Russell’s insights point to another potential flaw: the lack of text to train these datasets.

A study conducted last November by Epoch, a group of AI researchers, estimated that machine learning datasets would likely exhaust all «high-quality linguistic data» before 2026. Language data in «high-quality» sets come from sources such as «books, articles, scientific articles, Wikipedia, and filtered web content,» according to the study.

The LLMs that power today’s most popular general AI tools have been trained on large amounts of published text culled from publicly available online sources, including from technical news sources. digital and social networking sites. The latter’s «collection of data» is what causes Elon Musk to limit the number of tweets that users can view daily, he said.

In an email to Insider, Russell said multiple reports, though unconfirmed, detailed that OpenAI, the company behind ChatGPT, had purchased text datasets from private sources. Russell added that while there could be explanations for such a purchase, «the natural inference is that there is no longer enough high-quality public data.»

OpenAI did not immediately respond to a request for comment prior to publication.

Russell said in the interview that OpenAI, in particular, had to «supplement» its public language data with «private repositories» to create GPT-4, the most powerful and advanced AI model. company to date. However, he acknowledged in an email to Insider that OpenAI has yet to detail the exact training dataset of GPT-4.

Several lawsuits against OpenAI over the past few weeks allege the company used datasets containing personal data and copyrighted material to train ChatGPT. Among the largest is a 157-page lawsuit filed by 16 unnamed plaintiffs, who allege OpenAI used sensitive data such as private conversations and medical records.

The latest legal challenge, brought by comedian Sarah Silverman’s attorney and two other authors, has accused OpenAI of copyright infringement due to ChatGPT’s ability to write accurate summaries of their work. . Two other authors, Mona Awad and Paul Tremblay, filed a lawsuit against OpenAI in late June with similar allegations.

OpenAI has not made any public comment on the list of lawsuits against it. Its CEO, Sam Altman, has also limited discussion of the allegations, but has previously expressed a desire to avoid legal trouble.

At a technology conference in June in Abu Dhabi, Altman told the audience that he had no plans to issue an IPO for OpenAI, citing the company’s unorthodox structure and decision-making as possible. conflicts with investors.

«I really don’t want to be sued by a bunch of mass markets, Wall Street, whatever,» Altman said.

#run #text #universe #train #chatbots

Deja un comentario