AI companies, running out of conventional training datasets from the web, may be forced to shift from big, all-purpose LLMs to smaller, more specialized models

why human-sourced data can help prevent AI model collapse Matthias Bastian / The Decoder : OpenAI co-founder says AI is reaching “peak data” as it hits the limits of the internet Kylie Robison / The Verge : During his NeurIPS talk, Ilya Sutskever says “Pre-training as we know it will end”, as “we've achieved peak data and there'll be no more” Bluesky: James McDermott / @jmmcd : I refuse to read the article but I wonder what scenario people like this imagine, when they hear we're running out of data. Do they think it's used up, or deleted? [embedded post] Nicolai B. Hansen / @nbhansen.dk : Honestly much more interested in small specialized models than big all purpose models. They so far have not really been that useful. [embedded post] Casey Newton / @caseynewton : Seems to me that one answer to “we have run out of data to steal” could be to pay people to make stuff and use it with their consent [embedded post] @photogenealogy : Using AI generated data to train AI! — 😳 That can't be good — www.nature.com/articles/d41... Scott McGrath / @smcgrath.phd : 🧪 AI faces a significant data bottleneck: by 2028, training data may equal the total stock of public online text. — Solutions like synthetic data, smaller models, and specialized datasets are key to future advances.🩺💻 #MLSky Katherine Stiles / @katherinestiles.org : The Internet is a vast ocean of human knowledge, but it isn't infinite. And artificial intelligence (AI) researchers have nearly sucked it dry. Christian Frezza / @frezzalab : The sad realistion that someone will have to do experiments at some point....damn it — “The AI revolution is running out of data. What can researchers do?” — www.nature.com/articles/d41... X: Eric Topol / @erictopol : What happens when LLMs run out of data to ingest? https://www.nature.com/... @nature feature by @nicolakimjones [image] @bermaninstitute : The AI revolution is running out of data. What can researchers do? AI developers are rapidly picking the Internet clean to train large language models such as those behind ChatGPT. Here's how they are trying to get around the problem. https://www.nature.com/... Steven Ashley / @steveashleyplus : At current growth rates, the AI industry runs out of readily accessed HQ data in four yrs... Workarounds incl using synthetic data and less-easily-accessed data. (N) https://www.nature.com/... Nicola Jones / @nicolakimjones : “Compute is growing but the data is not growing... data is the fossil fuel of AI” - Ilya Sutskever #NeurIPS2024 Read my story in Nature about the data shortage: https://www.nature.com/...

Nature 2024-12-15

Chronicles

AI companies, running out of conventional training datasets from the web, may be forced to shift from big, all-purpose LLMs to smaller, more specialized models