Autocomplete needs more than autocompleted inputs.
What would be really surprising is if it didn't, like it had accidentally stumbled on the name of god and kept feeding that back to itself to calculate fragments of reality.
But, nope, just autocomplete needing good data. And no magic sky monkey.
Please, people, stop sharing papers on topics you're unfamiliar with. Synthetic data is not only "a thing", but it's incredibly important. Synthetic data is *how you learn*. That's what "mulling over" new information is - lack thereof was a historic *weakness* in LLMs.
People, use basic logic here. For example: AI image generators have been widely used over the past couple years. Their images are all over the net. Now compare the output quality of modern AI image generators to old ones. It's light night and day, *for the better*. *Way* better than old ones.
* Creators are a selective filter
* Websites are a selective filter
* Dataset curators are a selective filter
* Even automated tools, like aesthetic gradient raters, are selective filters.
Ultimately, if something looks good, *it doesn't matter* who or what made it.
You always want new sources of information, of course. Lock someone in Plato's Cave for long enough and they'll forget what the world outside looks like apart from shadows. But even the very act of selective rating is itself the addition of new information to the system.
I use synthetic data extensively in LLMs (for diffusion models, see e.g. Howard Arson - https://bsky.app/profile/theophite.bsky.social/post/3ky3flcn4vq2y). Let's say I want to make a "needle in a haystack" model that finds text on a specific exact topic, and I want a big training dataset of such things. How to do go about it?
I was watching years-old Ann Reardon/How To Cook That videos last night and in one of them she talked about content farms stealing the work of smaller YouTubers and how it drives actual creatives offsite so they all just start plagiarising each other and you get nothing new or interesting.
In-breeding "AI" vapourware, it's just as crap as the data.
Myth buster: "AI" = a bit of smart code, fast processing and data, that's it, there's no magic!
As I keep pointing out, humans "train" on pre-existing art, music, literature, etc., and their output is reliably used to train the next generation, and so on, and this leads, over time, to vastly increased diversity, complexity, and quality.
1/?
2/2 If AI actually was *intelligent*, self-training would rapidly produce creative works beyond human capability, compressing a millennia of artistic progress into months. But it's NOT, and there's no reason to think it will be using current paradigms.
I experienced a very small version of this first hand about 20 years ago (details are cumbersome to explain). It makes complete sense to me that people didn't grok this in advance, even tho it now looks like a pretty obvious pitfall.
The study looks at the results of "indiscriminately learning from data produced by other models." Of course there's no surprise. Novelty requires selection. Indiscriminate learning will result in collapse. Filtered learning will not. I mean, maybe it will. But then evolution is wrong.
Except in this context:
"Hey remember those AI things from a few years back?"
"Yeah they were the worst"
"Well someone is trying it again"
"I'll get my pitchfork"
How could no on predict this sort of outcome after seeing what happened to people on FB becoming dumber with every post or the Qanon faithful becoming dumber with every new drop?
Comments
What would be really surprising is if it didn't, like it had accidentally stumbled on the name of god and kept feeding that back to itself to calculate fragments of reality.
But, nope, just autocomplete needing good data. And no magic sky monkey.
Synthetic data is a growing portion of model training. Read the paper on the (spectacularly-well performing) LLaMA 3.1 as an example.
Please, people, stop sharing papers on topics you're unfamiliar with. Synthetic data is not only "a thing", but it's incredibly important. Synthetic data is *how you learn*. That's what "mulling over" new information is - lack thereof was a historic *weakness* in LLMs.
* Creators are a selective filter
* Websites are a selective filter
* Dataset curators are a selective filter
* Even automated tools, like aesthetic gradient raters, are selective filters.
Ultimately, if something looks good, *it doesn't matter* who or what made it.
AI teachers 0
Myth buster: "AI" = a bit of smart code, fast processing and data, that's it, there's no magic!
1/?
Seems like cog sci is pre-paradigmatic last I checked
"Hey remember those AI things from a few years back?"
"Yeah they were the worst"
"Well someone is trying it again"
"I'll get my pitchfork"
How could no on predict this sort of outcome after seeing what happened to people on FB becoming dumber with every post or the Qanon faithful becoming dumber with every new drop?