Meta considered licensing books to train AI—but opted instead to pirate LibGen, a database that currently contains more than 7.5 million books and 81 million research papers, Alex Reisner writes.
Comments
Log in with your Bluesky account to leave a comment
Can anyone explain how this might fall under the "fair use" exception to the general requirement that users of copyrighted works obtain consent from the original authors or creators?
hope everyone reading this article know that The Atlantic partners with OpenAI which's been *aggressively* lobbying US government to classify AI training on copyrighted data as "fair use." Had commented TA and asked to clarify, but they didn't bother to reply. Best wishes to their partnership!!👍
To be precise, when an ordinary citizen does it it's felony copyright infringement and you can go to prison for up to five years and pay up to $250,000. When a Facebook employee does it, it's a civil action only.
AI is not functional or viable without stealing from others, remember the lawsuits against Google by the News outlets who weren't getting paid from a billion dollar company. This time they are stealing from every contributor and writer.
I have to ask, does the reporting value of "naming" libgen justify the number of people who might be encouraged to use it themselves after reading the article?
I don't subscribe to the ides that training AI on copyrighted works is "theft," but on STOLEN works, yeah, that's theft.
To me, piracy and copyright infringements are not the main problem.
My concern is what AI does with the content. The key word here is decontextualisation.
Every work has a historical context reflected in the personal expression of the author. They are unique, should be interpreted in the context of their creation. From a democratic perspective, the question arises as to how the AI user should interact with authors who are no longer recognisable.
1. I don't agree with Big Tech stealing content from pirate platforms to train gen AI (this appears to be the new norm in the U.S. - do what you want, don't ask, deal with the consequences later). However, it's important to understand their arguments in order to fight back.
2. Big Tech says its use of content from book/publication pirate platforms constitutes "fair use" (an illogical argument because pirate platforms have already faced lawsuits); Big Tech's argument is what's before the U.S. courts.
3. In OpenAI's case, they are pushing this argument hard, saying the U.S. will fall behind China in the "success of democratic AI" if they are not allowed to continue taking content without permission or, it seems, compensation to authors/creators. https://futurism.com/openai-over-copyrighted-work
Craig, did you ever read anything written by someone else?
How did you compensate them for "training" yourself on their work?
(To be clear, I agree that acquiring content illegally, e.g., via libgen, is theft. I do NOT agree that training AI on legitimately-acquired works is theft.)
The industrial, automated scale of backhoes has essentially put human ditch-diggers out of work. Is that fair? If LLMs do the same for writers, why is that different? The fact that some people don't like progress because it threatens their jobs doesn't mean that progress is wrong.
Backhoes are better at digging ditches than humans. Systems regurgitating content based on algorithms and presenting it as fact, without any actual intelligence and fact checking being involved, aren't better than humans. Backhoe companies aren't making money off the back of ditches dug by humans.
If the algorithms aren't better, then you have nothing to worry about. Their inferior product will not make any money for anyone. And if they make no money, it obviously won't be "off the backs" of anyone.
Incidentally, do you believe human fact-checking is more reliable? It isn't.
The problem with your argument is that 99.9% of humans can't spit back an entire literary work word-for-word. A computer can. Training AI and human learning are apples and oranges and comparing the two isn't any defense.
And your problem is that Generative AI DOESN'T DO THAT!
Try it. I did. I took a NYT article from a few months ago. I converted its title + subtitle, VERBATIM, into a ChatGPT Query. I have no doubt that ChatGPT "read" the NYT article, yet its output was not even close to plagiarism.
I challenge you to use any AI system to produce anything that would qualify as plagiarism. These systems draw from numerous sources, just as a human author would, but they produce unique text, just as a human author would.
When you write something, where does the underlying . . .
. . . knowledge come from? Do you make it up? Hopefully not. Most likely a great deal of it comes from READING the works of others. Maybe some comes from direct observation, but when you PUBLISH that information, it then becomes PUBLIC - something any person or AI can read and learn from.
And to repeat myself AGAIN, I AGREE that using pirated work is theft, whether it's for a person or a computer.
Anyone training an AI system should have to legitimately acquire one copy of any work used for the training. But I break with the zealots who say it's STILL theft even then.
The backhoe argument misses the point. If your backhoe needs to use my property to operate, it either has to pay me for the use, or it doesn’t get to operate.
If I use a creator’s work to make something new—particularly for commercial purposes, I must license it (pay for) and credit that work. I expect these companies to do the same. If I failed to do that, I would be stealing.
It is simply not the same as an individual reading a book and remembering whatever bits they remember. Can you read a million books in seconds and retain everything?
The fact that kids were getting sued for sharing a few songs while Meta has yet to face any consequences shows that copyright law in the U.S. is a joke.
More concerned that the Dogebags have taken every bit of data from every individual, business, government entity, education, health, science, justice…for Musk’s private AI company.
This article is from 2013. A mere decade ago, this type of rampant piratry of intellectual property would have been years in jail. Now it's a tech bro right to steal and profit while stepping on necks. And the DOJ sends protection to Silicon Valley. What a world. https://nymag.com/intelligencer/2013/01/jstor-hacker-aaron-swartz-commits-suicide.html
Comments
133 times.
It's never been a surprise that someone, somewhere scanned every book ever and put it online for others to collate in a database.
F12 ✊
fuck meta
https://www.reuters.com/technology/artificial-intelligence/french-publishers-authors-file-lawsuit-against-meta-ai-case-2025-03-12/
I don't subscribe to the ides that training AI on copyrighted works is "theft," but on STOLEN works, yeah, that's theft.
My concern is what AI does with the content. The key word here is decontextualisation.
but I don’t mean that in a bad way…”
Dom Irrera might agree.
AI companies are going on about whether training is “fair use”
But they’re silent on the fact that they acquired the books illegally.
The latter makes it an open and shut case. If you steal, nothing you do with stolen property is “fair use”
https://futurism.com/openai-over-copyrighted-work
“Soylent green IS people!”
How did you compensate them for "training" yourself on their work?
(To be clear, I agree that acquiring content illegally, e.g., via libgen, is theft. I do NOT agree that training AI on legitimately-acquired works is theft.)
Incidentally, do you believe human fact-checking is more reliable? It isn't.
Try it. I did. I took a NYT article from a few months ago. I converted its title + subtitle, VERBATIM, into a ChatGPT Query. I have no doubt that ChatGPT "read" the NYT article, yet its output was not even close to plagiarism.
. . .
I challenge you to use any AI system to produce anything that would qualify as plagiarism. These systems draw from numerous sources, just as a human author would, but they produce unique text, just as a human author would.
When you write something, where does the underlying . . .
I always do, unless it's a loan from a Library or a friend.
I agree that ACQUIRING a work improperly is theft. But if the work is acquired properly, training an AI system on it is not.
so you also *usually* do it, because sometimes you loan/borrow it (or, i would imagine, receive it as a gift) just like me 😅
the whole point of the article here is that genAI is training itself on tons of pirated work, so, theft
Anyone training an AI system should have to legitimately acquire one copy of any work used for the training. But I break with the zealots who say it's STILL theft even then.
Someone?
https://nymag.com/intelligencer/2013/01/jstor-hacker-aaron-swartz-commits-suicide.html