Friends, for something to be open source, we need to see
1. The data it was trained & evaluated on
2. The code
3. Model architecture
4. Model weights.
DeepSeek only gives 3, 4. And I'll see the day that anyone gives us #1 without being forced to do so, because all of them are stealing data.
1. The data it was trained & evaluated on
2. The code
3. Model architecture
4. Model weights.
DeepSeek only gives 3, 4. And I'll see the day that anyone gives us #1 without being forced to do so, because all of them are stealing data.
Comments
for a more complete (and realistic) definition of Open Source AI:
https://opensource.org/ai/open-source-ai-definition
Open source depends on the license. Not this checklist.
i wanna know how much of the deepseek data was synthetic
We should also be careful not to assume any previous non A.I. collation or publication was with consent of everyone involved.
I just don't think we record information like that, we care about facts not dervivattions
OpenAi, thief in chief, stolen by another thief, it’s funny.
There are many LLM projects that are open about training and evaluation data, such as AllenAI OLMo, several EU projects (EuroGPT, HPLT), and several Huggingface projects. I don't think anybody forced them to do so.
the thing is, you can get the architecture from the full codebase, but the other way around is more complicated (think of reverse engineering a program)
"If you only get 3, you're still missing 2, whereas if you get 2, you could infer 3 yourself"
I think that the effort of open source community is praiseworthy. But our world is based on profit. Imho the open source community should be more organized and for other aspects than technical, such as economic...
It’s open source if it uses an open source license.
It is certainly open enough to be very disruptive.
AI models cut up input and collage it into a Frankenstein's monster of stolen work.
That's the key difference.
If you trained someone to make a forgery, then when they forged something, it would be copyright infringement.
- Aren't being licensed
- Aren't crediting the original artists
- Is being stolen from them with the express intent to replace them and devalue their industry
Makes all of this even worse