Announcing the release of Common Corpus 2. The largest fully open corpus for pretraining comes back better than ever: 2 trillion tokens with document-level licensing, provenance and language information. huggingface.co/datasets/Ple... - ThreadSky

dorialexander.bsky.social • 15 days ago

Announcing the release of Common Corpus 2. The largest fully open corpus for pretraining comes back better than ever: 2 trillion tokens with document-level licensing, provenance and language information. https://huggingface.co/datasets/PleIAs/common_corpus

Comments

philpax.me•15 days ago

terrific work as always

dorialexander.bsky.social•15 days ago

Common Corpus 2 is an in-kind contribution to CurrentAI, the new foundation for open source ai that just launched during the #AISummit. With support from the AI Alliance and institutional actors we contribute a much needed critical infrastructure for the emerging AI Commons

dorialexander.bsky.social•15 days ago

We have taken the opportunity to enlarge and refine the existing content. The main addition is an entirely new multilingual dataset made in partnership with Wikidata as no Wikimedia Deutschland: 110 billion knowledge items with structured data transcribed in 300 natural languages

dorialexander.bsky.social•15 days ago

We've also finalized the integration of the largest open scientific collection under free licenses: 11 millions articles, about 175 billion tokens, extracted from OpenAlex and extensively processed using SOTA PDF parsing models for academic publications on Nebius.

dorialexander.bsky.social•15 days ago

Two months ago we released the first ever models trained on Common Corpus, also the first evel models compliant with the AI Act. Due to the high rate of PDFs in training, the models have shown strong performance for RAG and document processing tasks. https://huggingface.co/collections/PleIAs/common-models-674cd0667951ab7c4ef84cc4

dorialexander.bsky.social•15 days ago

I know multiple models are currently being trained with at least some part of Common Corpus. We aim to support all theses diverse efforts by easing the curation of Common Corpus variants around criterias like licensing, language.

dorialexander.bsky.social•15 days ago

Much remains to be done. One my main regrets to date has been to never secure a relatively modest data grant to process the entirety of the corpus for OCR correction or reasoning filtering. We hope this will change this in the near future: AI is not only about compute

Comments

Posting Rules

Reply