Announcing the release of Common Corpus 2. The largest fully open corpus for pretraining comes back better than ever: 2 trillion tokens with document-level licensing, provenance and language information. https://huggingface.co/datasets/PleIAs/common_corpus

Comments