dvilasuero.hf.co - Profile | ThreadSky | a Reddit-style client for Bluesky

🚀 The open source community is unstoppable: 4M total downloads for DeepSeek models on @hf.co , with 3.2M coming from the +600 models created by the community. That's 30% more than yesterday!

submitted 29 days ago • 0 comments

💫 Generate RAG data with the Synthetic Data Generator to improve your RAG system! 1️⃣ Generate from your documents, dataset, or dataset description. 2️⃣ Configure it. 3️⃣ Generate the synthetic dataset. 4️⃣ Fine-tune the retrieval and reranking models. 5️⃣ Build a RAG pipeline.

submitted 37 days ago • 1 comment

New chapter in the Hugging Face NLP course! 🤗 🚀 We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub. Any feedback for improvements welcome!

submitted 40 days ago • 1 comment

🎉 50,000+ annotations reached! The FineWeb2-C community is helping build better language models on annotation at a time. 📊 Current stats: - 115 languages represented - 419 amazing contributors - 24 languages with complete datasets But we're not done yet! 🧵

submitted 41 days ago • 1 comment

High-quality data for fine-tuning language models for free and at the click of a button! Prompt and wait for your dataset to push to Argilla or the Hub Evaluate, review and fine-tune a model. Blog:

submitted 50 days ago • 1 comment

Was 2024 the year of datasets? Is 2025 the year for community-built datasets? It's exciting to see the progress of many languages in FineWeb-C: - Total annotations submitted: 41,577 - Languages with annotations: 106 - Total contributors: 363

submitted 54 days ago • 0 comments

The finish line is near! We're building FineWeb-Edu for many languages and need your help 🤗 Many FineWeb-C languages are close to 1,000 annotations! Assamese is 99.4% done, French needs 64 more annotations, Tamil: 216. Please help us reach the goal: huggingface.co/spaces/data-...

submitted 51 days ago • 1 comment

💥 Ending 2024: A full data annotation journey on the Hugging Face Hub—from raw data to training-ready datasets! With Argilla 2.6.0, push your data to the Hub from the UI Let’s make 2025 the year anyone can build more transparent and accountable AI—no coding or model skills needed.

submitted 68 days ago • 1 comment

🚀 Argilla v2.6.0 is here! 🎉 Let me show you how EASY it is to export your annotated datasets from Argilla to the Hugging Face Hub. 🤩 Take a look to this quick demo 👇 💁‍♂️ More info about the release at github.com/argilla-io/a... #AI #MachineLearning #OpenSource #DataScience #HuggingFace #Argilla

submitted 69 days ago • 0 comments

🔥 We got great feedback on this: "Synthetic Data Generator" A no-code tool to create datasets with LLMs, making it a breeze, allowing ANYONE to create datasets and models in minutes and without any code. Blog: https://buff.ly/4gybyoT GitHub: https://buff.ly/49IDSmd Space: https://buff.ly/3Y1S99z

submitted 72 days ago • 1 comment

Well, around 10 percent of the initial goal is complete, and so far, it's been quite a one-man army effort. We're still in the hunt for more people to join and contribute to this open-source initiative. @hf.co data-is-better-together-fineweb-c.hf.space/share-your-p...

submitted 75 days ago • 1 comment

The sprint for crowd sourced annotations with argilla is in full swing over at data-is-better-together-fineweb-c.hf.space I've just contributed 100 examples to this dataset: data-is-better-together-fineweb-c.hf.space/share-your-p... Big thanks to @dvilasuero.hf.co, @nataliaelv.hf.co and team 🙌

submitted 76 days ago • 0 comments

I've been building a small library for working with prompt templates on the @huggingface.bsky.social Hub: `pip install prompt-templates`. Motivation: The community currently shares prompt templates in a wide variety of formats: in datasets, in model cards, as strings in .py files, as .txt/... 🧵

submitted 76 days ago • 1 comment

Desperate to contribute to the development of Scots language AI. I've just contributed 16 examples to this dataset: data-is-better-together-fineweb-c.hf.space/share-your-p...

submitted 76 days ago • 1 comment

I've just contributed 156 examples to the FineWeb 2 Spanish dataset: data-is-better-together-fineweb-c.hf.space/share-your-p... If you want to contribute, sign in with @hf.co and find your language

submitted 76 days ago • 0 comments

Help shape the future of multilingual Open Source AI! Join the FineWeb 2 Community Annotation Sprint to create an open training dataset with full transparency and human validation in many languages. Review datasets in your language and help identify the best sources for training.

submitted 78 days ago • 1 comment

✨ Argilla 2.5.0 is live and it comes with webhook listener support to supercharge your workflows! 🚀 #AI #MachineLearning #Webhooks #TechUpdate

submitted 85 days ago • 1 comment

👐 Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation by the @hf.co community. This dataset contains 10K text-to-image preference pairs across image generation categories, using different model families and prompt complexities. Blog: huggingface.co/blog/image-p...

submitted 79 days ago • 1 comment

Open Image Preferences released! 🚀 - Open-source dataset for text2image - 10K samples manually evaluated by the HF community. - Binarized format for SFT, DPO, or ORPO. It comes with a nice blog post explaining the steps to pre-process and generate the data, along with the results.

submitted 79 days ago • 1 comment

Announcing Global-MMLU - an improved MMLU Open dataset with evaluation coverage across 42 languages. The result of months of work with the goal of advancing Multilingual LLM evaluation. Built together with the community and amazing collaborators at Cohere4AI, MILA, MIT, and many more.

submitted 83 days ago • 4 comments

We're about to launch the biggest collaboration effort since the Open Assistant. Let's get the highest quality data for open foundation models with all the nuances & diversity of each language, all with data provenance and transparency Join us as language lead: docs.google.com/forms/d/10XI...

submitted 85 days ago • 0 comments

Next week we're launching a collaborative annotation effort to build a big multilingual dataset, so you can have high-quality data in your language. We are really close to getting leads for 100 languages! Can you help us cover the remaining 200?

submitted 85 days ago • 4 comments

For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol. 🧵>>

submitted 85 days ago • 9 comments

🙌 I just wanted to share a few thoughts about the latest Argilla release, 2.5.0, as it's a pretty big one! Argilla now has full support for webhooks, which means you can do some pretty cool stuff, like model training on the fly as annotations are created. 🤯 #MachineLearning #NLP #DataLabeling

submitted 86 days ago • 1 comment

[SATURDAY THREAD] ☕️ 🧑‍🎓 In case you spent the week reading GDPR legislation and missed everything. It’s all about vision language models and image preference datasets. >> 🧵 Here are the models and datasets you can use in your projects.

submitted 89 days ago • 5 comments

Recently, I added a feature to #Argilla to optimize plugin loading 🎉. It removes unnecessary code, improves readability, and lets future plugins load automatically. 🚀 Check out the PR 👇 and make your first contribution to our repo. github.com/argilla-io/a... #dev_experience #clean_code

submitted 92 days ago • 0 comments

🚀 We’re excited to announce Argilla v2.5.0, which includes: * Argilla webhooks, * A new design for the datasets home page. * Python 3.13 and Pydantic v2 support. 📙 Read here 👇 the full release notes github.com/argilla-io/a...

submitted 89 days ago • 0 comments

A dataset of 1 million or 2 million Bluesky posts is completely irrelevant to training large language models. The primary usecase for the datasets that people are losing their shit over isn't ChatGPT, it's social science research and developing systems that improve Bluesky.

submitted 90 days ago • 8 comments

The best path forward in AI requires technologists to be reflective/self-critical about how their work impacts society. Transparency helps this. Appreciate Bsky for flagging AI ethics &my colleague’s response. Let’s make informed consent a real thing. More later; Recommend: bsky.app/profile/cfie...

submitted 91 days ago • 7 comments

I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.

submitted 92 days ago • 164 comments

The community has labelled over 3000 image preferences in a few hours. One open source image preferences dataset coming right up!

submitted 92 days ago • 6 comments

At @huggingface.bsky.social 🤗 we're preparing a collaborative annotation effort to build an open-source multilingual dataset. If you'd like to get high-quality open data for your language, check if yours is listed in this form and sign up! forms.gle/DHJdtvoSNxAA...

submitted 92 days ago • 4 comments

🎨 Want better open-source AI art models? We need your help! Most top image generators are trained on human preferences—but those datasets are closed. Let's build our own! Rate images in pairs and help make AI art accessible to everyone 🔓 👉 huggingface.co/blog/burtens...

submitted 92 days ago • 8 comments

First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts 🦋 📊 1M public posts from Bluesky's firehose API 🔍 Includes text, metadata, and language predictions 🔬 Perfect to experiment with using ML for Bluesky 🤗 huggingface.co/datasets/blu...

submitted 92 days ago • 731 comments

Did you know that on Argilla, we’re adding a new feature to export labeled datasets directly to the Hugging Face Hub? 🤔 We’re leveraging the Hugging Face datasets library for seamless integration, including defining span labeling Stay tuned for the release!🧠✨ #MachineLearning #NLP #DataLabeling

submitted 92 days ago • 0 comments

👀 Who said the Argilla tool was only for text? I am proud of my brilliant teammates for setting up this significant initiative 🤗 @benburtenshaw.bsky.social @davidberenstein.bsky.social @danielvanstrien.bsky.social @dvilasuero.hf.co

submitted 92 days ago • 0 comments