Wrote up some notes on Microsoft's new Phi-4 LLM. They trained it on a LOT of synthetic data, and the details of how and why they did that are really interesting.
https://simonwillison.net/2024/Dec/15/phi-4-technical-report/
https://simonwillison.net/2024/Dec/15/phi-4-technical-report/
Comments
In the short term, too, there is a deficit of deployed compute infrastructure for putting these models to work.
I wonder why microsoft always under-sales its models parameter numbers. and calls it a small model too. smh.
synthetic data is now industry standard, glad everyone caught up to 2018 finally.
Msft AI Foundry: why synth data is beneficial
In organic datasets… relationship btwn tokens is complex & indirect…[while] that generated by a lang model is 👉by definition predicted by the preceding tokens…easier for a model to follow the resulting reasoning patterns👈