I wrote more about this today, and I have a bunch of links to other blogs and papers with more evidence about why. It's messed up, but everything I was taught in ML 101 about test set reuse was wrong.
This is a bit of a tangent, but still related and interesting perspective on the topic (and the authors seem to have read Ben there) https://arxiv.org/abs/2407.12220
I appreciate your tackling this, because I generally find that ML users don't have a clear view of it. In fact, whenever I see a new ML book, the first thing I do is look at the coverage of the topic. On the other hand, I definitely feel your objections are overly critical. Overfitting DOES exist.
I won't attempt to give it a precise definition. Roughly speaking, I mean an increase in model complexity level that increases expected prediction error. Problematic, yes, e.g. due to the Double Descent phenomenon. Even without DD, there is no reason the expected loss should be a smooth U, no bumps.
Maybe the train and test come from the same distribution, but the model didn't learn the distribution, it learned a polynomial interpolation of the train set with 0 loss. This will be terrible on the test set even though there was no data change.
Right so it wouldn't happen if I had a holdout set which I evaluated on during training and stopped training when the loss stopped improving on the holdout set?
I guess I just wish there was a name for what happens when you make the mistake of not stopping there and keep fitting
Let me turn the question around: There is an *infinite* set of functions that interpolate data. Polynomials are merely one example. You need some way of picking amongst these. How do you do it without a holdout set?
In your list of rules, even though 2) sounds extreme, it has some merit depending on the use case. If your test set is the data you want to work on, it's fine looking at it a lot. But if it's only a small (i.i.d & noisy) sampling of the real world, maybe it's a good idea to not look at it too much.
I would agree that classical learning theory made a bunch of incorrect predictions about what should or shouldn't work, but I don't think this comes from the ambiguity of the idea of overfitting. A lot of the generalization failures you mention are distinguishable from overfitting.
Sure if the test distribution systematically differs training distribution a model might not generalize. But that's distinct from a model that gets worse at other bootstrapped subsets of a dataset than the one it trained on. You can intervene to resolve the narrow issue in ways that fail the broad.
Also it seems better to me for definitions of phenomena to not definitionally insist on their casual mechanisms. That way we don't have to assume we understand it correctly to talk about it, which lets us then try to improve our understanding without running into tautologies.
It'd still be good to have a unifying concept, e.g. sensitivity of performance to distortions (including sampling noise) of the test distribution away from train. But then we can say, ah, overfitting is the observation that such sensitivity often increases with additional training.
Appreciate the post and I agree DL provided new evidence.
I just think overfitting assumes iid train/test, so I'm not sure if cases like described in this paragraph hold (e.g., black swan).
I don't think that poor performance from distribution shift would be classified as "overfitting".
I like your definition, but I definitely see overfitting used in the context of out-of-sample.
For example, "a skin cancer data set has rulers in images of malignant tumors, and the ML 'overfits' to the rulers instead of the features of the tumor."
I can see how one could use it colloquially, but it's not an accurate use.
imo, "overfitting" is fitting the noise beyond the signal.
It's more of a variance thing.
While in your example the ruler is less of a noise and more of a competing signal, more akin to confounding bias than to variance.
I'm flattered, but now you made me draw DAGs, Ben.
On the left, you don't expect ɛ (y=f(X)+ɛ) to be consistent across data splits since it's random, and thus fitting it is bad.
On the right, you don't expect U (ruler) to appear on deployment, so a model using it instead of X (skin) will be wrong.
Those two sound similar in result--both latching on information that is not expected to reappear--but the underlying mechanism is different; one is affected by random noise and the other by consistent bias.
I think "overfit" should refer to the former; the latter ("contextual overfit") is "biased".
I claim that the term overfitting is simply a symptom a posteriori, not necessarily created by the person creating the model, but ultimately a consequence of assumptions about the data *under* a model. In DL, models can be ridiculously over-parameterized and work well. 1/
What would you call the process of fitting the model to noise among individual subject variation rather than adequately representing the underlying data generating process? To me that's overfitting, often due to p > n
Comments
https://www.argmin.net/p/flavors-of-overfitting
https://arxiv.org/abs/2407.12220
>When the future turns out not to be like the past, machine learning can’t work!
What exactly does it mean for the future to not be like the past?
What would you call that?
I guess I just wish there was a name for what happens when you make the mistake of not stopping there and keep fitting
I just think overfitting assumes iid train/test, so I'm not sure if cases like described in this paragraph hold (e.g., black swan).
I don't think that poor performance from distribution shift would be classified as "overfitting".
For example, "a skin cancer data set has rulers in images of malignant tumors, and the ML 'overfits' to the rulers instead of the features of the tumor."
In your opinion, is that a misuse of the word?
imo, "overfitting" is fitting the noise beyond the signal.
It's more of a variance thing.
While in your example the ruler is less of a noise and more of a competing signal, more akin to confounding bias than to variance.
On the left, you don't expect ɛ (y=f(X)+ɛ) to be consistent across data splits since it's random, and thus fitting it is bad.
On the right, you don't expect U (ruler) to appear on deployment, so a model using it instead of X (skin) will be wrong.
I think "overfit" should refer to the former; the latter ("contextual overfit") is "biased".