Overfitting, as it is colloquially described in data science and machine learning, doesn’t exist. www.argmin.net/p/thou-shalt... - ThreadSky

beenwrekt.bsky.social • 29 days ago

Overfitting, as it is colloquially described in data science and machine learning, doesn’t exist. https://www.argmin.net/p/thou-shalt-not-overfit

Comments

suhr.bsky.social•29 days ago

so glad my students are taking your class... wish I had time to sit in on it too

davidasboth.com•29 days ago

Appreciate your perspective. Why is "Thou shalt not evaluate your test error more than once." terrible advice?

beenwrekt.bsky.social•28 days ago

I wrote more about this today, and I have a bunch of links to other blogs and papers with more evidence about why. It's messed up, but everything I was taught in ML 101 about test set reuse was wrong.

https://www.argmin.net/p/flavors-of-overfitting

davidasboth.com•28 days ago

Thanks, look forward to reading this too!

ehudk.bsky.social•28 days ago

This is a bit of a tangent, but still related and interesting perspective on the topic (and the authors seem to have read Ben there)
https://arxiv.org/abs/2407.12220

matloff.bsky.social•29 days ago

I appreciate your tackling this, because I generally find that ML users don't have a clear view of it. In fact, whenever I see a new ML book, the first thing I do is look at the coverage of the topic. On the other hand, I definitely feel your objections are overly critical. Overfitting DOES exist.

beenwrekt.bsky.social•29 days ago

What do you mean by overfitting?

matloff.bsky.social•29 days ago

I won't attempt to give it a precise definition. Roughly speaking, I mean an increase in model complexity level that increases expected prediction error. Problematic, yes, e.g. due to the Double Descent phenomenon. Even without DD, there is no reason the expected loss should be a smooth U, no bumps.

cthorrez.bsky.social•29 days ago

Interesting points, I think there is some truth in it especially when it comes to modern LLMs but I think it's a bit hyperbolic.

>When the future turns out not to be like the past, machine learning can’t work!

What exactly does it mean for the future to not be like the past?

cthorrez.bsky.social•29 days ago

Maybe the train and test come from the same distribution, but the model didn't learn the distribution, it learned a polynomial interpolation of the train set with 0 loss. This will be terrible on the test set even though there was no data change.

What would you call that?

beenwrekt.bsky.social•29 days ago

I would call that a red herring. It would not happen if you had a holdout set. I'll talk about this in the next post.

cthorrez.bsky.social•29 days ago

Right so it wouldn't happen if I had a holdout set which I evaluated on during training and stopped training when the loss stopped improving on the holdout set?

I guess I just wish there was a name for what happens when you make the mistake of not stopping there and keep fitting

beenwrekt.bsky.social•29 days ago

Let me turn the question around: There is an *infinite* set of functions that interpolate data. Polynomials are merely one example. You need some way of picking amongst these. How do you do it without a holdout set?

cthorrez.bsky.social•29 days ago

I don't know a way to compare predictive performance of any two models without evaluating them on a held out set

davidpicard.bsky.social•28 days ago

In your list of rules, even though 2) sounds extreme, it has some merit depending on the use case. If your test set is the data you want to work on, it's fine looking at it a lot. But if it's only a small (i.i.d & noisy) sampling of the real world, maybe it's a good idea to not look at it too much.

ngutten.bsky.social•29 days ago

I would agree that classical learning theory made a bunch of incorrect predictions about what should or shouldn't work, but I don't think this comes from the ambiguity of the idea of overfitting. A lot of the generalization failures you mention are distinguishable from overfitting.

ngutten.bsky.social•29 days ago

Sure if the test distribution systematically differs training distribution a model might not generalize. But that's distinct from a model that gets worse at other bootstrapped subsets of a dataset than the one it trained on. You can intervene to resolve the narrow issue in ways that fail the broad.

ngutten.bsky.social•29 days ago

Also it seems better to me for definitions of phenomena to not definitionally insist on their casual mechanisms. That way we don't have to assume we understand it correctly to talk about it, which lets us then try to improve our understanding without running into tautologies.

ngutten.bsky.social•29 days ago

It'd still be good to have a unifying concept, e.g. sensitivity of performance to distortions (including sampling noise) of the test distribution away from train. But then we can say, ah, overfitting is the observation that such sensitivity often increases with additional training.

ehudk.bsky.social•29 days ago

Appreciate the post and I agree DL provided new evidence.
I just think overfitting assumes iid train/test, so I'm not sure if cases like described in this paragraph hold (e.g., black swan).
I don't think that poor performance from distribution shift would be classified as "overfitting".

beenwrekt.bsky.social•29 days ago

I like your definition, but I definitely see overfitting used in the context of out-of-sample.

For example, "a skin cancer data set has rulers in images of malignant tumors, and the ML 'overfits' to the rulers instead of the features of the tumor."

In your opinion, is that a misuse of the word?

ehudk.bsky.social•29 days ago

I can see how one could use it colloquially, but it's not an accurate use.
imo, "overfitting" is fitting the noise beyond the signal.
It's more of a variance thing.
While in your example the ruler is less of a noise and more of a competing signal, more akin to confounding bias than to variance.

beenwrekt.bsky.social•28 days ago

Interesting. I don't think your definition is universal, but you are of course not "wrong" per se. Regardless, you inspired today's post. https://www.argmin.net/p/flavors-of-overfitting

ehudk.bsky.social•28 days ago

I'm flattered, but now you made me draw DAGs, Ben.
On the left, you don't expect ɛ (y=f(X)+ɛ) to be consistent across data splits since it's random, and thus fitting it is bad.
On the right, you don't expect U (ruler) to appear on deployment, so a model using it instead of X (skin) will be wrong.

ehudk.bsky.social•28 days ago

Those two sound similar in result--both latching on information that is not expected to reappear--but the underlying mechanism is different; one is affected by random noise and the other by consistent bias.
I think "overfit" should refer to the former; the latter ("contextual overfit") is "biased".

skeptical-lemming.bsky.social•26 days ago

I claim that the term overfitting is simply a symptom a posteriori, not necessarily created by the person creating the model, but ultimately a consequence of assumptions about the data *under* a model. In DL, models can be ridiculously over-parameterized and work well. 1/

alxndrmlk.bsky.social•29 days ago

+1 @ehudk.bsky.social

rbly.bsky.social•29 days ago

What would you call the process of fitting the model to noise among individual subject variation rather than adequately representing the underlying data generating process? To me that's overfitting, often due to p > n

Comments

Posting Rules

Reply