Another q for the stats people! People worry about collinearity (cf blog post below). Consider a scenario in which the collinear predictors are just controls to account for confounding. Including both of them doesn't impair the precision with which the effect of interest is estimated, does it? - ThreadSky

dingdingpeng.the100.ci • 28 days ago

Another q for the stats people!
People worry about collinearity (cf blog post below).

Consider a scenario in which the collinear predictors are just controls to account for confounding.
Including both of them doesn't impair the precision with which the effect of interest is estimated, does it?

Comments

aecoppock.bsky.social•28 days ago

had some fun simulating this one in DeclareDesign. outcome model is Y ~ 0.5 * D + 0.5*X_1 + 0.5*X_2 + U with D confounded by X_1 and X_2. gotta control for both or BIAS. Simulation agrees with you, no precision loss (on ATE estimate) in this setup when X_1 and X_2 are v. correlated.

rmkubinec.bsky.social•28 days ago

What would happen to power if X_1 and X_2 are *inversely correlated*. Would that increase power vs. the situation where X_1 and X_2 are independent? (or maybe it just doesn't matter)

aecoppock.bsky.social•28 days ago

whoa! Some mild dependence of the SE of the "both" estimator over the full range of rho...

rmkubinec.bsky.social•28 days ago

no one ever looks at inverse correlation...

vincentab.bsky.social•28 days ago

I like this a lot.

matloff.bsky.social•28 days ago

It does. They don't call VIF the Variance Inflation Factor for nothing. :-) Intuition: Multicollinearity makes X'X "nearly noninvertible," analogous to nearly 0 for scalars. 1/(nearly 0) is "large," and so is (X'X)^{-1}. Devise a little experiment to see this.

dingdingpeng.the100.ci•28 days ago

Right, I do know the VIF! But it was my understanding that this is *about the coefficient of the variable that is explained by the others*. If x1 and x2 are highly correlated, but the effect of interest is of x3, does the corr between x1 and x2 affect x3’s variance?>

matloff.bsky.social•28 days ago

Yes, understood. Will devise an example when I get to a computer.

stephenjwild.bsky.social•28 days ago

Adding myself here to see the answer to this

jounihelske.bsky.social•28 days ago

Yes it does, unless x3 is independent of all other covariates, in which case you don't need to control in the first place .

dingdingpeng.the100.ci•28 days ago

Right, good point! But is the overlap between x1 and x2 relevant; or would it be just the overlap between both of them and x3?

matloff.bsky.social•28 days ago

Meanwhile, please explain what you mean by overlap.

dingdingpeng.the100.ci•28 days ago

The covariance i guess 🙈 sorry i tend to think in these diagrams with overlapping circles representing variances/covariances

jounihelske.bsky.social•28 days ago

On top of my head, I would guess the former, as the uncertainty of the coefficients of x1 and x2 propagate to other coefficients as well? Norm has apparently already promised us an illustration, so fortunately, I don't need to spend my night playing around with this. 😀

matloff.bsky.social•28 days ago

Right, good phrasing. Well, let's see what I can come up with now.

awfuldodger.bsky.social•28 days ago

I’m doing the same! (as I can’t bookmark it)

dingdingpeng.the100.ci•28 days ago

Beyond, of course, any overlap with x3. i.e., does overlap between covariates affect the precision with with the focal effect is estimated?

edesouza.bsky.social•28 days ago

It all depends on the correlation between the contral variables and the variable(s) of interest. But the variance of the coefficient of the variable of interest depends on other factors too. The VIF is best ignored (see the discussion in Wooldridge's intro textbook).

freerangestats.info•28 days ago

x3 must be somewhat collinear with x1 and x2 or there would be no need to control for them as confounders.

ericjpedersen.bsky.social•28 days ago

Even if X'X is nearly non-invertible, that does not imply that the SE for any given parameter is large. To see this:

Fit Y ~ b1*X1 +b2*X2 + b3*X2, and we care about the SE(b1). [Y, X1:X3 are all centered and scaled] We assume that Cor(X2, X3) is "large", but Cor(X1, X2) and Cor(X1,X2) is "small"

matloff.bsky.social•28 days ago

I just intended my remark as rough intuition, nothing more.

ericjpedersen.bsky.social•28 days ago

But I think that rough intuition is wrong here; just because (X'X)^-1 is hard to invert does not mean that the standard error for any specific parameter is large

matloff.bsky.social•28 days ago

Please, I didn't say such a thing, "for any specific parameter." I was merely giving intuition as to why multicollinearity can be a problem.

ericjpedersen.bsky.social•28 days ago

But Julia's original point was that *just because X2 and X3 are strongly correlated, including them both does not necessarily impair the precision of our effect of interest*. I was trying to demonstrate that point.

ericjpedersen.bsky.social•28 days ago

Since all variables have mean 0, SD of 1, then X'X is just the correlation matrix Cor(X). We can write it as a block matrix:

X'X = [A B]
[B'C]

(A=1, B = [cor(X1,X2), X1,X3], C = Cor(X2:X3))

SE(b) = (X'X)^-1[1,1], or the first diagonal element.

2/n

ericjpedersen.bsky.social•28 days ago

(as a note: A,B, and C should all be multiplied by n above)

Inverse of a block matrix can be inverted blockwise (https://en.wikipedia.org/wiki/Block_matrix#Inversion) and the top left element of that inverse will have a value of:

(A - BC^-1B')^-1

3/n

ericjpedersen.bsky.social•28 days ago

But if B / B' are "small", then BC^-1B' should be close to zero regardless unless C^-1 is very large (very poorly conditioned)

ericjpedersen.bsky.social•28 days ago

A couple quick corrections (I wrote this too fast):
the matrix should be:

M = n*[A B]
[B' C]

and

SE(b1)^2 = (SE(res)^2*M^{-1})[1,1]

(inverse of M times residual standard error)

so SE(b1)^2 = SE(res)^2*1/n(1-BC^{-1}B')^-1)

But the same argument still holds

stephenjwild.bsky.social•28 days ago

It should affect the standard error of your variable of interest based on the extent it is correlated with them

stephenjwild.bsky.social•28 days ago

So if not correlated with them, it should not affect it at all.

But let's simulate!

dingdingpeng.the100.ci•28 days ago

Yes, most definitely! But the correlation only among the controls isn't really relevant, is it? Inferences about the focal effect shouldn't be affected by it.

stephenjwild.bsky.social•28 days ago

Me, channeling you: You should not be interpreting control variables 🤪

dingdingpeng.the100.ci•28 days ago

Exactly, and you shouldn't even care about whether they are highly correlated with their standard errors blowing up.

stephenjwild.bsky.social•28 days ago

Wooldridge had a tweet thread about this once:

https://x.com/jmwooldridge/status/1483493723233259527

dingdingpeng.the100.ci•28 days ago

Nice, that's very much aligned with what I'm trying to get at

mikedecr.computer•11 days ago

The world of toy models is one thing. Outside the toy model, the model is useful it helps us forecast the effects of new interventions. I think explosive coefs from high correlation is only benign if you KNOW that new data follow the exact same DGP as training data. But.. reality has a way of biting

mikedecr.computer•11 days ago

tldr won’t someone think of the held out data

mattansb.bsky.social•28 days ago

I'm gonna go with "no" - collinearity only affects the SEs (or more generally the uncertainty) of the predictors that suffer from said high collinearity.

"Proof" by example:

isabellaghement.bsky.social•11 days ago

I do recall reading about collinearity “flipping the sign” of the coefficient of a predictor.

What is that about if collinearity only affects the SEs?

ingorohlfing.bsky.social•11 days ago

I also read, in multiple places I think, that very high collinearity may make estimates unstable that they flip signs. I do not recall a simulation illustrating this.

steamtraen.eu•11 days ago

Is this relevant? https://steamtraen.shinyapps.io/suppressiongraphics/

dingdingpeng.the100.ci•11 days ago

Sorry Nick but mentioning suppression, that’s a paddling

dingdingpeng.the100.ci•11 days ago

(People in psych use that term for essentially anything “I included two variables and you won’t believe what happened next”; I think your shiny app is about various partial correlation structures and how they translate into regression coefficients?)

dingdingpeng.the100.ci•11 days ago

There’s two parts of this. First, for correlated predictors, including both may easily flip the sign of each one — because eg one is a confounder, or a collider between the outcome and the other etc.; that’s the usual causal inference stuff.>

dingdingpeng.the100.ci•11 days ago

Second, if the standard errors get very large, of course in single draws the coefficients may flip sign (that’s what the large SEs mean, after all).>

dingdingpeng.the100.ci•11 days ago

However, the situation at hand is one in which two predictors (X1 and X2) are highly correlated, but we’re interested in the effect of X3. So the coefficients of X1 and X2 needn’t concern us in their own right. And it turns out that>

isabellaghement.bsky.social•11 days ago

Thank you, Julia! I’ve always struggled with this notion of “flip sign” - because I am not sure what the benchmark is here? Does the sign change w.r. to: 1) no collinearity; 2) omitting the non-focal collinear predictor(s); 2) the sign of the true coefficient?

akmontoya.bsky.social•28 days ago

I make a big deal about this in my regression class, that collinearity does not invalidate or bias standard errors. You have low power because you’re supposed to have low power.

dingdingpeng.the100.ci•28 days ago

Exactly! Also…throwing out a necessary control variable is not a very good way to “fix” it 😭

teaguerhenry.bsky.social•27 days ago

Yes! I tell my class that regression coefficients are the "unique" relationships of a variable. If variables are not "unique", there is less information available to make the SEs "smaller." It's always either multicollinearity or normality assumptions that you've got to disabuse students of!

charlesdriver.bsky.social•28 days ago

Would be interesting to understand how this collinearity thing became so embedded in the psych basic stats training.

dingdingpeng.the100.ci•28 days ago

It’s kinda wild. I was even thought that this is why you center predictors prior to calculating their product 🫠

foswald.bsky.social•27 days ago

a couple of psych papers on this, in case you haven't seen 'em:
https://link.springer.com/article/10.3758/s13428-015-0624-x
https://journals.sagepub.com/doi/abs/10.1177/0013164418817801

florcc.bsky.social•27 days ago

I would not recommand the paper of Iacobucci et al. though. See https://link.springer.com/article/10.3758/s13428-016-0785-2

foswald.bsky.social•27 days ago

Thank you, Florian! Will check this out.

florcc.bsky.social•27 days ago

You’re welcome! :)

alxndrmlk.bsky.social•11 days ago

I think this idea might come from the DoE community.

They would argue that centering can help in model selection in case when we include interaction terms.

But in DoE traditionally *all* variables are treatment variables.

federicovaggi.bsky.social•27 days ago

The only thing is you might run into numerical issues, especially if you are doing logistic regression. Also, depending on which flavor of statistical testing you are doing after the model is fit, some tests adjust for the number of covariates.

dingdingpeng.the100.ci•27 days ago

The numerical issues one is an interesting because I’ve wondered about it but haven’t encountered it yet (as far as i can tell). Would that result in convergence issues? Also can it…possibly cause numerical issues in OLS? I was wondering whether there was a historic issue…

federicovaggi.bsky.social•27 days ago

OLS is just the loss function, so it depends on how you actually optimize it: looking at R, it uses QR decomposition by default which should be quite robust - in practice, I think SVD will be even more robust since you can just drop the problematic singular values.

gregoryfaletto.com•27 days ago

If two columns are extremely highly correlated than X^T X could be almost rank-deficient and difficult to invert. But in practice that seems unlikely unless the number of predictors is close to the number of observations

katossky.bsky.social•26 days ago

It does happen in practice though. Never had R dropping regressors, warning you that "model matrix is nearly or exactly singular" and no coefficients for that regressor showing up in the summary?

dingdingpeng.the100.ci•26 days ago

I’ve only had that when the coefficients were actually deterministically related 😅 but I’m not doing a lot of heavy modeling I guess

federicovaggi.bsky.social•27 days ago

This is of course assuming the input datasets are small enough that optimizing the model is not actually a bottleneck - otherwise you have to use first order optimizers (SGD, SAGA, etc), which will be considerably more sensitive to numerical issues.

dingdingpeng.the100.ci•28 days ago

What will happen is that the coefficients of the correlated controls will be imprecise (because you don't know "which one it is" that's predicting the outcome). But it doesn't matter either way because the predicted outcome will always be the same.>

conjugateprior.org•28 days ago

One way I've seen to sharpen intuition about controls is to turn from the outcome to the treatment assignment side and note that collinearity doesn't matter inside a propensity score model either; its quality only depends on the extent to which conditioning on its expected values generates balance.

stippe87.bsky.social•28 days ago

Exact. Collinearity is an issue if your aim is to have a precise estimate of the coefficient, or otherwise if you want to find a minimal set of regressors in EDA.

dingdingpeng.the100.ci•28 days ago

Starting to feel like "don't look at the coefficients, just calculate whatever metric is relevant to your research question" is a highly underappreciated stats hack and also I may have to get myself a marginaleffects T-shirt.

conjugateprior.org•28 days ago

alxndrmlk.bsky.social•11 days ago

😁

zabong69.bsky.social•28 days ago

Finally someone says it!

mattansb.bsky.social•28 days ago

Would this really solve the issue? Wouldn't the lack of precision be "inherited" by any estimate derived with the affected coefficient?

charlesdriver.bsky.social•28 days ago

Only if you treat the uncertainty very crudely. Generally, the correlation (and higher order relations if fancy pants) in the uncertainty ensures that e.g. when one coefficient is low, the other is high. Sampling parameter estimates makes this stuff easier.

mattansb.bsky.social•27 days ago

Right, but if I'm looking, say, at the ATE of the effect associated with that coefficient, instead of the coefficient itself, wouldn't I still get high SEs?

charlesdriver.bsky.social•27 days ago

Yes if you're looking at anything that only depends on one of the coefficients the high uncertainty of that coefficient will propagate through. Which is good, because, you're uncertain about it!

dingdingpeng.the100.ci•28 days ago

Yes, exactly! For example, if you make a prediction and x1 and x2 are highly correlated, it doesn’t matter which one gets the high coefficient — as x1 and x2 are highly correlated, you’ll end up with the same predictions anyway.

etotheipie.bsky.social•28 days ago

Yes! Do the F test you want, ignore all those small ones in the default table output.

zabong69.bsky.social•28 days ago

Us ML people have it easy. We never look at coefficients.

dingdingpeng.the100.ci•28 days ago

this is the way

ehudk.bsky.social•27 days ago

but often at the price of no uncertainty quantification. i have a task open to inquire the requirements for the delta method to see if it can apply to arbitrary (but probably smooth/donsker?) estimators, as well as "prediction-powered inference"
https://www.science.org/doi/10.1126/science.adi6000

ehudk.bsky.social•27 days ago

I'm not familiar enough with the delta method to know its requirements, but it does feel that, once all you care about is prediction space and not parameter space, that shifting towards more expressive estimators (better MSE -> narrower uncertainty intervals?) feels inevitable.

ergative-abs.bsky.social•11 days ago

This has been my strategy for ages. Build a control model with all the nuisance variables chucked in there, multicollinear or not, make sure they're all necessary, and *then* look at whether my predictors of interest improve fit.

dingdingpeng.the100.ci•11 days ago

This is the way

klauspforr.bsky.social•27 days ago

When is collinearity really a problem?

dingdingpeng.the100.ci•27 days ago

I'd say if a confounder correlates .99 with your cause of interest, you may be in a bit of trouble, conceptually and statistically speaking.

klauspforr.bsky.social•27 days ago

All examples that come to mind seem to necessitate a different identification strategy, where we use an experiment or quasi-xp to manipulate the treatment

dingdingpeng.the100.ci•27 days ago

Fully agreed.

ehudk.bsky.social•26 days ago

A strong confounder-treatment association is essentially an overlap/positivity violation

econmaett.bsky.social•27 days ago

📌

chrisadamsecon.bsky.social•11 days ago

Not sure if it has come up, but you may want to look at this line of research. We may want to use a ML model to flexibly account for confounders. However, because ML models are biased, the treatment effect estimate will also be biased. https://economics.mit.edu/sites/default/files/2022-08/2017.06%20Double%20Debiased%20Machine%20Learning%20for%20Treat.pdf

doinkboy.bsky.social•27 days ago

Something not often realized about collinear predictors is that their Type 1 errors are correlated. Assuming pos. collinearity, if the effect of one variable is mistakenly estimated as positive, the other variable is more likely to have a neg. effect. For neg. col., the 2 effects are same signed.

boryslaw.bsky.social•28 days ago

I am sure you know this, but I felt compelled to spell this out: Collinearity is sometimes defined as high correlation and sometimes as linear dependence. These are two very different things.
1/3

isabellaghement.bsky.social•11 days ago

In practice, is there a way to distinguish between these two?

For example, for linear dependence, you could fit linear regression models

X_j ~ X_1 + … + X_n

(with X_j omitted from RHS)

and check the size of Rsq.

How on earth would you check correlation?

isabellaghement.bsky.social•11 days ago

Also, has anyone done simulations looking on the impact on estimation of X_j’s coefficient on these two distinct assumptions: correlation vs linear dependence?

P.S.: Are we talking about correlation between X_j and the linear combination of all other predictors? How is “correlation” defined here?

boryslaw.bsky.social•11 days ago

I am not sure if I understand the question well. Correlations between predictors can always be checked in the usual way.

isabellaghement.bsky.social•11 days ago

Ah, I thought you meant some kind of “multivariate” correlation rather than “bivariate” correlations.

boryslaw.bsky.social•28 days ago

One consequence of linear dependence is that, e.g., OLS estimates of linear regression coefficients are not unique, which is typically a serious problem. A mere high correlation can have a "similar" effect (e.g., prevent convergence) only because of how model fitting is implemented in software.
2/3

boryslaw.bsky.social•28 days ago

In the ideal world of arbitrary precision real numbers, OLS estimates of linear regression coefficients are the same as the orthogonal projection of data (as an n-dimensional point where n is the ss) on the linear span of the predictors; as long as they are not linearly dependent, all is fine.
3/3

Comments

Posting Rules

Reply