matloff.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

That's the major advantage of MI. Many people like to average the m results, but if you don't, then you avoid what amounts to replacing NAs by means. Hence no (big) distortion to X'X, again assuming the assumptions (e.g. MV normal) hold. 3/3

submitted 8 hours ago

comment in response to post

Say we've centered our predictor variables, and consider the matrix X'X in betahat = inv(X'X) X'Y. Then X'X will be approximately n times the variances and covariances of the predictors, which are now distorted by the replacement of NAs by some kind of mean (most non-MIs do this), reducing var(). 2/

submitted 8 hours ago

comment in response to post

Thanks for the explanation. As I've said before, missing data is not my forte', even though I've dabbled in it for years. But it seems to me that even if a missingness method is unbiased in the context of unconditional distributions, it cannot be unbiased in, e.g. a linear regression context. 🧵 1/

submitted 8 hours ago

comment in response to post

This looks very good.

submitted 8 hours ago

comment in response to post

I see. Then how would one check that, especially in the regression case?

submitted 8 hours ago

comment in response to post

I think we're talking past each other. :-) My phrase "doesn't matter" wasn't meant in your context here. BTW, what specifically do you mean by "unbiased"? I'm not aware of any method to check this.

submitted 9 hours ago

comment in response to post

Right, I used the term "effect assessment." The toweranNA package won't help you there.

submitted 9 hours ago

comment in response to post

Doesn't matter, the methods in the lit are neutral on that.

submitted 9 hours ago

comment in response to post

If you are really doing prediction, as opposed to effect assessment, I do suggest toweranNA. As I said, it's specifically for prediction, with the bonus that its assumptions are verifiable. 4/4

submitted 9 hours ago

comment in response to post

I definitely recommend multiple imputation methods, as most missingness methods have high variance. Of course, they also all have bias, but there is not much one can do about that, given unverifiable assumptions. 3/

submitted 9 hours ago

comment in response to post

AFAIK, the only missingness method specifically designed for prediction is our toweranNA package, in CRAN. But again, the standard is to apply missingness methods to your data first, no "equation," then fit your model. My students and I are developing a package to facilitate this. 2/

submitted 9 hours ago

comment in response to post

In the existing lit, the missingness is not on either side of the equation, because there is no equation, i.e. no predictive modeling. It simply assumes one has data, and it is up to the users what they want to do with the data after NAs have been dealt with, including predictive modeling. 🧵 1/

submitted 10 hours ago

comment in response to post

What I advise students is, just write SOMETHING. Don't optimize, just get the ball rolling. Then do iterative improvement.

submitted 11 hours ago

comment in response to post

Anybody who thought that this illustration enhanced clarity lives in an alternative reality

submitted 2 days ago

comment in response to post

Don't forget ctrl-p to search in the opposite direction.

submitted 2 days ago

comment in response to post

* ...does NOT mean ...

submitted 2 days ago

comment in response to post

So as a data analyst, ‪@ke-alos-ghenate.bsky.social‬, this IS your field. It's a shame that these issues are not standard parts of courses and textbooks, but the concepts are not deep. 5/5

submitted 2 days ago

comment in response to post

But due to inevitable model bias, that number easily could have been negative. The authors made no attempt to assess model validity, and indeed, the model was so complex that this would have been impossible to assess well. Yet that 20,000 figure will be quoted and used by policymakers. 4/

submitted 2 days ago

comment in response to post

It MATTERS. Recently some economists published a study claiming that early reopening of the schools caused 20,000 additional deaths from Covid. This came from a very elaborate model that found, IIRC, 0.7 excess deaths per 100K population. 3/

submitted 2 days ago

comment in response to post

Similarly, U and V can have similar cdfs yet quite divergent pdfs. Hence my reply to your comment yesterday that one can test for a Gaussian model being "reasonable." Aside from the general problems with NHST, it's even worse here, as e.g. a chi-square GOF test is essentially looking at cdfs. 2/

submitted 2 days ago

comment in response to post

I disagree that it's not your field. Everyone who works with data should be familiar with this fundamental point: If U and V are random variables and their cdfs or pdfs are close, it does mean their means, variances and so on are close. Same if U and V are random vectors etc. 🧵 1/

submitted 2 days ago

comment in response to post

Apparently so, sorry.

submitted 2 days ago

comment in response to post

I think you and I are saying the same thing. Actually, this is a point that I frequently bring up, in more basic settings than the one brought up by @ke-alos-ghenate.bsky.social. Just fitting a linear regression model, one does not have the unbiasedness, because the model is never correct.

submitted 2 days ago

comment in response to post

Good timing! Actually, I rarely use such measures, but just yesterday I found a very good use for them in a package I'm writing (which caused me to gain a new appreciation of them), so your writeup will be highly useful.

submitted 3 days ago

comment in response to post

If you are talking about the Gauss-Markov Theorem, it does require unbiasedness.

submitted 3 days ago

comment in response to post

A hypothesis test will not tell you that an assumption is "reasonable," matloff.github.io/No-P-Values. And the situation is even worse than that when one recognizes that two random variables can be close in cdf's but not in means, densities etc. Worse still for conditional quantities.

submitted 3 days ago

comment in response to post

Fantastic!

submitted 4 days ago

comment in response to post

So, results were mixed, but nevertheless I was quite impressed with the first version. A great example of the power and limitations of AI in the coding realm. 9/9

submitted 5 days ago

comment in response to post

However, in this version the test case blew up, not necessarily due to coding error but due to exhausting resources. I pointed this out, and asked the AI agent to try again. However, the corrected version blew up too, and I didn't go any further. 8/

submitted 5 days ago

comment in response to post

But interestingly, the AI agent noted that it had worked around R's lack of a coroutine capability. I then replied that there is a CRAN package for this, 'coro', and asked the AI agent to make use of it, which the agent did. 7/

submitted 5 days ago

comment in response to post

And I didn't even tell the AI agent where to find SimPy. But sure enough, it did translate the entire package, and so far, it seems to have done so correctly. It handles the test case, and a glance through the code didn't show any obvious problems. 6/

submitted 5 days ago

comment in response to post

This was a rather audacious request. First, previously I'd asked AI to translate individual R or Python functions, and now a whole package. But the second reason is more subtle: SimPy makes use of Python's excellent coroutine capability, which R doesn't have. 5/

submitted 5 days ago

comment in response to post

At that time (and now), there was a very good open-source DES package, SimPy, written in Python. So a few days ago, I was curious whether AI could translate it into R. 4/

submitted 5 days ago

comment in response to post

I used to teach a simulation course, focusing mainly on "discrete event" simulation. The term "discrete" refers to events that occur instantaneously rather than evolving continuously. E.g. in a communications network, arrivals of new calls are such events, even though call duration is continuous. 3/

submitted 5 days ago

comment in response to post

Though it should be obvious, it never occurred to me that language translation by AI is possible/practical. In this thread, I'll report on my latest usage of that concept, which was (mostly successful). 2/

submitted 5 days ago

comment in response to post

Very nice writeup! Especially clear re overfitting, a tricky topic.

submitted 7 days ago

comment in response to post

How many holders of Statistics degrees would notice that this is invalid? How many would be able to fix it? 2/2

submitted 12 days ago

comment in response to post

Every serious R user should learn at least basic Python.

submitted 13 days ago

comment in response to post

And for THAT kind of data science, R is by far the better tool.

submitted 13 days ago

comment in response to post

If one defines DS to be AI, then Python has the better developed libraries, and AI people tend to be CS. If, as I said, one defines DS to be "what most R people do" -- e.g.what most people reading this thread do -- it's more data analysis and modeling, including graphics, presentation tools etc.

submitted 13 days ago

comment in response to post

Doesn't surprise me at all. R has never been popular with the CS crowd. Many of them sneer at R. It's always been this way. The tech industry is NOT the measure we should use for the health of R.

submitted 13 days ago

comment in response to post

Yes, and even people who are not trying to grab attention can make highly incorrect statements, if the statements are thought correct by the poster. Even highly respected posters can be quite wrong.

submitted 13 days ago

comment in response to post

Very good point. I think you will agree with me when I say that a problem with many (though of course not all) journalists is that their livelihood depends on writing attention-grabbing statements, such as an impending demise of R.

submitted 13 days ago