matloff.bsky.social
Em. Prof., UC Davis. Many awards, incl. book, teaching, public service. Many books, latest The Art of Machine Learning (uses qeML pkg). Former Editor in Chief, the R Journal. Views mine. heather.cs.ucdavis.edu/matloff.html
1,103 posts
476 followers
765 following
Getting Started
Active Commenter
comment in response to
post
That's the major advantage of MI. Many people like to average the m results, but if you don't, then you avoid what amounts to replacing NAs by means. Hence no (big) distortion to X'X, again assuming the assumptions (e.g. MV normal) hold. 3/3
comment in response to
post
Say we've centered our predictor variables, and consider the matrix X'X in betahat = inv(X'X) X'Y. Then X'X will be approximately n times the variances and covariances of the predictors, which are now distorted by the replacement of NAs by some kind of mean (most non-MIs do this), reducing var(). 2/
comment in response to
post
Thanks for the explanation. As I've said before, missing data is not my forte', even though I've dabbled in it for years.
But it seems to me that even if a missingness method is unbiased in the context of unconditional distributions, it cannot be unbiased in, e.g. a linear regression context. 🧵 1/
comment in response to
post
This looks very good.
comment in response to
post
I see. Then how would one check that, especially in the regression case?
comment in response to
post
I think we're talking past each other. :-) My phrase "doesn't matter" wasn't meant in your context here.
BTW, what specifically do you mean by "unbiased"? I'm not aware of any method to check this.
comment in response to
post
Right, I used the term "effect assessment." The toweranNA package won't help you there.
comment in response to
post
Doesn't matter, the methods in the lit are neutral on that.
comment in response to
post
If you are really doing prediction, as opposed to effect assessment, I do suggest toweranNA. As I said, it's specifically for prediction, with the bonus that its assumptions are verifiable. 4/4
comment in response to
post
I definitely recommend multiple imputation methods, as most missingness methods have high variance. Of course, they also all have bias, but there is not much one can do about that, given unverifiable assumptions. 3/
comment in response to
post
AFAIK, the only missingness method specifically designed for prediction is our toweranNA package, in CRAN.
But again, the standard is to apply missingness methods to your data first, no "equation," then fit your model. My students and I are developing a package to facilitate this. 2/
comment in response to
post
In the existing lit, the missingness is not on either side of the equation, because there is no equation, i.e. no predictive modeling. It simply assumes one has data, and it is up to the users what they want to do with the data after NAs have been dealt with, including predictive modeling. 🧵 1/
comment in response to
post
What I advise students is, just write SOMETHING. Don't optimize, just get the ball rolling. Then do iterative improvement.
comment in response to
post
Anybody who thought that this illustration enhanced clarity lives in an alternative reality
comment in response to
post
Don't forget ctrl-p to search in the opposite direction.
comment in response to
post
* ...does NOT mean ...
comment in response to
post
So as a data analyst, ‪@ke-alos-ghenate.bsky.social‬, this IS your field. It's a shame that these issues are not standard parts of courses and textbooks, but the concepts are not deep. 5/5
comment in response to
post
But due to inevitable model bias, that number easily could have been negative. The authors made no attempt to assess model validity, and indeed, the model was so complex that this would have been impossible to assess well. Yet that 20,000 figure will be quoted and used by policymakers. 4/
comment in response to
post
It MATTERS. Recently some economists published a study claiming that early reopening of the schools caused 20,000 additional deaths from Covid. This came from a very elaborate model that found, IIRC, 0.7 excess deaths per 100K population. 3/
comment in response to
post
Similarly, U and V can have similar cdfs yet quite divergent pdfs. Hence my reply to your comment yesterday that one can test for a Gaussian model being "reasonable." Aside from the general problems with NHST, it's even worse here, as e.g. a chi-square GOF test is essentially looking at cdfs. 2/
comment in response to
post
I disagree that it's not your field. Everyone who works with data should be familiar with this fundamental point: If U and V are random variables and their cdfs or pdfs are close, it does mean their means, variances and so on are close. Same if U and V are random vectors etc. 🧵 1/
comment in response to
post
Apparently so, sorry.
comment in response to
post
I think you and I are saying the same thing. Actually, this is a point that I frequently bring up, in more basic settings than the one brought up by @ke-alos-ghenate.bsky.social. Just fitting a linear regression model, one does not have the unbiasedness, because the model is never correct.
comment in response to
post
Good timing! Actually, I rarely use such measures, but just yesterday I found a very good use for them in a package I'm writing (which caused me to gain a new appreciation of them), so your writeup will be highly useful.
comment in response to
post
If you are talking about the Gauss-Markov Theorem, it does require unbiasedness.
comment in response to
post
A hypothesis test will not tell you that an assumption is "reasonable," matloff.github.io/No-P-Values. And the situation is even worse than that when one recognizes that two random variables can be close in cdf's but not in means, densities etc. Worse still for conditional quantities.
comment in response to
post
Fantastic!
comment in response to
post
So, results were mixed, but nevertheless I was quite impressed with the first version. A great example of the power and limitations of AI in the coding realm. 9/9
comment in response to
post
However, in this version the test case blew up, not necessarily due to coding error but due to exhausting resources. I pointed this out, and asked the AI agent to try again. However, the corrected version blew up too, and I didn't go any further. 8/
comment in response to
post
But interestingly, the AI agent noted that it had worked around R's lack of a coroutine capability. I then replied that there is a CRAN package for this, 'coro', and asked the AI agent to make use of it, which the agent did. 7/
comment in response to
post
And I didn't even tell the AI agent where to find SimPy. But sure enough, it did translate the entire package, and so far, it seems to have done so correctly. It handles the test case, and a glance through the code didn't show any obvious problems. 6/
comment in response to
post
This was a rather audacious request. First, previously I'd asked AI to translate individual R or Python functions, and now a whole package. But the second reason is more subtle: SimPy makes use of Python's excellent coroutine capability, which R doesn't have. 5/
comment in response to
post
At that time (and now), there was a very good open-source DES package, SimPy, written in Python. So a few days ago, I was curious whether AI could translate it into R. 4/
comment in response to
post
I used to teach a simulation course, focusing mainly on "discrete event" simulation. The term "discrete" refers to events that occur instantaneously rather than evolving continuously. E.g. in a communications network, arrivals of new calls are such events, even though call duration is continuous. 3/
comment in response to
post
Though it should be obvious, it never occurred to me that language translation by AI is possible/practical. In this thread, I'll report on my latest usage of that concept, which was (mostly successful). 2/
comment in response to
post
Very nice writeup! Especially clear re overfitting, a tricky topic.
comment in response to
post
How many holders of Statistics degrees would notice that this is invalid? How many would be able to fix it? 2/2
comment in response to
post
Every serious R user should learn at least basic Python.
comment in response to
post
And for THAT kind of data science, R is by far the better tool.
comment in response to
post
If one defines DS to be AI, then Python has the better developed libraries, and AI people tend to be CS. If, as I said, one defines DS to be "what most R people do" -- e.g.what most people reading this thread do -- it's more data analysis and modeling, including graphics, presentation tools etc.
comment in response to
post
Doesn't surprise me at all. R has never been popular with the CS crowd. Many of them sneer at R. It's always been this way. The tech industry is NOT the measure we should use for the health of R.
comment in response to
post
Yes, and even people who are not trying to grab attention can make highly incorrect statements, if the statements are thought correct by the poster. Even highly respected posters can be quite wrong.
comment in response to
post
Very good point. I think you will agree with me when I say that a problem with many (though of course not all) journalists is that their livelihood depends on writing attention-grabbing statements, such as an impending demise of R.