goringennady.bsky.social
🦠🧬📊bioinformatics, statistics, and stochastic processes.
263 posts
277 followers
93 following
Regular Contributor
Active Commenter
comment in response to
post
It is optimistic to say this is common knowledge in the field
comment in response to
post
Regressing on x/y and y? Got it. Anyway, excited to someday see a reanalysis with more recent data and model stability to Belarus 2020
comment in response to
post
Makes perfect sense! But I am not familiar with this language: if the original outcome data do not play into the plot, where do the deviations from the parametric model fit come from?
comment in response to
post
tedious!
comment in response to
post
AI-generated suggestion," "image search that does not find images," and "a week when Google Search refuses to serve the correct results from a particular domain."
comment in response to
post
google has never taken feedback; and the introduction of google lens was reviled by many, many people because chrome replaced the shortcut to google image search, which could find image matches, with something that could not
comment in response to
post
If you are excluding artifact cells, it will still show up as DE, predominantly because the expression is so high that a statistical test will have a lot of evidence to conclude it's DE. There are other, subtler reasons related to distributional assumptions too, I think.
comment in response to
post
Exhaustive, pointed, and unfun critiques of methods are a good thing: they mean the field can settle down from the Wild West into the more esoteric statistical debates ("what's a good model here?" instead of "what's a replicate?"). I don't want a party, I want the right answer at a reasonable cost
comment in response to
post
in the future all scientific computation will rely on the 0x5F3759DF fast invsqrt
comment in response to
post
bsky.app/profile/gori... I wonder which point this is under: the model requires a particular substrate (e.g. perfect-precision data) which cannot be obtained even in principle.
comment in response to
post
perhaps a pithier formulation
comment in response to
post
Re all models are wrong but some are useful:
Useful models are designed with a good match between the model architecture and the science question aka ML task. Big gap between the model and what biologists mean with the cell comm questions -> both useless and likely wrong model.
comment in response to
post
🤷♀️It is more than a little dubious.
comment in response to
post
miscalibration in Fig. 2; on the other they seem OK with it in Fig. 3 as an induced lfc bound (?) (I think I'm just missing something here). But it certainly seems outré (relative to literature) either way.
comment in response to
post
absolutely, and it is applied on top of a different flavor of double-dipping (using the same genes to cluster and to do differential expression).
Now, what the effects of double-dipping through filtering are, I don't know. Probably worse FDR control. On one hand, Bourgon et al. show it leads to
comment in response to
post
p-values that are mostly <1e-100, because the authors downloaded Seurat and made a few changes to a tutorial script to do DEA. The story is, then, between unreliable and meaningless for all but the strongest signal.
It's been getting better, so cheerleading for bad stats is more than disturbing.
comment in response to
post
But I'm not a statistician! If a statistician wants to do this leg work, more power to them; maybe they will bring a sea change to how stats is done in genomics. Seems like an easy way to have a lot of impact!
comment in response to
post
I can elaborate on that, too (although this is certainly not exhaustive)
comment in response to
post
Even if there is validation, it is also not great to adamantly and visibly ignore best practices (if the simplest analysis is plain wrong, are the more complex/new ones reliable?), and to ignore all signal other than the strongest visible in naive analysis (again, these are not cheap experiments).
comment in response to
post
It's great when there is time and money to do validation! But more typically there isn't, and correct stats are the difference between the results being potentially interesting and worth following up on, and the results being unusable. These are not cheap experiments.
comment in response to
post
Here is the typical scenario: there is an interesting paper on a unique system or human subjects, with a modest n. The authors did not and will never release data. The biological story is interesting. It would be worth consideration if the analysis were done right. But it is based on a table with
comment in response to
post
3. None of this matters because sc is exploratory and everything should be validated anyway. This is great in theory, but a bit troubling that there is so much will to spend millions of dollars on experiments, seq kits, and GPUs, and none on ensuring the results are reliable on their own.
comment in response to
post
papers reporting issues with pseudoreplication: Squair 2021, yes, but also Zimmerman 2021 and associated correspondence, and Junttila 2022, alongside best practices reviews. It is easy but tedious to come up with adversarial examples where artifacts lead to meaninglessly low p-values on cell basis.
comment in response to
post
2. Treating cells as independent experimental units is OK because everybody does it, and even if we don't, the conclusions still stand (at least for Penk).
I do not think pseudobulk is the best possible approach (nor do I think this particular flavor of pseudobulk is great). But there are many
comment in response to
post
Speaking of rank-based methods: this is kind of a funny reference, because (5) says Wilcoxon is okay but risky for marker genes but bad for DGE.
comment in response to
post
Sometimes this nonindependence matters, sometime it doesn't. It seems to be less important for rank-based methods. But those are low-power and do not really take advantage of the data.
comment in response to
post
It does not take a lot of work to confirm (by drawing random variates from the NB) that yes, the DESeq2 filter (or a simpler implementation, throwing out low-expression genes) is independent (preserves the p-value distribution) and the fold change procedure is not. Here be dragons.
comment in response to
post
filtering in e.g. DESeq2. But that approach has a lot of statistical machinery behind it, discussed in great detail in 2010, and independence of the filtering criterion and test statistic seem to be mandatory.
comment in response to
post
This seems like an amazing and novel approach to doing NHST I have never once seen in a paper before. So perhaps the right solution here is to write a paper benchmarking and validating the method instead of baldly insisting it makes sense.
Intuitively, it looks superficially similar to independent
comment in response to
post
It's no good
comment in response to
post
which is to say: ideas certainly are cheap! but when the range of implemented products (I do not even ask for usable) is so spectacularly undiverse, there is either (1) ideation failure or (2) implementation failure, across a whole field, for a decade. I suspect (1) but neither is good
comment in response to
post
or for that matter
John's peer in famous foursome (0, e.g.; 4)
comment in response to
post
if one prefers so-called "legal" clues:
John's peer in famous foursome (0-0-5)
(may or may not be unconsciously stolen from a @frisco17.bsky.social Oh No!-meration)