chatgtp.bsky.social
Machine learning for molecular biology. ELLIS PhD student at Fabian Theis lab. EPFL alumnus.
27 posts
1,763 followers
4,096 following
Regular Contributor
Active Commenter
comment in response to
post
At this point, the art of detecting which claims and publications are overhyped is a core research skill.
comment in response to
post
It was the first publication I had the chance to work on, back as a MSc student. I was lucky to be mentored by Slavica Dimitrieva, who led the project, and to have worked on it with Eric Durand. Both inspired me to continue on the bio-ML trajectory 🚀
comment in response to
post
The speaker was describing some situation of student misconduct and without any reason or justification mentioned the nationality of the student.
comment in response to
post
4️⃣ “A benchmark for prediction of transcriptomic responses to chemical perturbations across cell types”
@chatgtp.bsky.social
neurips.cc/virtual/2024...
comment in response to
post
I’m told burnout comes less from having too much to do but rather feeling like what you have to do is out of your control (and/or unpleasant work). So, working out substantial changes in what you’re obligated to be doing is the best way out. Thus making space for something new/interesting!
comment in response to
post
Thanks for compiling. Happy to join the list!
comment in response to
post
It wouldn't have been possible without the Kaggle competitors who contributed their solutions and our collaborators who helped implement them into the platform. 🙏
comment in response to
post
Thanks to a great co-lead Andrew Benz, supervisors Daniel Burkhardt, Malte Luecken, @fabiantheis.bsky.social, help with OP from Robrecht Cannoodt, and everyone involved!
@chanzuckerberg.bsky.social and Cellarity for funding to generate data, Kaggle for competition, and SaturnCloud for compute. 🧵8/8
comment in response to
post
Best models predictions are still far from ground truth, but we have anticipated this room for growth, as the platform is a living benchmark, where new methods can easily be integrated into the leaderboard via contributions on GitHub github.com/openproblems... . We're open to suggestions! 🧵7/8
comment in response to
post
We implemented the winning Kaggle competition methods in our Open Problems Perturbation Prediction (OP3) platform. It has a robust eval with baseline methods, and dataset bootstrapping. Simple NNs (with a few caveats) perform best. Also, drugs with larger effects are more difficult to predict. 🧵6/8
comment in response to
post
We used this setup in a Kaggle competition (25k submissions, 1.3k competitors). It sourced models and feedback from competitors, that we used to refine the dataset and benchmark: filtering, cell type annotation, and estimation of perturbation effect.
competition: www.kaggle.com/competitions... 🧵5/8
comment in response to
post
Single-cell perturbation readouts have batch effects and a low signal-to-noise ratio. DEG analysis with GLMs and replicates help, but we need to decide on perturbation effect representation - we developed a “cross-donor retrieval” metric for perturbation effect representation evaluation. 🧵4/8
comment in response to
post
We generated a single-cell dataset of 146 drug perturbations in PBMCs of 3 human donors. We used it to benchmark perturbation effect predictions for held-out (cell type, compound) pairs. Perturbation effects are derived from DEG - contrasts treatment vs control in a generalized linear model. 🧵3/8
comment in response to
post
The chemical and biological space of possible perturbations is very large. Thus, methods try to learn from a fraction of possible experiments and infer the rest. However, existing perturbation datasets are limited by size and data quality issues. 🧵2/8
comment in response to
post
Genomics, Evolution, and More @jlsteenwyk.bsky.social bsky.app/starter-pack...
comment in response to
post
I'm a PhD student at Theislab, working on ML applications in omics with a focus on small-molecule perturbation modeling. I'm interested in applications of the above to cancer treatment (FPM).
comment in response to
post
For a list of sc-transformers with descriptions, check out github.com/theislab/sin....
(7/7)
comment in response to
post
While sc transformers are large compared to other sc models, they are tiny compared to LLMs - 650M vs 405B params. One way to leverage other diverse and abundant data is by training&using LLMs on sc tasks. (6/7)
comment in response to
post
The appeal of transformers is generalizing across a variety of tasks&data - we highlight that in independent benchmarks they often lag behind other specialized architectures. Maybe there's not enough diverse data. Maybe we need different data preprocessing or models. (5/7)
comment in response to
post
Non-sequential (tabular) omics data requires preprocessing. There are lots of possibilities. We highlight 3 dominant ways: rank-based (iSEEEK, Geneformer), value-binning (scBERT, scGPT) and value projection (TOSICA, CellPLM). (4/7)
comment in response to
post
Unlike popular in the field autoencoders, transformers take as input a set or a variable-length sequence of embeddings. Transformers rely on attention mechanism and can be trained with MLM or NTP, but neither of these gets us per-cell embeddings. (3/7)
comment in response to
post
For 7 years now, transformers have been taking over more and more fields, from NLP through image and speech processing to protein folding. Is it THE architecture for modeling non-sequential single-cell omics as well? Maybe we just need to make it sequential? (2/7)
comment in response to
post
Even worse - it means that the code is not sufficient to reproduce the scores even if you run it from scratch.