Profile avatar
stevenjrobbins.bsky.social
Do my science @ace_uq studying coral reef microbiomes. Data wrangler, meta-omics and long-read wonk, clean energy enthusiast, Saganist zealot, collector of weird zoology facts, other nonsense.
593 posts 3,305 followers 446 following
Regular Contributor
Active Commenter
comment in response to post
Only annoying thing is if you terminate at stop X and return via a different stop Y, it charges you the maximum, as if you road all over the city. So if you’re commuting, cheaper to return via the same stop in which you terminated.
comment in response to post
So a ticket is unnecessary. Can only speak directly to London, though.
comment in response to post
In London you literally just tap on and off the platform with your credit card for normal commuter trains and it records where you tap on and off and charges accordingly. I thought it was like that in most of the UK.
comment in response to post
It seems like today’s Resistance History talk from Tad Stoermer (Johns Hopkins) is instructive about how the Nazis first used laws about citizenship, then & deportation as the predicate steps to genocide in concentration camps. It’s about messaging & dehumanizing over time. Just like Trump
comment in response to post
CRISPR spacers are known to assemble poorly from metaG, so you need to look at the reads. Here we built a high-throughput tool (code.jgi.doe.gov/SRoux/spacerextractor) and mined ~800 million spacers from SRA metaG reads, all linked to a repeat and a sample (with taxonomy and metadata when possible).
comment in response to post
Is the other one Genomad? At this point, I feel like if one wanted to be the most confident in their viral set, one would just run Genomad and FDR correction and be done with it.
comment in response to post
You’re 100% right. 🙂 In that, I’m not criticizing the CheckV authors. It’s just a really hard problem for the reasons you point out. I’m a little surprised though it doesn’t seem to filter out contigs that have a higher number of host than viral markers, though, as a broad sanity check.
comment in response to post
Sub-thread 4: In constructing the viral (vMAG) component of the GBR-MGD, we find that some commonly used tools for viral metagenomics are unsuitable for use on long-read metagenome assemblies. bsky.app/profile/stev...
comment in response to post
We saw this same issue with ML-based "deep" classifiers for plasmids. DeepPlasmid, PlasClass, and Mobile-OG-db showed similar results when plotting Genomad's plasmid, chromosomal, viral markers. Contigs pred by these tools showed higher enrichment in chromosomal and viral markers than other tools.
comment in response to post
It surprised us how fragile this sort of viral pipeline is and how different CheckV is conceptually to CheckM. You can use CheckM to tell you where something is a good pMAG, you really can't use CheckV like that. It can only give you a completeness estimate, trusting that the contig is viral.
comment in response to post
Interesting! Would you mind sharing your threshold? We've noticed that long reads break a few machine learning based tools in this same way, for plasmids as well. Seems like if these tools are to be useful, new benchmarking has to be done to establish parameters that make sense for long contigs.
comment in response to post
I've noticed the same with DVF and CheckV for long contigs. My way of dealing with it was to use a very high score threshold for DVF to remove non-viral contigs. Of course, shorter true viral contigs are also lost, but I'm using multiple predictors, so hopefully that compensates for their loss
comment in response to post
So we moved forward with the 5 remaining viral identification tools to create the GBR-MGD vMAGs and recommend avoiding DeepVirFinder for long-reads. We advocate taking care to investigate machine learning-based tools when used on data types they're not benchmarked on--here, long read vs short read.
comment in response to post
What's also interesting is that if you look at the same plot for Illumina-only metagenomes, this issue becomes much less pronounced, simply because the contigs are much shorter and do not often reach the range of erroneous assignment. So on short-read metagenomes, DeepVirFinder/CheckV may be fine.
comment in response to post
but it shows that if you fed this high proportion of erroneously assigned "viral" contigs to CheckV, it would tell you that you have a lot of high quality viruses that aren't. This result surprised us and i'm happy to take feedback on it.
comment in response to post
You can see that, as contig length increases, DeepVirFinder's chance of designating a non-viral contig as viral goes up, as does the chance of CheckV assigning a high quality score to the that non-viral long contig. CheckV does not assess contamination, only completeness, so this might be expected,
comment in response to post
Most meta-omic viral identification tools are tested on short-read metagenomes. We wanted to see if any looked wonky on long-read contigs. Most tools hold up, DeepVirFinder didn't. Plot shows the ratio of CheckV host to viral markers vs contig length for ONT assemblies, colored by CheckV quality.
comment in response to post
@xrefugee13.bsky.social
comment in response to post
Heck yeah! I’m gonna shamelessly plug our recent preprint here in the spirit of WYMM. Kind of the first third is everything you’re missing with short reads, at least in seawater. bsky.app/profile/stev...
comment in response to post
Because traditional short-read platforms struggle with low GC/high strain diversity, NTMR indicator taxa like Pelagibacter/SAR86 would have been invisible to genomic analysis without @nanoporetech.com long reads. The ability to identify interesting biological insights required a comprehensive DB.
comment in response to post
And indeed, we find that NTMR reefs show statistically less nitrate than Fished reefs. We don’t really understand why, but the establishment of No Take Marine Reserves appears to lead to lower dissolved nitrogen, selecting for streamlined taxa with low GC.
comment in response to post
But why do indicators of NTMRs seem to always show streamlined genomes? The Giovannoni model would say that streamlining (low GC, smaller genomes) occurs in large population size, low nutrient communities, specifically low nitogen, because G and C bases require more nitrogen. Low GC is efficient.
comment in response to post
Sub-thread 3: using the GBR-MGD, we identify microbial indicators of reef management status, most of which are from low-GC taxa, “streamlined” taxa that could not be recovered using short-read sequencing. bsky.app/profile/stev...
comment in response to post
In contrast, indicators of fished reefs all had larger, higher GC genomes than the rest of their phylogenetic group. For example, the genera UBA11663, UBA8752, and species UBA10364 sp003445735 in the class Bacteroidia had genomes with 45-63 GC%, 9-27% higher than the rest of the Bacteroidia.
comment in response to post
The thing that stuck out about all microbes indicative of NTMRs is that they are exactly the microbes that could not be recovered using traditional short reads—e.g. Pelagibacteraceae, SAR86, HIMB59, the phylum Marinisomatota. All with low-GC, streamlined genomes.
comment in response to post
By mapping the metagenomic reads against the GBR-MGD MAGs to calculate relative abundance, we identify microbial taxa that can predict whether a reef is an NTMR or fished reef with 71% accuracy. Note that we didn’t have an extensive sample catalog, this number would likely rise with more samples.