rrwick.bsky.social
Bioinformatician at the Centre for Pathogen Genomics at the University of Melbourne
59 posts
1,021 followers
104 following
Regular Contributor
Active Commenter
comment in response to
post
Good to know - thanks for clarifying. Makes sense for a tool that's designed to work with big metagenomic datasets. I'm using it a bit out of its domain on a bacterial isolate.
comment in response to
post
The read set was only 240 MB (gzipped), so it is memory hungry. The myloasm docs do acknowledge that it uses more memory than other assemblers.
Also, I ran my tests on an ARM Mac, but the docs suggest that myloasm (specifically the polishing step) will be even faster on x86-64 CPUs with AVX2.
comment in response to
post
Just ran a few more tests through GNU time:
1 thread: 435 seconds, 10.1 GB RAM
2 threads: 238 seconds, 10.0 GB RAM
4 threads: 133 seconds, 10.1 GB RAM
8 threads: 73 seconds, 10.1 GB RAM
16 threads: 49 seconds, 13.3 GB RAM
comment in response to
post
I tested myloasm on a 50x Klebsiella isolate, and it was very fast - only took about 1 minute to complete (on my Macbook).
comment in response to
post
Even though it's designed for metagenomes, I suspect it might work nicely on bacterial isolates as well, especially if you filter out low-depth contigs. This is what I've found for metaMDBG, a long-read metagenome assembler from @gaetanbenoit.bsky.social.
comment in response to
post
I have đ¤
comment in response to
post
Yes, I used 6mA,4mC_5mC when basecalling. I didn't see any 4mC calls in the methylated_sites.tsv file. Also couldn't see anything suspicious in the bedMethyl file.
The motif is often GAGCTC, but GAGCTA, GAGCTG and GAGCTT came up too đ
comment in response to
post
Reference genomes and assemblies/reads for [email protected] basecalling are available as supp data for the Autocycler preprint:
github.com/rrwick/Autoc...
If you're interested in the other reads ([email protected], [email protected], [email protected]), I'm happy to share - send me an email!
comment in response to
post
For example, in the Klebsiella assemblies, there was often an error in the GAGCT motif (or its rev-comp AGCTC). I suspected this could be methylation, but I ran @acritschristoph.bsky.social's MicrobeMod and couldn't find anything at those sites. So a mystery to me đ¤ˇ
comment in response to
post
Most of the remaining errors are homopolymer-length errors, e.g. the genome had GĂ11 but the assembly had GĂ10. The rest are mostly 1-bp substitutions and indels. Often these occur at similar motifs within a genome.
comment in response to
post
The curved lines at the top-left of the plot are reads with 1 error, 2 errors, 3 errors, etc. The vertical lines in the plot correspond to the sizes of small plasmids in this dataset. Reads which span the entirety of the plasmid are common and create these peaks in the read length distribution.
comment in response to
post
I don't know what's behind that bump, but I'm very curious! I've seen it before, so I think it's an ONT thing (not a this-run thing). Here's a qscore-vs-length plot which shows that this bimodal distribution occurs at all read lengths.
comment in response to
post
Since it uses multiple input assemblies, Autocycler is more computationally demanding than other assembly approaches. But when accuracy matters, it's worth it!
(6/6)
comment in response to
post
Speaking of small plasmids, shout out to Plassembler by @gbouras13.bsky.social! It's the only tool I know of that can reliably assemble small plasmids from long-read data, and I use it with Autocycler to improve small plasmid recovery.
github.com/gbouras13/pl...
(5/6)
comment in response to
post
While Autocycler can be fully automated, it also allows for manual intervention when accuracy is paramount. For example, Autocycler still sometimes misses small plasmids, but manually revising its clusters can fix this.
(4/6)
comment in response to
post
Benchmark TLDR:
⢠Flye and Canu had the fewest sequence errors of the individual assemblers.
⢠All single-tool assemblies were prone to structural issues (e.g. missing small plasmids).
⢠Autocycler assemblies had lower error rates and better structural accuracy than other methods.
(3/6)
comment in response to
post
If you're interested in using Autocycler, the docs are probably more useful than the preprint:
github.com/rrwick/Autoc...
But if you want to see how it compares to other assemblers/pipelines, the preprint has a benchmark.
(2/6)
comment in response to
post
It's a replacement for `conda env list` that shows more useful info: a description, last update date, and versions of key tools. You just add a simple `envinfo.txt` file to each environment to define what you care about.
comment in response to
post
I wasn't aware of ska lo (looks like it was released just a few weeks ago), so thanks for bringing it to my attention. It certainly sounds like it would improve recall - I'll need to give it a try!
comment in response to
post
So no, I don't think variant calling offers a particularly distinct way to improve an assembly beyond what polishing already does.
comment in response to
post
Genome polishing and variant calling are quite similar - both involve using reads to detect differences from a reference and then applying corrections. A classic example is short-read polishing of long-read assemblies, which works great.
comment in response to
post
P.S. while Louise Judd didn't have a Bluesky account when I posted this, she does now: juddlmj.bsky.social. You should follow her!
comment in response to
post
And as always, many thanks to my co-authors (Louise Judd, @tstinear.bsky.social and @ianmonknz.bsky.social) for all their hard work on this!
(8/8)
comment in response to
post
Another question for future work: Assemblies might not be great for calling SNPs and small indels, but what about larger structural variants? Long-read assemblies in particular could be really powerful here.
(7/8)
comment in response to
post
What if you don't have reads and so must call variants from assemblies? Our study used simple approaches (e.g. synthetic reads), but more sophisticated methods (e.g. masking repeats) could help reduce false positives. We didn't tackle this here, but it's a great follow-up question.
(6/8)
comment in response to
post
So unless you're very confident in your assemblies (or can tolerate errors), stick with traditional read-based variant calling. But as sequencing tech and assemblers improve, we may reach a point where assemblies are reliably error-free - making assembly-based variant calling a real option.
(5/8)
comment in response to
post
Results? As expected, if your assemblies are error-free, assembly-based variant calling works great. But any assembly errors = false positives! Hybracter assemblies (ONT+Illumina) performed well, but all other assembly methods introduced false positives - sometimes hundreds.
(4/8)
comment in response to
post
That led to the bigger question: How well does assembly-based variant calling work for more typical assemblies?
So I expanded the study: multiple assemblers, sequencing platforms, depths and variant-calling pipelines, all benchmarked against a ground truth dataset.
(3/8)
comment in response to
post
This project started last year when I assembled a set of closely related Staph genomes and wondered - if I MSAed the assemblies and directly called variants, would I get the exact same results as Snippy?
The answer: Yes! But these were very high-quality assemblies...
(2/8)
comment in response to
post
Here's the genome: www.ncbi.nlm.nih.gov/datasets/gen...
I ran CheckM v1.2.3 locally and got 99.51% completeness and 0.64% contamination. CheckM2 v1.0.2 did a bit better: 100.0% completeness and 0.14% contamination.
comment in response to
post
Dorado can do assembly polishing. So you give it an input draft assembly (e.g. made by Flye) and ONT reads, and it outputs a polished version of the assembly which (hopefully) has fewer errors than the input.
comment in response to
post
To make producing soft core alignments easier, we developed Core-SNP-filter, a simple and efficient tool to process SNP alignments with user-defined thresholds.
github.com/rrwick/Core-...
(4/4)
comment in response to
post
And the benefits grow with dataset size! A 100% strict core may work fine for small datasets (e.g. ~10 genomes) but is devastating for very large ones (e.g. 1000+ genomes). A 95% soft core works well across all dataset sizes.
(3/4)
comment in response to
post
Our key finding: a 95% soft core (allowing up to 5% missing data per site) is usually better than a 100% strict core. It retains more information, often leading to better phylogenetic resolution and stronger temporal signal.
(2/4)
comment in response to
post
The next version of Autocycler will contain some improvements for dealing with cases like this. It will still take a bit of manual intervention (fully automated linear sequences is a more long-term goal), but the new version should help. Stay tuned đ
comment in response to
post
But if for whatever reason long-read assemblers struggle, then you might want to stick with Unicycler. Linear plasmids can be difficult to get right, and that's definitely a focus of mine for future development of Autocycler.