rrwick.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

However, I hear rumours that ONT might be working on a new move-table-aware bacterial polishing model. See my blog post from Feb for details: rrwick.github.io/2025/02/07/d.... If true, I'll be eager to test it out when released.

submitted 1 day ago

comment in response to post

Good to know - thanks for clarifying. Makes sense for a tool that's designed to work with big metagenomic datasets. I'm using it a bit out of its domain on a bacterial isolate.

submitted 12 days ago

comment in response to post

The read set was only 240 MB (gzipped), so it is memory hungry. The myloasm docs do acknowledge that it uses more memory than other assemblers. Also, I ran my tests on an ARM Mac, but the docs suggest that myloasm (specifically the polishing step) will be even faster on x86-64 CPUs with AVX2.

submitted 12 days ago

comment in response to post

Just ran a few more tests through GNU time: 1 thread: 435 seconds, 10.1 GB RAM 2 threads: 238 seconds, 10.0 GB RAM 4 threads: 133 seconds, 10.1 GB RAM 8 threads: 73 seconds, 10.1 GB RAM 16 threads: 49 seconds, 13.3 GB RAM

submitted 12 days ago

comment in response to post

I tested myloasm on a 50x Klebsiella isolate, and it was very fast - only took about 1 minute to complete (on my Macbook).

submitted 13 days ago

comment in response to post

Even though it's designed for metagenomes, I suspect it might work nicely on bacterial isolates as well, especially if you filter out low-depth contigs. This is what I've found for metaMDBG, a long-read metagenome assembler from @gaetanbenoit.bsky.social.

submitted 13 days ago

comment in response to post

I have 🤞

submitted 13 days ago

comment in response to post

Yes, I used 6mA,4mC_5mC when basecalling. I didn't see any 4mC calls in the methylated_sites.tsv file. Also couldn't see anything suspicious in the bedMethyl file. The motif is often GAGCTC, but GAGCTA, GAGCTG and GAGCTT came up too 😕

submitted 14 days ago

comment in response to post

Reference genomes and assemblies/reads for [email protected] basecalling are available as supp data for the Autocycler preprint: github.com/rrwick/Autoc... If you're interested in the other reads ([email protected], [email protected], [email protected]), I'm happy to share - send me an email!

submitted 14 days ago

comment in response to post

For example, in the Klebsiella assemblies, there was often an error in the GAGCT motif (or its rev-comp AGCTC). I suspected this could be methylation, but I ran @acritschristoph.bsky.social's MicrobeMod and couldn't find anything at those sites. So a mystery to me 🤷

submitted 14 days ago

comment in response to post

Most of the remaining errors are homopolymer-length errors, e.g. the genome had G×11 but the assembly had G×10. The rest are mostly 1-bp substitutions and indels. Often these occur at similar motifs within a genome.

submitted 14 days ago

comment in response to post

The curved lines at the top-left of the plot are reads with 1 error, 2 errors, 3 errors, etc. The vertical lines in the plot correspond to the sizes of small plasmids in this dataset. Reads which span the entirety of the plasmid are common and create these peaks in the read length distribution.

submitted 14 days ago

comment in response to post

I don't know what's behind that bump, but I'm very curious! I've seen it before, so I think it's an ONT thing (not a this-run thing). Here's a qscore-vs-length plot which shows that this bimodal distribution occurs at all read lengths.

submitted 14 days ago

comment in response to post

Since it uses multiple input assemblies, Autocycler is more computationally demanding than other assembly approaches. But when accuracy matters, it's worth it! (6/6)

submitted 21 days ago

comment in response to post

Speaking of small plasmids, shout out to Plassembler by @gbouras13.bsky.social! It's the only tool I know of that can reliably assemble small plasmids from long-read data, and I use it with Autocycler to improve small plasmid recovery. github.com/gbouras13/pl... (5/6)

submitted 21 days ago

comment in response to post

While Autocycler can be fully automated, it also allows for manual intervention when accuracy is paramount. For example, Autocycler still sometimes misses small plasmids, but manually revising its clusters can fix this. (4/6)

submitted 21 days ago

comment in response to post

Benchmark TLDR: • Flye and Canu had the fewest sequence errors of the individual assemblers. • All single-tool assemblies were prone to structural issues (e.g. missing small plasmids). • Autocycler assemblies had lower error rates and better structural accuracy than other methods. (3/6)

submitted 21 days ago

comment in response to post

If you're interested in using Autocycler, the docs are probably more useful than the preprint: github.com/rrwick/Autoc... But if you want to see how it compares to other assemblers/pipelines, the preprint has a benchmark. (2/6)

submitted 21 days ago

comment in response to post

It's a replacement for `conda env list` that shows more useful info: a description, last update date, and versions of key tools. You just add a simple `envinfo.txt` file to each environment to define what you care about.

submitted 76 days ago

comment in response to post

I wasn't aware of ska lo (looks like it was released just a few weeks ago), so thanks for bringing it to my attention. It certainly sounds like it would improve recall - I'll need to give it a try!

submitted 98 days ago

comment in response to post

So no, I don't think variant calling offers a particularly distinct way to improve an assembly beyond what polishing already does.

submitted 99 days ago

comment in response to post

Genome polishing and variant calling are quite similar - both involve using reads to detect differences from a reference and then applying corrections. A classic example is short-read polishing of long-read assemblies, which works great.

submitted 99 days ago

comment in response to post

P.S. while Louise Judd didn't have a Bluesky account when I posted this, she does now: juddlmj.bsky.social. You should follow her!

submitted 100 days ago

comment in response to post

And as always, many thanks to my co-authors (Louise Judd, @tstinear.bsky.social and @ianmonknz.bsky.social) for all their hard work on this! (8/8)

submitted 100 days ago

comment in response to post

Another question for future work: Assemblies might not be great for calling SNPs and small indels, but what about larger structural variants? Long-read assemblies in particular could be really powerful here. (7/8)

submitted 100 days ago

comment in response to post

What if you don't have reads and so must call variants from assemblies? Our study used simple approaches (e.g. synthetic reads), but more sophisticated methods (e.g. masking repeats) could help reduce false positives. We didn't tackle this here, but it's a great follow-up question. (6/8)

submitted 100 days ago

comment in response to post

So unless you're very confident in your assemblies (or can tolerate errors), stick with traditional read-based variant calling. But as sequencing tech and assemblers improve, we may reach a point where assemblies are reliably error-free - making assembly-based variant calling a real option. (5/8)

submitted 100 days ago

comment in response to post

Results? As expected, if your assemblies are error-free, assembly-based variant calling works great. But any assembly errors = false positives! Hybracter assemblies (ONT+Illumina) performed well, but all other assembly methods introduced false positives - sometimes hundreds. (4/8)

submitted 100 days ago

comment in response to post

That led to the bigger question: How well does assembly-based variant calling work for more typical assemblies? So I expanded the study: multiple assemblers, sequencing platforms, depths and variant-calling pipelines, all benchmarked against a ground truth dataset. (3/8)

submitted 100 days ago

comment in response to post

This project started last year when I assembled a set of closely related Staph genomes and wondered - if I MSAed the assemblies and directly called variants, would I get the exact same results as Snippy? The answer: Yes! But these were very high-quality assemblies... (2/8)

submitted 100 days ago

comment in response to post

Here's the genome: www.ncbi.nlm.nih.gov/datasets/gen... I ran CheckM v1.2.3 locally and got 99.51% completeness and 0.64% contamination. CheckM2 v1.0.2 did a bit better: 100.0% completeness and 0.14% contamination.

submitted 111 days ago

comment in response to post

Dorado can do assembly polishing. So you give it an input draft assembly (e.g. made by Flye) and ONT reads, and it outputs a polished version of the assembly which (hopefully) has fewer errors than the input.

submitted 111 days ago

comment in response to post

To make producing soft core alignments easier, we developed Core-SNP-filter, a simple and efficient tool to process SNP alignments with user-defined thresholds. github.com/rrwick/Core-... (4/4)

submitted 141 days ago

comment in response to post

And the benefits grow with dataset size! A 100% strict core may work fine for small datasets (e.g. ~10 genomes) but is devastating for very large ones (e.g. 1000+ genomes). A 95% soft core works well across all dataset sizes. (3/4)

submitted 141 days ago

comment in response to post

Our key finding: a 95% soft core (allowing up to 5% missing data per site) is usually better than a 100% strict core. It retains more information, often leading to better phylogenetic resolution and stronger temporal signal. (2/4)

submitted 141 days ago