arun-das.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

Thank you! It was awesome to talk to you too, and to learn about all the cool data and insights from your project!

submitted 28 days ago

comment in response to post

The homepage for this work is here: github.com/arun96/South... This analysis can be replicated on any population of your choosing, and the WDL scripts used to run the various stages (as well as any other analysis details) can all be found on that page and in our pre-print.

submitted 29 days ago

comment in response to post

Finally, we compare our placed contigs to loci associated with biomarker traits in the UK Biobank and East London Genes & Health Dataset, and find a number of positions where a placed contig is close to a significant locus.

submitted 29 days ago

comment in response to post

We are also able to align existing RNA-Seq data from 140 SAS individuals from MAGE directly to these contigs, allowing us to identify 200 contigs with a high density of RNA-Seq alignments. BLAST shows that these contigs are highly similar to non-reference human and primate sequences.

submitted 29 days ago

comment in response to post

We show that the majority of the placements we make are missed by traditional insertion calling tools, but in line with specific large non-reference sequence detection ones. For the unplaced contigs, BLAST shows that the majority have high similarity to non-reference human and primate sequences.

submitted 29 days ago

comment in response to post

We are able to place ~20K contigs against CHM13 through a combination of alignment, mate pair read information and LD. We find >8,000 instances of a placed contig intersecting one of 106 protein coding genes, and >6,000 placements within 1 Kb of a known GWAS site.

submitted 29 days ago

comment in response to post

We validate >80% of these contigs in a subset of 21 SAS individuals using auxiliary long read data. We repeat the linear pipeline with the HPRC v1 draft pangenomes, and see further improvements in alignment but only small reductions in the amount of assembled sequence.

submitted 29 days ago

comment in response to post

Despite improvements in alignment compared to GRCh38, we assemble ~600 Kb of sequence in >1 Kb contigs per individual from unmapped reads against T2T-CHM13. Across the whole set, we assemble 410 Mb of sequence in 199K contigs (which collapses down to 50 Mb when accounting for shared sequence).

submitted 29 days ago

comment in response to post

To do this, we align existing short read data from 640 South Asian (SAS) individuals from 1KGP and SGDP against linear & pangenome references, and assemble the unmapped reads into large contigs. We then attempt to analyze the functional impact of these sequences.

submitted 29 days ago

comment in response to post

South Asians are severely underrepresented in genomics, and this lack of representation makes it difficult to catalog and understand the variation present in these communities. Our goal was to investigate the variation present in these populations that is missing in widely used reference genomes.

submitted 29 days ago

comment in response to post

I'll also be on the job market this summer, so please reach out if you're interested! You can find out more about me at these links: LinkedIn: www.linkedin.com/in/arun96/ Personal Website: arundas.org

submitted 53 days ago

comment in response to post

If any of that interests you, or you would like the link to my defense, please shoot me a DM here or contact me through any one of these ways: arundas.org/Contact.html .

submitted 53 days ago

comment in response to post

I’ve also worked on a review paper on the evolution of human reference genomes (doi.org/10.1146/annu...), a visualization tool to detect relatedness in populations (doi.org/10.1093/bioi...) and on different approaches to genomic privacy (ccmb.brown.edu/sites/defaul... .

submitted 53 days ago

comment in response to post

In that work, we proposed a range of sketching and sampling approaches for classifying reads from metagenomic experiments without the overhead traditionally associated with alignment- or index-based approaches, and demonstrated that our approaches achieved comparable accuracy to those tools.

submitted 53 days ago

comment in response to post

I have also worked on sketching and sampling approaches for fast and accurate long read classification. Our paper was published in BMC Bioinformatics in 2022: bmcbioinformatics.biomedcentral.com/articles/10....

submitted 53 days ago

comment in response to post

Sapling utilizes learned index structures to predict the location of a query string within a suffix array, allowing for fast and accurate predictions that bypass the slow lookups often encountered in binary search.

submitted 53 days ago

comment in response to post

Now, onto some of my prior work. I worked on developing Sapling, an effort led by Dr. Melanie Kirsche, to utilize learned index structures to speed up suffix array queries. Sapling was published in Bioinformatics in 2021: academic.oup.com/bioinformati...

submitted 53 days ago

comment in response to post

This work is being done with @mikeschatz.bsky.social , @rajivmccoy.bsky.social and @aabiddanda.bsky.social . I’ll also be presenting a poster for this at Biology of Genomes 2025, so swing by and check it out and say hi!

submitted 53 days ago

comment in response to post

I am currently working on studying the variation present in South Asian populations relative to widely used linear & pangenome references, through the assembly of unmapped reads. We expect to have a pre-print very, very soon, and you can find all related resources here: github.com/arun96/South... .

submitted 53 days ago