arun-das.bsky.social
Scientist working in computer science and genomics.
Currently looking for opportunities in genomics! Contact me at arundas.org
PhD from Schatz Lab @ JHU CS. Brown '18. He/His/Him. #YNWA
24 posts
54 followers
73 following
Getting Started
Conversation Starter
comment in response to
post
Thank you! It was awesome to talk to you too, and to learn about all the cool data and insights from your project!
comment in response to
post
The homepage for this work is here: github.com/arun96/South...
This analysis can be replicated on any population of your choosing, and the WDL scripts used to run the various stages (as well as any other analysis details) can all be found on that page and in our pre-print.
comment in response to
post
Finally, we compare our placed contigs to loci associated with biomarker traits in the UK Biobank and East London Genes & Health Dataset, and find a number of positions where a placed contig is close to a significant locus.
comment in response to
post
We are also able to align existing RNA-Seq data from 140 SAS individuals from MAGE directly to these contigs, allowing us to identify 200 contigs with a high density of RNA-Seq alignments.
BLAST shows that these contigs are highly similar to non-reference human and primate sequences.
comment in response to
post
We show that the majority of the placements we make are missed by traditional insertion calling tools, but in line with specific large non-reference sequence detection ones.
For the unplaced contigs, BLAST shows that the majority have high similarity to non-reference human and primate sequences.
comment in response to
post
We are able to place ~20K contigs against CHM13 through a combination of alignment, mate pair read information and LD.
We find >8,000 instances of a placed contig intersecting one of 106 protein coding genes, and >6,000 placements within 1 Kb of a known GWAS site.
comment in response to
post
We validate >80% of these contigs in a subset of 21 SAS individuals using auxiliary long read data.
We repeat the linear pipeline with the HPRC v1 draft pangenomes, and see further improvements in alignment but only small reductions in the amount of assembled sequence.
comment in response to
post
Despite improvements in alignment compared to GRCh38, we assemble ~600 Kb of sequence in >1 Kb contigs per individual from unmapped reads against T2T-CHM13.
Across the whole set, we assemble 410 Mb of sequence in 199K contigs (which collapses down to 50 Mb when accounting for shared sequence).
comment in response to
post
To do this, we align existing short read data from 640 South Asian (SAS) individuals from 1KGP and SGDP against linear & pangenome references, and assemble the unmapped reads into large contigs.
We then attempt to analyze the functional impact of these sequences.
comment in response to
post
South Asians are severely underrepresented in genomics, and this lack of representation makes it difficult to catalog and understand the variation present in these communities.
Our goal was to investigate the variation present in these populations that is missing in widely used reference genomes.
comment in response to
post
I'll also be on the job market this summer, so please reach out if you're interested!
You can find out more about me at these links:
LinkedIn: www.linkedin.com/in/arun96/
Personal Website: arundas.org
comment in response to
post
If any of that interests you, or you would like the link to my defense, please shoot me a DM here or contact me through any one of these ways: arundas.org/Contact.html .
comment in response to
post
I’ve also worked on a review paper on the evolution of human reference genomes (doi.org/10.1146/annu...), a visualization tool to detect relatedness in populations (doi.org/10.1093/bioi...) and on different approaches to genomic privacy (ccmb.brown.edu/sites/defaul... .
comment in response to
post
In that work, we proposed a range of sketching and sampling approaches for classifying reads from metagenomic experiments without the overhead traditionally associated with alignment- or index-based approaches, and demonstrated that our approaches achieved comparable accuracy to those tools.
comment in response to
post
I have also worked on sketching and sampling approaches for fast and accurate long read classification. Our paper was published in BMC Bioinformatics in 2022: bmcbioinformatics.biomedcentral.com/articles/10....
comment in response to
post
Sapling utilizes learned index structures to predict the location of a query string within a suffix array, allowing for fast and accurate predictions that bypass the slow lookups often encountered in binary search.
comment in response to
post
Now, onto some of my prior work.
I worked on developing Sapling, an effort led by Dr. Melanie Kirsche, to utilize learned index structures to speed up suffix array queries. Sapling was published in Bioinformatics in 2021: academic.oup.com/bioinformati...
comment in response to
post
This work is being done with @mikeschatz.bsky.social , @rajivmccoy.bsky.social and @aabiddanda.bsky.social .
I’ll also be presenting a poster for this at Biology of Genomes 2025, so swing by and check it out and say hi!
comment in response to
post
I am currently working on studying the variation present in South Asian populations relative to widely used linear & pangenome references, through the assembly of unmapped reads.
We expect to have a pre-print very, very soon, and you can find all related resources here: github.com/arun96/South... .