Résumé
Genome sequence analysis plays an essential role in scientific and medical research, with applications spanning disease analysis, personalized medicine, epidemiology, forensics, evolutionary biology, and population genetics. Recent advancements in DNA sequencing technologies have led to an explosion in data generation, far outpacing the growth of computational power. As large-scale projects, such as the UK Biobank, which includes 500,000 sequenced individuals and associated biomedical data, become increasingly common, the computational burden intensifies, exacerbating existing bottlenecks and significantly raising energy consumption. Addressing these challenges is crucial to ensure that genomic research remains both scalable and sustainable.
This thesis focuses on accelerating genomic data processing while reducing its overall energy footprint. Several strategies are explored to achieve this goal. First, we introduce a novel genotype compression format that reduces storage requirements and enhances computational efficiency by enabling faster data access and allowing direct processing of compressed data, a concept known as "compressive genomics". We then present a parallelized version of the positional Burrows-Wheeler transform and associated algorithms, designed to leverage modern multi-core processors and accelerate genetic applications such as haplotype estimation and population structure analysis. Additionally, we propose a cloud-distributed method capable of efficiently processing population-scale whole-genome sequencing data, improving the statistical phasing of hundreds of thousands of genomes at petabyte scale. Finally, we introduce innovative hardware in the form of computational storage devices, which not only store data but are also capable of processing it locally. We demonstrate their potential for acceleration and energy efficiency by designing a computational storage device specifically for genomics. This device integrates a complete genomic analysis pipeline, from DNA sequence alignment to variant calling, directly within the storage hardware. This integration minimizes data movement, reduces energy consumption, and provides acceleration opportunities.
By combining advances in compression, algorithmic optimization, could-scale processing, and hardware architecture innovation, this work offers a comprehensive approach to accelerating genomic data analysis while improving energy efficiency. These contributions not only enable faster and deeper genomic research but also lay the foundation for sustainable, large-scale genomics studies.