Bioinformatics Seminar

The Bioinformatics Seminar is co-sponsored by the Department of Mathematics at the Massachusetts Institute of Technology and the Theory of Computation group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). The seminar series focuses on highlighting areas of research in the field of computational biology. This year, we are hoping to highlight three topics: (1) deep learning approaches for biology/biomedicine, (2) algorithms for genomics, and (3) computational methods for understanding and modeling evolution.

Fall 2025

Lectures are on Wednesdays, 11:30am - 1:00pm ET
Location: 32G-575 (Stata Center at MIT; Gates Tower; 5th Floor)
Zoom link for virtual attendants: https://mit.zoom.us/j/95319499071

Date	Speaker	Title/Abstract
Sept. 10 (virtual)	Brian Hie (Stanford)*	Genome modeling and design across all domains of life All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.
Sept. 17 (virtual)	Ron Dror (Stanford)*	Discovering Safe, Effective Drugs via Machine Learning and Simulation of 3D Structure Recent years have seen dramatic advances in both experimental determination and computational prediction of macromolecular structures. These structures hold great promise for the discovery of highly effective drugs with minimal side effects, but structure-based design of such drugs remains challenging. I will describe recent progress toward this goal, using both atomic-level molecular simulations and machine learning on three-dimensional structures.
Sept. 24	L. Aravind (NCBI)	Discovering new biochemistry from biological conflicts Biological replicators are locked in deeply intertwined genetic conflicts with each other. Using comparative genomics, protein sequence and structure analysis and evolutionary investigations, my lab has uncovered a staggering diversity of molecular armaments and mechanisms regulating their deployment, collectively termed biological conflict systems. These include toxins used in interorganismal interactions and a host of mechanisms involved in self/nonself discrimination, especially in the context of host-selfish element conflicts. Our studies have helped identify shared syntactical features in the organizational logic of biological conflict systems. These principles can be exploited to discover new conflict systems through computational analyses. Further, we find that across the range of biological organization, from intragenomic conflicts to interorganismal conflicts, a circumscribed set of effector protein domain families is deployed, targeting genetic information flow through the Central Dogma, certain membranes, and key molecules like NAD+ and NTPs. This has led to significant advances in discovering new biochemistry of these systems and furnished new biotechnological reagents for genome editing, sequencing and beyond. I’ll discuss this using specific examples of toxins in interorganismal conflict and effectors in antiviral immunity.
Oct. 1	Benedict Paten (UC Santa Cruz)	Furthering our understanding of human genetic variation: the human pangenome reference project second release Human genomics has relied on a single reference genome for the last twenty years. This reference genome is a corner stone of much of what we do in genomics but it can not, by definition, represent the variation present in the human population, and as a reference introduces a pervasive bias into genomic analyses. I will survey our recent efforts, through the Human Pangenome Reference Consortium, to build and use a reference pangenome – a collection of extremely high-quality reference genomes related together by a consensus genome alignment that we intend as a replacement for the reference genome.
Oct. 8 (virtual)	Vagheesh Narasimhan (UT Austin)*	AI integrating imaging and genetics to understand human evolution, development, aging, and disease Imaging has been the primary means of diagnosing as well as tracking the progression of many diseases for decades but has largely been collected in isolation. Recently through the advent of large scale biobanks this rich type of data has become linked with genetic and electronic health care record data at the level of tens of thousands of individuals providing an unprecedented ability to study the relationship between genotype and phenotype directly in humans. I will discuss our groups work leveraging >1.2M medical images (DXA, MRI and ultrasound) from ~60,000 individuals across multiple views of the heart, brain, skeleton, liver and pancreas to provide new insights in 4 different domains of biological science: (a) to understand the evolution of the human skeletal form which underlies our ability to be bipedal (b) examining the classical question in developmental biology of the genetic basis of left-right symmetry (c) building biological aging clocks to study mechanisms of age acceleration/deceleration and to identify gene targets to combat aging (d) multi-modal AI combining imaging, genetics and metabolics to predict 10-year disease incidence for common complex disease.
Oct. 15	Marina Sirota (UC San Francisco)	From Data to Knowledge: Integrating Clinical and Molecular Data for Predictive Medicine Alzheimer’s disease (AD) remains one of the most pressing medical challenges, with limited therapeutic options and heterogeneous disease trajectories complicating diagnosis and treatment. Recent advances in computational biology and artificial intelligence (AI) together with availability of rich molecular and clinical data, offer new opportunities to address these challenges by integrating molecular, clinical, and systems-level insights. In our recent studies, we developed a cell-type-directed, network-correcting approach to identify and prioritize rational drug combinations for AD, enabling targeted modulation of disease-relevant pathways across distinct cellular contexts (Li et al., Cell 2025). Complementarily, by leveraging large-scale electronic medical records (EMRs) integrated with biological knowledge networks, we demonstrated the ability to predict disease onset and progression while uncovering mechanistic insights into AD heterogeneity (Tang et al., Nature Aging 2024). Together, these complementary approaches illustrate the power of combining real-world clinical data, knowledge networks, and systems pharmacology to advance precision medicine for AD. This work highlights a paradigm shift toward AI-enabled, data-driven strategies that bridge molecular discovery and clinical application, ultimately informing novel therapeutic interventions and improving patient care.
Oct. 22	Kishwar Shafin (Google)	Creating the next generation of genome analysis tools with deep learning Deep learning is fueling a revolution in genomics, enabling the development of a new generation of analysis tools that offer unprecedented accuracy. This talk presents a suite of deep learning models designed to address fundamental challenges in variant calling and generating high-quality genome assemblies. We begin with DeepVariant, a convolutional neural network that redefined the standard for germline variant calling, and its extension, DeepSomatic, which adapts this technology to the critical task of identifying low-frequency somatic mutations in cancer genomes. Moving from variant analysis to genome construction, we introduce DeepPolisher. This tool leverages a powerful Transformer-based architecture to significantly reduce errors in genome assemblies, providing a more accurate and reliable foundation for downstream research. Finally, we explore the future of variant calling by integrating these methods with emerging pangenome references. We demonstrate how a pangenome-aware approach allows for a more comprehensive survey of human genetic diversity, resolving variation in previously intractable regions of the genome. Together, these tools represent a cohesive framework that is building the next generation of genomic analysis, transforming our ability to accurately read and interpret the code of life.
Oct. 29	Elinor Karlsson (UMass Medical, Broad Institute)	Exploring 100 million years of mammalian evolution for the origins of exceptional traits The Zoonomia Project, one of the largest comparative genomics initiatives ever undertaken, compared 240 mammalian species spanning over 100 million years of evolutionary history. This work revealed that at least 11% of the human genome is evolutionarily constrained, and that these constrained bases are more enriched for variants explaining common disease heritability than any other functional annotation. Yet nearly half of the most highly constrained bases remain unannotated in existing datasets, underscoring how much of the genome’s regulatory landscape remains unexplored. Building on this foundation, we are integrating the “common garden” framework from classical ecology with modern genomics to assay and compare cellular responses across diverse mammals. This effort includes RNA-seq and ATAC-seq profiling across 12 species and seven experimental states varying in temperature, oxygen, and glucose levels. We can identify molecular responses shared across mammals and those unique to species with remarkable physiological adaptations—such as camels that thrive in extreme heat, seals that dive deeply without suffering oxygen damage, and bats that tolerate extreme blood sugar fluctuations. Uncovering the genomic mechanisms that enable these exceptional traits may reveal new strategies for improving human health.
Nov. 5	Cancelled Chirag Patel (Harvard Medical School)	Cancelled ~~Interrogating the multi-omic architecture of the exposome and intervention from populations to individuals~~ The “exposome” — the array of environmental exposures from diet to chemical to infection - is vast, dynamic, and intertwined with biology across scales, but unlike the genome remains elusive despite thousands of candidate studies. This talk traces a practical pipeline for scaling exposome-phenome associations and calibrating them with intervention. First, we show population-scale maps relating hundreds of environmental and lifestyle factors to diverse phenotypes, quantifying realistic effect sizes, replication, and variance explained. Modeling correlated exposures jointly reveals modest but meaningful gains—an architecture reminiscent of polygenic traits—and sets baselines for discovery, prioritization, and study design. Second, we move to a promising future, integrate multi-omic layers—especially proteomics and metabolomics—to characterize trajectories of metabolic dysfunction and to nominate biology-anchored targets. By leveraging observational cohorts alongside interventional datasets (e.g., GLP-1–based therapy), we identify response-linked signatures for experimental opportunities. Third, we show recent work in the group to untangle intra-individual variation in glucose response, using AI approaches such as interpretable state-space models that fuse continuous glucose monitoring with wearable signals to forecast short-term risk and run counterfactual “what-if” scenarios for personalized self-management. We will also discuss emerging consortium for exposomic research, Nexus-exposomics.org
Nov. 12	Sumaiya Nazeen (Harvard Medical School)	From Networks to Subtypes: Statistical Frameworks for Mechanistic Insights into Complex Disease Genetics Complex diseases often arise from diverse genetic mechanisms acting through interconnected pathways and frequently encompass multiple hidden subtypes that share similar diagnostic features but have distinct genetic origins. Naturally, this raises two key questions: how can we move beyond single-gene associations to uncover mechanistic links to disease, and how can we identify latent, clinically meaningful subtype structure within complex disorders? Traditional analyses struggle to answer both. In this talk, I will present two approaches that bring a systems perspective to human genetics. First, I will introduce NERINE, a network-aware rare variant testing framework that integrates gene-gene interaction topology into a hierarchical model. By embedding human genetic variation within gene networks, NERINE enables competitive evaluation of biological hypotheses, achieving higher power and interpretability than traditional burden tests. Applied to both canonical pathway databases and experimentally derived networks, NERINE reveals novel disease mechanisms in breast cancer, type II diabetes, cardiovascular disease, and Parkinson’s disease (PD) across biobanks. In PD, rare variant associations identified by NERINE converge with a genome-scale CRISPRi screen in iPSC-derived neuronal models of synucleinopathies, revealing a mechanistic role for PRL in the α-synuclein stress response. Next, I will present Checkers, a new method for detecting genetic subtypes of complex diseases directly from genotype data. Using eigenanalysis under a liability threshold model, I will show that when subtypes arise from distinct underlying liabilities, a mixed disease cohort exhibits predictable low-dimensional patterns of sample relatedness at causal variants that can be detected without phenotypic or omics covariates. Checkers couples a novel matrix transformation to correct for sample kinship and linkage disequilibrium in controls with a statistical test on the eigenvalue distribution to estimate the number of disease subtypes. Our method effectively disentangles mixtures of binarized quantitative traits in the UK Biobank and provides a general framework for understanding disease heterogeneity. Together, these studies illustrate how network-aware and structure-aware computational frameworks can unify experimental and population-level perspectives, illuminating both the mechanistic architecture and hidden subtypes of human disease.
Nov. 19 (virtual)	Mile Šikić (Genome Institute of Singapore, University of Zagreb)*	AI for Genomes: Rethinking de novo Assembly Accurately resolving genomic paths in assembly graphs is a key challenge in de novo genome assembly, especially in the presence of repeats that create tangles and fragmentation. We present a geometric deep learning framework that learns directly from graph structure, bypassing conventional heuristics and exploiting problem symmetries to achieve PacBio HiFi reconstructions with state-of-the-art quality and contiguity. The same approach can be implemented for other sequencing technologies. Here, we will present results for haploid and diploid genomes. Our method performs robustly on both simulated and real datasets and will be able to utilise telomere-to-telomere reference expansion. By decoupling path inference from hard-coded strategies and generalising across species and genomic architectures, this framework opens the door to reconstructing highly complex genomes, including those with high ploidy or extensive structural variation.
Nov. 26	No speaker	No speaker due to the Thanksgiving holiday
Dec. 3 (virtual)	Fabian Theis (Helmholtz Munich)*	Towards virtual cells - the need for actionable, robust perturbation models Computational cell biology is evolving from descriptive atlases to predictive, actionable models — from mapping what cells are to simulating what they do. In this talk, I will outline progress toward virtual cells, focusing on machine learning approaches that enable robust perturbation modeling. Among others, I will present scConcept, a framework that differs from large-scale foundation models such as Geneformer by learning in the latent space through control-based objectives. Rather than passively embedding cellular states, scConcept explicitly models transitions between them, capturing how cells move through gene-expression space in response to context and perturbation. Building on this foundation, I will introduce CellFlow, a generative perturbation model that predicts how interventions — such as drugs, cytokines, or gene edits — reshape cellular phenotypes. By learning causal directions of change, CellFlow enables in silico experimentation and virtual screening of differentiation protocols. Together, these developments point toward virtual cells: computational counterparts capable of robustly predicting and designing biological behavior.
Dec. 10	Mona Singh (Princeton)	Advancing protein sequence analysis with protein language models Protein language models (PLMs) have emerged as transformative tools for understanding and interpreting protein sequences, enabling advances in structure prediction, functional annotation, and variant effect assessment directly from sequence alone. Yet realizing their full potential requires both algorithmic innovation and a deeper understanding of their capabilities and limitations. In this talk, I will present several recent developments that advance PLM-based protein sequence analysis along these dimensions. First, I will introduce Bag-of-Mer (BoM) pooling, a biologically inspired strategy for aggregating amino acid embeddings that can capture both local motifs and long-range interactions, improving performance on diverse tasks such as protein activity prediction, remote homology detection, and peptide–protein interaction prediction. Next, I will describe ARIES, a highly scalable multiple-sequence alignment algorithm that leverages PLM embeddings to achieve superior accuracy even in low-identity regions where traditional methods struggle. Finally, time permitting, I will discuss insights into PLM performance, including the roles of training data, sequence fit, and model architecture. Together, this work illustrates how PLMs can both power and reshape core computational biology tasks, while providing guidance for more effective and biologically grounded model development

*Indicates the speaker will be presenting over Zoom. Otherwise, they will be presenting in person.

Past Terms

A listing of the Bioinformatics Seminar series home pages from prior terms.

Organizers and Information

The Bioinformatics Seminar is hosted by MIT Simons Professor of Mathematics and head of the Computation and Biology group at CSAIL Bonnie Berger. Professor Berger is also Faculty of Harvard-MIT Health Sciences & Technology, Associate Member of the Broad Institute of MIT and Harvard, Faculty of MIT CSB, and Affiliated Faculty of Harvard Medical School.

The seminar is announced weekly via email to members of the seminar's mailing list and to those on CSAIL's event calendar list. It is also posted in the BioWeek calendar.

Bonnie Berger: bab@mit.edu

Megan Le (TA): meganle@mit.edu

To be added to the seminar's email announcement list or for any questions you have about the seminar, please mail bioinfo@csail.mit.edu and cc TA Megan Le (meganle@mit.edu).

If you plan to enroll in the associated course, 18.418/HST.504: Topics in Computational Molecular Biology, please contact Professor Berger (bab@mit.edu) and cc TA Megan Le (meganle@mit.edu) for more information.