Bioinformatics Seminar

The Bioinformatics Seminar is co-sponsored by the Department of Mathematics at the Massachusetts Institute of Technology and the Theory of Computation group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). The seminar series focuses on highlighting areas of research in the field of computational biology. This year, we are hoping to highlight three topics: (1) language models and their uses in biology and biomedicine, (2) ethical and societal issues in biomedical data (e.g. privacy, fairness), and (3) single-cell biology.

Fall 2023

Lectures are on Wednesdays, 11:30am - 1:00pm ET
Hybrid Lectures will be held in Stata (32) G-575.
Zoom link for virtual attendants:

Date Speaker Title/Abstract
Sep. 13 (hybrid) Manolis Kellis

From genomics to therapeutics: single cell dissection of disease circuitry

Disease-associated variants lie primarily in non-coding regions, increasing the urgency of understanding how gene-regulatory circuitry impacts human disease. To address this challenge, we generate comparative genomics, epigenomic, and transcriptional maps, spanning 823 human tissues, 1500 individuals, and 20 million single cells. We link variants to target genes, upstream regulators, cell types of action, and perturbed pathways, and predict causal genes and regions to provide unbiased views of disease mechanisms, sometimes re-shaping our understanding. We find that Alzheimer’s variants act primarily through immune processes, rather than neuronal processes, and the strongest genetic association with obesity acts via energy storage/dissipation rather than appetite/exercise decisions. We combine single-cell profiles, tissue-level variation, and genetic variation across healthy and diseased individuals to map genetic effects into epigenomic, transcriptional, and function changes at single-cell resolution, to recognize cell-type-specific disease-associated somatic mutations indicative of mosaicism, and to recognize multi-tissue single-cell effects. We expand these methods to electronic health records to recognize multi-phenotype effects of genetics, environment, and disease, combining clinical notes, lab tests, and diverse data modalities despite missing data. We integrate large cohorts to factorize phenotype-genotype correlations to reveal distinct biological contributors of complex diseases and traits, to partition disease complexity, and to stratify patients for pathway-matched treatments. Lastly, we develop massively-parallel, programmable and modular technologies for manipulating these pathways by high-throughput reporter assays, genome editing, and gene targeting in human cells and mice, to propose new therapeutic hypotheses in Alzheimer’s, ALS, obesity, cardiac disease, schizophrenia, aging, and cancer. These results provide a roadmap for translating genetic findings into mechanistic insights and ultimately new therapeutic avenues for complex disease and cancer.

Sep. 20 (hybrid) Ellen Zhong
(Princeton University)

Machine learning for determining protein structure and dynamics from cryo-EM images

Major technological advances in cryo-electron microscopy (cryo-EM) have produced new opportunities to study the structure and dynamics of proteins and other biomolecular complexes. However, this structural heterogeneity complicates the algorithmic task of 3D reconstruction from the collected dataset of 2D cryo-EM images. In this seminar, I will overview cryoDRGN and related methods that leverage the representation power of deep neural networks for cryo-EM reconstruction. Underpinning the cryoDRGN method is a deep generative model parameterized by an implicit neural representation of 3D volumes and a learning algorithm to optimize this representation from unlabeled 2D cryo-EM images. Extended to real datasets and released as an open-source tool, these methods have been used to discover new protein structures and visualize continuous trajectories of protein motion. I will discuss various extensions of the method for scalable and robust reconstruction, analyzing the learned generative model, and visualizing dynamic protein structures in situ.

Sep. 27 (hybrid) Stephen Quake
(Stanford University)
A Decade of Molecular Cell Atlases
Oct. 4 (virtual) David Knowles
(Columbia University)
Oct. 11 (virtual) Bogdan Pasaniuc
Oct. 18 (hybrid) William Yu

Augmenting k-mer sketching for (meta)genomic sequence comparisons

Over the last decade, k-mer sketching (e.g. minimizers or MinHash) to create succinct summaries of long sequences has proven effective at improving the speed of sequence comparisons. However, rigorously characterizing the accuracy of these techniques has been more difficult. In this talk, I'll touch on three results that showcase some of the modern theoretical developments and practical applications of theory to building faster sequence comparison tools for metagenomics.

We begin by rigorously providing average-case guarantees for the popular seed-chain-extend heuristic for pairwise sequence alignment under a random substitution model, showing that it is accurate and runs in close to O(n log n) time for similar sequences. Then, we will turn our focus to metagenomics: our new tool skani computes average nucleotide identity (ANI) using sparse approximate alignments, and is both more accurate and over 20 times faster than the current state-of-the-art FastANI for comparing incomplete, fragmented MAGs (metagenome assembled genomes). This was enabled by Belbasi, et al.'s work showing that minimizers are biased Jaccard estimators, whereas other k-mer sketching does not have that drawback. Finally, we will introduce sylph (unpublished work), which enables fast and accurate database search to find nearest neighbor genomes (in ANI space) of low-coverage sequenced samples by using a combination of k-mer sketching with a zero-inflated Poisson correction (45x faster than MetaPhlAn for screening databases).

All of the work in this talk is joint with my brilliant PhD student Jim Shaw.

Shaw J, Yu YW. Proving sequence aligners can guarantee accuracy in almost O (m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Research (2023) 33 (7), 1175-1187 Shaw J, Yu YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023).

Oct. 25 (hybrid) Hoon Cho
(Yale University)
Nov. 1 (virtual) Brian Hie
(Stanford University)
Nov. 8 (hybrid) Cory McLean
(Google Health)
Nov. 15 (virtual) Rohit Singh
(Duke University)
Nov. 22 (virtual) JP Hubaux
Nov. 29 (hybrid) Barbara Engelhardt
(Stanford University)
Dec. 6 (virtual) Emma Pierson
(Cornell University)

Using machine learning to increase equity in healthcare and public health.

Our society remains profoundly unequal. This talk discusses how data science and machine learning can be used to combat inequality in health care and public health by presenting several vignettes from domains like policing, women's health, and cancer risk prediction.

Dec. 13 (hybrid) Yun Song

Past Terms

A listing of the Bioinformatics Seminar series home pages from prior terms.

Organizers and Information

The Bioinformatics Seminar is hosted by MIT Simons Professor of Mathematics and head of the Computation and Biology group at CSAIL Bonnie Berger. Professor Berger is also Faculty of Harvard-MIT Health Sciences & Technology, Associate Member of the Broad Institute of MIT and Harvard, Faculty of MIT CSB, and Affiliated Faculty of Harvard Medical School.

The seminar is announced weekly via email to members of the seminar's mailing list and to those on CSAIL's event calendar list. It is also posted in the BioWeek calendar.

Bonnie Berger:

Shuvom Sadhuka (TA):

To be added to the seminar's email announcement list or for any questions you have about the seminar, please mail

If you plan to enroll in the associated course, 18.418/HST.504: Topics in Computational Molecular Biology, please contact Professor Berger ( and cc TA Shuvom Sadhuka ( for more information.