Bioinformatics Seminar

The Bioinformatics Seminar is co-sponsored by the Department of Mathematics at the Massachusetts Institute of Technology and the Theory of Computation group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). The seminar series focuses on highlighting areas of research in the field of computational biology. This year, we are hoping to highlight three topics: (1) language models and their uses in biology and biomedicine, (2) ethical and societal issues in biomedical data (e.g. privacy, fairness), and (3) single-cell biology.

Fall 2023

Lectures are on Wednesdays, 11:30am - 1:00pm ET
Hybrid Lectures will be held in Stata (32) G-575.
Zoom link for virtual attendants:

Date Speaker Title/Abstract
Sep. 13 (hybrid) Manolis Kellis

From genomics to therapeutics: single cell dissection of disease circuitry

Disease-associated variants lie primarily in non-coding regions, increasing the urgency of understanding how gene-regulatory circuitry impacts human disease. To address this challenge, we generate comparative genomics, epigenomic, and transcriptional maps, spanning 823 human tissues, 1500 individuals, and 20 million single cells. We link variants to target genes, upstream regulators, cell types of action, and perturbed pathways, and predict causal genes and regions to provide unbiased views of disease mechanisms, sometimes re-shaping our understanding. We find that Alzheimer’s variants act primarily through immune processes, rather than neuronal processes, and the strongest genetic association with obesity acts via energy storage/dissipation rather than appetite/exercise decisions. We combine single-cell profiles, tissue-level variation, and genetic variation across healthy and diseased individuals to map genetic effects into epigenomic, transcriptional, and function changes at single-cell resolution, to recognize cell-type-specific disease-associated somatic mutations indicative of mosaicism, and to recognize multi-tissue single-cell effects. We expand these methods to electronic health records to recognize multi-phenotype effects of genetics, environment, and disease, combining clinical notes, lab tests, and diverse data modalities despite missing data. We integrate large cohorts to factorize phenotype-genotype correlations to reveal distinct biological contributors of complex diseases and traits, to partition disease complexity, and to stratify patients for pathway-matched treatments. Lastly, we develop massively-parallel, programmable and modular technologies for manipulating these pathways by high-throughput reporter assays, genome editing, and gene targeting in human cells and mice, to propose new therapeutic hypotheses in Alzheimer’s, ALS, obesity, cardiac disease, schizophrenia, aging, and cancer. These results provide a roadmap for translating genetic findings into mechanistic insights and ultimately new therapeutic avenues for complex disease and cancer.

Sep. 20 (hybrid) Ellen Zhong
(Princeton University)

Machine learning for determining protein structure and dynamics from cryo-EM images

Major technological advances in cryo-electron microscopy (cryo-EM) have produced new opportunities to study the structure and dynamics of proteins and other biomolecular complexes. However, this structural heterogeneity complicates the algorithmic task of 3D reconstruction from the collected dataset of 2D cryo-EM images. In this seminar, I will overview cryoDRGN and related methods that leverage the representation power of deep neural networks for cryo-EM reconstruction. Underpinning the cryoDRGN method is a deep generative model parameterized by an implicit neural representation of 3D volumes and a learning algorithm to optimize this representation from unlabeled 2D cryo-EM images. Extended to real datasets and released as an open-source tool, these methods have been used to discover new protein structures and visualize continuous trajectories of protein motion. I will discuss various extensions of the method for scalable and robust reconstruction, analyzing the learned generative model, and visualizing dynamic protein structures in situ.

Sep. 27 (hybrid) Stephen Quake
(Stanford University)
A Decade of Molecular Cell Atlases
Oct. 2 (hybrid)
Stata G882
Steven Brenner
(UC Berkeley)

Notes on privacy timebombs in functional genomics data

Plans to sequence everyone in the developed world at birth were developed three decades ago. This is now plausible, and companies offer newborn sequencing as an option to parents—sometimes as an alternative to newborn screening (NBS) which identifies rare, treatable conditions that require urgent intervention. Yet the benefits and risks remain largely unknown. We probed the potential and pitfalls of performing pervasive population-scale public health sequencing of newborns. To do so, we drew upon an unparalleled NBS public health resource and used inborn errors of metabolism (IEMs) as a model system for human genetics.

We obtained archived residual dried blood spots and data for nearly all IEM cases from the 4.5 million infants born in California during an 8.5 year period. We found that exome analysis alone was insufficiently sensitive or specific to be a primary screen for most NBS IEMs. However, as a secondary test for infants, exomes could reduce false-positive results, facilitate timely case resolution and in some instances even suggest more appropriate or specific diagnosis. Sequence-based NBS could also be the foundation of a learning public health system identifying additional disorders.

As genomic data become increasingly mainstream, privacy risks have garnered increasing attention. Forensic cases using previously unidentifiable DNA have direct implications for research data. We found that consumer genomics and biological discovery activate latent privacy risk in ’omics data, including data scrubbed of any genetic variation information. I will briefly discuss how data previously deemed safe by privacy researchers for unrestricted sharing now appear to be at risk of potential re-identification.

Oct. 4 (virtual) David Knowles
(Columbia University)

Determining the molecular intermediates between genotype and phenotype

Abstract: I will describe two projects that aim to better dissect the causal chain from functional genetic variant through molecular intermediates and finally to organismal trait or disease risk. In the first, we use pooled profiling of splicing factor binding across individuals to measure and then computationally model genetic effects on both binding and RNA splicing. In the second, we developed an instrumental variable-based causal network inference method that scales to hundreds of nodes by leveraging convex optimization. We apply this approach to learning phenome-wide trait networks from the UK biobank and directed gene regulatory networks from Perturb-seq.

Bio: Dr Knowles uses statistical machine learning—probabilistic graphical models, deep learning and convex optimization—to address challenges in understanding large genomic datasets. His lab develops methods to map the causes and consequences of transcriptomic dysregulation, especially RNA splicing, across the spectrum from rare to common genetic disease. They collaborate with labs at NYGC, Columbia and MSSM, focusing on understanding the genetic basis of neurological diseases, both degenerative and psychiatric.

Dr Knowles studied Natural Sciences and Information Engineering at (old) Cambridge before obtaining an MSc in Bioinformatics and Systems Biology at Imperial College London. During his PhD in the Cambridge University Machine Learning Group under Zoubin Ghahramani he worked on variational inference and Bayesian nonparametric models for factor analysis, hierarchical clustering and network analysis. He was a postdoc at Stanford developing methods for functional genomics with Daphne Koller (CS), Sylvia Plevritis (Computational Systems Biology/Radiology) and Jonathan Pritchard (Genetics/Biology). At Columbia, he is an Assistant Professor of Computer Science, an Interdisciplinary Appointee in Systems Biology and an Affiliate Member of the Data Science Institute. He is also a Core Faculty Member at the New York Genome Center.

Oct. 11 (virtual) Bogdan Pasaniuc

Calibrated prediction intervals for polygenic scores across diverse contexts

Polygenic scores (PGS) have emerged as the tool of choice for genomic prediction in a wide range of fields from agriculture to personalized medicine. We analyze data from two large biobanks in the US (All of Us) and the UK (UK Biobank) to find widespread variability in PGS performance across contexts. Many contexts, including age, sex, and income, impact PGS accuracies with similar magnitudes as genetic ancestry. PGSs trained in single versus multi-ancestry cohorts show similar context-specificity in their accuracies. We introduce trait prediction intervals that are allowed to vary across contexts as a principled approach to account for context-specific PGS accuracy in genomic prediction. We show that prediction intervals need to be adjusted for all considered traits ranging from 10% for diastolic blood pressure to 80% for waist circumference.

Oct. 18 (hybrid) William Yu

Augmenting k-mer sketching for (meta)genomic sequence comparisons

Over the last decade, k-mer sketching (e.g. minimizers or MinHash) to create succinct summaries of long sequences has proven effective at improving the speed of sequence comparisons. However, rigorously characterizing the accuracy of these techniques has been more difficult. In this talk, I'll touch on three results that showcase some of the modern theoretical developments and practical applications of theory to building faster sequence comparison tools for metagenomics.

We begin by rigorously providing average-case guarantees for the popular seed-chain-extend heuristic for pairwise sequence alignment under a random substitution model, showing that it is accurate and runs in close to O(n log n) time for similar sequences. Then, we will turn our focus to metagenomics: our new tool skani computes average nucleotide identity (ANI) using sparse approximate alignments, and is both more accurate and over 20 times faster than the current state-of-the-art FastANI for comparing incomplete, fragmented MAGs (metagenome assembled genomes). This was enabled by Belbasi, et al.'s work showing that minimizers are biased Jaccard estimators, whereas other k-mer sketching does not have that drawback. Finally, we will introduce sylph (unpublished work), which enables fast and accurate database search to find nearest neighbor genomes (in ANI space) of low-coverage sequenced samples by using a combination of k-mer sketching with a zero-inflated Poisson correction (45x faster than MetaPhlAn for screening databases).

All of the work in this talk is joint with my brilliant PhD student Jim Shaw.

Shaw J, Yu YW. Proving sequence aligners can guarantee accuracy in almost O (m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Research (2023) 33 (7), 1175-1187 Shaw J, Yu YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023).

Oct. 25 (hybrid) Hoon Cho
(Yale University)

Computational tools for understanding and addressing privacy challenges in genomics

The sensitivity of human genomic data poses significant challenges for data sharing and collaboration in biomedicine. Balancing privacy protection and scientific progress is crucial. In this talk, I will discuss our recent efforts to develop effective algorithms that deepen our understanding of genomic privacy risks and facilitate secure data sharing. First, I will describe our novel probabilistic approach to modeling associations between gene expression levels and genotypes at the sequence level. This approach reveals a greater extent of data linkage risks than previously recognized. Next, I will describe our data-oblivious genomic analysis algorithms, designed for deployment in trusted execution environments (TEEs). These tools can power secure analytic services, providing confidential processing of a user’s genome. Lastly, I will introduce our suite of secure and federated (SF) algorithms for essential tasks including genome-wide association studies (GWAS), incorporating modern techniques from applied cryptography, distributed algorithms, and statistical genetics. These tools enable collaborative studies across large-scale genomic data silos, encompassing hundreds of thousands of genomes, while ensuring the confidentiality of each dataset. Our work lays the foundation for broader collaboration in biomedicine, advancing progress while respecting individual privacy.

Nov. 1 (virtual) Brian Hie
(Stanford University)

Learning to read and write protein evolution

Evolution is the powerful force driving both the real-time emergence of pathogen resistance to drugs and immunity, as well as the diversity of natural forms and functions that have emerged over longer timescales. Modern evolutionary models, especially those that leverage advances in machine learning, can improve our ability to design new proteins in the laboratory. This talk will cover how models of protein sequences and structures can learn evolutionary rules that help guide artificial evolution, including the affinity maturation of antibodies against diverse viral antigens and the in-silico evolution of modular and programmable de novo proteins with structures not found in nature.

Bio: Brian Hie is an incoming Assistant Professor of Chemical Engineering and Data Science at Stanford University and an Innovation Investigator at Arc Institute, where he conducts research at the intersection of biology and machine learning. He was previously a Stanford Science Fellow in the Stanford University School of Medicine and a Visiting Researcher at Meta AI. He completed his Ph.D. at MIT CSAIL and was an undergraduate at Stanford University.

Nov. 8 (hybrid) Cory McLean
(Google Health)


Genome-wide association studies have shed light on the genetic architecture of many diseases and complex traits. As genotyping becomes increasingly commoditized, the major challenge for discovering genotype/phenotype interactions is accurate phenotyping at scale. High-dimensional clinical data provide a unique opportunity to perform accurate phenotyping with deep learning models. This talk will overview multiple techniques for coupling machine-learning-based phenotyping with biobank-scale genetic data to improve genomic discovery and risk prediction for respiratory, circulatory, and eye morphology diseases and traits.

Bio: Cory McLean is the engineering lead for the Genomics team in Google Research. He completed his PhD at Stanford as a Bio-X fellow, a postdoc at UCSF as a Damon Runyon Cancer Research Foundation fellow, and spent 3 years in the Research team at 23andMe before joining Google in 2015. His research interests broadly include applying machine learning to the analysis and interpretation of genomic data and publishing tools and methods as open-source software.

Nov. 15 (virtual) Rohit Singh
(Duke University)

The Geometry of Single-Cell Biology: Geodesics, Metrics, and Parallaxes

The power of single-cell genomics is the ability to measure each cell in a tissue. What can be measured varies: transcript counts, how many of the transcripts are spliced, the accessible locations of the chromatin, and so on. We can even make two sets of measurements at the same time! Each measurement tells us a different aspect of the cell's biology, and together, they offer a revolutionary way to understand cellular development, differentiation and disease.

However, the challenge is interpreting the measurements: the data is almost always high-dimensional, noisy, and sparse. In this talk, I will present a set of algorithms that reveal new biology by considering the geometric aspects of these measurements. By exploiting the structure of the Riemannian manifold of gene co-expression, we introduce a powerful new way of discovering gene programs that drive cell fate commitment. To integrate multiple measurement modalities into a cohesive whole, we construct a kernel-based metric learning formalization that can be efficiently solved with quadratic programming. To identify the causal drivers of cell differentiation and disease, we exploit the parallax between leading and lagging measurements in a single-cell snapshot. This allows us to introduce the first generalization of Granger causal inference to directed acyclic graphs. We demonstrate its power by using it to uncover new genes affected in Schizophrenia.

Nov. 22 (virtual) JP Hubaux

Secure and Privacy-Preserving Decentralized Analytics and Its Application to Biomedical Data

To work properly, Machine Learning requires the access to large amounts of data. Yet, access to datasets can be difficult, because of regulations or because the controller considers its own data to be too sensitive or too precious. In this case, datasets remain in silos, thus jeopardizing the ability to properly train ML models with enough data. In this talk, we will present several results that show how to solve this problem, leveraging notably on recent advances of cryptography.

We first address the challenge of privacy-preserving training and evaluation of neural networks in an N-party, federated learning setting. We propose a novel system, POSEIDON, the first of its kind in the regime of privacy-preserving neural network training. It employs multiparty cryptography to preserve the confidentiality of the training data, the model, and the evaluation data, under a passive-adversary model and collusions between up to N−1 parties. It is based on Lattigo, an open-source lattice-based cryptographic library written in the Go language. Our experimental results show that POSEIDON achieves accuracy similar to centralized or decentralized non-private approaches.

We then switch to principal component analysis (PCA), an essential algorithm for dimensionality reduction in many data science domains. We address the problem of performing a federated PCA on private data distributed among multiple data providers while ensuring data confidentiality. Our solution, SF-PCA, is an end-to-end secure system that preserves the confidentiality of both the original data and all intermediate results in a passive-adversary model with up to all-but-one colluding parties. SF-PCA jointly leverages multiparty homomorphic encryption, interactive protocols, and edge computing to efficiently interleave computations on local cleartext data with operations on collectively encrypted data. SF-PCA obtains results as accurate as non-secure centralized solutions, independently of the data distribution among the parties.

Next, we show how we use these techniques in medical research. We propose FAMHE, a novel federated analytics system that, based on multiparty homomorphic encryption (MHE), enables privacy-preserving analyses of distributed datasets by yielding highly accurate results without revealing any intermediate data. We demonstrate the applicability of FAMHE to essential biomedical analysis tasks, including Kaplan-Meier survival analysis in oncology and genome-wide association studies in medical genetics.

Finally, we present Tune Insight SA, a start-up company that has industrialized the software implementing some of our results.

This work was carried out in collaboration with colleagues at EPFL, MIT, Broad Institute, Lausanne University Hospital and Tune Insight.

Short bio: Prof. Jean-Pierre Hubaux is the academic director of the EPFL Center for Digital Trust (C4DT). For its whole duration (April 2018 - December 2021), he led the national Data Protection in Personalized Health (DPPH) project. Until December 2021, he was a co-chair of the Data Security Work Stream of the Global Alliance for Genomics and Health (GA4GH). From 2008 to 2019 he was one of the seven commissioners of the Swiss FCC. He is a Fellow of both IEEE and ACM. Awards: three of his papers obtained distinctions at the IEEE Symposium on Security and Privacy, the flagship event on the topic (in 2015, 2018 and 2021). He is among the most cited researchers in privacy protection and in information security. He is a co-founder of Tune Insight SA.

Nov. 29 (hybrid) Barbara Engelhardt
(Stanford University)

Building 3D single-cell atlases

Comprehensive atlases of tissues, organs, and diseased tissues are being created regularly, but the definition of an 'atlas' remains unclear, and, as a result, the quality and consistency of available single cell atlases is variable. In this talk, I propose a definition for an atlas, and show models and methods we have developed to design, create, and annotate these atlases with minimal expert interventions. I show results benchmarking these methods on existing data and propose a series of future directions of study.

Bio: Barbara Engelhardt, PhD, is a Professor at Stanford University in the Department of Biomedical Data Studies and a Senior Investigator at Gladstone Institutes. Prior to taking these roles in 2022, she was a Professor at Princeton University in the Computer Science Department, and before that an Assistant Professor at Duke University. She earned her PhD at University of California, Berkeley, working with Prof. Michael I Jordan, and was a postdoc at the University of Chicago with Prof. Matthew Stephens. She has held positions in industry, including Google Research, 23andMe, and Genomics plc. She has won awards that include the Google Anita Borg Scholarship, the SMBE Walter M. Fitch Prize, a Sloan Faculty Fellowship, an NSF CAREER, and the ISCB Overton Prize. She is currently a CIFAR fellow in the Multiscale Human program.

Dec. 6 (virtual) Emma Pierson
(Cornell University)

Using machine learning to increase equity in healthcare and public health.

Our society remains profoundly unequal. This talk discusses how data science and machine learning can be used to combat inequality in health care and public health by presenting several vignettes from domains like policing, women's health, and cancer risk prediction.

Bio: Emma Pierson is an assistant professor of computer science at the Jacobs Technion-Cornell Institute at Cornell Tech and the Technion, and a computer science field member at Cornell University. She holds a secondary joint appointment as an Assistant Professor of Population Health Sciences at Weill Cornell Medical College. She develops data science and machine learning methods to study inequality and healthcare. Her work has been recognized by best paper, poster, and talk awards, an NSF CAREER award, a Rhodes Scholarship, Hertz Fellowship, Rising Star in EECS, MIT Technology Review 35 Innovators Under 35, and Forbes 30 Under 30 in Science. Her research has been published at venues including ICML, KDD, WWW, Nature, and Nature Medicine, and she has also written for The New York Times, FiveThirtyEight, Wired, and various other publications.

Dec. 13 (hybrid) Yun Song

Learning complex functional constraints in proteins and non-coding DNA

Predicting the impact of genetic variants is a significant challenge in computational biology, with crucial applications in disease diagnosis, gene regulation modeling, and protein engineering. In this talk, I will describe my lab's recent work on improving variant effect prediction for both coding and non-coding regions by leveraging advances in unsupervised learning, particularly self-supervised learning from natural language processing. For coding variants, I will introduce a robust learning framework to transfer properties between unrelated proteins and discuss how this approach fares in comparison with the recently published PrimateAI-3D and AlphaMissense methods. Regarding non-coding variants, I will present our work on DNA language models, highlighting their efficacy in genome-wide variant effect prediction.

Past Terms

A listing of the Bioinformatics Seminar series home pages from prior terms.

Organizers and Information

The Bioinformatics Seminar is hosted by MIT Simons Professor of Mathematics and head of the Computation and Biology group at CSAIL Bonnie Berger. Professor Berger is also Faculty of Harvard-MIT Health Sciences & Technology, Associate Member of the Broad Institute of MIT and Harvard, Faculty of MIT CSB, and Affiliated Faculty of Harvard Medical School.

The seminar is announced weekly via email to members of the seminar's mailing list and to those on CSAIL's event calendar list. It is also posted in the BioWeek calendar.

Bonnie Berger:

Shuvom Sadhuka (TA):

To be added to the seminar's email announcement list or for any questions you have about the seminar, please mail

If you plan to enroll in the associated course, 18.418/HST.504: Topics in Computational Molecular Biology, please contact Professor Berger ( and cc TA Shuvom Sadhuka ( for more information.