Bioinformatics Seminar

The Bioinformatics Seminar is co-sponsored by the Department of Mathematics at the Massachusetts Institute of Technology and the Theory of Computation group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). The seminar series focuses on highlighting areas of research in the field of computational biology. This year, we are hoping to highlight three topics: (1) evolution and computational approaches to modeling and understanding it, (2) generative AI for biology/biomedicine, and (3) algorithms for computational biology/genomics.

Fall 2024

Lectures are on Wednesdays, 11:30am - 1:00pm ET
Location: 32G-575 (Stata Center at MIT; Gates Tower; 5th Floor)
Zoom link for virtual attendants: https://harvard.zoom.us/j/99103715484

Date Speaker Title/Abstract
Sept. 11 Roshan Rao
(EvolutionaryScale)

Multimodal Protein Foundation Models

How can multimodality improve representations of proteins? Foundation models have shown promise in building powerful representations for many domains. Language models are able to access a vast quantity of human knowledge and are able to perform limited reasoning over this body of knowledge. Protein models learn the evolutionary patterns in proteins, enabling prediction of protein structure and function. This talk will cover the development of protein foundation models, understanding the representations they build, and how they scale. Finally, it will cover incorporating modalities beyond protein sequences, and how additional data could be added to produce better representations in the future.

Sept. 18 Ben Langmead
(JHU)

Pan-genomic advances for fighting reference bias

Sequencing data analysis often begins with aligning reads to a reference genome, where the reference takes the form of a linear string of bases. But linearity leads to reference bias, a tendency to miss or misreport alignments containing non-reference alleles, which can confound downstream statistical and biological results. This is a major concern in human genomics; we don't want to live in a world where diagnostics and therapeutics are differentially effective depending whether and where our genetic variants happen to match the reference.

Fortunately, computer science and bioinformatics are meeting the moment. We can now index and align sequencing reads to references that include many population variants. I will present some of the major and insights that have shaped this journey from the early days of efficient genome indexing -- especially the Burrows-Wheeler Transform -- continuing through recent methods for indexing graph-shaped references and references that include many genomes. I will emphasize recent results that show how to optimize simple and complex pan-genome representations for effective avoidance of reference bias. Finally, I will outline promising methods for the bias, including new ideas for how to measure bias, new proposals in compressed indexing, and new workflows that integrate genotype imputation to improve reference bias.

Sept. 25 Ard Louis
(Oxford)*

Does evolution have an inbuilt bias towards highly compressible phenotypes? 

Darwinian evolution proceeds by natural selection acting on random variation. I will argue that, although mutations are random, the novel phenotypes they produce can be highly biased towards simple or more compressible forms. This bias is so strong that it can dramatically shape the spectrum of adaptive outcomes. The basic intuition follows from an algorithmic twist on the infinite monkey theorem inspired by the fact that natural selection doesn’t act directly on mutations, but rather on the phenotypes that are generated by developmental programmes. If monkeys type at random in a computer language, they are much more likely to generate outputs derived from shorter algorithms. This intuition can be formalised with the coding theorem of algorithmic information theory, predicting that random mutations are exponentially more likely to result in simpler, more compressible phenotypes with low descriptional (Kolmogorov) complexity. Evidence for this evolutionary Occam’s razor can be found in the symmetry in protein complexes [1], and in the simplicity of RNA secondary structures [2], gene regulatory networks, leaf shape, and Richard Dawkins’ biomorphs model of development [3]. This principle may also extend to machine learning, offering insights into why neural networks generalize well on typical datasets [4].

[1] Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution, IG Johnston, et al, PNAS 119 (11), e2113883119 (2022);
[2] Phenotype bias determines how RNA structures occupy the morphospace of all possible shapes, Kamaludin Dingle, Fatme Ghaddar, Petr Sulc, and Ard A. Louis. Molecular Biology and Evolution, 39, msab280 (2021)
[3] Bias in the arrival of variation can dominate over natural selection in Richard Dawkins’s biomorphs, NS Martin, CQ Camargo, AA Louis PLOS Comp. Bio. 20 (3), e1011893 (2024)
[4] Do deep neural networks have an inbuilt Occam's razor? C Mingard, H Rees, G Valle-Pérez, AA Louis arXiv preprint arXiv:2304.06670

Oct. 2 Smita Krishnaswamy
(Yale)*

Inferring and Characterizing Cellular and Neural Dynamics with Geometric and Topological Deep Learning

In the last decade there has been a data revolution in biology with the advent of high-throughput high dimensional data modalities such as single-cell RNA-sequencing, fMRI data, molecular structure data and other modalities. A key issue in these data types is that they provide static snapshots of highly dynamic biological entities. In this talk I will cover our work inferring and characterizing cellular and neural dynamics during various processes. First, I will cover how to infer cell state dynamics during differentiation and disease with a neural ODE framework called MIOflow that is regularized with data geometric and manifold priors. Then I will discuss RITINI, our recent graph ODE network which allows us to learn gene regulation that underlies cellular dynamics, and potentially find new targets for treatments of disease. I will showcase applications of these in triple negative breast cancer and human embryonic stem cell differentiation. Once these dynamics are available, I will showcase tools to quantify and classify these dynamics based on graph signal processing and topological data analysis. This will involve our learnable geometric scattering transform to capture spatial signal patterns, as well as persistence homology and other tools to quantify time-varying patterns. Applications to characterization of brain activity data will be presented.

Oct. 9 Sriram Sankararaman
(UCLA)

Understanding the genetic basis of complex traits from Biobank-scale data: Statistical and Computational challenges

The quest to understand the interplay between evolution, genes and traits has been revolutionized by the collection of rich phenotypic and genetic data across millions of individuals in diverse populations. However analyses of these Biobank-scale datasets present substantial statistical and computational challenges.

I will describe how we bring together statistical and computational insights to design accurate and highly scalable algorithms for a suite of problems that arise in the analysis of Biobank data: highly scalable randomized inference algorithms to dissect the genetic architecture of complex traits and deep-learning based phenotype imputation to deal with complex patterns of missingness. By applying these methods to about half a million individuals from the UK Biobank, we obtain novel insights how genetic effects are distributed across the genome, the relative contributions of additive, dominance and gene-environment interaction effects to trait variation, and new genes that confer risk for hard-to-measure diseases.

Oct. 16 Kevin K. Yang
(Microsoft Research)

Deep generative models for protein engineering

Deep generative models are increasingly powerful tools for the in silico design of novel proteins. Recently, a family of generative models called diffusion models has demonstrated the ability to generate biologically plausible proteins that are dissimilar to any actual proteins seen in nature, enabling unprecedented capability and control in de novo protein design. However, current state-of-the-art models generate protein structures, which limits the scope of their training data and restricts generations to a small and biased subset of protein design space. Here, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.

Oct. 23 Bin Yu
(UC Berkeley)*

Veridical Data Science and PCS Uncertainty Quantification

Data Science is central to AI and has driven most of the recent advances in biomedicine and beyond. Human judgment calls are ubiquitous at every step of the data science life cycle (DSLC). We will introduce Veridical (truthful) Data Science (VDS) based on three core principles of data science: Predictability, Computability and Stability (PCS) to formally take into account the human judgment calls as sources of uncertainty. PCS will be showcased through collaborative research in prostate cancer detection and in seeking genetic drivers of a heart disease. We will end with on-going research on PCS uncertainty quantification (UQ) that addresses two unconventional prominent sources of uncertainty in the DSLC from data cleaning and algorithm choices.

Oct 30 Adam Phillippy
(NIH)

Telomere-to-telomere genome assembly and alignment

In 2022, roughly 20 years after the conclusion of the Human Genome Project, we were finally able to complete the last 8% of the human genome that had been missing from all prior assemblies of the human genome. Our complete, gapless, “telomere-to-telomere” assembly revealed over 200 Mbp of novel sequence, comprising some of the most repetitive and structurally variable regions of the genome. In addition to new methods for genome sequencing and assembly, these regions have also required new methods for sequence alignment, annotation, and analysis that account for their unique evolutionary properties. I will cover some of the key algorithmic details that have now enabled the routine assembly and analysis of complete human genomes, and the new biology we are uncovering.

Nov. 6 Ava Amini
(Microsoft Research)

Learning the functional consequences of cell state across human cancers.

Assessing how alterations in DNA control disease progression and overall cellular function is a core component of cancer biology and has largely driven how we search for and assign therapies. The advent of single-cell RNA-sequencing (scRNA-seq) has reshaped our understanding of human cancers by revealing that tumors are complex systems of interacting cells and exhibit substantial variation in transcriptional states in addition to mutational heterogeneity. Despite the generation of many high-resolution and multimodal single-cell atlases of cancer, we still have a limited understanding of the relative importance and functional consequences of cell state diversity in human malignancy. Addressing this problem is equal parts biology and computer science. In Project Ex Vivo, a joint cancer research collaboration between Microsoft Research and the Broad Institute, we are leveraging the knowledge within a diverse group of computer scientists, experimentalists, clinicians, and computational biologists to better understand the complexity of cell state phenotypes in cancer. I will discuss our efforts to build AI models to better define, model, and therapeutically target cell states in cancer.

Nov. 13 Pranav Rajpurkar
(HMS)

Building Machines That Can Match Doctors in Medical Image Interpretation

Accurate interpretation of medical images is crucial for disease diagnosis and treatment, and AI has the potential to minimize errors, reduce delays, and improve accessibility. The focal point of this presentation lies in a grand ambition: the development of 'Generalist Medical AI' systems that can closely resemble doctors in their ability to reason through a wide range of medical tasks, incorporate multiple data modalities, and communicate in natural language. Starting with pioneering algorithms that have already demonstrated their potential in diagnosing diseases from chest X-rays or electrocardiograms, matching the proficiency of expert radiologists and cardiologists, I will delve into the core challenges and advancements in the field. The discussion will navigate towards the topic of label-efficient AI models: with a scarcity of meticulously annotated data in healthcare, the development of AI systems capable of learning effectively from limited labels has become a key concern. In this vein, I'll delve into how the innovative use of self-supervision and pre-training methods has led to algorithmic advancements that can perform high-level diagnostic tasks using significantly less annotated data. Additionally, I will talk about initiatives in data curation, human-AI collaboration, and the creation of open benchmarks to evaluate the generalizability of medical AI algorithms. In sum, this talk aims to deliver a comprehensive picture of the state of 'Generalist Medical AI,' the advancements made, the challenges faced, and the prospects lying ahead.

Nov. 20 Jesse Bloom
(Fred Hutchinson Cancer Center)*

Interpreting the evolution of SARS-CoV-2 and other viruses

Some human viruses including SARS-CoV-2 and seasonal influenza evolve rapidly to erode antibody immunity. I will discuss how new high-throughput experimental techniques including deep mutational scanning and sequencing-based neutralization assays can be used to understand and to some extent forecast this evolution.

Nov. 27 Aleksandra Walczak
(Ecole Normale Supérieure)*

How personalised is your immune repertoire?

Immune repertoires provide a unique fingerprint reflecting the immune history of individuals, with potential applications in precision medicine. Can this information be used to identify a person uniquely? If it really is a personalised medical record, can it inform us about the outcomes of a COVID-19 infection? I will show how statistical analysis of immune repertoires sequencing experiments can answer these questions.

Dec. 4 Tristan Bepler
(New York Structural Biology Center)*

Breaking scaling laws in protein language models with the protein evolutionary transformer (PoET)

Protein language models (PLMs) are powerful tools for extracting information from large natural protein sequence databases. By learning from the manifold of natural proteins, these models are able to learn structural and functional properties in an unsupervised manner, making them powerful foundation models for protein structure and function prediction. However, these models become increasingly impractical as they scale to more and more parameters, need to be retrained to incorporate new data, and generally lack controllability. In this talk, I'll discuss PoET (the Protein Evolutionary Transformer), a fully generative model of whole protein families as sequences-of-sequences. By reformulating protein language modeling as a family-level, rather than individual protein-level, generative problem, PoET learns to extract evolutionary signatures from extremely small numbers of example sequences to generate novel proteins and model fitness distributions. Through controlling this sequence context, PoET can be prompted to generate proteins from any target distribution of interest while benefiting from learning general principles across the entire natural protein landscape. This enables PoET to achieve state-of-the-art performance for zero-shot variant effect prediction across deep mutational scanning and clinical datasets, without any structure conditioning. Homology-augmented PoET embeddings are state-of-the-art for transfer learning and protein function prediction, enabling accurate function predictors to be trained with 10x less data than alternative foundation models. PoET is also an order-of-magnitude smaller than other transformer-based PLMs. PoET is open source on github (https://github.com/OpenProteinAI/PoET) and is also available through the OpenProtein.AI web app where it underpins our protein property prediction and design tools.

Dec. 11 David Van Valen
(Caltech)*

TBA

*Indicates the speaker will be presenting over Zoom. Otherwise, they will be presenting in person.

Past Terms

A listing of the Bioinformatics Seminar series home pages from prior terms.

Organizers and Information

The Bioinformatics Seminar is hosted by MIT Simons Professor of Mathematics and head of the Computation and Biology group at CSAIL Bonnie Berger. Professor Berger is also Faculty of Harvard-MIT Health Sciences & Technology, Associate Member of the Broad Institute of MIT and Harvard, Faculty of MIT CSB, and Affiliated Faculty of Harvard Medical School.

The seminar is announced weekly via email to members of the seminar's mailing list and to those on CSAIL's event calendar list. It is also posted in the BioWeek calendar.

Bonnie Berger: bab@mit.edu

Anna Sappington (TA): asapp@mit.edu

To be added to the seminar's email announcement list or for any questions you have about the seminar, please mail bioinfo@csail.mit.edu and cc TA Anna Sappington (asapp@mit.edu).

If you plan to enroll in the associated course, 18.418/HST.504: Topics in Computational Molecular Biology, please contact Professor Berger (bab@mit.edu) and cc TA Anna Sappington (asapp@mit.edu) for more information.