18.417: lecture 1

Molecular biology

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/



The depressing truth

Ultimately, it all comes down to 3 facts:

  1. All things eventually disappear.
  2. Making copies can delay this.
  3. With limited resources, what is left is that which makes good copies of itself.

To endure, copy. To really endure, copy better.

That's life.

The following patterns copy themselves in the right environment:

 main(){char q=34, n=10,*a="main() {char q=34,n=10,*a=%c%s%c; printf(a,q,a,q,n);}%c";printf(a,q,a,q,n);}

((lambda (h q) (list h (list q h) (list q q))) (quote 
 (lambda (h q) (list h (list q h) (list q q)))) (quote quote)) 

... TGAACTGCAC ACTTTCAGTC CGGTCCTCAC AGTTGAAAAG ACCTAAGCTT GTGCCTGATT ...

The first two use a computer environment.

The third, is a substring of the human genome (it is beta globin on chr 11). It uses you.


Biology is the study of life

fly pictureOK, what does "life" mean?

Cells are either eukaryotic or prokaryotic depending on how their big molecules are organized.


Inside the cell

A prokaryotic cell

A eukaryotic cell

Generally, eukaryotes are much more complicated than prokaryotes.

More on this as we go.


Proteins

Proteins are polymers of 20 amino acids.

Sometimes multiple proteins form a protein complex.

Enzymes are proteins/protein complexes which do things.


Deoxyribonucleic Acids (DNA)

Polymers formed from 4 nucleic acids

  • Adenine
  • Cytosine
  • Guanine
  • Thymine

Long means 1-2 meters for typical mammals

  • <1 million bps for most bacteria (prokaryote)
  • 5 million bps for E.coli (prokaryote)
  • 15 million bps for yeast (eukaryote)
  • 3 billion base-pairs for human/mouse
  • 17 billion base-pairs for triticum aestivum

In eukaryotes, many linear chromosomes

In prokaryotes, one circular chromosome


A more schematic DNA

  • Strands come in pairs.
  • Alternating sugar-phosphate backbone
  • A,T,G,C variability in the side groups
  • Nucleotide bonds are hydrogren bonds

For reasons to be made clear later, DNA is read from the 3' end to the 5' end.

3' ATTAGCCCAT 5'
5' TAATCGGGTA 3'
Is the string "attagcccat" bonded to its complement "atgggctaat".

Keeping DNA organized


Ribonucleic Acids (RNA)

Also polymers formed from 4 nucleic acids
  • Adenine
  • Cytosine
  • Guanine
  • Uracil

RNA is stable in single stranded form, though will H-bond to DNA other RNA or itself.

RNA is less stable than double stranded DNA. It doesn't get very long.

Folded RNA has some functional uses,

  • transfer RNA
  • ribosomes
  • other(???)


DNA replication

DNA strands can replicate

  • A replication fork is created
  • DNA polymerase is the enzyme responsible
  • DNA ligase is a helper enzyme for the lagging part of the fork

Proteins are the catalysts which make this happen


DNA is transcribed to RNA

RNA-polymerase can produce a complementary RNA strand to some sections of DNA.

The tendency of RNA-polymerase to do this is affected by

  • chromatin structure (in eukaryotes)
  • regulatory proteins

Transcribed RNA goes on to become:

  • messenger RNA (mRNA)
  • ribosomal (rRNA)
  • transfer RNA (tRNA)
  • something we don't know about (?RNA)


RNA is translated to AA

Proteins get made by

  • RNA moves out from the nucleus to the Ribosom
  • At the Ribosome tRNAs bring in amino acids
  • Ribosome/tRNAs catalyze the bonding of AAs

Things are a little more complicated for Eukaryotes


tRNA

tRNA comes in almost 61 flavors

  • typical stem loop structure
  • specific 3 base anticodon
  • specific tail at 3' for an amino acid


The codon table

ATGGAATTCTCGCTC
DNA coding strand
AUGGAAUUCUCGCUC
mRNA from 5'->3' strand
MEFSL
amino acid sequence
Second Position of Codon
TCAG
F
i
r
s
t

P
o
s
i
t
i
o
n
T
TTTPhe[F]
TTCPhe[F]
TTALeu[L]
TTGLeu[L]
TCTSer[S]
TCCSer[S]
TCASer[S]
TCGSer[S]
TATTyr[Y]
TACTyr[Y]
TAASTOP[end]
TAGSTOP[end]
TGTCys[C]
TGCCys[C]
TGASTOP[end]
TGGTrp[W]
T
C
A
G
T
h
i
r
d

P
o
s
i
t
i
o
n
C
CTTLeu[L]
CTCLeu[L]
CTALeu[L]
CTGLeu[L]
CCTPro[P]
CCCPro[P]
CCAPro[P]
CCGPro[P]
CATHis[H]
CACHis[H]
CAAGln[Q]
CAGGln[Q]
CGTArg[R]
CGCArg[R]
CGAArg[R]
CGGArg[R]
T
C
A
G
A
ATTIle[I]
ATCIle[I]
ATAIle[I]
ATGMet[M]
ACTThr[T]
ACCThr[T]
ACAThr[T]
ACGThr[T]
AATAsn[N]
AACAsn[N]
AAALys[K]
AAGLys[K]
AGTSer[S]
AGCSer[S]
AGAArg[R]
AGGArg[R]
T
C
A
G
G
GTTVal[V]
GTCVal[V]
GTAVal[V]
GTGVal[V]
GCTAla[A]
GCCAla[A]
GCAAla[A]
GCGAla[A]
GATAsp[D]
GACAsp[D]
GAAGlu[E]
GAGGlu[E]
GGTGly[G]
GGCGly[G]
GGAGly[G]
GGGGly[G]
T
C
A
G

Splicing (eukaryotes)

In Eukaryotes, mRNA is modified before leaving the nucleus.

Splicing is the most significant one of these mods.

Additional ornamentation is added to the mRNA before it leaves the nucleus.


Gene structure

The structure of a general eukaryotic gene is a sequence of control regions, followed by coding regions.

Coding regions are sites where promoter of inhibitor proteins might bind.

The concatenation of the coding regions is the final product which goes to translation.


Cellular reproduction (mitosis)

We will talk about meiosis and its effects later.


But reproductive fidelity not 100%

During DNA replication, occassional changes occur.

ATCCCCA ------> ATCTCCA

insertion
vv-->vXw
deletion
vXv-->vw
substitution
vXv-->vYw

Other kinds of mutations:

  • large (5-10bp) indels
  • gene conversion
  • translocation
  • transpositions

Effects of mutations


Change, competition, diversity


The history of a string

Variations in the example gene across different species

Clearly the strategy of creating cows, chickens, and carp has been asuccessful one for the beta globin pattern.


Bioinformatics/compuational biology

Computational Biology Central Dogma:

Sequence --> Structure --> Function

  • DNA sequence databases keep growing
  • About 70 species have been or are being sequenced
  • The data currently staggers most computational resources
  • It is believed that the data has value





Where do computational biology problems come from?

Supplement for experiment

Science is driven by observations.

Observations can include

  • a nucleotide sequence at a given location
  • the binding energy of a specific molecule to another
  • the expression and relative amount of a protein in a cell
  • a nucleotide sequence is a gene

Sometimes experimental devices fall short of the observation desired.

Computational techniques serve to supply hidden data.

Inferences

With sufficiently advanced technology, we can observe any fact.

Science draws inferences from observations.

Typical inferences include

  • two sequences are more closely related than another two
  • some region of the genome is undergoing selection
  • a set of proteins interact to support some complex function
  • generalizations and categorization of facts
  • inferences of past facts (evolutionary history)


What is computer science?

The study of automation:

  • What can a machine do? (will it stop with the right answer?)
  • How well can a machine do it? (will it stop today?)

Model machines

  • regular expressions
  • context free grammars
  • Turing machines
  • lambda calculus
It is no coincidence that many of these models are linguistic.

Originating in mathematics

  • deductive
  • abstract
This has an effect on interactions with biology which isinductive and suspicious.

What does abstract mean?

Tends to consider the details less than the generality

What does suspicious mean?

Once, when trying to be general, a detail bit him on the ass.

What does deductive mean?

Truth can be derived from other true things.
In theory there is no difference between theory and practice.

What does inductive mean?

Truth must be demonstrated repeatedly.
In practice, there is.


Complexity

The second question of CS, that of complexity, plays a central role in applications to other disciplines.

Computer science provides: Complexity Theory which is about what can and can't be done practically.

Its statements are usually asymptotic, phrased in terms of limiting behaviors: e.g. as input size gets large.

  • a simplifying reduction: reason about the largest cost
  • hardware independence: requires few specifics of the actual machine

Contingencies

  • what descriptors describe the difficulty of the problem?
  • are we in the asymptotic regime of those descriptors?
  • can hardware (e.g. memory hierarchy) be neglected/simplified?

O(n P(log(n)))

n times a polynomial in the logarithm of n.

  • Runtime very roughly proportional to database size
  • Most string matching problems need to run here

O(2^P(n))

2 to some power which is a polynomial of n.

  • the hard problems
  • not really hard unless n is large


What this class should be about

This class is intended to survey some of the leading results in computational biology.

This class will emphasize sequence analysis over structural analysis.

After taking this course, you should be able to: