18.417: lecture 1
Molecular biology
instructor: Ross A. Lippert
http://www-math.mit.edu/~lippert/18.417/

The depressing truth
Ultimately, it all comes down to 3 facts:
- All things eventually disappear.
- Making copies can delay this.
- With limited resources, what is left is that which makes good copies of itself.
To endure, copy. To really endure, copy better.
That's life.
The following patterns copy themselves in the right environment:
main(){char q=34, n=10,*a="main() {char q=34,n=10,*a=%c%s%c; printf(a,q,a,q,n);}%c";printf(a,q,a,q,n);}
((lambda (h q) (list h (list q h) (list q q))) (quote
(lambda (h q) (list h (list q h) (list q q)))) (quote quote))
... TGAACTGCAC ACTTTCAGTC CGGTCCTCAC AGTTGAAAAG ACCTAAGCTT GTGCCTGATT ...
The first two use a computer environment.
The third, is a substring of the human genome (it is beta globin on chr 11). It uses you.
Biology is the study of life
OK, what does "life" mean?
- An organism is something alive.
- Organisms are composed of cells.
- Some organisms are multicellular and some are not.
- Cells are composed of
- water (70%)
- big molecules (23%)
- small molecules (7%)
- Big molecules are
- polymers of nucleic acids (DNA or RNA)
- polymers of amino acids (proteins)
- Polysaccharides and lipids (which I know nothing about)
- Small molecules are
- free-floating bits of DNA, RNA, AA
- salts, ions, pharmaceuticals, etc.
Cells are either eukaryotic or prokaryotic depending on how their big molecules are organized.
Inside the cell
A prokaryotic cell | A eukaryotic cell |
 |  |
Generally, eukaryotes are much more complicated than prokaryotes.
- bigger
- specialized compartments/membrane-bound organelles
- capable of forming multicellular organisms
- DNA molecules are in chromosomal pairs
More on this as we go.
Proteins
Proteins are polymers of 20 amino acids. Sometimes multiple proteins form a protein complex. Enzymes are proteins/protein complexes which do things.  |  |
Deoxyribonucleic Acids (DNA)
Polymers formed from 4 nucleic acids - Adenine
- Cytosine
- Guanine
- Thymine
Long means 1-2 meters for typical mammals - <1 million bps for most bacteria (prokaryote)
- 5 million bps for E.coli (prokaryote)
- 15 million bps for yeast (eukaryote)
- 3 billion base-pairs for human/mouse
- 17 billion base-pairs for triticum aestivum
In eukaryotes, many linear chromosomes In prokaryotes, one circular chromosome |  |
A more schematic DNA
 | - Strands come in pairs.
- Alternating sugar-phosphate backbone
- A,T,G,C variability in the side groups
- Nucleotide bonds are hydrogren bonds
For reasons to be made clear later, DNA is read from the 3' end to the 5' end. 3' ATTAGCCCAT 5'
5' TAATCGGGTA 3'
Is the string "attagcccat" bonded to its complement "atgggctaat". |
Keeping DNA organized

Ribonucleic Acids (RNA)
Also polymers formed from 4 nucleic acids- Adenine
- Cytosine
- Guanine
- Uracil
RNA is stable in single stranded form, though will H-bond to DNA other RNA or itself. RNA is less stable than double stranded DNA. It doesn't get very long. Folded RNA has some functional uses, - transfer RNA
- ribosomes
- other(???)
|  |
DNA replication
 | DNA strands can replicate - A replication fork is created
- DNA polymerase is the enzyme responsible
- DNA ligase is a helper enzyme for the lagging part of the fork
Proteins are the catalysts which make this happen  |
DNA is transcribed to RNA
RNA-polymerase can produce a complementary RNA strand to some sections of DNA. The tendency of RNA-polymerase to do this is affected by - chromatin structure (in eukaryotes)
- regulatory proteins
 Transcribed RNA goes on to become: - messenger RNA (mRNA)
- ribosomal (rRNA)
- transfer RNA (tRNA)
- something we don't know about (?RNA)
|  |
RNA is translated to AA
 | Proteins get made by - RNA moves out from the nucleus to the Ribosom
- At the Ribosome tRNAs bring in amino acids
- Ribosome/tRNAs catalyze the bonding of AAs
Things are a little more complicated for Eukaryotes  |
tRNA
tRNA comes in almost 61 flavors - typical stem loop structure
- specific 3 base anticodon
- specific tail at 3' for an amino acid
|  |
The codon table
ATGGAATTCTCGCTC | DNA coding strand | AUGGAAUUCUCGCUC | mRNA from 5'->3' strand | MEFSL | amino acid sequence |
 | | | Second Position of Codon | | | | | T | C | A | G | | | F i r s t P o s i t i o n
| T | | TTT | Phe | [F] | | TTC | Phe | [F] | | TTA | Leu | [L] | | TTG | Leu | [L] |
| | TCT | Ser | [S] | | TCC | Ser | [S] | | TCA | Ser | [S] | | TCG | Ser | [S] |
| | TAT | Tyr | [Y] | | TAC | Tyr | [Y] | | TAA | STOP | [end] | | TAG | STOP | [end] |
| | TGT | Cys | [C] | | TGC | Cys | [C] | | TGA | STOP | [end] | | TGG | Trp | [W] |
| | T h i r d P o s i t i o n
| | C | | CTT | Leu | [L] | | CTC | Leu | [L] | | CTA | Leu | [L] | | CTG | Leu | [L] |
| | CCT | Pro | [P] | | CCC | Pro | [P] | | CCA | Pro | [P] | | CCG | Pro | [P] |
| | CAT | His | [H] | | CAC | His | [H] | | CAA | Gln | [Q] | | CAG | Gln | [Q] |
| | CGT | Arg | [R] | | CGC | Arg | [R] | | CGA | Arg | [R] | | CGG | Arg | [R] |
| | | A | | ATT | Ile | [I] | | ATC | Ile | [I] | | ATA | Ile | [I] | | ATG | Met | [M] |
| | ACT | Thr | [T] | | ACC | Thr | [T] | | ACA | Thr | [T] | | ACG | Thr | [T] |
| | AAT | Asn | [N] | | AAC | Asn | [N] | | AAA | Lys | [K] | | AAG | Lys | [K] |
| | AGT | Ser | [S] | | AGC | Ser | [S] | | AGA | Arg | [R] | | AGG | Arg | [R] |
| | | G | | GTT | Val | [V] | | GTC | Val | [V] | | GTA | Val | [V] | | GTG | Val | [V] |
| | GCT | Ala | [A] | | GCC | Ala | [A] | | GCA | Ala | [A] | | GCG | Ala | [A] |
| | GAT | Asp | [D] | | GAC | Asp | [D] | | GAA | Glu | [E] | | GAG | Glu | [E] |
| | GGT | Gly | [G] | | GGC | Gly | [G] | | GGA | Gly | [G] | | GGG | Gly | [G] |
| |
|
Splicing (eukaryotes)
In Eukaryotes, mRNA is modified before leaving the nucleus. Splicing is the most significant one of these mods.  |  Additional ornamentation is added to the mRNA before it leaves the nucleus. |
Gene structure
The structure of a general eukaryotic gene is a sequence of control regions, followed by coding regions.
Coding regions are sites where promoter of inhibitor proteins might bind.

The concatenation of the coding regions is the final product which goes to translation.
Cellular reproduction (mitosis)
 |
We will talk about meiosis and its effects later. |
But reproductive fidelity not 100%
During DNA replication, occassional changes occur. ATCCCCA ------> ATCTCCA | insertion | vv-->vXw | | deletion | vXv-->vw | | substitution | vXv-->vYw |
Other kinds of mutations: - large (5-10bp) indels
- gene conversion
- translocation
- transpositions
|  |
Effects of mutations
- Beneficial: new functionalities arise, resistance to malaria or HIV
- Deterimental: death, debilitating or degenerative disease
- Neutral: not in a coding or control region -- the majority case!
Change, competition, diversity

The history of a string
Variations in the example gene across different species
Clearly the strategy of creating cows, chickens, and carp has been asuccessful one for the beta globin pattern.

Bioinformatics/compuational biology
Computational Biology Central Dogma:Sequence --> Structure --> Function - DNA sequence databases keep growing
- About 70 species have been or are being sequenced
- The data currently staggers most computational resources
- It is believed that the data has value
|  |



Where do computational biology problems come from?
Supplement for experimentScience is driven by observations. Observations can include - a nucleotide sequence at a given location
- the binding energy of a specific molecule to another
- the expression and relative amount of a protein in a cell
- a nucleotide sequence is a gene
Sometimes experimental devices fall short of the observation desired. Computational techniques serve to supply hidden data. | InferencesWith sufficiently advanced technology, we can observe any fact. Science draws inferences from observations. Typical inferences include - two sequences are more closely related than another two
- some region of the genome is undergoing selection
- a set of proteins interact to support some complex function
- generalizations and categorization of facts
- inferences of past facts (evolutionary history)
|
What is computer science?
The study of automation: - What can a machine do? (will it stop with the right answer?)
- How well can a machine do it? (will it stop today?)
Model machines - regular expressions
- context free grammars
- Turing machines
- lambda calculus
It is no coincidence that many of these models are linguistic.Originating in mathematics This has an effect on interactions with biology which isinductive and suspicious. | What does abstract mean?Tends to consider the details less than the generalityWhat does suspicious mean?Once, when trying to be general, a detail bit him on the ass.What does deductive mean?Truth can be derived from other true things. In theory there is no difference between theory and practice.What does inductive mean?Truth must be demonstrated repeatedly. In practice, there is. |
Complexity
The second question of CS, that of complexity, plays a central role in applications to other disciplines. Computer science provides: Complexity Theory which is about what can and can't be done practically. Its statements are usually asymptotic, phrased in terms of limiting behaviors: e.g. as input size gets large. - a simplifying reduction: reason about the largest cost
- hardware independence: requires few specifics of the actual machine
Contingencies - what descriptors describe the difficulty of the problem?
- are we in the asymptotic regime of those descriptors?
- can hardware (e.g. memory hierarchy) be neglected/simplified?
| O(n P(log(n)))n times a polynomial in the logarithm of n. - Runtime very roughly proportional to database size
- Most string matching problems need to run here
O(2^P(n))2 to some power which is a polynomial of n. - the hard problems
- not really hard unless n is large
|
What this class should be about
This class is intended to survey some of the leading results in computational biology.
- some examples will be too simple
- some examples will be outdated
- all of them illustrate important techniques still in use
This class will emphasize sequence analysis over structural analysis.
After taking this course, you should be able to:
- be able to read the current literature
- be conversant in enough biology to be useful to a biologist