3

Brute force and motif finding

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/

ctgttgaaagtatacaacatgtaagtctgttcatcttttcgtatcaatcg
tatcgcgctaaaaattatggtagttactaacgtagtggtatacataatgt
caactgccgcatataatggatttgcctagtgcttgaacggaggtgcaatg
atgtcaaaacgcctgaattaattggtaatactatagcggtgcggacccta
ttaagtattaggtgcgtaacctctcagggttgccgcccggttttatcctt
tgtgtaatagcctttttagagtcgaccgttcctcgtcacgcgtaaaattt
gtatgaatcctcgttggtttgtgggacgaccctttgtctatagtataaca
ccccggcaagttctaatcggctgtcagctactactatcctgggcgaacag
tgaaggcgtcgcgagtcttatgggtcaaatggccgaataaaacaatctta
tgagaggtctgtagacgacgattcgctgtcttatttgcccgccaagtaag


Lac operon in E.coli


Trp operon in E.coli


Trp operon attenuation


Eukaryotic regulation


Eukaryotic regulation

But don't forget:


Some known Transcription factors

FactorSequence MotifComments
c-Myc and MaxCACGTGc-Myc first identified as retroviral oncogene; Max specifically associates with c-Myc in cells
c-Fos and c-JunTGA[CG]T[CA]Aboth first identified as retroviral oncogenes; associate in cells, also known as the factor AP-1
CREBTGACG[CT][CA][GA]binds to the cAMP response element; family of at least 10 factors resulting from different genes or alternative splicing; can form dimers with c-Jun
c-ErbA; also TR (thyroid hormone receptor)GTGTCAAAGGTCAfirst identified as retroviral oncogene; member of the steroid/thyroid hormone receptor superfamily; binds thyroid hormone
c-Ets[GC][AC]GGA[AT]G[TC]first identified as retroviral oncogene; predominates in B- and T-cells
GATA[TA]GATAfamily of erythroid cell-specific factors, GATA-1 to -6
c-Myb[TC]AAC[GT]Gfirst identified as retroviral oncogene; hematopoietic cell-specific factor

Computational prediction of TFBS

upstream regions:

ctgttgaaagtatacaacatgtaagtctgttcatcttttcgtatcaatcg
tatcgcgctaaaaattatggtagttactaacgtagtggtatacataatgt
caactgccgcatataatggatttgcctagtgcttgaacggaggtgcaatg
atgtcaaaacgcctgaattaattggtaatactatagcggtgcggacccta
ttaagtattaggtgcgtaacctctcagggttgccgcccggttttatcctt
tgtgtaatagcctttttagagtcgaccgttcctcgtcacgcgtaaaattt
gtatgaatcctcgttggtttgtgggacgaccctttgtctatagtataaca
ccccggcaagttctaatcggctgtcagctactactatcctgggcgaacag
tgaaggcgtcgcgagtcttatgggtcaaatggccgaataaaacaatctta
tgagaggtctgtagacgacgattcgctgtcttatttgcccgccaagtaag

and a motif with logo


Motif finding

1/2*gtatca+1/2*gtatac
1*gtatac
1*atataa
1/4*ctatag+1/4*gtaata+1/4*gaatta+1/4*gtcaaa
1*gtatta
1/2*gtaaaa+1/2*gtgtaa
1*gtataa
1*ttctaa
1*gaataa
1/2*gtctta+1/2*gtagac

Collect the stats into a profile matrix

A15/431/4115/231/4
C1/407/401/22
G31/401/21/201/4
T135/4017/220

gtataa

48

One approach: finding similar substrings

---gtatac---
---atataa---
---gaataa---
---gtctta---

Find substrings with the `best' profile

Best profile = best CONSENSUS


Consensus


Select substrings

0|ctgttgaaagtatacaacatgtaagtctgttcatcttttcgtatcaatcg
0|tatcgcgctaaaaattatggtagttactaacgtagtggtatacataatgt
0|caactgccgcatataatggatttgcctagtgcttgaacggaggtgcaatg
0|atgtcaaaacgcctgaattaattggtaatactatagcggtgcggacccta
0|ttaagtattaggtgcgtaacctctcagggttgccgcccggttttatcctt
0|tgtgtaatagcctttttagagtcgaccgttcctcgtcacgcgtaaaattt
0|gtatgaatcctcgttggtttgtgggacgaccctttgtctatagtataaca
0|ccccggcaagttctaatcggctgtcagctactactatcctgggcgaacag
0|tgaaggcgtcgcgagtcttatgggtcaaatggccgaataaaacaatctta
0|tgagaggtctgtagacgacgattcgctgtcttatttgcccgccaagtaag

Profile/consensus

A312343
C114202
G211322
T473243

ttcaaa

25
Motif finding problemApproach by brute force

What if we guess the profile string?

ttcaaa
1|ctgttgaaagtatacaacatgtaagtctgttcatcttttcgtatcaatcg
2|tatcgcgctaaaaattatggtagttactaacgtagtggtatacataatgt
2|caactgccgcatataatggatttgcctagtgcttgaacggaggtgcaatg
1|atgtcaaaacgcctgaattaattggtaatactatagcggtgcggacccta
3|ttaagtattaggtgcgtaacctctcagggttgccgcccggttttatcctt
2|tgtgtaatagcctttttagagtcgaccgttcctcgtcacgcgtaaaattt
3|gtatgaatcctcgttggtttgtgggacgaccctttgtctatagtataaca
1|ccccggcaagttctaatcggctgtcagctactactatcctgggcgaacag
1|tgaaggcgtcgcgagtcttatgggtcaaatggccgaataaaacaatctta
3|tgagaggtctgtagacgacgattcgctgtcttatttgcccgccaagtaag

Profile/consensus

A2/729/2861/421011/1403163/420788/105
C47/6012/35381/701/5107/21017/21
G223/7016/1523/14107/21031/3011/30
T2413/4203173/420307/210869/420389/420277/210

ttcaaa

41

Note!

1+2+2+1+3+2+3+1+1+3 = 19

Median string problem

Approach by brute force