Title: Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University
1Searching for Similarity in SequencesGary
BensonDepartments of Computer Science and
BiologyBoston University
2Topic 1 Outline
- Similarity and Alignment
- Define homology, similarity by descent and
similarity by convergence - Common mutations and their mathematical models
- Alignments
- Scoring Alignments
- Gap penalty functions
- Computing the best scoring alignment the
Longest Common Subsequence (LCS) problem
3Similarity and Biomolecules
- Similarity is expected among biomolecules that
are descended from a common ancestor. Mutations
cause differences, but survival of the organism
requires that mutations occur in regions that are
less critical to function while important
catalytic, regulatory or structural regions
remain similar.
4Similarity and Evolution
- Evolution has duplicated and shuffled bits and
pieces of molecules to produce new linear
arrangements that combine function in novel ways.
Regions of similarity often suggest an
evolutionary tie and/or common functional
properties between very different molecules.
5Three common similarity problems
- Start with a query sequence with unknown
properties and search within a database of
millions of sequences to find those which share
similarity with the query. - Start with a small set of sequences and identify
similarities and differences among them. - In many sequences or very long sequences, detect
commonly occurring patterns.
6What is Similarity?How can we measure it?
7Morphology
- Morphology is the form and structure of an
organism. - Should shared morphology mean similarity?
8Hands
9Aquatic Shape
10Shared morphology
- Shared morphology does not necessarily imply
common ancestry. - The animals with hands have all evolved from a
common ancester with a hand. - The ichthyosaur, shark and porpoise each evolved
sea life adaptations independently.
11Homology
- When similarity is due to common ancestry, we
call it homology.
12- Modern molecular biology seeks to understand
cellular processes through the action of DNA,
RNA, and protein molecules. This will ultimately
lead to a biochemical understanding of - The pathogenesis of infectious diseases like
AIDS, hepatitus and SARS. - The mutagenic properties of environmental toxins
and how they lead to diseases like cancer. - The etiology of human genetic disease.
- Strategies to prevent and treat diseases through
drug and vaccine design, gene therapy, risk
reduction, etc.
13How homology helps
- Given molecular sequences X and Y
- X Y AND INFO(Y) gt INFO(X)
- ( means similar)
14Are the Sequences Similar?
15Are the Sequences Similar
- How similar?
- What parts are the most similar?
- Remember, the common ancestor of the two
sequences may have existed millions of years ago.
16- How can we tell if the two sequences are similar?
- Similarity judgements should be based on
- The types of changes or mutations that occur
within sequences. - Characteristics of those different types of
mutations. - The frequency of those mutations.
17Common mutations in DNA
- Substitution
- A C G T T G A C
- A C G A T G A C
- Deletion
- A C G T T G A C
- A C G A C
- Insertion
- A C G T T G A C
- A C G C A A G T T G A C
18Common mutations
- Duplication
- A C G T T G A C
- A C G T T G A T T G A C
- Inversion (double stranded DNA shown)
- A C G T T G A C
- T G C A A C T G
- A C T C A A C C
- A C A G T T G G
19Frequency of mutations
- Substitution gt Insertion, Deletion
- gt gt
- Duplication
- gt
- Inversion
20Evolutionary history of sequences
21Alignments
- There are many ways to align two sequences. We
just saw one way - T T A C G T A C A G A T T A
- T - - G G A A C A - - - T A
- Here is another
- T T A C G T A C A G A T T A
- T - - - G G A A C - - A T - A
- Which is better? Remember, we can not choose
based on the evolutionary history, because that
is unknown.
22Alignments and Paths through the Alignment Array
23Alignments and Paths through the Alignment Array
t a c g - c a a - - - a c g t g a a t t
24Alignments and PathsAn Alternate Alignment
t - - a c g c a - - a a c g t g - - a a t t
25Finding the Best AlignmentRanking Alignments by
Score
- Score an alignment by
- Partitioning it into columns
- Assign a weight to each column
- Sum the column weights
26Distance Scoring
- Distance scoring
- Alignment gets a non-negative score.
- Alignment of identical sequences scores zero,
- all others gt zero.
- Best alignment has smallest score.
- Typical scoring functions are
- d(a,a) 0 identity
- d(a,b) d(b,a) gt 0 a ? b substitution
- g d(a, ) gt 0 indel (gap)
27Similarity Scoring
- Similarity scoring
- Alignment scores may be positive, zero, or
negative. - More similar means larger positive score.
- The best alignment has largest score.
- Typical scoring functions are
- s(a,b) is gt 0 if a and b are similar in one or
more characteristics or are observed to
substitute frequently for each other - 0 otherwise substitution
- g s(a, ) lt 0 indel (gap)
28Gap penalty functions
- Single character gap penalty
- g(a, ) c
- (c a constant or a value dependent on a)
- Affine (linear) gap penalty
- g(k) a ßk
- (a is a gap opening penalty, ß is a gap
extension penalty) - Concave gap penalty
- g(k) a ß(m(k))
- m(k) is a function like log(k) which grows more
slowly as k increases.
29Distance Scoring
- Alignment parameters
- d(a, a) 0 d(a, b) 2, g 4
- A G C C G T A T
- A C G A - - T - T
- 0 4 0 2 4 4 0 4 0
18
30Similarity Scoring
- Scoring parameters
- s(a, a) 5, s(a, b) - 3, g - 8
- A G C C G T A T
- A C G A - - T - T
- 5 5 5 5
- 8 3 8 8 8
- 15
-
31Similarity scoring with affine gap
- Alignment parameters
- s(a, a) 5, s(a, b) - 3,
- g(k) a ßk, a - 5, ß - 4
- A G C C G T A T
- A C G A - - T - T
- 5 5 5 5
- 8 3 8 4 8
- 11
-
32Computing the Optimal AlignmentThe LCS Problem
as Prototype
- The Longest Common Subsequence (LCS) problem is a
method for comparing sequences. Although the
solution does not produce an alignment, it
illustrates a method of dynamic programming that
is very similar to that used by alignment
algorithms.
33Longest Common Subsequence Problem
- Let X be a string of characters. A subsequence
X of X is formed by discarding zero or more
letters of X. Note that the letters in X
maintain their same order as in X. - Let X and Y be two strings. A common subsequence
Z is a subsequence of both. A longest common
subsequence (LCS) is the longest such Z. - Examples
X a b c d e b a Y b e b d c e a c d Z b d
e a
X a b c d e b a X a b d b
34LCS Problem
- Given Two sequences X and Y.
- Find An LCS for X and Y.
- A divide and conquer solution can be developed by
looking at what happens to the last letters in
each sequence. That is, are they part of the LCS
solution or not?
35Possible ways to split the problem
36LCS recursion
37Filling the dynamic programming array
38Filling the dynamic programming array
39Necessary values in adjacent cells
40Completed LCS array
41Tracing back for a solution
LCS bdea
42LCS time complexity
- There are (n 1)(m 1) cells in the LCS score
array. Each cell is filled by examining 3 other
cells in constant time. The time complexity to
fill the array is O(nm). - Tracing back for an LCS solution takes at most n
m steps. - The total time complexity is therefore O(nm).
43Topic 2 Outline
- Types of Alignment
- Substitution Matrices
- Global vs Local Alignment
- Recursions for Global, time complexity
- Global alignment with affine gap penalty, time
complexity - Similarity scoring and local alignment
- Recursion for local, time complexity
- Finding suboptimal local alignments declumping
- Substitution Matrices
44Global vs Local Alignment
- Given two strings, X and Y
- global alignment produces an alignment that
contains all of X and all of Y. - local alignment produces an alignment that
contains only the best matching substrings, one
from X and one from Y.
X
Y
X
Y
45Global vs Local Alignment
- Global alignment is useful when
- The sequences are known to be related throughout
their length, for example, similar protein
sequences from close species. - Local alignment is useful when
- The sequences are believed to contain parts that
are closely related.
46Global Alignment Problem
- Given two sequences X and Y and alignment
scoring functions, - Find the best scoring alignment that includes
all of X and all of Y. - Solution Dynamic Programming
47Global Alignment
- Analysis of global alignment is similar to the
LCS. Alignments can end in one of three ways.
In terms of the prefix strings x1xi and y1yj,
we have - 1. xi and yj are aligned with each other. (Here
it makes no difference whether xi and yj are the
same.) - Gi,j Gi 1, j 1 s(xi, yj)
- X C G T
- Y C G C
48Global Alignment
- xi is deleted (aligned against a dash).
- Gi, j Gi 1, j g
- X C A T
- Y C A -
- yj is deleted (aligned against a dash).
- Gi, j Gi, j 1 g
- X C A
- Y C A A
49Global alignment recursion(similarity scoring)
50Global alignment example
match 2, mismatch - 3, gap - 4
51Global Alignment Example
match 2, mismatch - 3, gap - 4
52Global Alignment Example
Tracing back for an alignment
C G T A G C C T A G A
53Global alignment time complexity
- As with the LCS problem, there are (n 1) (m
1) cells in the dynamic programming array. Each
is filled by examining 3 other cells in constant
time. The time complexity to fill the array is
O(nm). - Tracing back for a global alignment takes at most
n m steps. - The total time complexity is therefore O(nm).
54Global alignment and affine gap penalty
- Recall the affine gap penalty function
- g(k) a ßk
- When xi or yj is deleted, we have to consider
that it could be the last of a string of
characters that is deleted as one unit. And the
size of that unit will affect the deletion cost.
55Time Complexity (naïve)with Affine Gap Cost
- For each (i, j) in the alignment matrix, there
are O(n m) posible deletion costs that must be
considered in order to choose the optimal cost.
Without any improvements, the time complexity
grows to O(nm(n m)) or cubic O(n3) time.
56Refining the Affine Gap Computation
- The regularity of deletion costs helps reduce the
time complexity. Observe the two tables.
57Auxillary Functions for Affine Gap
Ei,j is max of all possibilities
58Auxillary Functions for Affine Gap
Fi,j is the maximum of all possibilities
59Global Alignment with Affine Gap
recursion(similarity scoring)
60Time Complexity with affine gap cost
- A total of (n 1)(m 1) cells must be computed.
For each cell, E, F, and G values must be
computed. E and F both require looking up 2
values. G requires looking up 3 values. Time to
compute scores is O(nm). - Tracing back can be done in O(n m) if the E and
F values are retained. This triples the memory
space required for scoring arrays (E, F, and G). - Total time O(nm).
61Local Alignment Problem
- Given two sequences X and Y and alignment
scoring functions, - Find the best scoring alignment over all
substring pairs, one from X and one from Y. - Solution Dynamic Programming
62Local Alignment Looks Harder than Global
Alignment
- Where global alignment asks for the solution to
one problem, the best alignment of X1m versus
Y1n, local alignment asks for the best
alignment out of O(n4) subproblems, any substring
in X versus any substring in Y - Xhi versus Ykj
- for 1 h i m, 1 k j n
- Instead, we solve O(n2) subproblems, the best
alignment of any substring ending at xi versus
any substring ending at yj.
63Local Alignment and Similarity Scoring
- Local alignment uses similarity scoring for the
following reason. When an alignment score is
negative, the alignment is worse than no
alignment at all. For an (i, j) pair, it often
happens that the best alignment of every
substring ending at xi with every substring
ending at yj has a negative score. - Similarity scoring detects these bad alignments
and local alignment discards them. If every
alignment score for an (i, j) cell is negative,
then the score is reset to zero.
64Local Alignment Recursion
65Local Alignment Time Complexity
- Proportional to the product of the sequence
lengths O(nm).
66Finding Suboptimal Alignments
- When computing local alignment, we may want to
know optimal and suboptimal alignments. This
can be important in the case where the sequences
contain several parts that are similar.
X
Y
67High Alignment Scores may not be Independent
68Declump scores by prohibiting match and
substitution pairings from realignment
69Source Michael S. Waterman, 1994
70Source Michael S. Waterman, 1994
71Substitution Matrices
- Used for protein alignments
- Substitution rates for amino acid pairs are
determined from known similar sequences - Matrices contain log-odds scores
- First matrices designed by Margaret Dayhof are
called PAM matrices (Point Accepted Mutation) - Current matrices designed by Henikoff and
Henikoff are called BLOSUM matrices (Blocks
Substitution Matrices) -
72(No Transcript)
73(No Transcript)
74Odds Ratios
- Odds is a ratio of probabilities for two events
which are mutually exclusive. - In horse racing, for example, odds is the ratio
of the betting that a horse will lose to the
betting that the horse will win. So, for a horse
with odds of 20 to 1, the betting is 20 times
higher that the horse will lose than it will win,
while for a horse with odds of 3 to 2, the
betting is only 1.5 times higher that the horse
will lose than win.
75Alignment and Odds
- A substitution score can be interpreted as an
odds ratio. For an individual pair of aligned
amino acids, the events are - The pair are aligned because they are
evolutionarily related - The pair are aligned merely by chance.
- For each pair, the relevant question is
- What are the odds that amino acid i would be
substituted for amino acid j if they were
evolutionarily related? - If the odds are good, then the pair supports the
alignment. - If the odds are bad, then the pair reduces
confidence in the alignment.
76Log Odds
- The odds for the entire alignment
- Aligned because the sequences are
evolutionarily related or aligned by chance
alone - can be obtained by multiplying the odds for each
aligned pair of amino acids. - Multiplication is expensive computationally, so
logarithms are used because they can be added.
77Log-Odds Substitution Matrices
- The BLOSUM and PAM substitution matrices contain
log-odds values. The ratios have the basic form - Observed probability of pairing amino acids i
and j in related sequences - Expected probability of pairing at random
Oij Eij
78BLOSUM Data
- DATA
- Ungapped multiple alignments (blocks) taken from
504 families of known related protein sequences
in the PROSITE database. In the original paper
(1992), this produced 2100 blocks.
79BLOSUM Observed Frequencies
- A typical block
- R T S C L H
- K T S A N P
- K W D C L P
- K
- R
- E
First column pairs RK 6 RR 1 RE 2 KK
3 KE 3 Repeat and accumulate for every column.
Fij Pair (i, j) counts Oi,j Fij / total
number of pairs observed pair frequencies
80BLOSUM Background Frequencies
- Pi probability of amino acid i occurring in an
amino acid pair - Pi Oii Sj ? i Oij / 2
- Eij expected probability of random pairs
- Eij
Pi Pj if i j Pi Pj Pj Pi
if i ? j
81BLOSUM Log-Odds Values
- Sij Log-odds ratio for aligning amino acids i
and j. - Sij 2 log 2 (Oij / Eij)
- If observed frequencies are
- as expected, Sij 0
- greater than expected, Sij gt 0 (positive)
- less than expected, Sij lt 0 (negative)
82Related Sequences Must Be Clustered
- Overrepresentation of closely related sequences
can bias the matrices. - K
- K 4 sequences more than
- K 80 identical
- K
- R
- E
- The overrepresentation of the closely related
sequences increases the observed K to K pairing,
falsely increasing the log-odds score for that
pair.
Other related sequences
83Clustering Sequences
- Closely related sequences are clustered and the
letters in any cluster are given fractional
values. In the cluster below, amino acids K in
the first column counts as ¼ each rather than 1.
R and T in the fourth column count as ½ each. - K R
- K R 4 sequences more than
- K T 80 identical clustered
- K T
- R L
- E K
84The BLOSUM Matrix Family
- Blosum 80 clusters sequences if they are 80
identical (clustering is not transitive). - Blosum 62 clusters sequences if they are 62
identical. - Lower numbers yield matrices that give higher
scores to alignments of distantly related
sequences.
85(No Transcript)
86Topic 3 Outline
- Database Searching Algorithms
- Sensitivity vs Selectivity vs Running Time
- Dot Plots
- Blast
- High Scoring Segment Pairs (HSP)
- Target Words
- Extending a Hit
- Statistics of HSP scores
- Gapped Blast modifications
87Alternatives to Alignment
- Alignment is fine when the sequences are
relatively short, but is unusable for longer
sequences such as are encountered in - database searches
- comparison of genomes
- large repetition identification
- because it takes too long. For these, we need
alternate methods.
88Dot Plots
- Two dimensional array like an alignment array.
Put a dot in each cell where the sequence
characters match. Long diagonal runs of dots
indicate sequence similarity. - Match rule for proteins might be positive
substitution value in BLOSUM matrix.
89Extended Dot Plots (Nature, 19 June 2003, Vol
423, p 831)
90(No Transcript)
91Tandem repeat
92Database Search AlgorithmsSensitivity,
Selectivity, Running Time
- Sensitivity the ability to detect weak
similarities between sequences (often due to long
evolutionary separation). Increasing sensitivity
reduces false negatives, those database sequences
similar to the query, but rejected. - Selectivity the ability to screen out
similarities due to chance. Increasing
selectivity reduces false positives, those
sequences recognized as similar when they are not.
Running time
1 Running time
Sensitivity
Selectivity
93BLAST
- The BLAST program is designed for fast sequence
to database search. The basic idea is that a
high scoring alignment between the query and a
database sequence will almost always contain a
core part that is well conserved. This core is
called a High-scoring Segment Pair (HSP). The
statistical theory behind BLAST deals with the
probability of finding HSPs.
94High-scoring Segment Pair (HSP)
- An HSP consists of two equal length substrings,
one from the query and one from the database
sequence. When aligned without gaps, the score,
S, is above a specified threshold and is locally
maximal, meaning that extending the substrings on
either end to lengthen the alignment produces a
smaller score. - Suppose we have the following two sequences
X LKFSFALCCTIG Y ADQHLSRPTWAFYC S F A L
C C T W A F Y C 1 1 4 0 -2 9 13
One of many segment pairs is (BLOSUM scoring)
95High Scoring Segment Pairs
X LKFSFALCCTIG Y ADQHLSRPTWAFYC S F
A L C C T W A F Y C 1 1 4
0 -2 9 13 This pair is locally maximal
because shortening or lengthening the alignment
reduces the score. L K F S F A L C
C S R P T W A F Y C -2 2 -4 1
1 4 0 -2 9
96Assumptions for the Statistical Theory
- We assume
- a simple protein model the probabilities for
the amino acids appearing at each position in a
protein are independent and identically
distributed (iid) and amino acid i occurs
randomly with probability Pi. - the substitution matrix (such as BLOSUM 62) has
at least one positive score. - the expected score for amino acid pairs
- S pipj sij lt 0
- is negative.
97Normalized Scores of HSPs
- Given Pi and Sij, the statistical theory yields
two parameters, ? and K which can be used to
normalize an HSP score S with the formula - S'
- The units of S' are bits. When two random
proteins sequences are compared, the expected
number of HSPs with a normalized score of S' is - E N / 2S'
- where N is the product of the sequence lengths.
?S ln K ln 2
98Score for a Given Statistical Significance
- Solving for S' yields
- S' log2 (N / E)
- For example, if
- the query protein has length 250.
- the database has length 50 000 000
- the E-value is .05
- then the score required to reach that level of
statistical signficance is 38 bits.
99Deficiencies in the BLAST theory
- The result is asymptotic, meaning there is some
error for finite sequences. - Local variations in residue composition in real
sequences often do not match the model. - The model assumes that all mutations at all sites
in a protein sequence can be described by a
single substitution matrix. - It does not apply to gapped alignments.
100Finding HSPs
- Blast finds HSPs by first looking for well
conserved core alignments, i.e., ungapped
alignments between short query words of length
3 and database words. Each core alignment must
score greater than a threshold score, T, which is
typically 13 using the BLOSUM matrix. - Core alignments or hits are typically found
using pattern matching finite automata, i.e.
linear search through the database.
101Query and Target Words
- Suppose the query sequence is LVNRKPVVP.
- Chop the query into all possible words of length
3 - LVN VNR NRK RKP KPV PVV VVP
- Collect target words associated with each query
word. For example, the word RKP has six target
words that score at least 13 (using BLOSUM 62)
when aligned with RKP - QKP KKP RQP REP RRP RKP
- In general, there will be a large set of target
words derived from the query sequence. Each
target word and its associated query word could
form the core of an HSP.
102Extending a Hit
- A hit is extended in either direction to find its
locally maximal segment pair. Extension
terminates when the score drops too far below the
highest score found. - For example, Suppose the target RRP is found and
it occurs in the sequence - EPGVCRRPLKCTAS
- When trying to extend the core against LVNRKPVVP
the HSP has 6 letters and a score of 16 - L V N R K P V V P
- G V C R R P L K C
- -3 4 -3 5 2 7 1 -2 -3
103Gapped Blast
- Gapped BLAST (1997) has two improvements over the
original BLAST (1990). - Two hits Only extends core alignments when two
occur nearby on the same diagonal. Involves
lowering the threshold T to retain sensitivity,
but reduces extension which is the most costly. - Gapped alignments are computed. Allows raising
the threshold T while retaining selectivity.
Speeds initial database scan.
104Optimal Substitution Matricesfor Distant
Homologies
- Among HSPs (ungapped) from the comparison of
random sequences, amino acids ai and aj are
aligned with frequency approaching - qij pipje?sij
- The qij are called target frequencies for the
given substitution matrix sij. Among alignments
of distantly related proteins, amino acids tend
to be paired with certain characteristic
frequencies. Only if these correspond to a
matrixs target frequencies ... can the matrix be
optimal for distinguishing the distant local
homologies. (Altschul 1991)
105Log-Odds Matrices Again
- Rearranging qij pipje?sij yields
- sij ln(qij /pipj)/?
- where ? acts as a scaling factor for the
logarithm. This shows - that the substitution scores are inherently
log-odds scores for the target and background
frequencies, and - scores may be chosen that correspond to any
desired set of target frequencies.
106Topic 4 Outline
- Specialized Sequence Alignment Algorithms
- Sim4
- Blat
107sim4
- Sim4 is a program for aligning a cDNA sequence to
a genomic sequence. - cDNA is complementary to a messenger RNA which is
the RNA molecule after introns have been cut out
and the exons spliced together. The difference
between cDNA and genomic DNA is the absence of
the intron sequences in cDNA. - Sim4 assumes that the differences between the two
sequences are limited to - introns in the genomic sequence
- sequencing errors in either sequence
108sim4 Find HSPs
- In the first step, sim4 finds HSPs which must
have an exact matching core of 12 nucleotides.
The core is extended on both ends with a score of
1 for match and -5 for mismatch.
109sim4 Select chains of HSPs
- HSps are chained with the constraints that
- starting positions in the cDNA are increasing
- HSPs are in nearby diagonals or are in diagonals
separated by plausible intron distances
Genomic
cDNA
starting positions increase
gap typical for intron
110sim4 Trim overlaps in cDNA
- Exon cores that overlap in the cDNA are trimmed
to find - GT ... AG, a common intron signal, in the
genomic DNA.
GT
AG
overlap
111sim4 Trim overlaps in cDNA
- Exon cores that overlap in the cDNA are trimmed
to find - GT ... AG, a common intron signal, in the
genomic DNA.
GT
AG
112sim4 Filling Gaps in cDNA
- Gaps are filled by recursively finding HSPs with
smaller cores (starting with exact matches of
length 8)
smaller cores are chained as before
113BLAT The BLAST-Like Alignment Tool
- BLAT is a specialized program for mRNA to DNA
alignments and cross-species protein alignments.
It is faster than BLAST and sim4 for these tasks.
It is not appropriate for distant homology
searching. - BLAT is available at the UC Santa Cruz human
genome website for database searches against the
human genome.
114BLAT vs BLAST
- Both BLAT and BLAST first look for high-scoring
pairs - BLAST collects short words in the query and does
a linear scan through the database for those
words. - BLAT builds an index of the database (a data
structure suited for rapid detection of word
matches) and then scans linearly through the
query. Since the query is much smaller than the
database, the scan phase is very rapid. The data
structure is persistent, meaning that it is built
once and then reused for every query search.
115BLAT Hits
- To form the database index, BLAT cuts the
database sequences into non-overlapping words of
length k.
Database
Query
A hit is an exact match or a near perfect match
(single character mismatch) with any query
subword.
Query k-words
116BLAT Criteria for Extending Hits
- BLAT allows different criteria for extending
hits - Single exact match
- Single near-exact match (one difference)
- C A G T G C G A T G A
- C A G T A C G A T G A
- Two exact matches on the same diagonal
separated by a small distance
117BLAT Statistics
- The size, k, of matching words is selected to
balance sensitivity and selectivity with running
time. BLAT makes flexible assumptions about the
query and database sequences to select k, the
word match size - M the percent matching between homologous
regions (98 for cDNA/genomic alignments, 89 for
cross-species protein alignments) - H the size of the homologous regions
- G the size of the database (3 000 000 000
bases) - Q the size of a query
- A the alphabet size (4 for DNA, 20 for protein)
118BLAT Statistics for k Selection (Exact Match)
- The number of non-overlapping k-words in a
homologous region is - T floor (H / k)
- Sequence letters are assumed to be iid. The
probability that - a match occurs between a homologous region k-word
and the query is - p Mk
- no match occurs is
- q (1 p)
- at least one homologous region word matches the
query is - Phit 1 minus no matches 1 q T
119BLAT Statistics for k Selection(Exact Match)
- The probability that two k words match by chance
is - r (1/A)k
- The number of
- query words is qw Q k 1
- database words is dw G/k
- The number of k words that are expected to match
by chance is - F qw dw r
120BLAT Statistics for k Selection
- With increasing k,
- the probability of a valid hit, Phit, goes down
because it becomes harder to find two words that
match due to errors or mutations. - the number of false hits, F, goes down because
probability of random word matching decreases as
words grow. - For example, in DNA, with M 0.95, H 100, and
Q 500, if - k 11, then Phit 0.999 and F 32 512
- k 14, then Phit 0.991 and F 399
- By decreasing sensitivity slightly, the number of
false hits can be reduced by almost a factor of
100. -
121BLAT indexing
- Each k-word (nonoverlapping) in the database is
converted to a number in base four - T C A G T T A
- 3 1 0 2 3 3 04
- List I points to a list L, the sorted locations
of the k-word in the database sequences.
A list, I, of all possible numbers is
maintained. For DNA, when k 7, this list has
47 or 16,384 entries.
I
3102330
3102331
L
100
730
600
350
930
870
122Deficiencies in the BLAT Program
- Near exact matching with proteins requires an
index which is too large to fit in memory, so a
less efficient hashing scheme is used instead - The alignment method is not standard optimal
alignment and - for DNA does not work well below 90 sequence
identity - for proteins does not work well with indels in
one of the sequences
123Topic 5 Outline
- Searching for Repetitive Motifs and Patterns
- Pattern Detection vs Alignment
- Short word methods
- TRF
124Pattern Detection
- In pattern detection problems
- there is no query sequence
- we look for a repetitive pattern or motif in one
long sequence or several sequences - broad characteristics of the target pattern are
specified in advance
125Short Word Methods
- In short word methods, small matching words are
detected. A cluster of short words indicates a
potential pattern. A typical applications is the
search for tandem repeats in DNA sequences.
Word spacings may differ.
126Short Word Methods
- Short word methods are well suited to DNA
sequences because - the alphabet is small so exact matching words are
common even in homologous regions that have
experienced significant mutation. - they can handle insertions and deletions that are
common in DNA - They are less well suited to protein sequences
because - large amino acid alphabet and frequent
substitution make short matching words uncommon.
127Tandem Repeats
A tandem repeat is any pattern of nucleotides
that has been duplicated so that it appears
several times in succession. For example, the
sequence fragment below contains a tandem repeat
of the trinucleotide CGT
tcgctggtcatacgtcgtcgtcgtcgttacaaacgtcttccgt
128Approximate Tandem Repeats
More typically, the tandem copies are only
approximate due to mutations. Here is an
alignment of copies from a tandem repeat in C.
elegans.
Shown are the copies and a
consensus pattern
129Tandem Repeats Associated with Human Disease
- Trinucleotide diseases caused by expansion of a
trinucleotide repeat - Fragile-X mental retardation
- Myotonic dystrophy
- Huntingtons disease
- Friedreichs ataxia
- Multilocus diseases linked in some cases to
unstable or uncommon minisatellites - Epilepsy
- Diabetes
- Ovarian cancer
130Tandem Repeats Function and Usefulness
- Tandem repeats
- are involved in gene regulation and often
contain putative transcription factor binding
sites. - exhibit copy number polymorphism, making them
valuable genomic markers.
131Tandem Repeats Finder (TRF)
- TRF finds tandem repeats in genomic DNA sequences
using short word matches (k-tuple matches). It
assumes - repeats have on average gt80 sequence identity
- insertions and deletions occur on average in lt
10 of the pattern positions -
- These are average values for program parameter
settings. Repeats with lower sequence identity
and higher indel frequency are also found.
132- LOCUS HUMFMR1 3765 bp
mRNA PRI 08-NOV-1994 - DEFINITION Human Fragile X mental
retardation 1 FMR-1 gene, 3' end, clones - 1 gacggaggcg
cccgtgccag ggggcgtgcg gcagcgcggc ggcggcggcg
gcggcggcgg - 61 cggcggaggc ggcggcggcg gcggcggcgg
cggcggaggc ggcggcggcg gcggcggcgg - 121 cggcggctgg gcctcgagcg cccgcagccc
acctctcggg ggcgggctcc cggcgctagc - 181 agggctgaag agaagatgga ggagctggtg
gtggaagtgc ggggctccaa tggcgctttc - 241 tacaaggcat ttgtaaagga tgttcatgaa
gattcaataa cagttgcatt tgaaaacaac - 301 tggcagcctg ataggcagat tccatttcat
gatgtcagat tcccacctcc tgtaggttat - 361 aataaagata taaatgaaag tgatgaagtt
gaggtgtatt ccagagcaaa tgaaaaagag - 421 ccttgctgtt ggtggttagc taaagtgagg
atgataaagg gtgagtttta tgtgatagaa - 481 tatgcagcat gtgatgcaac ttacaatgaa
attgtcacaa ttgaacgtct aagatctgtt - 541 aatcccaaca aacctgccac aaaagatact
ttccataaga tcaagctgga tgtgccagaa - 601 gacttacggc aaatgtgtgc caaagaggcg
gcacataagg attttaaaaa ggcagttggt - 661 gccttttctg taacttatga tccagaaaat
tatcagcttg tcattttgtc catcaatgaa - 721 gtcacctcaa agcgagcaca tatgctgatt
gacatgcact ttcggagtct gcgcactaag - 781 ttgtctctga taatgagaaa tgaagaagct
agtaagcagc tggagagttc aaggcagctt - 841 gcctcgagat ttcatgaaca gtttatcgta
agagaagatc tgatgggtct agctattggt - 901 actcatggtg ctaatattca gcaagctaga
aaagtacctg gggtcactgc tattgatcta - 961 gatgaagata cctgcacatt tcatatttat
ggagaggatc aggatgcagt gaaaaaagct
133- LOCUS HUMFMR1 3765 bp
mRNA PRI 08-NOV-1994 - DEFINITION Human Fragile X mental
retardation 1 FMR-1 gene, 3' end, clones - 1 gacggaggcg
cccgtgccag ggggcgtgcg gcagcgcggc ggcggcggcg
gcggcggcgg - 61 cggcggaggc ggcggcggcg gcggcggcgg
cggcggaggc ggcggcggcg gcggcggcgg - 121 cggcggctgg gcctcgagcg cccgcagccc
acctctcggg ggcgggctcc cggcgctagc - 181 agggctgaag agaagatgga ggagctggtg
gtggaagtgc ggggctccaa tggcgctttc - 241 tacaaggcat ttgtaaagga tgttcatgaa
gattcaataa cagttgcatt tgaaaacaac - 301 tggcagcctg ataggcagat tccatttcat
gatgtcagat tcccacctcc tgtaggttat - 361 aataaagata taaatgaaag tgatgaagtt
gaggtgtatt ccagagcaaa tgaaaaagag - 421 ccttgctgtt ggtggttagc taaagtgagg
atgataaagg gtgagtttta tgtgatagaa - 481 tatgcagcat gtgatgcaac ttacaatgaa
attgtcacaa ttgaacgtct aagatctgtt - 541 aatcccaaca aacctgccac aaaagatact
ttccataaga tcaagctgga tgtgccagaa - 601 gacttacggc aaatgtgtgc caaagaggcg
gcacataagg attttaaaaa ggcagttggt - 661 gccttttctg taacttatga tccagaaaat
tatcagcttg tcattttgtc catcaatgaa - 721 gtcacctcaa agcgagcaca tatgctgatt
gacatgcact ttcggagtct gcgcactaag - 781 ttgtctctga taatgagaaa tgaagaagct
agtaagcagc tggagagttc aaggcagctt - 841 gcctcgagat ttcatgaaca gtttatcgta
agagaagatc tgatgggtct agctattggt - 901 actcatggtg ctaatattca gcaagctaga
aaagtacctg gggtcactgc tattgatcta - 961 gatgaagata cctgcacatt tcatatttat
ggagaggatc aggatgcagt gaaaaaagct
134 LOCUS RATIGCA 4461 bp DNA ROD
18-APR-1994 DEFINITION Rat Ig germline
epsilon H-chain gene C-region, 3' end.
2881 cgccccaagt aggcttcatc atgctctttg
gtttagcaat agcccaaagc aagctatgca 2941
tccatctcag gcccagaggg atgaggagac cagaatcaag
acatacccac gcccatccca 3001 cgcccaacca
ccaaccacca gcacatcagg ttcacacacc tgagaccagt
ggctcccatc 3061 acacacacac acacacacac
acacacacac acacacacac acacacaagc ccgtacacat
3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct
ctgaggatca ccaatggcag 3181 agtcggccag
cacctcagcc tccaggccaa tccttatact ttggcccact
gcaggccatg 3241 agagatggag gaggtggagg
cctgagctgt ggaaaaccag agacaggaag atggtctgta
3301 ctccaggcca atccttatac tttggcccac tgcaggccat
gagagatgga ggaggtggag 3361 gcctgagctg
tggaaaacca gagacaggaa gatggtctgt atggagagag
tagtaaacca 3421 gattataggg agactgaggc
aggagtagag ctcctacaag gccagtagtc taccttagag
3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg
ctcctcagat atcacaacca 3541 agattcctgg
agccagagtg tgcatgcagg ccctagaaga aatgtggagc
ttagagccct 3601 tcctggaggg ccctgggcac
tctgaacaaa aggcaattct gtaggctgta tagaggcatc
3661 ctgtcagata cacacacaca tgcacacaca tacacacaca
gagacacaga cacacacaca 3721 tgcccacaca
catgcataca cacatgcaca cacatacaca cacagagaca
cagacacaca 3781 catgcccaca cacatgcata
cacacatgca tgcacacaca cacacacaca tacacataca
3841 cacacacaca cacaccccgc aggtagcctt catcatgctg
tctagcgata gccctgctga 3901 gggtgggaga
tactgggtca tggtgggcac cggagtagaa agagggaatg
agcagtcagg 3961 gtcaggggaa aaggacatct
gcctccaggg ctgaacagag acttggagca gtcccagagc
4021 aagtgggatg gggagctctg ccactccagt ttcaccagga
ctgcctgaga ccagtgaggg
135 LOCUS RATIGCA 4461 bp DNA ROD
18-APR-1994 DEFINITION Rat Ig germline
epsilon H-chain gene C-region, 3' end.
2881 cgccccaagt aggcttcatc atgctctttg
gtttagcaat agcccaaagc aagctatgca 2941
tccatctcag gcccagaggg atgaggagac cagaatcaag
acatacccac gcccatccca 3001 cgcccaacca
ccaaccacca gcacatcagg ttcacacacc tgagaccagt
ggctcccatc 3061 acacacacac acacacacac
acacacacac acacacacac acacacaagc ccgtacacat
3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct
ctgaggatca ccaatggcag 3181 agtcggccag
cacctcagcc tccaggccaa tccttatact ttggcccact
gcaggccatg 3241 agagatggag gaggtggagg
cctgagctgt ggaaaaccag agacaggaag atggtctgta
3301 ctccaggcca atccttatac tttggcccac tgcaggccat
gagagatgga ggaggtggag 3361 gcctgagctg
tggaaaacca gagacaggaa gatggtctgt atggagagag
tagtaaacca 3421 gattataggg agactgaggc
aggagtagag ctcctacaag gccagtagtc taccttagag
3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg
ctcctcagat atcacaacca 3541 agattcctgg
agccagagtg tgcatgcagg ccctagaaga aatgtggagc
ttagagccct 3601 tcctggaggg ccctgggcac
tctgaacaaa aggcaattct gtaggctgta tagaggcatc
3661 ctgtcagata cacacacaca tgcacacaca tacacacaca
gagacacaga cacacacaca 3721 tgcccacaca
catgcataca cacatgcaca cacatacaca cacagagaca
cagacacaca 3781 catgcccaca cacatgcata
cacacatgca tgcacacaca cacacacaca tacacataca
3841 cacacacaca cacaccccgc aggtagcctt catcatgctg
tctagcgata gccctgctga 3901 gggtgggaga
tactgggtca tggtgggcac cggagtagaa agagggaatg
agcagtcagg 3961 gtcaggggaa aaggacatct
gcctccaggg ctgaacagag acttggagca gtcccagagc
4021 aagtgggatg gggagctctg ccactccagt ttcaccagga
ctgcctgaga ccagtgaggg
136Basic Assumption
- Mutated, adjacent copies of a pattern will
contain runs of exact matches.
T C C A C G G A G A T A T T T A
T A T A C G T C G A G A C T T A
137Basic Assumption
- Mutated, adjacent copies of a pattern will
contain runs of exact matches.
T C C A C G G A G A T A T T T A
T A T A C G T C G A G A C T T A
Runs of matches are identified using k-tuple
matches.
138Using k-tuple matches
- For purposes of program efficiency, fixed size
k-tuples are used.
139k-tuple matches
In pattern matching, a k-tuple is a window of
length k which contains text characters. Two
windows which contain the same text, form a
k-tuple match GAACGTTAGGTAACTGCAT CCTAGTTATACG
TTAAC
140(No Transcript)
141Modeling Tandem Repeats
- Multiple k-tuple matches suggest the occurrence
of a tandem repeat. The appropriate number of
matches depends on how a tandem repeat is
defined. For example, are the following two
aligned sequence fragments two copies of the same
underlying pattern? - TCGGCATCAGTCTATGG
- TCAA-TG-GTGT-TGG
-
142A Stochastic Model
- TRFs stochastic model is based on the
probability of character matching and the
frequency of insertion and deletion (indels)
between aligned adjacent copies. - CCACAACC-CGTCAGGCAAGT
- CTGCACCATCGTCTGGGAAGT
- HTTHHTHTTHHHHTHHTHHHH
- Note that the alignment has been converted into a
Bernoulli (coin-toss) sequence.
143Model Parameters
- PM the expected frequency of a match
- PI the expected frequency of an indel
- The parameters are applied to Bernoulli sequences
to establish criteria for detecting the repeats.
144Number of matches to indicate a repeat
- Sum of heads is the minimum required number of
matches - for a repeat with period n
- match probability p
- tuple size k
- Unless k 1, not all matches will be detected.
- For example
- HHTHHHHTHTHHHTTHHHT
k H seen
1 13
2 12
3 10
4 4
145Sum of heads
- Suppose a random Bernoulli sequence has length
100 and the expected number of heads is 75 (PM
0.75). If we count the number of heads, then 95
of the time we expect to count at least 68 heads.
- If we count only heads that occur in runs of
length 5 or more, then 95 of the time we expect
to count at least 26 heads. This is the sum of
heads criteria.
146Other Criteria
- Apparent Size Used to distinguish tandem from
non-tandem repeats. - Waiting time Used to pick a suitable tuple
size. - Random walk Used to accommodate insertions and
deletions.
147(No Transcript)
148The Tandem Repeats Database
- The Tandem Repeats Database (TRDB) is
- A public database of information on tandem
repeats. - A private workspace for extended research on
tandem repeats.
149C. elegans Distribution of Repeat Pattern Size
Chr 1
150Human Distribution of Pattern Size Chr 1
151C. elegans Distribution of Repeat Location Chr 1
152Human Distribution of Repeat Location Chr 1
153Human Distribution of Repeat Location Chr 1 for
Patternsize gt 5
154Clusters in 41 bp Repeats
155Two identical repeats
156Four repeats from a tandem repeat cluster
157Topic 6 Outline
- Composition Alignment
- Sequence composition and composition match
- Composition alignment algorithm
- Composition match scoring functions
- Limiting the length of a composition match
- Growth of local composition alignment scores
- Biological examples
158Sequence Composition
- Composition is a vector quantity describing the
frequency of occurrence of each alphabet letter
in a particular string. Let S be a string over
S. Then, - C(S)(fs1 , fs2 , fs3 , , fsS)
- is the composition of S, where fsi is the
fraction of the characters in S that are si. - Note that the order of letters is irrelevant as
it has no effect on the composition.
159Composition Example
- S ACTGTACCTGGCGCTATT
- C(S) ( 0.17, 0.28, 0.22, 0.33 )
- A C G
T
160Composition Match
- Two strings, S and T, have a composition match if
their lengths are equal and C(S) C(T). - For example, S and T below have a composition
match - S ACTGTACCTGGCGCTATT
- T AAACCCCCGGGGTTTTTT
161Composition and Sequence Features
- Isochores Multi-megabase, specifically GC-rich
or GC-poor. GC-rich isochores have greater gene
density. - CpG Islands Several hundred nucleotides, rich
in the dinucleotide CG which is underrepresented
in eukaryotic genomes. Methylation of the
cystine in these dinucleotides affects gene
expression. - Protein binding regions Tens of nucleotides,
dinucleotide composition contributes to DNA
flexibility, allowing the helix to change shape
during protein binding.
162Composition Alignment Problem
- Given Two sequences, S of length m, and T of
length n, over an alphabet S, and a scoring
function cm(s, t) for the score of a composition
match between substrings s and t. - Find The best scoring alignment (global or
local) of S with T such that the allowed scoring
options include composition match between
substrings of S and T as well as the standard
options of single character match, single
character mismatch, insertion and deletion.
163Example of composition alignment
- S AACGTCTTTGAGCTC
- T AGCCTGACTGCCTA
- Alignment
- AACGTCTTTGAGCTC
- lt-gt lt---gt
- AGCCTGACT-GCCTA
164Algorithm Analysis
- Given two sequences, S and T, the best alignment
of the prefix strings - S1, i s1 si
- T1, j t1 tj
- ends in one of four ways, mismatch, insertion,
deletion, or composition match between suffixes
of length l - 1 l min(i, j, limit)
- i.e., between substrings Si l 1, i and Tj
l 1, j
165Time Complexity
- Computing the optimal composition alignment is
done with dynamic programming and is similar to
standard alignment, except for the composition
match scoring option. The overall time
complexity is - O(nmZ)
- where Z is the time required per (i, j) pair to
find the best length l for the composition match.
166Computing length of the shortest composition
match
- Our goal here is to start with two strings, S and
T, of equal length, and for each prefix pair S1,
k, T1, k, find the length of the shortest
suffixes that have a composition match. For
example, let - S AACGTCTTTGAGCT
- T AGCCTGACTGCCTA
- Then for k 6, the shortest suffixes which have
a composition match have length 3 - S AACGTC
- T AGCCTG
167Composition difference
- Composition difference is a vector quantity for
two strings x and y - CD(x, y) (cs1 , , csS)
- where csi is the difference between the number
of times si occurs in x and in y.
168Using composition difference
Key observation two identical composition
differences at prefix lengths k and g indicate a
composition match of length k g.
169Sorting to find shortest composition matches
Sort on composition difference using stable sort.
Adjacent tuples with the same composition
difference identify shortest composition matches.
170Time complexity for composition matches
- O(nmS) to find nm shortest composition match
lengths for two strings of length n and m. - In our work, S, is a small constant (4 for DNA,
16 for dinucleotides). For larger alphabets, the
method of Amir, Apostolico, Landau and Satta
(2003) can be used.
171Composition match scoring functions
- Functions based on match length, k
- Function 1 cm(k) ck
- Function 2 cm(k) cv k
- where c is a constant.
- Functions based on substring composition
- Function 4 cm(C, B, k) ck H(C,B)
- where H is the relative entropy function, C is
the composition of the matching substrings and B
is a background composition.
172Additive and subadditive scoring functions
- The functions based on length are additive or
subadditive - cm(i j) cm(i) cm(j)
- Lemma For additive or subadditive composition
match scoring functions, any best scoring
alignment is equivalent in score to an alignment
which contains only shortest composition matches.
- Theorem Composition alignment with additive or
subadditive match scoring functions and finite
alphabet has time complexity O(nm).
173The limit parameter
- Intuitively, allowing scrambled letters to match
should increase the amount of matching between
sequences. If too much matching occurs,
alignments will not be meaningful. - The limit parameter is an upper bound on the
length l of the longest single composition match.
174Limit and fraction of matching charactersrandom,
ungapped alignments
Sequence length limit 1 2 5 10
binary 50 62.5 75.6 82.4
DNA 25 30 37.5 44.2
Dinucleotide 6.2 6.5 7.1 7.5
175Limit and fraction of matching charactersrandom,
ungapped alignments
Sequence length 100, all letters equal
probability p 0.25
limit 1 2 5 10
DNA 25 33.7 44.4 51
Sequence length 400, all letters equal
probability p 0.25
limit 1 2 10 50
dinucleotide 6.25 6.81 7.76 7.78
176Growth of local alignment scoreFunction 1
177Global score as a predictor of local parameter
suitability Function 1
178Growth of local alignment score Function 2
179Global score as a predictor of local parameter
suitability Function 2
180Limit values for DNA
- Function 1 cm(k) ck Limit 3.
- Function 2 cm(k) cvk Limit 10.
- Function 4 cm(C, B, k) ck H(C, B)
- Limit 50.
181Biological examples
- Composition alignment was tested on a set of 1796
promoter sequences from the Eukaryotic Promoter
Database. Each sequence is 600 nucleotides long,
500 bases upstream and 100 downstream of the
transcription initiation site. - Two local alignment scores were produced using
function 1, W using composition alignment and S
using standard alignment. The examples shown have
statistically significant W with W 3 S to
exclude good standard alignments.
182Example 1
Composition alignment and standard alignment of
two promoters. Standard alignment is not
statistically significant. Sequences are
characteristic of CpG islands.
Composition Alignment GCCCGCCCGCCGCGCTCCCGCCCGCC
GCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC lt-gt
ltgtltgtltgt ltgtltgt lt-gt ltgtltgt ltgtltgt
ltgt lt-gt CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGG
CCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC Standard
Alignment CGCCGCCGCCG CGCCGCCGCCG
183Example 2
Composition alignment of two promoter sequences.
Composition changes at vertical line.
A C G T Left
(0.01, 0.61, 0.30, 0.08) Right (0.19, 0.16,
0.56, 0.09)
GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GC
TCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG lt-gtlt
gtltgt ltgt lt-gtltgt ltgt
lt-gt ltgtltgtlt-gt ltgtltgtltgtlt-gtlt-gt CCGCGC
GCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGG
GGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG
184Final Slide
- Recognize that similarity among biological
sequences most likely exists in ways which we can
not today perceive and for which we have no
detection tools. You are encouraged, therefore,
to think broadly, beyond the embellishment or
refinement of current methods, to new definitions
of similarity and new problems of comparison.