Title: Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University
1Searching for Similarity in SequencesGary
BensonDepartments of Computer Science and
BiologyBoston University
2Topic 1 Outline
 Similarity and Alignment
 Define homology, similarity by descent and
similarity by convergence  Common mutations and their mathematical models
 Alignments
 Scoring Alignments
 Gap penalty functions
 Computing the best scoring alignment the
Longest Common Subsequence (LCS) problem
3Similarity and Biomolecules
 Similarity is expected among biomolecules that
are descended from a common ancestor. Mutations
cause differences, but survival of the organism
requires that mutations occur in regions that are
less critical to function while important
catalytic, regulatory or structural regions
remain similar.
4Similarity and Evolution
 Evolution has duplicated and shuffled bits and
pieces of molecules to produce new linear
arrangements that combine function in novel ways.
Regions of similarity often suggest an
evolutionary tie and/or common functional
properties between very different molecules.
5Three common similarity problems
 Start with a query sequence with unknown
properties and search within a database of
millions of sequences to find those which share
similarity with the query.  Start with a small set of sequences and identify
similarities and differences among them.  In many sequences or very long sequences, detect
commonly occurring patterns.
6What is Similarity?How can we measure it?
7Morphology
 Morphology is the form and structure of an
organism.  Should shared morphology mean similarity?
8Hands
9Aquatic Shape
10Shared morphology
 Shared morphology does not necessarily imply
common ancestry.  The animals with hands have all evolved from a
common ancester with a hand.  The ichthyosaur, shark and porpoise each evolved
sea life adaptations independently.
11Homology
 When similarity is due to common ancestry, we
call it homology.
12 Modern molecular biology seeks to understand
cellular processes through the action of DNA,
RNA, and protein molecules. This will ultimately
lead to a biochemical understanding of  The pathogenesis of infectious diseases like
AIDS, hepatitus and SARS.  The mutagenic properties of environmental toxins
and how they lead to diseases like cancer.  The etiology of human genetic disease.
 Strategies to prevent and treat diseases through
drug and vaccine design, gene therapy, risk
reduction, etc.
13How homology helps
 Given molecular sequences X and Y
 X Y AND INFO(Y) gt INFO(X)
 ( means similar)
14Are the Sequences Similar?
15Are the Sequences Similar
 How similar?
 What parts are the most similar?
 Remember, the common ancestor of the two
sequences may have existed millions of years ago.
16 How can we tell if the two sequences are similar?
 Similarity judgements should be based on
 The types of changes or mutations that occur
within sequences.  Characteristics of those different types of
mutations.  The frequency of those mutations.
17Common mutations in DNA
 Substitution
 A C G T T G A C
 A C G A T G A C
 Deletion
 A C G T T G A C
 A C G A C
 Insertion
 A C G T T G A C
 A C G C A A G T T G A C
18Common mutations
 Duplication
 A C G T T G A C
 A C G T T G A T T G A C
 Inversion (double stranded DNA shown)
 A C G T T G A C
 T G C A A C T G
 A C T C A A C C
 A C A G T T G G
19Frequency of mutations
 Substitution gt Insertion, Deletion
 gt gt
 Duplication
 gt
 Inversion
20Evolutionary history of sequences
21Alignments
 There are many ways to align two sequences. We
just saw one way  T T A C G T A C A G A T T A
 T   G G A A C A    T A
 Here is another
 T T A C G T A C A G A T T A
 T    G G A A C   A T  A
 Which is better? Remember, we can not choose
based on the evolutionary history, because that
is unknown.
22Alignments and Paths through the Alignment Array
23Alignments and Paths through the Alignment Array
t a c g  c a a    a c g t g a a t t
24Alignments and PathsAn Alternate Alignment
t   a c g c a   a a c g t g   a a t t
25Finding the Best AlignmentRanking Alignments by
Score
 Score an alignment by
 Partitioning it into columns
 Assign a weight to each column
 Sum the column weights
26Distance Scoring
 Distance scoring
 Alignment gets a nonnegative score.
 Alignment of identical sequences scores zero,
 all others gt zero.
 Best alignment has smallest score.
 Typical scoring functions are
 d(a,a) 0 identity
 d(a,b) d(b,a) gt 0 a ? b substitution
 g d(a, ) gt 0 indel (gap)
27Similarity Scoring
 Similarity scoring
 Alignment scores may be positive, zero, or
negative.  More similar means larger positive score.
 The best alignment has largest score.
 Typical scoring functions are
 s(a,b) is gt 0 if a and b are similar in one or
more characteristics or are observed to
substitute frequently for each other  0 otherwise substitution
 g s(a, ) lt 0 indel (gap)
28Gap penalty functions
 Single character gap penalty
 g(a, ) c
 (c a constant or a value dependent on a)
 Affine (linear) gap penalty
 g(k) a ßk
 (a is a gap opening penalty, ß is a gap
extension penalty)  Concave gap penalty
 g(k) a ß(m(k))
 m(k) is a function like log(k) which grows more
slowly as k increases.
29Distance Scoring
 Alignment parameters
 d(a, a) 0 d(a, b) 2, g 4
 A G C C G T A T
 A C G A   T  T
 0 4 0 2 4 4 0 4 0
18
30Similarity Scoring
 Scoring parameters
 s(a, a) 5, s(a, b)  3, g  8
 A G C C G T A T
 A C G A   T  T
 5 5 5 5
 8 3 8 8 8
 15

31Similarity scoring with affine gap
 Alignment parameters
 s(a, a) 5, s(a, b)  3,
 g(k) a ßk, a  5, ß  4
 A G C C G T A T
 A C G A   T  T
 5 5 5 5
 8 3 8 4 8
 11

32Computing the Optimal AlignmentThe LCS Problem
as Prototype
 The Longest Common Subsequence (LCS) problem is a
method for comparing sequences. Although the
solution does not produce an alignment, it
illustrates a method of dynamic programming that
is very similar to that used by alignment
algorithms.
33Longest Common Subsequence Problem
 Let X be a string of characters. A subsequence
X of X is formed by discarding zero or more
letters of X. Note that the letters in X
maintain their same order as in X.  Let X and Y be two strings. A common subsequence
Z is a subsequence of both. A longest common
subsequence (LCS) is the longest such Z.  Examples
X a b c d e b a Y b e b d c e a c d Z b d
e a
X a b c d e b a X a b d b
34LCS Problem
 Given Two sequences X and Y.
 Find An LCS for X and Y.
 A divide and conquer solution can be developed by
looking at what happens to the last letters in
each sequence. That is, are they part of the LCS
solution or not?
35Possible ways to split the problem
36LCS recursion
37Filling the dynamic programming array
38Filling the dynamic programming array
39Necessary values in adjacent cells
40Completed LCS array
41Tracing back for a solution
LCS bdea
42LCS time complexity
 There are (n 1)(m 1) cells in the LCS score
array. Each cell is filled by examining 3 other
cells in constant time. The time complexity to
fill the array is O(nm).  Tracing back for an LCS solution takes at most n
m steps.  The total time complexity is therefore O(nm).
43Topic 2 Outline
 Types of Alignment
 Substitution Matrices
 Global vs Local Alignment
 Recursions for Global, time complexity
 Global alignment with affine gap penalty, time
complexity  Similarity scoring and local alignment
 Recursion for local, time complexity
 Finding suboptimal local alignments declumping
 Substitution Matrices
44Global vs Local Alignment
 Given two strings, X and Y
 global alignment produces an alignment that
contains all of X and all of Y.  local alignment produces an alignment that
contains only the best matching substrings, one
from X and one from Y.
X
Y
X
Y
45Global vs Local Alignment
 Global alignment is useful when
 The sequences are known to be related throughout
their length, for example, similar protein
sequences from close species.  Local alignment is useful when
 The sequences are believed to contain parts that
are closely related.
46Global Alignment Problem
 Given two sequences X and Y and alignment
scoring functions,  Find the best scoring alignment that includes
all of X and all of Y.  Solution Dynamic Programming
47Global Alignment
 Analysis of global alignment is similar to the
LCS. Alignments can end in one of three ways.
In terms of the prefix strings x1xi and y1yj,
we have  1. xi and yj are aligned with each other. (Here
it makes no difference whether xi and yj are the
same.)  Gi,j Gi 1, j 1 s(xi, yj)
 X C G T
 Y C G C
48Global Alignment
 xi is deleted (aligned against a dash).
 Gi, j Gi 1, j g
 X C A T
 Y C A 
 yj is deleted (aligned against a dash).
 Gi, j Gi, j 1 g
 X C A
 Y C A A
49Global alignment recursion(similarity scoring)
50Global alignment example
match 2, mismatch  3, gap  4
51Global Alignment Example
match 2, mismatch  3, gap  4
52Global Alignment Example
Tracing back for an alignment
C G T A G C C T A G A
53Global alignment time complexity
 As with the LCS problem, there are (n 1) (m
1) cells in the dynamic programming array. Each
is filled by examining 3 other cells in constant
time. The time complexity to fill the array is
O(nm).  Tracing back for a global alignment takes at most
n m steps.  The total time complexity is therefore O(nm).
54Global alignment and affine gap penalty
 Recall the affine gap penalty function
 g(k) a ßk
 When xi or yj is deleted, we have to consider
that it could be the last of a string of
characters that is deleted as one unit. And the
size of that unit will affect the deletion cost.
55Time Complexity (naïve)with Affine Gap Cost
 For each (i, j) in the alignment matrix, there
are O(n m) posible deletion costs that must be
considered in order to choose the optimal cost.
Without any improvements, the time complexity
grows to O(nm(n m)) or cubic O(n3) time.
56Refining the Affine Gap Computation
 The regularity of deletion costs helps reduce the
time complexity. Observe the two tables.
57Auxillary Functions for Affine Gap
Ei,j is max of all possibilities
58Auxillary Functions for Affine Gap
Fi,j is the maximum of all possibilities
59Global Alignment with Affine Gap
recursion(similarity scoring)
60Time Complexity with affine gap cost
 A total of (n 1)(m 1) cells must be computed.
For each cell, E, F, and G values must be
computed. E and F both require looking up 2
values. G requires looking up 3 values. Time to
compute scores is O(nm).  Tracing back can be done in O(n m) if the E and
F values are retained. This triples the memory
space required for scoring arrays (E, F, and G).  Total time O(nm).
61Local Alignment Problem
 Given two sequences X and Y and alignment
scoring functions,  Find the best scoring alignment over all
substring pairs, one from X and one from Y.  Solution Dynamic Programming
62Local Alignment Looks Harder than Global
Alignment
 Where global alignment asks for the solution to
one problem, the best alignment of X1m versus
Y1n, local alignment asks for the best
alignment out of O(n4) subproblems, any substring
in X versus any substring in Y  Xhi versus Ykj
 for 1 h i m, 1 k j n
 Instead, we solve O(n2) subproblems, the best
alignment of any substring ending at xi versus
any substring ending at yj.
63Local Alignment and Similarity Scoring
 Local alignment uses similarity scoring for the
following reason. When an alignment score is
negative, the alignment is worse than no
alignment at all. For an (i, j) pair, it often
happens that the best alignment of every
substring ending at xi with every substring
ending at yj has a negative score.  Similarity scoring detects these bad alignments
and local alignment discards them. If every
alignment score for an (i, j) cell is negative,
then the score is reset to zero.
64Local Alignment Recursion
65Local Alignment Time Complexity
 Proportional to the product of the sequence
lengths O(nm).
66Finding Suboptimal Alignments
 When computing local alignment, we may want to
know optimal and suboptimal alignments. This
can be important in the case where the sequences
contain several parts that are similar.
X
Y
67High Alignment Scores may not be Independent
68Declump scores by prohibiting match and
substitution pairings from realignment
69Source Michael S. Waterman, 1994
70Source Michael S. Waterman, 1994
71Substitution Matrices
 Used for protein alignments
 Substitution rates for amino acid pairs are
determined from known similar sequences  Matrices contain logodds scores
 First matrices designed by Margaret Dayhof are
called PAM matrices (Point Accepted Mutation)  Current matrices designed by Henikoff and
Henikoff are called BLOSUM matrices (Blocks
Substitution Matrices) 
72(No Transcript)
73(No Transcript)
74Odds Ratios
 Odds is a ratio of probabilities for two events
which are mutually exclusive.  In horse racing, for example, odds is the ratio
of the betting that a horse will lose to the
betting that the horse will win. So, for a horse
with odds of 20 to 1, the betting is 20 times
higher that the horse will lose than it will win,
while for a horse with odds of 3 to 2, the
betting is only 1.5 times higher that the horse
will lose than win.
75Alignment and Odds
 A substitution score can be interpreted as an
odds ratio. For an individual pair of aligned
amino acids, the events are  The pair are aligned because they are
evolutionarily related  The pair are aligned merely by chance.
 For each pair, the relevant question is
 What are the odds that amino acid i would be
substituted for amino acid j if they were
evolutionarily related?  If the odds are good, then the pair supports the
alignment.  If the odds are bad, then the pair reduces
confidence in the alignment.
76Log Odds
 The odds for the entire alignment
 Aligned because the sequences are
evolutionarily related or aligned by chance
alone  can be obtained by multiplying the odds for each
aligned pair of amino acids.  Multiplication is expensive computationally, so
logarithms are used because they can be added.
77LogOdds Substitution Matrices
 The BLOSUM and PAM substitution matrices contain
logodds values. The ratios have the basic form  Observed probability of pairing amino acids i
and j in related sequences  Expected probability of pairing at random
Oij Eij
78BLOSUM Data
 DATA
 Ungapped multiple alignments (blocks) taken from
504 families of known related protein sequences
in the PROSITE database. In the original paper
(1992), this produced 2100 blocks.
79BLOSUM Observed Frequencies
 A typical block
 R T S C L H
 K T S A N P
 K W D C L P
 K
 R
 E
First column pairs RK 6 RR 1 RE 2 KK
3 KE 3 Repeat and accumulate for every column.
Fij Pair (i, j) counts Oi,j Fij / total
number of pairs observed pair frequencies
80BLOSUM Background Frequencies
 Pi probability of amino acid i occurring in an
amino acid pair  Pi Oii Sj ? i Oij / 2
 Eij expected probability of random pairs
 Eij
Pi Pj if i j Pi Pj Pj Pi
if i ? j
81BLOSUM LogOdds Values
 Sij Logodds ratio for aligning amino acids i
and j.  Sij 2 log 2 (Oij / Eij)
 If observed frequencies are
 as expected, Sij 0
 greater than expected, Sij gt 0 (positive)
 less than expected, Sij lt 0 (negative)
82Related Sequences Must Be Clustered
 Overrepresentation of closely related sequences
can bias the matrices.  K
 K 4 sequences more than
 K 80 identical
 K
 R
 E
 The overrepresentation of the closely related
sequences increases the observed K to K pairing,
falsely increasing the logodds score for that
pair.
Other related sequences
83Clustering Sequences
 Closely related sequences are clustered and the
letters in any cluster are given fractional
values. In the cluster below, amino acids K in
the first column counts as ¼ each rather than 1.
R and T in the fourth column count as ½ each.  K R
 K R 4 sequences more than
 K T 80 identical clustered
 K T
 R L
 E K
84The BLOSUM Matrix Family
 Blosum 80 clusters sequences if they are 80
identical (clustering is not transitive).  Blosum 62 clusters sequences if they are 62
identical.  Lower numbers yield matrices that give higher
scores to alignments of distantly related
sequences.
85(No Transcript)
86Topic 3 Outline
 Database Searching Algorithms
 Sensitivity vs Selectivity vs Running Time
 Dot Plots
 Blast
 High Scoring Segment Pairs (HSP)
 Target Words
 Extending a Hit
 Statistics of HSP scores
 Gapped Blast modifications
87Alternatives to Alignment
 Alignment is fine when the sequences are
relatively short, but is unusable for longer
sequences such as are encountered in  database searches
 comparison of genomes
 large repetition identification
 because it takes too long. For these, we need
alternate methods.
88Dot Plots
 Two dimensional array like an alignment array.
Put a dot in each cell where the sequence
characters match. Long diagonal runs of dots
indicate sequence similarity.  Match rule for proteins might be positive
substitution value in BLOSUM matrix.
89Extended Dot Plots (Nature, 19 June 2003, Vol
423, p 831)
90(No Transcript)
91Tandem repeat
92Database Search AlgorithmsSensitivity,
Selectivity, Running Time
 Sensitivity the ability to detect weak
similarities between sequences (often due to long
evolutionary separation). Increasing sensitivity
reduces false negatives, those database sequences
similar to the query, but rejected.  Selectivity the ability to screen out
similarities due to chance. Increasing
selectivity reduces false positives, those
sequences recognized as similar when they are not.
Running time
1 Running time
Sensitivity
Selectivity
93BLAST
 The BLAST program is designed for fast sequence
to database search. The basic idea is that a
high scoring alignment between the query and a
database sequence will almost always contain a
core part that is well conserved. This core is
called a Highscoring Segment Pair (HSP). The
statistical theory behind BLAST deals with the
probability of finding HSPs.
94Highscoring Segment Pair (HSP)
 An HSP consists of two equal length substrings,
one from the query and one from the database
sequence. When aligned without gaps, the score,
S, is above a specified threshold and is locally
maximal, meaning that extending the substrings on
either end to lengthen the alignment produces a
smaller score.  Suppose we have the following two sequences
X LKFSFALCCTIG Y ADQHLSRPTWAFYC S F A L
C C T W A F Y C 1 1 4 0 2 9 13
One of many segment pairs is (BLOSUM scoring)
95High Scoring Segment Pairs
X LKFSFALCCTIG Y ADQHLSRPTWAFYC S F
A L C C T W A F Y C 1 1 4
0 2 9 13 This pair is locally maximal
because shortening or lengthening the alignment
reduces the score. L K F S F A L C
C S R P T W A F Y C 2 2 4 1
1 4 0 2 9
96Assumptions for the Statistical Theory
 We assume
 a simple protein model the probabilities for
the amino acids appearing at each position in a
protein are independent and identically
distributed (iid) and amino acid i occurs
randomly with probability Pi.  the substitution matrix (such as BLOSUM 62) has
at least one positive score.  the expected score for amino acid pairs
 S pipj sij lt 0
 is negative.
97Normalized Scores of HSPs
 Given Pi and Sij, the statistical theory yields
two parameters, ? and K which can be used to
normalize an HSP score S with the formula  S'
 The units of S' are bits. When two random
proteins sequences are compared, the expected
number of HSPs with a normalized score of S' is  E N / 2S'
 where N is the product of the sequence lengths.
?S ln K ln 2
98Score for a Given Statistical Significance
 Solving for S' yields
 S' log2 (N / E)
 For example, if
 the query protein has length 250.
 the database has length 50 000 000
 the Evalue is .05
 then the score required to reach that level of
statistical signficance is 38 bits.
99Deficiencies in the BLAST theory
 The result is asymptotic, meaning there is some
error for finite sequences.  Local variations in residue composition in real
sequences often do not match the model.  The model assumes that all mutations at all sites
in a protein sequence can be described by a
single substitution matrix.  It does not apply to gapped alignments.
100Finding HSPs
 Blast finds HSPs by first looking for well
conserved core alignments, i.e., ungapped
alignments between short query words of length
3 and database words. Each core alignment must
score greater than a threshold score, T, which is
typically 13 using the BLOSUM matrix.  Core alignments or hits are typically found
using pattern matching finite automata, i.e.
linear search through the database.
101Query and Target Words
 Suppose the query sequence is LVNRKPVVP.
 Chop the query into all possible words of length
3  LVN VNR NRK RKP KPV PVV VVP
 Collect target words associated with each query
word. For example, the word RKP has six target
words that score at least 13 (using BLOSUM 62)
when aligned with RKP  QKP KKP RQP REP RRP RKP
 In general, there will be a large set of target
words derived from the query sequence. Each
target word and its associated query word could
form the core of an HSP.
102Extending a Hit
 A hit is extended in either direction to find its
locally maximal segment pair. Extension
terminates when the score drops too far below the
highest score found.  For example, Suppose the target RRP is found and
it occurs in the sequence  EPGVCRRPLKCTAS
 When trying to extend the core against LVNRKPVVP
the HSP has 6 letters and a score of 16  L V N R K P V V P
 G V C R R P L K C
 3 4 3 5 2 7 1 2 3
103Gapped Blast
 Gapped BLAST (1997) has two improvements over the
original BLAST (1990).  Two hits Only extends core alignments when two
occur nearby on the same diagonal. Involves
lowering the threshold T to retain sensitivity,
but reduces extension which is the most costly.  Gapped alignments are computed. Allows raising
the threshold T while retaining selectivity.
Speeds initial database scan.
104Optimal Substitution Matricesfor Distant
Homologies
 Among HSPs (ungapped) from the comparison of
random sequences, amino acids ai and aj are
aligned with frequency approaching  qij pipje?sij
 The qij are called target frequencies for the
given substitution matrix sij. Among alignments
of distantly related proteins, amino acids tend
to be paired with certain characteristic
frequencies. Only if these correspond to a
matrixs target frequencies ... can the matrix be
optimal for distinguishing the distant local
homologies. (Altschul 1991)
105LogOdds Matrices Again
 Rearranging qij pipje?sij yields
 sij ln(qij /pipj)/?
 where ? acts as a scaling factor for the
logarithm. This shows  that the substitution scores are inherently
logodds scores for the target and background
frequencies, and  scores may be chosen that correspond to any
desired set of target frequencies.
106Topic 4 Outline
 Specialized Sequence Alignment Algorithms
 Sim4
 Blat
107sim4
 Sim4 is a program for aligning a cDNA sequence to
a genomic sequence.  cDNA is complementary to a messenger RNA which is
the RNA molecule after introns have been cut out
and the exons spliced together. The difference
between cDNA and genomic DNA is the absence of
the intron sequences in cDNA.  Sim4 assumes that the differences between the two
sequences are limited to  introns in the genomic sequence
 sequencing errors in either sequence
108sim4 Find HSPs
 In the first step, sim4 finds HSPs which must
have an exact matching core of 12 nucleotides.
The core is extended on both ends with a score of
1 for match and 5 for mismatch.
109sim4 Select chains of HSPs
 HSps are chained with the constraints that
 starting positions in the cDNA are increasing
 HSPs are in nearby diagonals or are in diagonals
separated by plausible intron distances
Genomic
cDNA
starting positions increase
gap typical for intron
110sim4 Trim overlaps in cDNA
 Exon cores that overlap in the cDNA are trimmed
to find  GT ... AG, a common intron signal, in the
genomic DNA.
GT
AG
overlap
111sim4 Trim overlaps in cDNA
 Exon cores that overlap in the cDNA are trimmed
to find  GT ... AG, a common intron signal, in the
genomic DNA.
GT
AG
112sim4 Filling Gaps in cDNA
 Gaps are filled by recursively finding HSPs with
smaller cores (starting with exact matches of
length 8)
smaller cores are chained as before
113BLAT The BLASTLike Alignment Tool
 BLAT is a specialized program for mRNA to DNA
alignments and crossspecies protein alignments.
It is faster than BLAST and sim4 for these tasks.
It is not appropriate for distant homology
searching.  BLAT is available at the UC Santa Cruz human
genome website for database searches against the
human genome.
114BLAT vs BLAST
 Both BLAT and BLAST first look for highscoring
pairs  BLAST collects short words in the query and does
a linear scan through the database for those
words.  BLAT builds an index of the database (a data
structure suited for rapid detection of word
matches) and then scans linearly through the
query. Since the query is much smaller than the
database, the scan phase is very rapid. The data
structure is persistent, meaning that it is built
once and then reused for every query search.
115BLAT Hits
 To form the database index, BLAT cuts the
database sequences into nonoverlapping words of
length k.
Database
Query
A hit is an exact match or a near perfect match
(single character mismatch) with any query
subword.
Query kwords
116BLAT Criteria for Extending Hits
 BLAT allows different criteria for extending
hits  Single exact match
 Single nearexact match (one difference)
 C A G T G C G A T G A
 C A G T A C G A T G A
 Two exact matches on the same diagonal
separated by a small distance
117BLAT Statistics
 The size, k, of matching words is selected to
balance sensitivity and selectivity with running
time. BLAT makes flexible assumptions about the
query and database sequences to select k, the
word match size  M the percent matching between homologous
regions (98 for cDNA/genomic alignments, 89 for
crossspecies protein alignments)  H the size of the homologous regions
 G the size of the database (3 000 000 000
bases)  Q the size of a query
 A the alphabet size (4 for DNA, 20 for protein)
118BLAT Statistics for k Selection (Exact Match)
 The number of nonoverlapping kwords in a
homologous region is  T floor (H / k)
 Sequence letters are assumed to be iid. The
probability that  a match occurs between a homologous region kword
and the query is  p Mk
 no match occurs is
 q (1 p)
 at least one homologous region word matches the
query is  Phit 1 minus no matches 1 q T
119BLAT Statistics for k Selection(Exact Match)
 The probability that two k words match by chance
is  r (1/A)k
 The number of
 query words is qw Q k 1
 database words is dw G/k
 The number of k words that are expected to match
by chance is  F qw dw r
120BLAT Statistics for k Selection
 With increasing k,
 the probability of a valid hit, Phit, goes down
because it becomes harder to find two words that
match due to errors or mutations.  the number of false hits, F, goes down because
probability of random word matching decreases as
words grow.  For example, in DNA, with M 0.95, H 100, and
Q 500, if  k 11, then Phit 0.999 and F 32 512
 k 14, then Phit 0.991 and F 399
 By decreasing sensitivity slightly, the number of
false hits can be reduced by almost a factor of
100. 
121BLAT indexing
 Each kword (nonoverlapping) in the database is
converted to a number in base four  T C A G T T A
 3 1 0 2 3 3 04
 List I points to a list L, the sorted locations
of the kword in the database sequences.
A list, I, of all possible numbers is
maintained. For DNA, when k 7, this list has
47 or 16,384 entries.
I
3102330
3102331
L
100
730
600
350
930
870
122Deficiencies in the BLAT Program
 Near exact matching with proteins requires an
index which is too large to fit in memory, so a
less efficient hashing scheme is used instead  The alignment method is not standard optimal
alignment and  for DNA does not work well below 90 sequence
identity  for proteins does not work well with indels in
one of the sequences
123Topic 5 Outline
 Searching for Repetitive Motifs and Patterns
 Pattern Detection vs Alignment
 Short word methods
 TRF
124Pattern Detection
 In pattern detection problems
 there is no query sequence
 we look for a repetitive pattern or motif in one
long sequence or several sequences  broad characteristics of the target pattern are
specified in advance
125Short Word Methods
 In short word methods, small matching words are
detected. A cluster of short words indicates a
potential pattern. A typical applications is the
search for tandem repeats in DNA sequences.
Word spacings may differ.
126Short Word Methods
 Short word methods are well suited to DNA
sequences because  the alphabet is small so exact matching words are
common even in homologous regions that have
experienced significant mutation.  they can handle insertions and deletions that are
common in DNA  They are less well suited to protein sequences
because  large amino acid alphabet and frequent
substitution make short matching words uncommon.
127Tandem Repeats
A tandem repeat is any pattern of nucleotides
that has been duplicated so that it appears
several times in succession. For example, the
sequence fragment below contains a tandem repeat
of the trinucleotide CGT
tcgctggtcatacgtcgtcgtcgtcgttacaaacgtcttccgt
128Approximate Tandem Repeats
More typically, the tandem copies are only
approximate due to mutations. Here is an
alignment of copies from a tandem repeat in C.
elegans.
Shown are the copies and a
consensus pattern
129Tandem Repeats Associated with Human Disease
 Trinucleotide diseases caused by expansion of a
trinucleotide repeat  FragileX mental retardation
 Myotonic dystrophy
 Huntingtons disease
 Friedreichs ataxia
 Multilocus diseases linked in some cases to
unstable or uncommon minisatellites  Epilepsy
 Diabetes
 Ovarian cancer
130Tandem Repeats Function and Usefulness
 Tandem repeats
 are involved in gene regulation and often
contain putative transcription factor binding
sites.  exhibit copy number polymorphism, making them
valuable genomic markers.
131Tandem Repeats Finder (TRF)
 TRF finds tandem repeats in genomic DNA sequences
using short word matches (ktuple matches). It
assumes  repeats have on average gt80 sequence identity
 insertions and deletions occur on average in lt
10 of the pattern positions 
 These are average values for program parameter
settings. Repeats with lower sequence identity
and higher indel frequency are also found.
132 LOCUS HUMFMR1 3765 bp
mRNA PRI 08NOV1994  DEFINITION Human Fragile X mental
retardation 1 FMR1 gene, 3' end, clones  1 gacggaggcg
cccgtgccag ggggcgtgcg gcagcgcggc ggcggcggcg
gcggcggcgg  61 cggcggaggc ggcggcggcg gcggcggcgg
cggcggaggc ggcggcggcg gcggcggcgg  121 cggcggctgg gcctcgagcg cccgcagccc
acctctcggg ggcgggctcc cggcgctagc  181 agggctgaag agaagatgga ggagctggtg
gtggaagtgc ggggctccaa tggcgctttc  241 tacaaggcat ttgtaaagga tgttcatgaa
gattcaataa cagttgcatt tgaaaacaac  301 tggcagcctg ataggcagat tccatttcat
gatgtcagat tcccacctcc tgtaggttat  361 aataaagata taaatgaaag tgatgaagtt
gaggtgtatt ccagagcaaa tgaaaaagag  421 ccttgctgtt ggtggttagc taaagtgagg
atgataaagg gtgagtttta tgtgatagaa  481 tatgcagcat gtgatgcaac ttacaatgaa
attgtcacaa ttgaacgtct aagatctgtt  541 aatcccaaca aacctgccac aaaagatact
ttccataaga tcaagctgga tgtgccagaa  601 gacttacggc aaatgtgtgc caaagaggcg
gcacataagg attttaaaaa ggcagttggt  661 gccttttctg taacttatga tccagaaaat
tatcagcttg tcattttgtc catcaatgaa  721 gtcacctcaa agcgagcaca tatgctgatt
gacatgcact ttcggagtct gcgcactaag  781 ttgtctctga taatgagaaa tgaagaagct
agtaagcagc tggagagttc aaggcagctt  841 gcctcgagat ttcatgaaca gtttatcgta
agagaagatc tgatgggtct agctattggt  901 actcatggtg ctaatattca gcaagctaga
aaagtacctg gggtcactgc tattgatcta  961 gatgaagata cctgcacatt tcatatttat
ggagaggatc aggatgcagt gaaaaaagct
133 LOCUS HUMFMR1 3765 bp
mRNA PRI 08NOV1994  DEFINITION Human Fragile X mental
retardation 1 FMR1 gene, 3' end, clones  1 gacggaggcg
cccgtgccag ggggcgtgcg gcagcgcggc ggcggcggcg
gcggcggcgg  61 cggcggaggc ggcggcggcg gcggcggcgg
cggcggaggc ggcggcggcg gcggcggcgg  121 cggcggctgg gcctcgagcg cccgcagccc
acctctcggg ggcgggctcc cggcgctagc  181 agggctgaag agaagatgga ggagctggtg
gtggaagtgc ggggctccaa tggcgctttc  241 tacaaggcat ttgtaaagga tgttcatgaa
gattcaataa cagttgcatt tgaaaacaac  301 tggcagcctg ataggcagat tccatttcat
gatgtcagat tcccacctcc tgtaggttat  361 aataaagata taaatgaaag tgatgaagtt
gaggtgtatt ccagagcaaa tgaaaaagag  421 ccttgctgtt ggtggttagc taaagtgagg
atgataaagg gtgagtttta tgtgatagaa  481 tatgcagcat gtgatgcaac ttacaatgaa
attgtcacaa ttgaacgtct aagatctgtt  541 aatcccaaca aacctgccac aaaagatact
ttccataaga tcaagctgga tgtgccagaa  601 gacttacggc aaatgtgtgc caaagaggcg
gcacataagg attttaaaaa ggcagttggt  661 gccttttctg taacttatga tccagaaaat
tatcagcttg tcattttgtc catcaatgaa  721 gtcacctcaa agcgagcaca tatgctgatt
gacatgcact ttcggagtct gcgcactaag  781 ttgtctctga taatgagaaa tgaagaagct
agtaagcagc tggagagttc aaggcagctt  841 gcctcgagat ttcatgaaca gtttatcgta
agagaagatc tgatgggtct agctattggt  901 actcatggtg ctaatattca gcaagctaga
aaagtacctg gggtcactgc tattgatcta  961 gatgaagata cctgcacatt tcatatttat
ggagaggatc aggatgcagt gaaaaaagct
134 LOCUS RATIGCA 4461 bp DNA ROD
18APR1994 DEFINITION Rat Ig germline
epsilon Hchain gene Cregion, 3' end.
2881 cgccccaagt aggcttcatc atgctctttg
gtttagcaat agcccaaagc aagctatgca 2941
tccatctcag gcccagaggg atgaggagac cagaatcaag
acatacccac gcccatccca 3001 cgcccaacca
ccaaccacca gcacatcagg ttcacacacc tgagaccagt
ggctcccatc 3061 acacacacac acacacacac
acacacacac acacacacac acacacaagc ccgtacacat
3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct
ctgaggatca ccaatggcag 3181 agtcggccag
cacctcagcc tccaggccaa tccttatact ttggcccact
gcaggccatg 3241 agagatggag gaggtggagg
cctgagctgt ggaaaaccag agacaggaag atggtctgta
3301 ctccaggcca atccttatac tttggcccac tgcaggccat
gagagatgga ggaggtggag 3361 gcctgagctg
tggaaaacca gagacaggaa gatggtctgt atggagagag
tagtaaacca 3421 gattataggg agactgaggc
aggagtagag ctcctacaag gccagtagtc taccttagag
3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg
ctcctcagat atcacaacca 3541 agattcctgg
agccagagtg tgcatgcagg ccctagaaga aatgtggagc
ttagagccct 3601 tcctggaggg ccctgggcac
tctgaacaaa aggcaattct gtaggctgta tagaggcatc
3661 ctgtcagata cacacacaca tgcacacaca tacacacaca
gagacacaga cacacacaca 3721 tgcccacaca
catgcataca cacatgcaca cacatacaca cacagagaca
cagacacaca 3781 catgcccaca cacatgcata
cacacatgca tgcacacaca cacacacaca tacacataca
3841 cacacacaca cacaccccgc aggtagcctt catcatgctg
tctagcgata gccctgctga 3901 gggtgggaga
tactgggtca tggtgggcac cggagtagaa agagggaatg
agcagtcagg 3961 gtcaggggaa aaggacatct
gcctccaggg ctgaacagag acttggagca gtcccagagc
4021 aagtgggatg gggagctctg ccactccagt ttcaccagga
ctgcctgaga ccagtgaggg
135 LOCUS RATIGCA 4461 bp DNA ROD
18APR1994 DEFINITION Rat Ig germline
epsilon Hchain gene Cregion, 3' end.
2881 cgccccaagt aggcttcatc atgctctttg
gtttagcaat agcccaaagc aagctatgca 2941
tccatctcag gcccagaggg atgaggagac cagaatcaag
acatacccac gcccatccca 3001 cgcccaacca
ccaaccacca gcacatcagg ttcacacacc tgagaccagt
ggctcccatc 3061 acacacacac acacacacac
acacacacac acacacacac acacacaagc ccgtacacat
3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct
ctgaggatca ccaatggcag 3181 agtcggccag
cacctcagcc tccaggccaa tccttatact ttggcccact
gcaggccatg 3241 agagatggag gaggtggagg
cctgagctgt ggaaaaccag agacaggaag atggtctgta
3301 ctccaggcca atccttatac tttggcccac tgcaggccat
gagagatgga ggaggtggag 3361 gcctgagctg
tggaaaacca gagacaggaa gatggtctgt atggagagag
tagtaaacca 3421 gattataggg agactgaggc
aggagtagag ctcctacaag gccagtagtc taccttagag
3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg
ctcctcagat atcacaacca 3541 agattcctgg
agccagagtg tgcatgcagg ccctagaaga aatgtggagc
ttagagccct 3601 tcctggaggg ccctgggcac
tctgaacaaa aggcaattct gtaggctgta tagaggcatc
3661 ctgtcagata cacacacaca tgcacacaca tacacacaca
gagacacaga cacacacaca 3721 tgcccacaca
catgcataca cacatgcaca cacatacaca cacagagaca
cagacacaca 3781 catgcccaca cacatgcata
cacacatgca tgcacacaca cacacacaca tacacataca
3841 cacacacaca cacaccccgc aggtagcctt catcatgctg
tctagcgata gccctgctga 3901 gggtgggaga
tactgggtca tggtgggcac cggagtagaa agagggaatg
agcagtcagg 3961 gtcaggggaa aaggacatct
gcctccaggg ctgaacagag acttggagca gtcccagagc
4021 aagtgggatg gggagctctg ccactccagt ttcaccagga
ctgcctgaga ccagtgaggg
136Basic Assumption
 Mutated, adjacent copies of a pattern will
contain runs of exact matches.
T C C A C G G A G A T A T T T A
T A T A C G T C G A G A C T T A
137Basic Assumption
 Mutated, adjacent copies of a pattern will
contain runs of exact matches.
T C C A C G G A G A T A T T T A
T A T A C G T C G A G A C T T A
Runs of matches are identified using ktuple
matches.
138Using ktuple matches
 For purposes of program efficiency, fixed size
ktuples are used.
139ktuple matches
In pattern matching, a ktuple is a window of
length k which contains text characters. Two
windows which contain the same text, form a
ktuple match GAACGTTAGGTAACTGCAT CCTAGTTATACG
TTAAC
140(No Transcript)
141Modeling Tandem Repeats
 Multiple ktuple matches suggest the occurrence
of a tandem repeat. The appropriate number of
matches depends on how a tandem repeat is
defined. For example, are the following two
aligned sequence fragments two copies of the same
underlying pattern?  TCGGCATCAGTCTATGG
 TCAATGGTGTTGG

142A Stochastic Model
 TRFs stochastic model is based on the
probability of character matching and the
frequency of insertion and deletion (indels)
between aligned adjacent copies.  CCACAACCCGTCAGGCAAGT
 CTGCACCATCGTCTGGGAAGT
 HTTHHTHTTHHHHTHHTHHHH
 Note that the alignment has been converted into a
Bernoulli (cointoss) sequence.
143Model Parameters
 PM the expected frequency of a match
 PI the expected frequency of an indel
 The parameters are applied to Bernoulli sequences
to establish criteria for detecting the repeats.
144Number of matches to indicate a repeat
 Sum of heads is the minimum required number of
matches  for a repeat with period n
 match probability p
 tuple size k
 Unless k 1, not all matches will be detected.
 For example
 HHTHHHHTHTHHHTTHHHT
k H seen
1 13
2 12
3 10
4 4
145Sum of heads
 Suppose a random Bernoulli sequence has length
100 and the expected number of heads is 75 (PM
0.75). If we count the number of heads, then 95
of the time we expect to count at least 68 heads.
 If we count only heads that occur in runs of
length 5 or more, then 95 of the time we expect
to count at least 26 heads. This is the sum of
heads criteria.
146Other Criteria
 Apparent Size Used to distinguish tandem from
nontandem repeats.  Waiting time Used to pick a suitable tuple
size.  Random walk Used to accommodate insertions and
deletions.
147(No Transcript)
148The Tandem Repeats Database
 The Tandem Repeats Database (TRDB) is
 A public database of information on tandem
repeats.  A private workspace for extended research on
tandem repeats.
149C. elegans Distribution of Repeat Pattern Size
Chr 1
150Human Distribution of Pattern Size Chr 1
151C. elegans Distribution of Repeat Location Chr 1
152Human Distribution of Repeat Location Chr 1
153Human Distribution of Repeat Location Chr 1 for
Patternsize gt 5
154Clusters in 41 bp Repeats
155Two identical repeats
156Four repeats from a tandem repeat cluster
157Topic 6 Outline
 Composition Alignment
 Sequence composition and composition match
 Composition alignment algorithm
 Composition match scoring functions
 Limiting the length of a composition match
 Growth of local composition alignment scores
 Biological examples
158Sequence Composition
 Composition is a vector quantity describing the
frequency of occurrence of each alphabet letter
in a particular string. Let S be a string over
S. Then,  C(S)(fs1 , fs2 , fs3 , , fsS)
 is the composition of S, where fsi is the
fraction of the characters in S that are si.  Note that the order of letters is irrelevant as
it has no effect on the composition.
159Composition Example
 S ACTGTACCTGGCGCTATT
 C(S) ( 0.17, 0.28, 0.22, 0.33 )
 A C G
T
160Composition Match
 Two strings, S and T, have a composition match if
their lengths are equal and C(S) C(T).  For example, S and T below have a composition
match  S ACTGTACCTGGCGCTATT
 T AAACCCCCGGGGTTTTTT
161Composition and Sequence Features
 Isochores Multimegabase, specifically GCrich
or GCpoor. GCrich isochores have greater gene
density.  CpG Islands Several hundred nucleotides, rich
in the dinucleotide CG which is underrepresented
in eukaryotic genomes. Methylation of the
cystine in these dinucleotides affects gene
expression.  Protein binding regions Tens of nucleotides,
dinucleotide composition contributes to DNA
flexibility, allowing the helix to change shape
during protein binding.
162Composition Alignment Problem
 Given Two sequences, S of length m, and T of
length n, over an alphabet S, and a scoring
function cm(s, t) for the score of a composition
match between substrings s and t.  Find The best scoring alignment (global or
local) of S with T such that the allowed scoring
options include composition match between
substrings of S and T as well as the standard
options of single character match, single
character mismatch, insertion and deletion.
163Example of composition alignment
 S AACGTCTTTGAGCTC
 T AGCCTGACTGCCTA
 Alignment
 AACGTCTTTGAGCTC
 ltgt ltgt
 AGCCTGACTGCCTA
164Algorithm Analysis
 Given two sequences, S and T, the best alignment
of the prefix strings  S1, i s1 si
 T1, j t1 tj
 ends in one of four ways, mismatch, insertion,
deletion, or composition match between suffixes
of length l  1 l min(i, j, limit)
 i.e., between substrings Si l 1, i and Tj
l 1, j
165Time Complexity
 Computing the optimal composition alignment is
done with dynamic programming and is similar to
standard alignment, except for the composition
match scoring option. The overall time
complexity is  O(nmZ)
 where Z is the time required per (i, j) pair to
find the best length l for the composition match.
166Computing length of the shortest composition
match
 Our goal here is to start with two strings, S and
T, of equal length, and for each prefix pair S1,
k, T1, k, find the length of the shortest
suffixes that have a composition match. For
example, let  S AACGTCTTTGAGCT
 T AGCCTGACTGCCTA
 Then for k 6, the shortest suffixes which have
a composition match have length 3  S AACGTC
 T AGCCTG
167Composition difference
 Composition difference is a vector quantity for
two strings x and y  CD(x, y) (cs1 , , csS)
 where csi is the difference between the number
of times si occurs in x and in y.
168Using composition difference
Key observation two identical composition
differences at prefix lengths k and g indicate a
composition match of length k g.
169Sorting to find shortest composition matches
Sort on composition difference using stable sort.
Adjacent tuples with the same composition
difference identify shortest composition matches.
170Time complexity for composition matches
 O(nmS) to find nm shortest composition match
lengths for two strings of length n and m.  In our work, S, is a small constant (4 for DNA,
16 for dinucleotides). For larger alphabets, the
method of Amir, Apostolico, Landau and Satta
(2003) can be used.
171Composition match scoring functions
 Functions based on match length, k
 Function 1 cm(k) ck
 Function 2 cm(k) cv k
 where c is a constant.
 Functions based on substring composition
 Function 4 cm(C, B, k) ck H(C,B)
 where H is the relative entropy function, C is
the composition of the matching substrings and B
is a background composition.
172Additive and subadditive scoring functions
 The functions based on length are additive or
subadditive  cm(i j) cm(i) cm(j)
 Lemma For additive or subadditive composition
match scoring functions, any best scoring
alignment is equivalent in score to an alignment
which contains only shortest composition matches.
 Theorem Composition alignment with additive or
subadditive match scoring functions and finite
alphabet has time complexity O(nm).
173The limit parameter
 Intuitively, allowing scrambled letters to match
should increase the amount of matching between
sequences. If too much matching occurs,
alignments will not be meaningful.  The limit parameter is an upper bound on the
length l of the longest single composition match.
174Limit and fraction of matching charactersrandom,
ungapped alignments
Sequence length limit 1 2 5 10
binary 50 62.5 75.6 82.4
DNA 25 30 37.5 44.2
Dinucleotide 6.2 6.5 7.1 7.5
175Limit and fraction of matching charactersrandom,
ungapped alignments
Sequence length 100, all letters equal
probability p 0.25
limit 1 2 5 10
DNA 25 33.7 44.4 51
Sequence length 400, all letters equal
probability p 0.25
limit 1 2 10 50
dinucleotide 6.25 6.81 7.76 7.78
176Growth of local alignment scoreFunction 1
177Global score as a predictor of local parameter
suitability Function 1
178Growth of local alignment score Function 2
179Global score as a predictor of local parameter
suitability Function 2
180Limit values for DNA
 Function 1 cm(k) ck Limit 3.
 Function 2 cm(k) cvk Limit 10.
 Function 4 cm(C, B, k) ck H(C, B)
 Limit 50.
181Biological examples
 Composition alignment was tested on a set of 1796
promoter sequences from the Eukaryotic Promoter
Database. Each sequence is 600 nucleotides long,
500 bases upstream and 100 downstream of the
transcription initiation site.  Two local alignment scores were produced using
function 1, W using composition alignment and S
using standard alignment. The examples shown have
statistically significant W with W 3 S to
exclude good standard alignments.
182Example 1
Composition alignment and standard alignment of
two promoters. Standard alignment is not
statistically significant. Sequences are
characteristic of CpG islands.
Composition Alignment GCCCGCCCGCCGCGCTCCCGCCCGCC
GCTCTCCGTGGCCCCGCCGCGCTGCCGCCGCCGCCGCTGC ltgt
ltgtltgtltgt ltgtltgt ltgt ltgtltgt ltgtltgt
ltgt ltgt CCGCGCCGCCGCCGTCCGCGCCGCCCCGCCCTTGG
CCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC Standard
Alignment CGCCGCCGCCG CGCCGCCGCCG
183Example 2
Composition alignment of two promoter sequences.
Composition changes at vertical line.
A C G T Left
(0.01, 0.61, 0.30, 0.08) Right (0.19, 0.16,
0.56, 0.09)
GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTCCGCCCGCCCCTGC
TCCGGCCTTGCGCCTGCGCACAGTGGGATGCGCGGGGAG ltgtlt
gtltgt ltgt ltgtltgt ltgt
ltgt ltgtltgtltgt ltgtltgtltgtltgtltgt CCGCGC
GCCCCCGCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGCCCCGG
GGGCAGTCGCGCCTGTGAACGGTGAGTGCGGGCAGGG
184Final Slide
 Recognize that similarity among biological
sequences most likely exists in ways which we can
not today perceive and for which we have no
detection tools. You are encouraged, therefore,
to think broadly, beyond the embellishment or
refinement of current methods, to new definitions
of similarity and new problems of comparison.