Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University - PowerPoint PPT Presentation

1 / 184
About This Presentation
Title:

Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University

Description:

Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 185
Provided by: Alfre121
Learn more at: http://tandem.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University


1
Searching for Similarity in SequencesGary
BensonDepartments of Computer Science and
BiologyBoston University
2
Topic 1 Outline
  • Similarity and Alignment
  • Define homology, similarity by descent and
    similarity by convergence
  • Common mutations and their mathematical models
  • Alignments
  • Scoring Alignments
  • Gap penalty functions
  • Computing the best scoring alignment the
    Longest Common Subsequence (LCS) problem

3
Similarity and Biomolecules
  • Similarity is expected among biomolecules that
    are descended from a common ancestor. Mutations
    cause differences, but survival of the organism
    requires that mutations occur in regions that are
    less critical to function while important
    catalytic, regulatory or structural regions
    remain similar.

4
Similarity and Evolution
  • Evolution has duplicated and shuffled bits and
    pieces of molecules to produce new linear
    arrangements that combine function in novel ways.
    Regions of similarity often suggest an
    evolutionary tie and/or common functional
    properties between very different molecules.

5
Three common similarity problems
  1. Start with a query sequence with unknown
    properties and search within a database of
    millions of sequences to find those which share
    similarity with the query.
  2. Start with a small set of sequences and identify
    similarities and differences among them.
  3. In many sequences or very long sequences, detect
    commonly occurring patterns.

6
What is Similarity?How can we measure it?
7
Morphology
  • Morphology is the form and structure of an
    organism.
  • Should shared morphology mean similarity?

8
Hands
9
Aquatic Shape
10
Shared morphology
  • Shared morphology does not necessarily imply
    common ancestry.
  • The animals with hands have all evolved from a
    common ancester with a hand.
  • The ichthyosaur, shark and porpoise each evolved
    sea life adaptations independently.

11
Homology
  • When similarity is due to common ancestry, we
    call it homology.

12
  • Modern molecular biology seeks to understand
    cellular processes through the action of DNA,
    RNA, and protein molecules. This will ultimately
    lead to a biochemical understanding of
  • The pathogenesis of infectious diseases like
    AIDS, hepatitus and SARS.
  • The mutagenic properties of environmental toxins
    and how they lead to diseases like cancer.
  • The etiology of human genetic disease.
  • Strategies to prevent and treat diseases through
    drug and vaccine design, gene therapy, risk
    reduction, etc.

13
How homology helps
  • Given molecular sequences X and Y
  • X Y AND INFO(Y) gt INFO(X)
  • ( means similar)

14
Are the Sequences Similar?
15
Are the Sequences Similar
  • How similar?
  • What parts are the most similar?
  • Remember, the common ancestor of the two
    sequences may have existed millions of years ago.

16
  • How can we tell if the two sequences are similar?
  • Similarity judgements should be based on
  • The types of changes or mutations that occur
    within sequences.
  • Characteristics of those different types of
    mutations.
  • The frequency of those mutations.

17
Common mutations in DNA
  • Substitution
  • A C G T T G A C
  • A C G A T G A C
  • Deletion
  • A C G T T G A C
  • A C G A C
  • Insertion
  • A C G T T G A C
  • A C G C A A G T T G A C

18
Common mutations
  • Duplication
  • A C G T T G A C
  • A C G T T G A T T G A C
  • Inversion (double stranded DNA shown)
  • A C G T T G A C
  • T G C A A C T G
  • A C T C A A C C
  • A C A G T T G G

19
Frequency of mutations
  • Substitution gt Insertion, Deletion
  • gt gt
  • Duplication
  • gt
  • Inversion

20
Evolutionary history of sequences
21
Alignments
  • There are many ways to align two sequences. We
    just saw one way
  • T T A C G T A C A G A T T A
  • T - - G G A A C A - - - T A
  • Here is another
  • T T A C G T A C A G A T T A
  • T - - - G G A A C - - A T - A
  • Which is better? Remember, we can not choose
    based on the evolutionary history, because that
    is unknown.

22
Alignments and Paths through the Alignment Array
23
Alignments and Paths through the Alignment Array
t a c g - c a a - - - a c g t g a a t t
24
Alignments and PathsAn Alternate Alignment
t - - a c g c a - - a a c g t g - - a a t t
25
Finding the Best AlignmentRanking Alignments by
Score
  • Score an alignment by
  • Partitioning it into columns
  • Assign a weight to each column
  • Sum the column weights

26
Distance Scoring
  • Distance scoring
  • Alignment gets a non-negative score.
  • Alignment of identical sequences scores zero,
  • all others gt zero.
  • Best alignment has smallest score.
  • Typical scoring functions are
  • d(a,a) 0 identity
  • d(a,b) d(b,a) gt 0 a ? b substitution
  • g d(a, ) gt 0 indel (gap)

27
Similarity Scoring
  • Similarity scoring
  • Alignment scores may be positive, zero, or
    negative.
  • More similar means larger positive score.
  • The best alignment has largest score.
  • Typical scoring functions are
  • s(a,b) is gt 0 if a and b are similar in one or
    more characteristics or are observed to
    substitute frequently for each other
  • 0 otherwise substitution
  • g s(a, ) lt 0 indel (gap)

28
Gap penalty functions
  • Single character gap penalty
  • g(a, ) c
  • (c a constant or a value dependent on a)
  • Affine (linear) gap penalty
  • g(k) a ßk
  • (a is a gap opening penalty, ß is a gap
    extension penalty)
  • Concave gap penalty
  • g(k) a ß(m(k))
  • m(k) is a function like log(k) which grows more
    slowly as k increases.

29
Distance Scoring
  • Alignment parameters
  • d(a, a) 0 d(a, b) 2, g 4
  • A G C C G T A T
  • A C G A - - T - T
  • 0 4 0 2 4 4 0 4 0

18
30
Similarity Scoring
  • Scoring parameters
  • s(a, a) 5, s(a, b) - 3, g - 8
  • A G C C G T A T
  • A C G A - - T - T
  • 5 5 5 5
  • 8 3 8 8 8


- 15
-
31
Similarity scoring with affine gap
  • Alignment parameters
  • s(a, a) 5, s(a, b) - 3,
  • g(k) a ßk, a - 5, ß - 4
  • A G C C G T A T
  • A C G A - - T - T
  • 5 5 5 5
  • 8 3 8 4 8


- 11
-
32
Computing the Optimal AlignmentThe LCS Problem
as Prototype
  • The Longest Common Subsequence (LCS) problem is a
    method for comparing sequences. Although the
    solution does not produce an alignment, it
    illustrates a method of dynamic programming that
    is very similar to that used by alignment
    algorithms.

33
Longest Common Subsequence Problem
  • Let X be a string of characters. A subsequence
    X of X is formed by discarding zero or more
    letters of X. Note that the letters in X
    maintain their same order as in X.
  • Let X and Y be two strings. A common subsequence
    Z is a subsequence of both. A longest common
    subsequence (LCS) is the longest such Z.
  • Examples

X a b c d e b a Y b e b d c e a c d Z b d
e a
X a b c d e b a X a b d b
34
LCS Problem
  • Given Two sequences X and Y.
  • Find An LCS for X and Y.
  • A divide and conquer solution can be developed by
    looking at what happens to the last letters in
    each sequence. That is, are they part of the LCS
    solution or not?

35
Possible ways to split the problem
36
LCS recursion
37
Filling the dynamic programming array
38
Filling the dynamic programming array
39
Necessary values in adjacent cells
40
Completed LCS array
41
Tracing back for a solution
LCS bdea
42
LCS time complexity
  • There are (n 1)(m 1) cells in the LCS score
    array. Each cell is filled by examining 3 other
    cells in constant time. The time complexity to
    fill the array is O(nm).
  • Tracing back for an LCS solution takes at most n
    m steps.
  • The total time complexity is therefore O(nm).

43
Topic 2 Outline
  • Types of Alignment
  • Substitution Matrices
  • Global vs Local Alignment
  • Recursions for Global, time complexity
  • Global alignment with affine gap penalty, time
    complexity
  • Similarity scoring and local alignment
  • Recursion for local, time complexity
  • Finding suboptimal local alignments declumping
  • Substitution Matrices

44
Global vs Local Alignment
  • Given two strings, X and Y
  • global alignment produces an alignment that
    contains all of X and all of Y.
  • local alignment produces an alignment that
    contains only the best matching substrings, one
    from X and one from Y.

X
Y
X
Y
45
Global vs Local Alignment
  • Global alignment is useful when
  • The sequences are known to be related throughout
    their length, for example, similar protein
    sequences from close species.
  • Local alignment is useful when
  • The sequences are believed to contain parts that
    are closely related.

46
Global Alignment Problem
  • Given two sequences X and Y and alignment
    scoring functions,
  • Find the best scoring alignment that includes
    all of X and all of Y.
  • Solution Dynamic Programming

47
Global Alignment
  • Analysis of global alignment is similar to the
    LCS. Alignments can end in one of three ways.
    In terms of the prefix strings x1xi and y1yj,
    we have
  • 1. xi and yj are aligned with each other. (Here
    it makes no difference whether xi and yj are the
    same.)
  • Gi,j Gi 1, j 1 s(xi, yj)
  • X C G T
  • Y C G C

48
Global Alignment
  • xi is deleted (aligned against a dash).
  • Gi, j Gi 1, j g
  • X C A T
  • Y C A -
  • yj is deleted (aligned against a dash).
  • Gi, j Gi, j 1 g
  • X C A
  • Y C A A

49
Global alignment recursion(similarity scoring)
50
Global alignment example
match 2, mismatch - 3, gap - 4
51
Global Alignment Example
match 2, mismatch - 3, gap - 4
52
Global Alignment Example
Tracing back for an alignment
C G T A G C C T A G A
53
Global alignment time complexity
  • As with the LCS problem, there are (n 1) (m
    1) cells in the dynamic programming array. Each
    is filled by examining 3 other cells in constant
    time. The time complexity to fill the array is
    O(nm).
  • Tracing back for a global alignment takes at most
    n m steps.
  • The total time complexity is therefore O(nm).

54
Global alignment and affine gap penalty
  • Recall the affine gap penalty function
  • g(k) a ßk
  • When xi or yj is deleted, we have to consider
    that it could be the last of a string of
    characters that is deleted as one unit. And the
    size of that unit will affect the deletion cost.

55
Time Complexity (naïve)with Affine Gap Cost
  • For each (i, j) in the alignment matrix, there
    are O(n m) posible deletion costs that must be
    considered in order to choose the optimal cost.
    Without any improvements, the time complexity
    grows to O(nm(n m)) or cubic O(n3) time.

56
Refining the Affine Gap Computation
  • The regularity of deletion costs helps reduce the
    time complexity. Observe the two tables.

57
Auxillary Functions for Affine Gap
Ei,j is max of all possibilities
58
Auxillary Functions for Affine Gap
Fi,j is the maximum of all possibilities
59
Global Alignment with Affine Gap
recursion(similarity scoring)
60
Time Complexity with affine gap cost
  • A total of (n 1)(m 1) cells must be computed.
    For each cell, E, F, and G values must be
    computed. E and F both require looking up 2
    values. G requires looking up 3 values. Time to
    compute scores is O(nm).
  • Tracing back can be done in O(n m) if the E and
    F values are retained. This triples the memory
    space required for scoring arrays (E, F, and G).
  • Total time O(nm).

61
Local Alignment Problem
  • Given two sequences X and Y and alignment
    scoring functions,
  • Find the best scoring alignment over all
    substring pairs, one from X and one from Y.
  • Solution Dynamic Programming

62
Local Alignment Looks Harder than Global
Alignment
  • Where global alignment asks for the solution to
    one problem, the best alignment of X1m versus
    Y1n, local alignment asks for the best
    alignment out of O(n4) subproblems, any substring
    in X versus any substring in Y
  • Xhi versus Ykj
  • for 1 h i m, 1 k j n
  • Instead, we solve O(n2) subproblems, the best
    alignment of any substring ending at xi versus
    any substring ending at yj.

63
Local Alignment and Similarity Scoring
  • Local alignment uses similarity scoring for the
    following reason. When an alignment score is
    negative, the alignment is worse than no
    alignment at all. For an (i, j) pair, it often
    happens that the best alignment of every
    substring ending at xi with every substring
    ending at yj has a negative score.
  • Similarity scoring detects these bad alignments
    and local alignment discards them. If every
    alignment score for an (i, j) cell is negative,
    then the score is reset to zero.

64
Local Alignment Recursion
65
Local Alignment Time Complexity
  • Proportional to the product of the sequence
    lengths O(nm).

66
Finding Suboptimal Alignments
  • When computing local alignment, we may want to
    know optimal and suboptimal alignments. This
    can be important in the case where the sequences
    contain several parts that are similar.

X
Y
67
High Alignment Scores may not be Independent
68
Declump scores by prohibiting match and
substitution pairings from realignment
69
Source Michael S. Waterman, 1994
70
Source Michael S. Waterman, 1994
71
Substitution Matrices
  • Used for protein alignments
  • Substitution rates for amino acid pairs are
    determined from known similar sequences
  • Matrices contain log-odds scores
  • First matrices designed by Margaret Dayhof are
    called PAM matrices (Point Accepted Mutation)
  • Current matrices designed by Henikoff and
    Henikoff are called BLOSUM matrices (Blocks
    Substitution Matrices)

72
(No Transcript)
73
(No Transcript)
74
Odds Ratios
  • Odds is a ratio of probabilities for two events
    which are mutually exclusive.
  • In horse racing, for example, odds is the ratio
    of the betting that a horse will lose to the
    betting that the horse will win. So, for a horse
    with odds of 20 to 1, the betting is 20 times
    higher that the horse will lose than it will win,
    while for a horse with odds of 3 to 2, the
    betting is only 1.5 times higher that the horse
    will lose than win.

75
Alignment and Odds
  • A substitution score can be interpreted as an
    odds ratio. For an individual pair of aligned
    amino acids, the events are
  • The pair are aligned because they are
    evolutionarily related
  • The pair are aligned merely by chance.
  • For each pair, the relevant question is
  • What are the odds that amino acid i would be
    substituted for amino acid j if they were
    evolutionarily related?
  • If the odds are good, then the pair supports the
    alignment.
  • If the odds are bad, then the pair reduces
    confidence in the alignment.

76
Log Odds
  • The odds for the entire alignment
  • Aligned because the sequences are
    evolutionarily related or aligned by chance
    alone
  • can be obtained by multiplying the odds for each
    aligned pair of amino acids.
  • Multiplication is expensive computationally, so
    logarithms are used because they can be added.

77
Log-Odds Substitution Matrices
  • The BLOSUM and PAM substitution matrices contain
    log-odds values. The ratios have the basic form
  • Observed probability of pairing amino acids i
    and j in related sequences
  • Expected probability of pairing at random

Oij Eij

78
BLOSUM Data
  • DATA
  • Ungapped multiple alignments (blocks) taken from
    504 families of known related protein sequences
    in the PROSITE database. In the original paper
    (1992), this produced 2100 blocks.

79
BLOSUM Observed Frequencies
  • A typical block
  • R T S C L H
  • K T S A N P
  • K W D C L P
  • K
  • R
  • E

First column pairs RK 6 RR 1 RE 2 KK
3 KE 3 Repeat and accumulate for every column.
Fij Pair (i, j) counts Oi,j Fij / total
number of pairs observed pair frequencies
80
BLOSUM Background Frequencies
  • Pi probability of amino acid i occurring in an
    amino acid pair
  • Pi Oii Sj ? i Oij / 2
  • Eij expected probability of random pairs
  • Eij

Pi Pj if i j Pi Pj Pj Pi
if i ? j
81
BLOSUM Log-Odds Values
  • Sij Log-odds ratio for aligning amino acids i
    and j.
  • Sij 2 log 2 (Oij / Eij)
  • If observed frequencies are
  • as expected, Sij 0
  • greater than expected, Sij gt 0 (positive)
  • less than expected, Sij lt 0 (negative)

82
Related Sequences Must Be Clustered
  • Overrepresentation of closely related sequences
    can bias the matrices.
  • K
  • K 4 sequences more than
  • K 80 identical
  • K
  • R
  • E
  • The overrepresentation of the closely related
    sequences increases the observed K to K pairing,
    falsely increasing the log-odds score for that
    pair.

Other related sequences
83
Clustering Sequences
  • Closely related sequences are clustered and the
    letters in any cluster are given fractional
    values. In the cluster below, amino acids K in
    the first column counts as ¼ each rather than 1.
    R and T in the fourth column count as ½ each.
  • K R
  • K R 4 sequences more than
  • K T 80 identical clustered
  • K T
  • R L
  • E K

84
The BLOSUM Matrix Family
  • Blosum 80 clusters sequences if they are 80
    identical (clustering is not transitive).
  • Blosum 62 clusters sequences if they are 62
    identical.
  • Lower numbers yield matrices that give higher
    scores to alignments of distantly related
    sequences.

85
(No Transcript)
86
Topic 3 Outline
  • Database Searching Algorithms
  • Sensitivity vs Selectivity vs Running Time
  • Dot Plots
  • Blast
  • High Scoring Segment Pairs (HSP)
  • Target Words
  • Extending a Hit
  • Statistics of HSP scores
  • Gapped Blast modifications

87
Alternatives to Alignment
  • Alignment is fine when the sequences are
    relatively short, but is unusable for longer
    sequences such as are encountered in
  • database searches
  • comparison of genomes
  • large repetition identification
  • because it takes too long. For these, we need
    alternate methods.

88
Dot Plots
  • Two dimensional array like an alignment array.
    Put a dot in each cell where the sequence
    characters match. Long diagonal runs of dots
    indicate sequence similarity.
  • Match rule for proteins might be positive
    substitution value in BLOSUM matrix.

89
Extended Dot Plots (Nature, 19 June 2003, Vol
423, p 831)
90
(No Transcript)
91
Tandem repeat
92
Database Search AlgorithmsSensitivity,
Selectivity, Running Time
  • Sensitivity the ability to detect weak
    similarities between sequences (often due to long
    evolutionary separation). Increasing sensitivity
    reduces false negatives, those database sequences
    similar to the query, but rejected.
  • Selectivity the ability to screen out
    similarities due to chance. Increasing
    selectivity reduces false positives, those
    sequences recognized as similar when they are not.

Running time
1 Running time
Sensitivity
Selectivity
93
BLAST
  • The BLAST program is designed for fast sequence
    to database search. The basic idea is that a
    high scoring alignment between the query and a
    database sequence will almost always contain a
    core part that is well conserved. This core is
    called a High-scoring Segment Pair (HSP). The
    statistical theory behind BLAST deals with the
    probability of finding HSPs.

94
High-scoring Segment Pair (HSP)
  • An HSP consists of two equal length substrings,
    one from the query and one from the database
    sequence. When aligned without gaps, the score,
    S, is above a specified threshold and is locally
    maximal, meaning that extending the substrings on
    either end to lengthen the alignment produces a
    smaller score.
  • Suppose we have the following two sequences

X LKFSFALCCTIG Y ADQHLSRPTWAFYC S F A L
C C T W A F Y C 1 1 4 0 -2 9 13
One of many segment pairs is (BLOSUM scoring)
95
High Scoring Segment Pairs
X LKFSFALCCTIG Y ADQHLSRPTWAFYC S F
A L C C T W A F Y C 1 1 4
0 -2 9 13 This pair is locally maximal
because shortening or lengthening the alignment
reduces the score. L K F S F A L C
C S R P T W A F Y C -2 2 -4 1
1 4 0 -2 9
96
Assumptions for the Statistical Theory
  • We assume
  • a simple protein model the probabilities for
    the amino acids appearing at each position in a
    protein are independent and identically
    distributed (iid) and amino acid i occurs
    randomly with probability Pi.
  • the substitution matrix (such as BLOSUM 62) has
    at least one positive score.
  • the expected score for amino acid pairs
  • S pipj sij lt 0
  • is negative.

97
Normalized Scores of HSPs
  • Given Pi and Sij, the statistical theory yields
    two parameters, ? and K which can be used to
    normalize an HSP score S with the formula
  • S'
  • The units of S' are bits. When two random
    proteins sequences are compared, the expected
    number of HSPs with a normalized score of S' is
  • E N / 2S'
  • where N is the product of the sequence lengths.

?S ln K ln 2
98
Score for a Given Statistical Significance
  • Solving for S' yields
  • S' log2 (N / E)
  • For example, if
  • the query protein has length 250.
  • the database has length 50 000 000
  • the E-value is .05
  • then the score required to reach that level of
    statistical signficance is 38 bits.

99
Deficiencies in the BLAST theory
  • The result is asymptotic, meaning there is some
    error for finite sequences.
  • Local variations in residue composition in real
    sequences often do not match the model.
  • The model assumes that all mutations at all sites
    in a protein sequence can be described by a
    single substitution matrix.
  • It does not apply to gapped alignments.

100
Finding HSPs
  • Blast finds HSPs by first looking for well
    conserved core alignments, i.e., ungapped
    alignments between short query words of length
    3 and database words. Each core alignment must
    score greater than a threshold score, T, which is
    typically 13 using the BLOSUM matrix.
  • Core alignments or hits are typically found
    using pattern matching finite automata, i.e.
    linear search through the database.

101
Query and Target Words
  • Suppose the query sequence is LVNRKPVVP.
  • Chop the query into all possible words of length
    3
  • LVN VNR NRK RKP KPV PVV VVP
  • Collect target words associated with each query
    word. For example, the word RKP has six target
    words that score at least 13 (using BLOSUM 62)
    when aligned with RKP
  • QKP KKP RQP REP RRP RKP
  • In general, there will be a large set of target
    words derived from the query sequence. Each
    target word and its associated query word could
    form the core of an HSP.

102
Extending a Hit
  • A hit is extended in either direction to find its
    locally maximal segment pair. Extension
    terminates when the score drops too far below the
    highest score found.
  • For example, Suppose the target RRP is found and
    it occurs in the sequence
  • EPGVCRRPLKCTAS
  • When trying to extend the core against LVNRKPVVP
    the HSP has 6 letters and a score of 16
  • L V N R K P V V P
  • G V C R R P L K C
  • -3 4 -3 5 2 7 1 -2 -3

103
Gapped Blast
  • Gapped BLAST (1997) has two improvements over the
    original BLAST (1990).
  • Two hits Only extends core alignments when two
    occur nearby on the same diagonal. Involves
    lowering the threshold T to retain sensitivity,
    but reduces extension which is the most costly.
  • Gapped alignments are computed. Allows raising
    the threshold T while retaining selectivity.
    Speeds initial database scan.

104
Optimal Substitution Matricesfor Distant
Homologies
  • Among HSPs (ungapped) from the comparison of
    random sequences, amino acids ai and aj are
    aligned with frequency approaching
  • qij pipje?sij
  • The qij are called target frequencies for the
    given substitution matrix sij. Among alignments
    of distantly related proteins, amino acids tend
    to be paired with certain characteristic
    frequencies. Only if these correspond to a
    matrixs target frequencies ... can the matrix be
    optimal for distinguishing the distant local
    homologies. (Altschul 1991)

105
Log-Odds Matrices Again
  • Rearranging qij pipje?sij yields
  • sij ln(qij /pipj)/?
  • where ? acts as a scaling factor for the
    logarithm. This shows
  • that the substitution scores are inherently
    log-odds scores for the target and background
    frequencies, and
  • scores may be chosen that correspond to any
    desired set of target frequencies.

106
Topic 4 Outline
  • Specialized Sequence Alignment Algorithms
  • Sim4
  • Blat

107
sim4
  • Sim4 is a program for aligning a cDNA sequence to
    a genomic sequence.
  • cDNA is complementary to a messenger RNA which is
    the RNA molecule after introns have been cut out
    and the exons spliced together. The difference
    between cDNA and genomic DNA is the absence of
    the intron sequences in cDNA.
  • Sim4 assumes that the differences between the two
    sequences are limited to
  • introns in the genomic sequence
  • sequencing errors in either sequence

108
sim4 Find HSPs
  • In the first step, sim4 finds HSPs which must
    have an exact matching core of 12 nucleotides.
    The core is extended on both ends with a score of
    1 for match and -5 for mismatch.

109
sim4 Select chains of HSPs
  • HSps are chained with the constraints that
  • starting positions in the cDNA are increasing
  • HSPs are in nearby diagonals or are in diagonals
    separated by plausible intron distances

Genomic
cDNA
starting positions increase
gap typical for intron
110
sim4 Trim overlaps in cDNA
  • Exon cores that overlap in the cDNA are trimmed
    to find
  • GT ... AG, a common intron signal, in the
    genomic DNA.

GT
AG
overlap
111
sim4 Trim overlaps in cDNA
  • Exon cores that overlap in the cDNA are trimmed
    to find
  • GT ... AG, a common intron signal, in the
    genomic DNA.

GT
AG
112
sim4 Filling Gaps in cDNA
  • Gaps are filled by recursively finding HSPs with
    smaller cores (starting with exact matches of
    length 8)

smaller cores are chained as before
113
BLAT The BLAST-Like Alignment Tool
  • BLAT is a specialized program for mRNA to DNA
    alignments and cross-species protein alignments.
    It is faster than BLAST and sim4 for these tasks.
    It is not appropriate for distant homology
    searching.
  • BLAT is available at the UC Santa Cruz human
    genome website for database searches against the
    human genome.

114
BLAT vs BLAST
  • Both BLAT and BLAST first look for high-scoring
    pairs
  • BLAST collects short words in the query and does
    a linear scan through the database for those
    words.
  • BLAT builds an index of the database (a data
    structure suited for rapid detection of word
    matches) and then scans linearly through the
    query. Since the query is much smaller than the
    database, the scan phase is very rapid. The data
    structure is persistent, meaning that it is built
    once and then reused for every query search.

115
BLAT Hits
  • To form the database index, BLAT cuts the
    database sequences into non-overlapping words of
    length k.

Database
Query
A hit is an exact match or a near perfect match
(single character mismatch) with any query
subword.
Query k-words
116
BLAT Criteria for Extending Hits
  • BLAT allows different criteria for extending
    hits
  • Single exact match
  • Single near-exact match (one difference)
  • C A G T G C G A T G A
  • C A G T A C G A T G A
  • Two exact matches on the same diagonal
    separated by a small distance

117
BLAT Statistics
  • The size, k, of matching words is selected to
    balance sensitivity and selectivity with running
    time. BLAT makes flexible assumptions about the
    query and database sequences to select k, the
    word match size
  • M the percent matching between homologous
    regions (98 for cDNA/genomic alignments, 89 for
    cross-species protein alignments)
  • H the size of the homologous regions
  • G the size of the database (3 000 000 000
    bases)
  • Q the size of a query
  • A the alphabet size (4 for DNA, 20 for protein)

118
BLAT Statistics for k Selection (Exact Match)
  • The number of non-overlapping k-words in a
    homologous region is
  • T floor (H / k)
  • Sequence letters are assumed to be iid. The
    probability that
  • a match occurs between a homologous region k-word
    and the query is
  • p Mk
  • no match occurs is
  • q (1 p)
  • at least one homologous region word matches the
    query is
  • Phit 1 minus no matches 1 q T

119
BLAT Statistics for k Selection(Exact Match)
  • The probability that two k words match by chance
    is
  • r (1/A)k
  • The number of
  • query words is qw Q k 1
  • database words is dw G/k
  • The number of k words that are expected to match
    by chance is
  • F qw dw r

120
BLAT Statistics for k Selection
  • With increasing k,
  • the probability of a valid hit, Phit, goes down
    because it becomes harder to find two words that
    match due to errors or mutations.
  • the number of false hits, F, goes down because
    probability of random word matching decreases as
    words grow.
  • For example, in DNA, with M 0.95, H 100, and
    Q 500, if
  • k 11, then Phit 0.999 and F 32 512
  • k 14, then Phit 0.991 and F 399
  • By decreasing sensitivity slightly, the number of
    false hits can be reduced by almost a factor of
    100.

121
BLAT indexing
  • Each k-word (nonoverlapping) in the database is
    converted to a number in base four
  • T C A G T T A
  • 3 1 0 2 3 3 04
  • List I points to a list L, the sorted locations
    of the k-word in the database sequences.

A list, I, of all possible numbers is
maintained. For DNA, when k 7, this list has
47 or 16,384 entries.
I
3102330
3102331
L
100
730
600
350
930
870
122
Deficiencies in the BLAT Program
  • Near exact matching with proteins requires an
    index which is too large to fit in memory, so a
    less efficient hashing scheme is used instead
  • The alignment method is not standard optimal
    alignment and
  • for DNA does not work well below 90 sequence
    identity
  • for proteins does not work well with indels in
    one of the sequences

123
Topic 5 Outline
  • Searching for Repetitive Motifs and Patterns
  • Pattern Detection vs Alignment
  • Short word methods
  • TRF

124
Pattern Detection
  • In pattern detection problems
  • there is no query sequence
  • we look for a repetitive pattern or motif in one
    long sequence or several sequences
  • broad characteristics of the target pattern are
    specified in advance

125
Short Word Methods
  • In short word methods, small matching words are
    detected. A cluster of short words indicates a
    potential pattern. A typical applications is the
    search for tandem repeats in DNA sequences.

Word spacings may differ.
126
Short Word Methods
  • Short word methods are well suited to DNA
    sequences because
  • the alphabet is small so exact matching words are
    common even in homologous regions that have
    experienced significant mutation.
  • they can handle insertions and deletions that are
    common in DNA
  • They are less well suited to protein sequences
    because
  • large amino acid alphabet and frequent
    substitution make short matching words uncommon.

127
Tandem Repeats
A tandem repeat is any pattern of nucleotides
that has been duplicated so that it appears
several times in succession. For example, the
sequence fragment below contains a tandem repeat
of the trinucleotide CGT
tcgctggtcatacgtcgtcgtcgtcgttacaaacgtcttccgt
128
Approximate Tandem Repeats
More typically, the tandem copies are only
approximate due to mutations. Here is an
alignment of copies from a tandem repeat in C.
elegans.
Shown are the copies and a
consensus pattern
129
Tandem Repeats Associated with Human Disease
  • Trinucleotide diseases caused by expansion of a
    trinucleotide repeat
  • Fragile-X mental retardation
  • Myotonic dystrophy
  • Huntingtons disease
  • Friedreichs ataxia
  • Multilocus diseases linked in some cases to
    unstable or uncommon minisatellites
  • Epilepsy
  • Diabetes
  • Ovarian cancer

130
Tandem Repeats Function and Usefulness
  • Tandem repeats
  • are involved in gene regulation and often
    contain putative transcription factor binding
    sites.
  • exhibit copy number polymorphism, making them
    valuable genomic markers.

131
Tandem Repeats Finder (TRF)
  • TRF finds tandem repeats in genomic DNA sequences
    using short word matches (k-tuple matches). It
    assumes
  • repeats have on average gt80 sequence identity
  • insertions and deletions occur on average in lt
    10 of the pattern positions
  • These are average values for program parameter
    settings. Repeats with lower sequence identity
    and higher indel frequency are also found.

132
  • LOCUS HUMFMR1 3765 bp
    mRNA PRI 08-NOV-1994
  • DEFINITION Human Fragile X mental
    retardation 1 FMR-1 gene, 3' end, clones
  • 1 gacggaggcg
    cccgtgccag ggggcgtgcg gcagcgcggc ggcggcggcg
    gcggcggcgg
  • 61 cggcggaggc ggcggcggcg gcggcggcgg
    cggcggaggc ggcggcggcg gcggcggcgg
  • 121 cggcggctgg gcctcgagcg cccgcagccc
    acctctcggg ggcgggctcc cggcgctagc
  • 181 agggctgaag agaagatgga ggagctggtg
    gtggaagtgc ggggctccaa tggcgctttc
  • 241 tacaaggcat ttgtaaagga tgttcatgaa
    gattcaataa cagttgcatt tgaaaacaac
  • 301 tggcagcctg ataggcagat tccatttcat
    gatgtcagat tcccacctcc tgtaggttat
  • 361 aataaagata taaatgaaag tgatgaagtt
    gaggtgtatt ccagagcaaa tgaaaaagag
  • 421 ccttgctgtt ggtggttagc taaagtgagg
    atgataaagg gtgagtttta tgtgatagaa
  • 481 tatgcagcat gtgatgcaac ttacaatgaa
    attgtcacaa ttgaacgtct aagatctgtt
  • 541 aatcccaaca aacctgccac aaaagatact
    ttccataaga tcaagctgga tgtgccagaa
  • 601 gacttacggc aaatgtgtgc caaagaggcg
    gcacataagg attttaaaaa ggcagttggt
  • 661 gccttttctg taacttatga tccagaaaat
    tatcagcttg tcattttgtc catcaatgaa
  • 721 gtcacctcaa agcgagcaca tatgctgatt
    gacatgcact ttcggagtct gcgcactaag
  • 781 ttgtctctga taatgagaaa tgaagaagct
    agtaagcagc tggagagttc aaggcagctt
  • 841 gcctcgagat ttcatgaaca gtttatcgta
    agagaagatc tgatgggtct agctattggt
  • 901 actcatggtg ctaatattca gcaagctaga
    aaagtacctg gggtcactgc tattgatcta
  • 961 gatgaagata cctgcacatt tcatatttat
    ggagaggatc aggatgcagt gaaaaaagct

133
  • LOCUS HUMFMR1 3765 bp
    mRNA PRI 08-NOV-1994
  • DEFINITION Human Fragile X mental
    retardation 1 FMR-1 gene, 3' end, clones
  • 1 gacggaggcg
    cccgtgccag ggggcgtgcg gcagcgcggc ggcggcggcg
    gcggcggcgg
  • 61 cggcggaggc ggcggcggcg gcggcggcgg
    cggcggaggc ggcggcggcg gcggcggcgg
  • 121 cggcggctgg gcctcgagcg cccgcagccc
    acctctcggg ggcgggctcc cggcgctagc
  • 181 agggctgaag agaagatgga ggagctggtg
    gtggaagtgc ggggctccaa tggcgctttc
  • 241 tacaaggcat ttgtaaagga tgttcatgaa
    gattcaataa cagttgcatt tgaaaacaac
  • 301 tggcagcctg ataggcagat tccatttcat
    gatgtcagat tcccacctcc tgtaggttat
  • 361 aataaagata taaatgaaag tgatgaagtt
    gaggtgtatt ccagagcaaa tgaaaaagag
  • 421 ccttgctgtt ggtggttagc taaagtgagg
    atgataaagg gtgagtttta tgtgatagaa
  • 481 tatgcagcat gtgatgcaac ttacaatgaa
    attgtcacaa ttgaacgtct aagatctgtt
  • 541 aatcccaaca aacctgccac aaaagatact
    ttccataaga tcaagctgga tgtgccagaa
  • 601 gacttacggc aaatgtgtgc caaagaggcg
    gcacataagg attttaaaaa ggcagttggt
  • 661 gccttttctg taacttatga tccagaaaat
    tatcagcttg tcattttgtc catcaatgaa
  • 721 gtcacctcaa agcgagcaca tatgctgatt
    gacatgcact ttcggagtct gcgcactaag
  • 781 ttgtctctga taatgagaaa tgaagaagct
    agtaagcagc tggagagttc aaggcagctt
  • 841 gcctcgagat ttcatgaaca gtttatcgta
    agagaagatc tgatgggtct agctattggt
  • 901 actcatggtg ctaatattca gcaagctaga
    aaagtacctg gggtcactgc tattgatcta
  • 961 gatgaagata cctgcacatt tcatatttat
    ggagaggatc aggatgcagt gaaaaaagct

134
LOCUS RATIGCA 4461 bp DNA ROD
18-APR-1994 DEFINITION Rat Ig germline
epsilon H-chain gene C-region, 3' end.
2881 cgccccaagt aggcttcatc atgctctttg
gtttagcaat agcccaaagc aagctatgca 2941
tccatctcag gcccagaggg atgaggagac cagaatcaag
acatacccac gcccatccca 3001 cgcccaacca
ccaaccacca gcacatcagg ttcacacacc tgagaccagt
ggctcccatc 3061 acacacacac acacacacac
acacacacac acacacacac acacacaagc ccgtacacat
3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct
ctgaggatca ccaatggcag 3181 agtcggccag
cacctcagcc tccaggccaa tccttatact ttggcccact
gcaggccatg 3241 agagatggag gaggtggagg
cctgagctgt ggaaaaccag agacaggaag atggtctgta
3301 ctccaggcca atccttatac tttggcccac tgcaggccat
gagagatgga ggaggtggag 3361 gcctgagctg
tggaaaacca gagacaggaa gatggtctgt atggagagag
tagtaaacca 3421 gattataggg agactgaggc
aggagtagag ctcctacaag gccagtagtc taccttagag
3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg
ctcctcagat atcacaacca 3541 agattcctgg
agccagagtg tgcatgcagg ccctagaaga aatgtggagc
ttagagccct 3601 tcctggaggg ccctgggcac
tctgaacaaa aggcaattct gtaggctgta tagaggcatc
3661 ctgtcagata cacacacaca tgcacacaca tacacacaca
gagacacaga cacacacaca 3721 tgcccacaca
catgcataca cacatgcaca cacatacaca cacagagaca
cagacacaca 3781 catgcccaca cacatgcata
cacacatgca tgcacacaca cacacacaca tacacataca
3841 cacacacaca cacaccccgc aggtagcctt catcatgctg
tctagcgata gccctgctga 3901 gggtgggaga
tactgggtca tggtgggcac cggagtagaa agagggaatg
agcagtcagg 3961 gtcaggggaa aaggacatct
gcctccaggg ctgaacagag acttggagca gtcccagagc
4021 aagtgggatg gggagctctg ccactccagt ttcaccagga
ctgcctgaga ccagtgaggg
135
LOCUS RATIGCA 4461 bp DNA ROD
18-APR-1994 DEFINITION Rat Ig germline
epsilon H-chain gene C-region, 3' end.
2881 cgccccaagt aggcttcatc atgctctttg
gtttagcaat agcccaaagc aagctatgca 2941
tccatctcag gcccagaggg atgaggagac cagaatcaag
acatacccac gcccatccca 3001 cgcccaacca
ccaaccacca gcacatcagg ttcacacacc tgagaccagt
ggctcccatc 3061 acacacacac acacacacac
acacacacac acacacacac acacacaagc ccgtacacat
3121 ccaccatatc cagagacaag tgtctgagtc tgagatacct
ctgaggatca ccaatggcag 3181 agtcggccag
cacctcagcc tccaggccaa tccttatact ttggcccact
gcaggccatg 3241 agagatggag gaggtggagg
cctgagctgt ggaaaaccag agacaggaag atggtctgta
3301 ctccaggcca atccttatac tttggcccac tgcaggccat
gagagatgga ggaggtggag 3361 gcctgagctg
tggaaaacca gagacaggaa gatggtctgt atggagagag
tagtaaacca 3421 gattataggg agactgaggc
aggagtagag ctcctacaag gccagtagtc taccttagag
3481 tcctataagt ctgggctggg agtccatgtg tcctgacttg
ctcctcagat atcacaacca 3541 agattcctgg
agccagagtg tgcatgcagg ccctagaaga aatgtggagc
ttagagccct 3601 tcctggaggg ccctgggcac
tctgaacaaa aggcaattct gtaggctgta tagaggcatc
3661 ctgtcagata cacacacaca tgcacacaca tacacacaca
gagacacaga cacacacaca 3721 tgcccacaca
catgcataca cacatgcaca cacatacaca cacagagaca
cagacacaca 3781 catgcccaca cacatgcata
cacacatgca tgcacacaca cacacacaca tacacataca
3841 cacacacaca cacaccccgc aggtagcctt catcatgctg
tctagcgata gccctgctga 3901 gggtgggaga
tactgggtca tggtgggcac cggagtagaa agagggaatg
agcagtcagg 3961 gtcaggggaa aaggacatct
gcctccaggg ctgaacagag acttggagca gtcccagagc
4021 aagtgggatg gggagctctg ccactccagt ttcaccagga
ctgcctgaga ccagtgaggg
136
Basic Assumption
  • Mutated, adjacent copies of a pattern will
    contain runs of exact matches.

T C C A C G G A G A T A T T T A
T A T A C G T C G A G A C T T A
137
Basic Assumption
  • Mutated, adjacent copies of a pattern will
    contain runs of exact matches.

T C C A C G G A G A T A T T T A
T A T A C G T C G A G A C T T A
Runs of matches are identified using k-tuple
matches.
138
Using k-tuple matches
  • For purposes of program efficiency, fixed size
    k-tuples are used.

139
k-tuple matches
In pattern matching, a k-tuple is a window of
length k which contains text characters. Two
windows which contain the same text, form a
k-tuple match GAACGTTAGGTAACTGCAT CCTAGTTATACG
TTAAC
140
(No Transcript)
141
Modeling Tandem Repeats
  • Multiple k-tuple matches suggest the occurrence
    of a tandem repeat. The appropriate number of
    matches depends on how a tandem repeat is
    defined. For example, are the following two
    aligned sequence fragments two copies of the same
    underlying pattern?
  • TCGGCATCAGTCTATGG
  • TCAA-TG-GTGT-TGG

142
A Stochastic Model
  • TRFs stochastic model is based on the
    probability of character matching and the
    frequency of insertion and deletion (indels)
    between aligned adjacent copies.
  • CCACAACC-CGTCAGGCAAGT
  • CTGCACCATCGTCTGGGAAGT
  • HTTHHTHTTHHHHTHHTHHHH
  • Note that the alignment has been converted into a
    Bernoulli (coin-toss) sequence.

143
Model Parameters
  • PM the expected frequency of a match
  • PI the expected frequency of an indel
  • The parameters are applied to Bernoulli sequences
    to establish criteria for detecting the repeats.

144
Number of matches to indicate a repeat
  • Sum of heads is the minimum required number of
    matches
  • for a repeat with period n
  • match probability p
  • tuple size k
  • Unless k 1, not all matches will be detected.
  • For example
  • HHTHHHHTHTHHHTTHHHT

k H seen
1 13
2 12
3 10
4 4
145
Sum of heads
  • Suppose a random Bernoulli sequence has length
    100 and the expected number of heads is 75 (PM
    0.75). If we count the number of heads, then 95
    of the time we expect to count at least 68 heads.
  • If we count only heads that occur in runs of
    length 5 or more, then 95 of the time we expect
    to count at least 26 heads. This is the sum of
    heads criteria.

146
Other Criteria
  • Apparent Size Used to distinguish tandem from
    non-tandem repeats.
  • Waiting time Used to pick a suitable tuple
    size.
  • Random walk Used to accommodate insertions and
    deletions.

147
(No Transcript)
148
The Tandem Repeats Database
  • The Tandem Repeats Database (TRDB) is
  • A public database of information on tandem
    repeats.
  • A private workspace for extended research on
    tandem repeats.

149
C. elegans Distribution of Repeat Pattern Size
Chr 1
150
Human Distribution of Pattern Size Chr 1
151
C. elegans Distribution of Repeat Location Chr 1
152
Human Distribution of Repeat Location Chr 1
153
Human Distribution of Repeat Location Chr 1 for
Patternsize gt 5
154
Clusters in 41 bp Repeats
155
Two identical repeats
156
Four repeats from a tandem repeat cluster
157
Topic 6 Outline
  • Composition Alignment
  • Sequence composition and composition match
  • Composition alignment algorithm
  • Composition match scoring functions
  • Limiting the length of a composition match
  • Growth of local composition alignment scores
  • Biological examples

158
Sequence Composition
  • Composition is a vector quantity describing the
    frequency of occurrence of each alphabet letter
    in a particular string. Let S be a string over
    S. Then,
  • C(S)(fs1 , fs2 , fs3 , , fsS)
  • is the composition of S, where fsi is the
    fraction of the characters in S that are si.
  • Note that the order of letters is irrelevant as
    it has no effect on the composition.

159
Composition Example
  • S ACTGTACCTGGCGCTATT
  • C(S) ( 0.17, 0.28, 0.22, 0.33 )
  • A C G
    T

160
Composition Match
  • Two strings, S and T, have a composition match if
    their lengths are equal and C(S) C(T).
  • For example, S and T below have a composition
    match
  • S ACTGTACCTGGCGCTATT
  • T AAACCCCCGGGGTTTTTT

161
Composition and Sequence Features
  • Isochores Multi-megabase, specifically GC-rich
    or GC-poor. GC-rich isochores have greater gene
    density.
  • CpG Islands Several hundred nucleotides, rich
    in the dinucleotide CG which is underrepresented
    in eukaryotic genomes. Methylation of the
    cystine in these dinucleotides affects gene
    expression.
  • Protein binding regions Tens of nucleotides,
    dinucleotide composition contributes to DNA
    flexibility, allowing the helix to change shape
    during protein binding.

162
Composition Alignment Problem
  • Given Two sequences, S of length m, and T of
    length n, over an alphabet S, and a scoring
    function cm(s, t) for the score of a composition
    match between substrings s and t.
  • Find The best scoring alignment (global or
    local) of S with T such that the allowed scoring
    options include composition match between
    substrings of S and T as well as the standard
    options of single character match, single
    character mismatch, insertion and deletion.

163
Example of composition alignment
  • S AACGTCTTTGAGCTC
  • T AGCCTGACTGCCTA
  • Alignment
  • AACGTCTTTGAGCTC
  • lt-gt lt---gt
  • AGCCTGACT-GCCTA

164
Algorithm Analysis
  • Given two sequences, S and T, the best alignment
    of the prefix strings
  • S1, i s1 si
  • T1, j t1 tj
  • ends in one of four ways, mismatch, insertion,
    deletion, or composition match between suffixes
    of length l
  • 1 l min(i, j, limit)
  • i.e., between substrings Si l 1, i and Tj
    l 1, j

165
Time Complexity
  • Computing the optimal composition alignment is
    done with dynamic programming and is similar to
    standard alignment, except for the composition
    match scoring option. The overall time
    complexity is
  • O(nmZ)
  • where Z is the time required per (i, j) pair to
    find the best length l for the composition match.

166
Computing length of the shortest composition
match
  • Our goal here is to start with two strings, S and
    T, of equal length, and for each prefix pair S1,
    k, T1, k, find the length of the shortest
    suffixes that have a composition match. For
    example, let
  • S AACGTCTTTGAGCT
  • T AGCCTGACTGCCTA
  • Then for k 6, the shortest suffixes which have
    a composition match have length 3
  • S AACGTC
  • T AGCCTG

167
Composition difference
  • Composition difference is a vector quantity for
    two strings x and y
  • CD(x, y) (cs1 , , csS)
  • where csi is the difference between the number
    of times si occurs in x and in y.

168
Using composition difference
Key observation two identical composition
differences at prefix lengths k and g indicate a
composition match of length k g.
169
Sorting to find shortest composition matches
Sort on composition difference using stable sort.
Adjacent tuples with the same composition
difference identify shortest composition matches.
170
Time complexity for composition matches
  • O(nmS) to find nm shortest composition match
    lengths for two strings of length n and m.
  • In our work, S, is a small constant (4 for DNA,
    16 for dinucleotides). For larger alphabets, the
    method of Amir, Apostolico, Landau and Satta
    (2003) can be used.

171
Composition match scoring functions
  • Functions based on match length, k
  • Function 1 cm(k) ck
  • Function 2 cm(k) cv k
  • where c is a constant.
  • Functions based on substring composition
  • Function 4 cm(C, B, k) ck H(C,B)
  • where H is the relative entropy function, C is
    the composition of the matching substrings and B
    is a background composition.

172
Additive and subadditive scoring functions
  • The functions based on length are additive or
    subadditive
  • cm(i j) cm(i) cm(j)
  • Lemma For additive or subadditive composition
    match scoring functions, any best scoring
    alignment is equivalent in score to an alignment
    which contains only shortest composition matches.
  • Theorem Composition alignment with additive or
    subadditive match scoring functions and finite
    alphabet has time complexity O(nm).

173
The limit parameter
  • Intuitively, allowing scrambled letters to match
    should increase the amount of matching between
    sequences. If too much matching occurs,
    alignments will not be meaningful.
  • The limit parameter is an upper bound on the
    length l of the longest single composition match.

174
Limit and fraction of matching charactersrandom,
ungapped alignments
Sequence length limit 1 2 5 10
binary 50 62.5 75.6 82.4
DNA 25 30 37.5 44.2
Dinucleotide 6.2 6.5 7.1 7.5
175
Limit and fraction of matching charactersrandom,
ungapped alignments
Sequence length 100, all letters equal
probability p 0.25
limit 1 2 5 10
DNA 25 33.7 44.4 51
Sequence length 400, all letters equal
probability p 0.25
limit 1 2 10 50
dinucleotide 6.25 6.81 7.76 7.78
176
Growth of local alignment scoreFunction 1
177
Global score as a predictor of local parameter
suitability Function 1
178
Growth of local alignment score Function 2
179
Global score as a predictor of local parameter
suitability Function 2
180
Limit values for DNA
  • Function 1 cm(k) ck Limit 3.
  • Function 2 cm(k) cvk Limit 10.
  • Function 4 cm(C, B, k) ck H(C, B)
  • Limit 50.

181
Biological examples
  • Composition alignment was tested on a set of 1796
    promoter sequences from the Eukaryotic Promoter
    Database. Each sequence is 600 nucleotides long,
    500 bases upstream and 100 downstream of the
    transcription initiation site.
  • Two local alignment scores were produced using
    function 1, W using composition alignment and S
    using standard alignment. The examples shown have
    statistically significant W with W 3 S to
    exclude good standard alignments.

182
Example 1
Composition alignment and standard alignment of
two promoters. Standard alignment is not
statistically significant. Sequences are
characteristic of CpG islands.
Composition Alignment GCCCGCCCGCCGCGCTCCCGCCCGCC
GCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC lt-gt
ltgtltgtltgt ltgtltgt lt-gt ltgtltgt ltgtltgt
ltgt lt-gt CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGG
CCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC Standard
Alignment CGCCGCCGCCG CGCCGCCGCCG
183
Example 2
Composition alignment of two promoter sequences.
Composition changes at vertical line.
A C G T Left
(0.01, 0.61, 0.30, 0.08) Right (0.19, 0.16,
0.56, 0.09)
GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GC
TCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG lt-gtlt
gtltgt ltgt lt-gtltgt ltgt
lt-gt ltgtltgtlt-gt ltgtltgtltgtlt-gtlt-gt CCGCGC
GCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGG
GGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG
184
Final Slide
  • Recognize that similarity among biological
    sequences most likely exists in ways which we can
    not today perceive and for which we have no
    detection tools. You are encouraged, therefore,
    to think broadly, beyond the embellishment or
    refinement of current methods, to new definitions
    of similarity and new problems of comparison.
Write a Comment
User Comments (0)
About PowerShow.com