Protein Structure Similarity

Computation of Best Matches

- Two simultaneous subproblems
- Find maximal correspondence set C
- Find alignment transform T
- Chicken-and-egg issue
- Each subproblem is relatively simple
- If we knew C, we could compute T
- If we knew T, we could get C by proximity
- But the combination is hard !!!

Computation of Best Matches

- Two simultaneous subproblems
- Find maximal correspondence set C
- Find alignment transform T
- Chicken-and-egg issue
- Each subproblem is relatively simple
- If we knew C, we could compute T
- If we knew T, we could get C by proximity
- But the combination is hard !!!

Find Alignment Transform

- Two sets of points A a1,,an and B

b1,,bn - Correspondence pairs (ai, bi)
- Find T arg minT RMSD(A,T(B)) ?
- O(n) closed-form solution Arun, Huang, and

Blostein, 87 Horn, 87 Horn, Hilden, and

Negahdaripour, 88

O(n) SVD-Based Algorithm

- T combines translation t and rotation R, such

that T(bi) t R(bi) - b (Si1,...,nbi)/n mean of the bis
- Place the origin of coordinate system at b
- minT RMSD(A,T(B)) simplifies to (up to some

constants) - t and R can be computed separately
- t a mean of the ais

Arun, Huang, and Blostein, 87

O(n) SVD-Based Algorithm

- A3?n a1-a, ..., an-a B3?n b1-b, ...,

bn-b - Compute SVD decomposition of 33 correlation

matrix BAT BAT UDVT

where D is a diagonal matrices with decreasing

non-negative entries (singular values) along the

diagonal - If det(U)det(V) 1 then S I,

else S diag(1,1,-1) - R USVT

Arun, Huang, and Blostein, 87

O(n) SVD-Based Algorithm

- A3?n a1-a, ..., an-a B3?n b1-b, ...,

bn-b - Compute SVD decomposition of 33 correlation

matrix BAT BAT UDVT

where D is a diagonal matrices with decreasing

non-negative entries (singular values) along the

diagonal - If det(U)det(V) 1 then S I,

else S diag(1,1,-1) - R USVT

Arun, Huang, and Blostein, 87

- Arun, Huang, and Blostein, 87
- ? rotation matrix
- Horn, 87 ? quaternion

? Trial-and-Error Approach to Protein Structure

Comparison

? Trial-and-Error Approach to Protein Structure

Comparison

- Set CS to a seed correspondence set (small set

sufficient to generate an alignment transform) - Compute the alignment transform T for CS and

apply T to the second protein B - Update CS to include all pairs of features that

are close apart - If CS has changed, then return to Step 2 else

return (CS,T)

? Trial-and-Error Approach to Protein Structure

Comparison

- - result nil
- - Iterate N times
- Set CS to a seed correspondence set (small set

sufficient to generate an alignment transform) - Compute the alignment transform T for CS and

apply T to the second protein B - Update CS to include all pairs of features that

are close apart - If CS has changed, then return to Step 2 else

result ? result ? (CS,T) - - Return result

- How to get seed correspondences?

Seed Generation from Fragment

- From distance matrices
- E.g., DALI Holm and Sander, 1996

Using Distance Matrices (DALI)

- Distances are invariant to rigid-body

transformations - DALI Holm and Sander, 1996 looks for similar

hexapeptides by searching for similar 7x7 Ca-Ca

distance matrices

Seed Generation from Fragment

- From distance matrices
- E.g., DALI Holm and Sander, 1996
- From secondary structure elements (SSEs)
- E.g., LOCK Singh and Brutlag, 1996
- From voting scheme (using geometric hashing)
- E.g., 3dSEARCH Singh and Brutlag, 2000

LOCK

- A.P. Singh and D.L. Brutlag. Hierarchical

Protein Structure Superposition Using Both

Secondary and Atomic Representations. Proc. ISMB,

pp. 284-293, 1997. - LOCK2J. Shapiro and D.L. Brutlag. FoldMiner

Structural Motif Discovery Using an Improved

Superposition Algorithm. Protein Science,

13278-294, 2004. - http//motif.stanford.edu/lock2/

LOCK

- Two levels of features SSEs and Ca atoms
- Stage 1 (SSE alignment) Initial alignment is

computed using SSEs represented as vectors - Stage 2 (atom alignment) Alignment is refined

using Ca atoms represented as points

Rationale for LOCK

- Using types of features is an effective way to

reduce combinatorial explosion and computation - SSEs, which are responsible for most of the

stability and functionality of the proteins, are

more meaningful and better conserved than types

of atoms and amino-acids - If 2 structures are similar, some of their SSEs

should form similar substructures - Drawback It narrows down the set of possible

applications, e.g., cant find small motifs at

atomic level

Vector-Based Representation

b-strands

loops

a-helices

One vector per SSE (helix, strand, loop)

Vector-Based Representation

- DSSP Kabsch and Sander, 1983 classifies

residues into helices/strands - For a-helix starting at residue iXorigin

(0.74Xi Xi1 Xi2 0.74Xi3)/3.48where Xi

is the position of the Ca atom of residue i - (angle between two consecutive residues is 100dg

? factor 0.74) - Similar computation for Xend and for b-strand

Scoring Similarity

Maximal score

- Position-independent differences
- angle(i,k)-angle(p,r)
- angle(i,j)-angle(p,q)
- angle(j,k)-angle(q,r)
- distance(i,k)-distance(p,r)
- length(k)-length(r)
- Position-dependent differences
- angle(k,r)
- distance(k,r)
- Scores are additive

Score S S(di)

Value of di forwhich score is 0

Stage 1 SSE Alignment

- For every pair of SSE vectors of protein A, find

all pairs of vectors in B that align well using

orientation-independent scores ? seed

correspondence sets - For each correspondence set
- Find alignment transform and apply it to B
- Find correspondence set with maximal score
- (record transform T and correspondence set CS

that yields maximal score)

Stage 1 SSE Alignment

- A (i, j, k, l, m)
- B (p, q, r, s, t)
- Seed correspondence (i,p),(j,q)

- Simultaneous gaps in both structures are not

allowed (not in SCOP2) - Terminate a path when score of new

correspondence is negative - Re-compute new transform with each new

correspondence (?)

Stage 2 Atom (Core) Alignment

- Construct correspondence pairs of atoms
- Atom i of A corresponds to atom j of T(B) iff i

is the closest atom in A to j and j is the

closest atom in T(B) to i - The distance between i and T(j) is lt e (3Å)
- Prune correspondence set to largest subset of

correspondence pairs that follow backbone

alignment constraint - Re-compute T to be the transform that minimizes

the RMSD of the atoms in the correspondence set - Iterate 1-2-3 until RSMD converges

Experimental Results

- 685 protein structures from PDB such that each

pair has less than 25 sequence identity - 3 families of folds (based on SCOP

classification) - myoglobins (11 structures)

20 amino acid identity- TIM barrels (50

structures)- immunoglobulins (38 structures) - Goal Given one query protein in each family,

find the other members of the family (3685

2055 alignments) - Method For each query, sort the 685 structures

by score (computed by LOCK). Select the top k

proteins. Count members of family (true

positives) and non-members (false positives)

Myoglobins (11)

TIM-barrels (50)

Immunoglobulins (38)

True positives False positives

11 0

True positives False positives

40 0

45 1

50 5

True positives False positives

20 0

25 1

30 2

35 11

38 383

Alignment of 11 Myoglobins

Alignment of 50 TIM barrels

a-helices in red b-strands in yellow

Alignments of 31 Immunoglobulins

Only b-strands are shown

ROC Curves

Running Time

- 1ms per seed correspondence
- 1h to search 10,000 protein structures
- 100s of days to compare all pairs of proteins

in PDB - ? Geometric hashing to speedup stage 1

Seed Generation from Fragment

- From distance matrices
- E.g., DALI Holm and Sander, 1996
- From secondary structure elements (SSEs)
- E.g., LOCK Singh and Brutlag, 1996
- From voting scheme (using geometric hashing)
- E.g., 3dSEARCH Singh and Brutlag, 2000

Voting Scheme with Hash Table

- Many-to-many comparison requires a better

organization of computation to avoid repeating

the same computation again and again - Pre-computation Index proteins in hash table
- Query phase Voting scheme using hash table
- Several variants on this theme

3d-Lookup Holm and Sander, 1995

3dSEARCH Singh 2002

Voting Scheme with Hash Table

- Many-to-many comparison requires a better

organization of computation to avoid repeting the

same computation again and again - Pre-computation Index proteins in hash table
- Query phase Voting scheme using hash table
- Several variants on this theme

3d-Lookup Holm and Sander, 1995

3dSEARCH Singh 2002

Indexing Target Structures in Hash Table

(3dSEARCH Singh 2002)

- Hash table 3-D regular grid of cubic bins (2Å)
- For each target structure
- For each pair of vectors (i,j)
- Compute a coordinate system
- Place an entry for each other vectork into the

bin containing the coordinates of the midpoint of

the vector (or average of coordinates of origin,

middle, and end points). Store ID of coordinate

system ks orientation and type (a or b) in the

entry.

v

u

Grid is same for all coordinate systems

v

v

u

u

Grid is same for all coordinate systems

Indexing Target Structures in Hash Table

(3dSEARCH Singh 2002)

- Hash table 3-D regular grid of cubic bins (2Å)
- For each target structure
- For each pair of vectors (i,j)
- Compute a coordinate system
- Place an entry for each other vectork into the

bin containing the coordinates of the midpoint of

the vector (or average of coordinates of origin,

middle, and end points). Store ID of coordinate

system ks orientation and type (a or b) in

the entry. - Grid is sparsely occupied ? hash table
- A structure with n SSEs contributes n(n-1)(n-2)

entries. Each vector is represented (n-1)(n-2)

times - 10,000 structures with 10 SSEs each yield 7M

entries

Voting Using Hash Table

- Given a query structure
- For each pair of vectors (i,j)
- Compute a coordinate system
- For each other vector k
- Retrieve the bin accessed by this vector and the

neighboring bins - For every entry (vector) in those bins that has

the same orientation and type as k, add a vote

for the coordinate system stored in the entry - Sort target structures based on max number of

votes received by any of its coordinate systems - ? Small number of target structures. Use LOCK for

better alignment - Hours of pure LOCK are reduced to seconds

Advantages of Voting System

- Very efficient in practice for many-to-many

comparisons - Can establish correspondence between partial,

disconnected substructures - Parallel implementation is straightforward
- Independent of the order in which vectors are

considered - Drawback (?) May establish correspondences that

do not satisfy the backbone sequence constraint

Problem 4 Find Pharmacophore in Ligands

- Given
- Collection of N ( 5 to 10) small flexible

ligands with similar activity (binding at same

sites)

Benzamidine binding to beta-Trypsin (3ptb)

Inhibitor binding to HIV protease

(No Transcript)

Problem 4 Find Pharmacophore in Ligands

- Given
- Collection of N ( 5 to 10) small flexible

ligands with similar activity (binding at same

sites) - A set of low-energy conformations (dozens to few

hundreds) for each ligand

Problem 4 Find Pharmacophore in Ligands

- Given
- Collection of N ( 5 to 10) small flexible

ligands with similar activity (binding at same

sites) - A set of low-energy conformations (dozens to few

hundreds) for each ligand - Find a substructure (pharmacophore) that has a

match in at least one conformation of each ligand

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

Pharmacophore and Rational Drug Design

- Pharmacophore identification is a form of

reverse engineering to get a model of a binding

site - A pharmacophore can be used to modify ligands

into more potent drugs and/or to screen large

databases of ligands for leads

Three Simultaneous Problems

- Conformations?
- Correspondence?
- Transform?
- But ligands are small molecules

Software

- DISCO Martin et al., 1993
- DISCOtech and GASP Tripos, Inc.
- CATALYST and HIPHOP Accelrys et al. Green et

al., 1994 Barnum et al., 1996 - RAPID P.W. Finn, L.E. Kavraki, J.C. Latombe, R.

Motwani, C. Shelton, S. Venkatasubramanian, and

A. Yao. RAPID Randomized Pharmacophore

Identification for Drug Design. Computational

Geometry Theory and Applications, 10, pp.

263-272, 1998

(No Transcript)

Pairwise Comparison

- Multi-Probe(M1,,MN)
- Extract invariants from M1 and M2 by calling

Pair-Probe(P1,P2) on every pair of conformations

of the two ligands - Test each candidate invariant S obtained at Step

1 against every ligand Mi, i 3,,N by calling

Pair-Probe(S,P) on S and each conformation P of Mi

Pair-Probe

- n smallest number of atoms/features in a

liganda given constant (0 lt a 1) P1 and P2

Conformations of two distinct ligands (or

candidate invariant) - Pair-Probe(P1,P2)
- Perform s times
- Pick a triplet of atoms at random from P1
- Determine three atoms in P2 congruent to this

triplet compute the alignment transform T - Iterate Apply T to P2 determine the atoms in P1

matching those in P2 update T - If the number of matching atoms exceed an, then

return this atom set as a candidate invariant S

Magnitude of s

- Prpicking 3 atoms in invariant ? a3
- Prfailing to find invariant ? (1 - a3)s
- We want (1-a3)s ? g (g is acceptable

probability of failure) - s ? ln(g)/ln(1-a3)
- Since x lt -ln(1-x) for 0 lt x lt 1, we get s ?

ln(1/g)/a3 - For g 10-2 and a 0.3, we get s ? 180

Some Results

- 63 to 69 atoms with 10 to 15 torsional degrees

of freedom - Feature every non-H atom ? 30 features of 6

types(atom types) - Invariant in active conformations 7-atom

pharmacophore 7-atom scaffolding

conf t(s) 4 5 6 7 8 9

10 11 12 13 14

11 800 44 20 10 5 2 1 0 0 1 0 0

Fuel for Thoughts

Idea Many-to-many correspondence may be more

robust

Example Hausdorf distance

Hausdorf Distance

- Two sets of points A a1,...,an and B

b1,...,bm in ?k - dH(A,B) maxa?A minb?B a-b
- DH(A,B) max dH(A,B), dH(B,A)
- Variation for shape similarity?H(A,B) minT

DH(A,T(B)) - But efficient algorithms only exist for planar

sets of points

Other Idea Minimize cost of transforming A into

B

- Old idea
- Graphics Morphing distance
- Computer vision Earth Movers distanceRubner,

Tomasi, and Guibas, 1998 - Protein similarity
- Isotopic distance Erdmann, 2004

Structure Alignment Isotopies

- Two curves are isotopic if one can be deformed

into the other without self-collision - Example Polygonal curve with n vertices
- One may think of structure alignment as an

isotopy deforming one structure into the other - Two structures are similar if the isotopy is

small

M.A. Erdmann. Protein Similarity from Knot

Theory GeometricConvolution and Line Weavings,

CMU Tech. Rep. CMU-CS-04-138.

Small Isotopy

- Model a structure as a set of polygonal lines

(e.g., vertices are Ca atoms) - Two structures A and B are (T,d)-isotopic if

there exists an isotopy deforming A into T(B) in

such a way that no vertices of A moves further

away than some d from its initial or final

location

Erdmann 2004

Similarity Measure

- dT(A,B) inf d A is (T,d)-isotopic to B
- d(A,B) infT dT(A,B)
- d is computable Erdmann,2004
- But as complex as path planning, hence

exponential in the number of degrees of freedom - Possibility of approximating d using

probabilistic roadmaps?

Topology of Line Weavings

1xis 1nar

a helix axes

M.A. Erdmann. Protein Similarity from Knot

Theory GeometricConvolution and Line Weavings,

CMU Tech. Rep. CMU-CS-04-138.

(No Transcript)

? 2 topologically equivalent line weavings

3 equivalent classes for 4 lines

Erdmann 2004

(No Transcript)

Another (incorrect) alignment of 1xis and 1nar

? 2 non-equivalent line weavings

Why topology is interesting?

- Two conformations may be geometrically close

(small RMSD) may require a long continuous

deformation to map one into the other (without

steric clashes)

Conclusion

- Automatic computation of structure similarity is

essential due to the rapid growth of the PDB and

other molecule (e.g., ligand) libraries - As the growth of new protein structures outpaces

that of new folds, detecting structural

similarity will have to be much more fine-grained

than it is today - Biological discoveries will likely lie in local,

possibly rare structure similarities, rather than

in global fold-level classification - Need for better understanding of applications

and radically new approaches - Still a lot of work ...