Title: An Introduction to Structured Output Learning Using Support Vector Machines
1An Introduction to Structured Output Learning
Using Support Vector Machines
- Yisong Yue
- Cornell University
- Some material used courtesy of Thorsten Joachims
- (Cornell University)
2Supervised Learning
- Find function from input space X to output space
Y - such that the prediction error is low.
3Examples of Complex Output Spaces
- Natural Language Parsing
- Given a sequence of words x, predict the parse
tree y. - Dependencies from structural constraints, since y
has to be a tree.
4Examples of Complex Output Spaces
- Part-of-Speech Tagging
- Given a sequence of words x, predict sequence of
tags y. - Dependencies from tag-tag transitions in Markov
model. - ? Similarly for other sequence labeling problems,
e.g., RNA Intron/Exon Tagging.
5Examples of Complex Output Spaces
- Multi-class Labeling
- Protein Sequence Alignment
- Noun Phrase Co-reference Clustering
- Learning Parameters of Graphical Models
- Markov Random Fields
- Multivariate Performance Measures
- F1 Score
- ROC Area
- Average Precision
- NDCG
6Notation
- Bold x,y are structured input/output examples.
- Usually consists of a collection of elements
- x (x1,,xn), y (y1,,yn)
- Each input element xi belongs to some high
dimensional feature space, Rd - Each output element yi is usually a multiclass
label or real valued number - Joint feature functions ?,F map input/output
examples to points in RD
71st Order Sequence Labeling
- Given
- scoring function S(x, y1, y2)
- input example (x1,,xn)
- Finds sequence (y1,,yn) to maximize
- Solved with dynamic programming (Viterbi)
8Some Formulation Restrictions
- Assume S is parameterized linearly by some weight
vector w in RD. - This means that
Hypothesis Function
9The Linear Discriminant
- From last slide
- Putting it together
- Our hypothesis function
Linear Discriminant Function
10Structured Learning Problem
- Efficient Inference/Prediction hypothesis
function solves for y when given x and w - Viterbi in sequence labeling
- CKY Parser for parse trees
- Belief Propagation for Markov random fields
- Sorting for ranking
- Efficient Learning/Training need to efficiently
learn parameters w from training data
xi,yii1..N - Solution use Structural SVM framework
- Can also use Perceptrons, CRFs, MEMMs, M3Ns etc.
11How to Train?
- Given a set of structured training examples
x(i),y(i)i1..N - Different training methods can be used.
- Perceptrons perform update whenever current model
mispredicts. - CRFs plug the discriminant into a conditional
log-likelihood function to optimize. - Structural SVMs solve a quadratic program
minimizes a tradeoff between model complexity and
a convex upper bound of performance loss.
12Support Vector Machines
- Input examples denoted by x (high dimensional
point) - Output targets denoted by y (either 1 or -1)
- SVMs learns a hyperplane w, predictions are
sign(wTx) - Training involves finding w which minimizes
- subject to
- The sum of slacks upper bounds the
accuracy loss
13Structural SVM Formulation
- Let x denote a structured input example (x1,,xn)
- Let y denote a structured output target (y1,
,yn) - Same objective function
- Constraints are defined for each incorrect
labeling y over input x(i) . - Discriminant score for the correct labeling at
least as large as incorrect labeling plus the
performance loss. - Another interpretation the margin between
correct label and incorrect label at least as
large as how bad the incorrect label is. -
14Adapting to Sequence Labeling
- Minimize
- subject to
- where
- and
- Sum of slacks upper bound performance
loss. - Too many constraints!
Use the same slack variable for all constraints
of the same structured training example
15Structural SVM Training
- Suppose we only solve the SVM objective over a
small subset of constraints (working set). - Some constraints from global set might be
violated. - When finding a violated constraint, only y is
free, everything else is fixed - ys and xs fixed from training
- w and slack variables fixed from solving SVM
objective - Degree of violation of a constraint is measured
by
16Structural SVM Training
- STEP 1 Solve the SVM objective function using
only the current working set of constraints. - STEP 2 Using the model learned in STEP 1, find
the most violated constraint from the global set
of constraints. - STEP 3 If the constraint returned in STEP 2 is
violated by more than epsilon, add it to the
working set. - Repeat STEP 1-3 until no additional constraints
are added. Return the most recent model that was
trained in STEP 1.
STEP 1-3 is guaranteed to loop for at most
O(1/epsilon2) iterations. Tsochantaridis et
al. 2005
This is known as a cutting plane method.
17Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
18Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
19Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
20Illustrative Example
- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important
constraints
- Structural SVM Approach
- Repeatedly finds the next most violated
constraint - until set of constraints is a good
approximation.
This is known as a cutting plane method.
21Finding Most Violated Constraint
- Structural SVM is an oracle framework.
- Requires subroutine to find the most violated
constraint. - Dependent on formulation of loss function and
joint feature representation. - Exponential number of constraints!
- Can usually expect efficient solution when
inference has efficient algorithm.
22Finding Most Violated Constraint
- Finding most violated constraint is equivalent to
maximizing the RHS w/o slack - Requires solving
- Highly related to inference
23Sequence Labeling Revisited
- Finding most violated constraint
- can be solved using Viterbi!
24SVMStruct Abstracts Away Structure
- Minimize
- Subject to
- Working set of constraints are fixed ys and xs
- Just like solving a conventional linear SVM!
- Notion of structure almost completely used in
finding the most violated constraint - (this is just an interpretation)