An Introduction to Structured Output Learning

Using Support Vector Machines

- Yisong Yue
- Cornell University
- Some material used courtesy of Thorsten Joachims
- (Cornell University)

Supervised Learning

- Find function from input space X to output space

Y - such that the prediction error is low.

Examples of Complex Output Spaces

- Natural Language Parsing
- Given a sequence of words x, predict the parse

tree y. - Dependencies from structural constraints, since y

has to be a tree.

Examples of Complex Output Spaces

- Part-of-Speech Tagging
- Given a sequence of words x, predict sequence of

tags y. - Dependencies from tag-tag transitions in Markov

model. - ? Similarly for other sequence labeling problems,

e.g., RNA Intron/Exon Tagging.

Examples of Complex Output Spaces

- Multi-class Labeling
- Protein Sequence Alignment
- Noun Phrase Co-reference Clustering
- Learning Parameters of Graphical Models
- Markov Random Fields
- Multivariate Performance Measures
- F1 Score
- ROC Area
- Average Precision
- NDCG

Notation

- Bold x,y are structured input/output examples.
- Usually consists of a collection of elements
- x (x1,,xn), y (y1,,yn)
- Each input element xi belongs to some high

dimensional feature space, Rd - Each output element yi is usually a multiclass

label or real valued number - Joint feature functions ?,F map input/output

examples to points in RD

1st Order Sequence Labeling

- Given
- scoring function S(x, y1, y2)
- input example (x1,,xn)
- Finds sequence (y1,,yn) to maximize
- Solved with dynamic programming (Viterbi)

Some Formulation Restrictions

- Assume S is parameterized linearly by some weight

vector w in RD. - This means that

Hypothesis Function

The Linear Discriminant

- From last slide
- Putting it together
- Our hypothesis function

Linear Discriminant Function

Structured Learning Problem

- Efficient Inference/Prediction hypothesis

function solves for y when given x and w - Viterbi in sequence labeling
- CKY Parser for parse trees
- Belief Propagation for Markov random fields
- Sorting for ranking
- Efficient Learning/Training need to efficiently

learn parameters w from training data

xi,yii1..N - Solution use Structural SVM framework
- Can also use Perceptrons, CRFs, MEMMs, M3Ns etc.

How to Train?

- Given a set of structured training examples

x(i),y(i)i1..N - Different training methods can be used.
- Perceptrons perform update whenever current model

mispredicts. - CRFs plug the discriminant into a conditional

log-likelihood function to optimize. - Structural SVMs solve a quadratic program

minimizes a tradeoff between model complexity and

a convex upper bound of performance loss.

Support Vector Machines

- Input examples denoted by x (high dimensional

point) - Output targets denoted by y (either 1 or -1)
- SVMs learns a hyperplane w, predictions are

sign(wTx) - Training involves finding w which minimizes
- subject to
- The sum of slacks upper bounds the

accuracy loss

Structural SVM Formulation

- Let x denote a structured input example (x1,,xn)
- Let y denote a structured output target (y1,

,yn) - Same objective function
- Constraints are defined for each incorrect

labeling y over input x(i) . - Discriminant score for the correct labeling at

least as large as incorrect labeling plus the

performance loss. - Another interpretation the margin between

correct label and incorrect label at least as

large as how bad the incorrect label is.

Adapting to Sequence Labeling

- Minimize
- subject to
- where

- and
- Sum of slacks upper bound performance

loss. - Too many constraints!

Use the same slack variable for all constraints

of the same structured training example

Structural SVM Training

- Suppose we only solve the SVM objective over a

small subset of constraints (working set). - Some constraints from global set might be

violated. - When finding a violated constraint, only y is

free, everything else is fixed - ys and xs fixed from training
- w and slack variables fixed from solving SVM

objective - Degree of violation of a constraint is measured

by

Structural SVM Training

- STEP 1 Solve the SVM objective function using

only the current working set of constraints. - STEP 2 Using the model learned in STEP 1, find

the most violated constraint from the global set

of constraints. - STEP 3 If the constraint returned in STEP 2 is

violated by more than epsilon, add it to the

working set. - Repeat STEP 1-3 until no additional constraints

are added. Return the most recent model that was

trained in STEP 1.

STEP 1-3 is guaranteed to loop for at most

O(1/epsilon2) iterations. Tsochantaridis et

al. 2005

This is known as a cutting plane method.

Illustrative Example

- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important

constraints

- Structural SVM Approach
- Repeatedly finds the next most violated

constraint - until set of constraints is a good

approximation.

Illustrative Example

- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important

constraints

- Structural SVM Approach
- Repeatedly finds the next most violated

constraint - until set of constraints is a good

approximation.

Illustrative Example

- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important

constraints

- Structural SVM Approach
- Repeatedly finds the next most violated

constraint - until set of constraints is a good

approximation.

Illustrative Example

- Original SVM Problem
- Exponential constraints
- Most are dominated by a small set of important

constraints

- Structural SVM Approach
- Repeatedly finds the next most violated

constraint - until set of constraints is a good

approximation.

This is known as a cutting plane method.

Finding Most Violated Constraint

- Structural SVM is an oracle framework.
- Requires subroutine to find the most violated

constraint. - Dependent on formulation of loss function and

joint feature representation. - Exponential number of constraints!
- Can usually expect efficient solution when

inference has efficient algorithm.

Finding Most Violated Constraint

- Finding most violated constraint is equivalent to

maximizing the RHS w/o slack - Requires solving
- Highly related to inference

Sequence Labeling Revisited

- Finding most violated constraint
- can be solved using Viterbi!

SVMStruct Abstracts Away Structure

- Minimize
- Subject to
- Working set of constraints are fixed ys and xs
- Just like solving a conventional linear SVM!
- Notion of structure almost completely used in

finding the most violated constraint - (this is just an interpretation)