We last covered the dependency graph formulation of the parsing problem, and an algorithm which finds a dependency tree maximising the sum of arc weights. The missing piece of the puzzle is how we learn the ‘goodness’ of each candidate arc. Let the score of an arc be $s(i,j)=\mathbf{w}\cdot \mathbf{f}(i,j)$, the dot product of a weight vector $\mathbf{w}$ and a feature representation $\mathbf{f}(i,j)$ of that arc. Now all that’s left is to:

1. Represent the dependency between words $i$ and $j$ in a high-dimensional vector space
2. Learn the weight vector $\mathbf{w}$

No discussion of the parser features used appears in the paper. However, Ryan McDonald’s dissertation Discriminative Learning and Spanning Tree Algorithms for Dependency Parsing cover the features used — features based on the lexical items and POS tags found in a neighbourhood of the head and argument words.

For learning, McDonald et al. use the Margin Infused Relaxed Algorithm (MIRA). Each step of MIRA selects a training example (which is a pair $(\mathbf{x},\mathbf{y}$ of the words vertices and the correct dependency structure). Then, given the original weight vector $\mathbf{w}^{(t)}$, MIRA tries to find a new weight vector $\mathbf{w}^{(t+1)}$ which nudges the current weight vector $\mathbf{w}^{(t)}$ as little as possible, such that the following set of constraints holds:
$\forall \mathbf{y}'\in dt(\mathbf{x}): s(\mathbf{x},\mathbf{y})-s(\mathbf{x},\mathbf{y'})\ge L(\mathbf{y},\mathbf{y'})$

where $dt(\mathbf{x})$ is the set of all possible dependency structures over the sentence $\mathbf{x}$, and $L(\cdot, \cdot)$ is a loss function between two dependency structures. In other words, the score of the true dependency structure $\mathbf{y}$ and the score of any possible dependency structure $\mathbf{y’}$ should be separated by a margin of at least $L(\mathbf{y},\mathbf{y’})$.

If $\mathbf{y’}$ is equal to $\mathbf{y}$ the loss should be zero and the weight vector shouldn’t change (in other words, the model correctly classifies $\mathbf{y}$). If the trees differ, then the weight vector should change to yield a difference in scores at least as large as the loss between the true and candidate dependency structures. The above optimisation can be solved numerically through quadratic programming.

The problem with general MIRA is that the number of possible dependency structures $dt(\mathbf{x})$ is exponential in the length of the sentence $\mathbf{x}$, yielding an exponential number of constraints. To render MIRA tractable, McDonald et al. use the intuition that out of all possible dependency structures, the one which is best distinguished from the true example is the one with the current maximum score. And so, single-best MIRA considers only one constraint:
$s(\mathbf{x},\mathbf{y})-s(\mathbf{x},\mathbf{y'})\ge L(\mathbf{y},\mathbf{y'})$
where
$\mathbf{y}' = \arg\max_{\mathbf{y}'}\; s(\mathbf{x},\mathbf{y}')$

McDonald et al. also discusses factored MIRA, with a set of constraints factored over arcs rather than complete dependency structures:
$\forall (l,j)\in \mathbf{y}, (k,j)\not\in \mathbf{y} s(l,j)-s(k,j)\ge 1$
In other words, a margin of one should separate the score of each correct arc $(l,j)$ and each incorrect arc $(k,j)$. Considering complete dependency structures again shows that the correct structure is separated from each incorrect structure by a margin of at least the number of incorrect arcs. While single-best MIRA yields an optimisation with one constraint, factored MIRA yields an optimisation with $O(n^2)$ constraints, because each of the $n$ constraints deals with $n-1$ incorrect arcs. (McDonald et al. note that the constraints applied in factored MIRA may rule out the true optimal dependency structure.)

They test on two languages: Czech and English. Czech, like German and Dutch, exhibits freer word order and hence generates more non-projective dependencies than English. The hypothesis that the non-projective MST algorithm yields better parsers for languages like Czech is borne out, with the best accuracy obtained from the factored MIRA model: 84.4%, versus 83.3% obtained with the McDonald et al. (2005) projective dependency parser. On the other hand, both models trained on a non-projective parser for English underperform, obtaining 90.2% relative to the 90.9% of the original McDonald et al. (2005) projective parser.

Seldom is one hammer good enough for every nail!