Archives for category: Parsing

We last covered the dependency graph formulation of the parsing problem, and an algorithm which finds a dependency tree maximising the sum of arc weights. The missing piece of the puzzle is how we learn the ‘goodness’ of each candidate arc. Let the score of an arc be $s(i,j)=\mathbf{w}\cdot \mathbf{f}(i,j)$, the dot product of a weight vector $\mathbf{w}$ and a feature representation $\mathbf{f}(i,j)$ of that arc. Now all that’s left is to:

1. Represent the dependency between words $i$ and $j$ in a high-dimensional vector space
2. Learn the weight vector $\mathbf{w}$

No discussion of the parser features used appears in the paper. However, Ryan McDonald’s dissertation Discriminative Learning and Spanning Tree Algorithms for Dependency Parsing cover the features used — features based on the lexical items and POS tags found in a neighbourhood of the head and argument words.

For learning, McDonald et al. use the Margin Infused Relaxed Algorithm (MIRA). Each step of MIRA selects a training example (which is a pair $(\mathbf{x},\mathbf{y}$ of the words vertices and the correct dependency structure). Then, given the original weight vector $\mathbf{w}^{(t)}$, MIRA tries to find a new weight vector $\mathbf{w}^{(t+1)}$ which nudges the current weight vector $\mathbf{w}^{(t)}$ as little as possible, such that the following set of constraints holds:
$\forall \mathbf{y}'\in dt(\mathbf{x}): s(\mathbf{x},\mathbf{y})-s(\mathbf{x},\mathbf{y'})\ge L(\mathbf{y},\mathbf{y'})$

where $dt(\mathbf{x})$ is the set of all possible dependency structures over the sentence $\mathbf{x}$, and $L(\cdot, \cdot)$ is a loss function between two dependency structures. In other words, the score of the true dependency structure $\mathbf{y}$ and the score of any possible dependency structure $\mathbf{y’}$ should be separated by a margin of at least $L(\mathbf{y},\mathbf{y’})$.

If $\mathbf{y’}$ is equal to $\mathbf{y}$ the loss should be zero and the weight vector shouldn’t change (in other words, the model correctly classifies $\mathbf{y}$). If the trees differ, then the weight vector should change to yield a difference in scores at least as large as the loss between the true and candidate dependency structures. The above optimisation can be solved numerically through quadratic programming.

The problem with general MIRA is that the number of possible dependency structures $dt(\mathbf{x})$ is exponential in the length of the sentence $\mathbf{x}$, yielding an exponential number of constraints. To render MIRA tractable, McDonald et al. use the intuition that out of all possible dependency structures, the one which is best distinguished from the true example is the one with the current maximum score. And so, single-best MIRA considers only one constraint:
$s(\mathbf{x},\mathbf{y})-s(\mathbf{x},\mathbf{y'})\ge L(\mathbf{y},\mathbf{y'})$
where
$\mathbf{y}' = \arg\max_{\mathbf{y}'}\; s(\mathbf{x},\mathbf{y}')$

McDonald et al. also discusses factored MIRA, with a set of constraints factored over arcs rather than complete dependency structures:
$\forall (l,j)\in \mathbf{y}, (k,j)\not\in \mathbf{y} s(l,j)-s(k,j)\ge 1$
In other words, a margin of one should separate the score of each correct arc $(l,j)$ and each incorrect arc $(k,j)$. Considering complete dependency structures again shows that the correct structure is separated from each incorrect structure by a margin of at least the number of incorrect arcs. While single-best MIRA yields an optimisation with one constraint, factored MIRA yields an optimisation with $O(n^2)$ constraints, because each of the $n$ constraints deals with $n-1$ incorrect arcs. (McDonald et al. note that the constraints applied in factored MIRA may rule out the true optimal dependency structure.)

They test on two languages: Czech and English. Czech, like German and Dutch, exhibits freer word order and hence generates more non-projective dependencies than English. The hypothesis that the non-projective MST algorithm yields better parsers for languages like Czech is borne out, with the best accuracy obtained from the factored MIRA model: 84.4%, versus 83.3% obtained with the McDonald et al. (2005) projective dependency parser. On the other hand, both models trained on a non-projective parser for English underperform, obtaining 90.2% relative to the 90.9% of the original McDonald et al. (2005) projective parser.

Seldom is one hammer good enough for every nail!

We computational linguists, apparently by reputation, have a bad rap for formalism infidelity. In reality, each approach and each model is often good at handling certain aspects of a problem, and not so good at others. To know these strengths and weaknesses is to know the right tool (model, formalism, …) for the job.

We’ll focus on non-projective dependency parsing, a view on the parsing problem which lets us exploit some of the well-developed machinery and robust algorithms of graph theory. The seminal paper to which much of the recent interest is owed is Non-projective dependency parsing using spanning tree algorithms (McDonald et al., 2005).

Parsing is the teasing out of structure from the linear form of words. One way to represent this structure is by grouping particular words together (constituent parsing). Another is dependency parsing, where we represent a relationship between two words by drawing an arc between them. In a sentence like Johann loathes the buffalo, there would be three arcs: one from loathes to Johann (Johann being the loather), one from loathes to buffalo (being the ruminant loathed), and one from buffalo to the.

The dependencies naturally form a graph, with the words of the sentence as the vertices and the dependencies as arcs. Also, assume that each arc has a weight which expresses its ‘goodness’. This graph formulation lets us characterise parsing in this way: start by assuming a dependency between every pair of words. Then, remove arcs to form a spanning tree so as to maximise the sum of edge weights.

A given dependency graph is known as projective when the edges do not cross when the vertices are embedded from left-to-right in the order that they appear in the sentence. In other words, all the dependencies are properly nested.

While the vast majority of dependencies in English are projective, English does license constructions which yield crossing dependencies. In languages exhibiting freer word order, the scales tip further towards non-projectivity. Work prior to McDonald et al. (2005) on graph parsing would focus on projective dependency parsing as a simplification of the full setting. What McDonald et al. showed was surprising in two ways. Not only is the non-projective case which is easier to solve, the non-projective algorithm is in $$O(n^2)$$, while the projective algorithm is in $$O(n^3)$$.

Let’s assume for the time being that each candidate arc $(i,j)$ has a score $s(i,j)$ representing its ‘goodness’. To parse a sentence, we start with the complete graph $G=(V,E)$, then run a MST (maximum spanning tree) algorithm to find the subgraph $G^*=(V,E^*)$ such that $\sum_{(i,j)\in E^*} s(i,j)$ is maximised, and every vertex in $V$ appears in $E^*$.

While Prim’s and Kruskal’s algorithms are known to every CS undergrad, the corresponding algorithm to find a MST over a directed graph is less well known. Thankfully, the Chu-Liu-Edmonds algorithm solves the problem and is not too difficult to describe.

For each vertex $v \in V$, we add the incoming edge $(v^*, v)$ which maximises the score $s(v^*, v)$. If the resulting graph has no cycles, then it can be shown that the resulting edge set induces an MST: we are done.
Otherwise, we perform a contraction on the cycle, shrinking it down to a supernode and setting the weights of the incoming and outgoing connections to and from the supernode. Because it can be shown that an MST on the contracted graph corresponds to an MST on the original graph, we simply run CLE recursively on the contracted graph.

When CLE completes, we break the cycle in each supernode, then unpack the supernode into the set of vertices it originally represented, resulting in a spanning tree over the vertices and dependency edges which maximises the sum of edge weights.

But before we can apply this approach, we must answer the question: How do we find the scores $s(i,j)$ on dependency edges? We’ll cover this in part two.