We computational linguists, apparently by reputation, have a bad rap for formalism infidelity. In reality, each approach and each model is often good at handling certain aspects of a problem, and not so good at others. To know these strengths and weaknesses is to know the right tool (model, formalism, …) for the job.

We’ll focus on *non-projective dependency parsing*, a view on the parsing problem which lets us exploit some of the well-developed machinery and robust algorithms of graph theory. The seminal paper to which much of the recent interest is owed is Non-projective dependency parsing using spanning tree algorithms (McDonald et al., 2005).

Parsing is the teasing out of structure from the linear form of words. One way to represent this structure is by grouping particular words together (*constituent parsing*). Another is *dependency parsing*, where we represent a relationship between two words by drawing an *arc* between them. In a sentence like *Johann loathes the buffalo*, there would be three arcs: one from *loathes* to *Johann* (Johann being the loather), one from *loathes* to *buffalo* (being the ruminant loathed), and one from *buffalo* to *the*.

The dependencies naturally form a graph, with the words of the sentence as the vertices and the dependencies as arcs. Also, assume that each arc has a *weight* which expresses its ‘goodness’. This graph formulation lets us characterise parsing in this way: start by assuming a dependency between every pair of words. Then, remove arcs to form a spanning tree so as to maximise the sum of edge weights.

A given dependency graph is known as *projective* when the edges do not cross when the vertices are embedded from left-to-right in the order that they appear in the sentence. In other words, all the dependencies are properly nested.

While the vast majority of dependencies in English are projective, English does license constructions which yield crossing dependencies. In languages exhibiting freer word order, the scales tip further towards non-projectivity. Work prior to McDonald et al. (2005) on graph parsing would focus on projective dependency parsing as a simplification of the full setting. What McDonald et al. showed was surprising in two ways. Not only is the non-projective case which is easier to solve, the non-projective algorithm is in \(O(n^2)\), while the projective algorithm is in \(O(n^3)\).

Let’s assume for the time being that each candidate arc $(i,j)$ has a score $s(i,j)$ representing its ‘goodness’. To parse a sentence, we start with the complete graph $G=(V,E)$, then run a MST (maximum spanning tree) algorithm to find the subgraph $G^*=(V,E^*)$ such that $\sum_{(i,j)\in E^*} s(i,j)$ is maximised, and every vertex in $V$ appears in $E^*$.

While Prim’s and Kruskal’s algorithms are known to every CS undergrad, the corresponding algorithm to find a MST over a *directed* graph is less well known. Thankfully, the *Chu-Liu-Edmonds algorithm* solves the problem and is not too difficult to describe.

For each vertex $v \in V$, we add the incoming edge $(v^*, v)$ which maximises the score $s(v^*, v)$. If the resulting graph has no cycles, then it can be shown that the resulting edge set induces an MST: we are done.

Otherwise, we perform a *contraction* on the cycle, shrinking it down to a supernode and setting the weights of the incoming and outgoing connections to and from the supernode. Because it can be shown that an MST on the contracted graph corresponds to an MST on the original graph, we simply run CLE recursively on the contracted graph.

When CLE completes, we break the cycle in each supernode, then unpack the supernode into the set of vertices it originally represented, resulting in a spanning tree over the vertices and dependency edges which maximises the sum of edge weights.

But before we can apply this approach, we must answer the question: *How do we find the scores $s(i,j)$ on dependency edges?* We’ll cover this in part two.