I worked on my thesis in one repo, let’s call it thesis. My lab has another repo, pubs, and we have a policy of keeping all submissions (final or in draft) in this repo.

I decided to use subtree merge to maintain the history of my own repo, while moving the contents into another repo. In other words, I grafted a subdirectory of one repo onto a subdirectory of another. For generality, suppose you want to move the path a/b on the repo OLD to the path y/z on the repo NEW.

First, use git subtree split to create a commit on repo OLD, containing only commits which affect the path a/b, putting the resulting commit on branch B:

# (inside repo OLD)
git subtree split -P a/b -b B

This results in the following output, the last line of which is the hash of the new commit. Remember that hash, as we’ll use it later when merging into the new repo.

Then switch to the repo NEW, and add the repo OLD as a remote:

git remote add -f old-remote PATH-TO-OLD

Issue the following command, where COMMIT is the commit hash we got in the subtree split:

# (inside repo NEW)
git subtree add --prefix=y/z COMMIT

The result is that the commit history of subdirectory a/b in the OLD repo is spliced onto subdirectory y/z of NEW, which is what we wanted.

I switched to iTerm2 recently from a lifetime of Terminal.app, and I am not missing the old Terminal one bit.

Ever double-clicked a token in your terminal to select it, and wished that Terminal.app knew what you actually meant to select? iTerm2 lets you define what text a quadruple-click will select, by matching the surrounding context against a regex.

Here are some additional Smart Selection rules I’ve found very useful:

  • Capturing the value of key=value: (?<==)[A-Za-z0-9-]+, precision high
  • Hex strings (hashes): [A-Fa-f0-9]+, precision normal
  • Path without initial [ab]/: (?<=\b[ab]/)([[:letter:][:number:]._-]+/+)+[[:letter:][:number:]._-]+/?, precision high

The last one is to get paths from git diff without the initial a/ or b/.

Constraints of the data model

  • Indexing is not available, so data may have to be denormalised
  • Columns and supercolumns are sorted by key name
    • names are byte strings but interpretation for sorting can be changed
  • Range queries are possible through partitioning
    • RandomPartitioner randomly distributes rows among machines according to MD5 value, leading to load-balancing
      • within a node, rows are sorted by key
    • OrderPreservingPartitioner distributes according to key

Cassandra data model

Column :: key → value

  • similar to a single datum

SuperColumn :: key → { subkey1 → value1, … }

  • a datum whose value is structured

ColumnFamily :: { column1, column2, … } = { key1 → {subkey1 → value1, subkey2 → value2}, … }

  • column families are stored in separate files
  • sorted by key major order
  • similar to an RDBMS table, except sparse

SuperColumnFamily :: { supercolumn1, supercolumn2, … }

Keyspace :: [ key1, key2, ... ] for a ColumnFamily

An example

  • User (an RDBMS table, a Cassandra ColumnFamily)
    • maps user attributes to byte array values
  • To do a query on one of those attributes, say state,
    • need to manually create a ColumnFamily { state → { city → { name → username ] } }
      • like indexing on state
    • then, where state == ‘CA’ is efficient (since ColumnFamilies are sorted by key)
  • Composite keys
    • corresponds to where state == ‘CA’ and city == ‘San Mateo’
      • { state:city → {name → username} }
    • ColumnFamilies are sorted by key
    • we can do where state == ‘CA’ (get all cities)
    • but also where state == ‘CA’ and city == ‘San Mateo’ (get one city)
    • but not range queries on city

Cribbed from

  • http://www.slideshare.net/benjaminblack/cassandra-basics-indexing
  • http://wiki.apache.org/cassandra/DataModel/
  • http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model

We last covered the dependency graph formulation of the parsing problem, and an algorithm which finds a dependency tree maximising the sum of arc weights. The missing piece of the puzzle is how we learn the ‘goodness’ of each candidate arc. Let the score of an arc be $s(i,j)=\mathbf{w}\cdot \mathbf{f}(i,j)$, the dot product of a weight vector $\mathbf{w}$ and a feature representation $\mathbf{f}(i,j)$ of that arc. Now all that’s left is to:

  1. Represent the dependency between words $i$ and $j$ in a high-dimensional vector space
  2. Learn the weight vector $\mathbf{w}$

No discussion of the parser features used appears in the paper. However, Ryan McDonald’s dissertation Discriminative Learning and Spanning Tree Algorithms for Dependency Parsing cover the features used — features based on the lexical items and POS tags found in a neighbourhood of the head and argument words.

For learning, McDonald et al. use the Margin Infused Relaxed Algorithm (MIRA). Each step of MIRA selects a training example (which is a pair $(\mathbf{x},\mathbf{y}$ of the words vertices and the correct dependency structure). Then, given the original weight vector $\mathbf{w}^{(t)}$, MIRA tries to find a new weight vector $\mathbf{w}^{(t+1)}$ which nudges the current weight vector $\mathbf{w}^{(t)}$ as little as possible, such that the following set of constraints holds:
\[
\forall \mathbf{y}'\in dt(\mathbf{x}): s(\mathbf{x},\mathbf{y})-s(\mathbf{x},\mathbf{y'})\ge L(\mathbf{y},\mathbf{y'})
\]

where $dt(\mathbf{x})$ is the set of all possible dependency structures over the sentence $\mathbf{x}$, and $L(\cdot, \cdot)$ is a loss function between two dependency structures. In other words, the score of the true dependency structure $\mathbf{y}$ and the score of any possible dependency structure $\mathbf{y’}$ should be separated by a margin of at least $L(\mathbf{y},\mathbf{y’})$.

If $\mathbf{y’}$ is equal to $\mathbf{y}$ the loss should be zero and the weight vector shouldn’t change (in other words, the model correctly classifies $\mathbf{y}$). If the trees differ, then the weight vector should change to yield a difference in scores at least as large as the loss between the true and candidate dependency structures. The above optimisation can be solved numerically through quadratic programming.

The problem with general MIRA is that the number of possible dependency structures $dt(\mathbf{x})$ is exponential in the length of the sentence $\mathbf{x}$, yielding an exponential number of constraints. To render MIRA tractable, McDonald et al. use the intuition that out of all possible dependency structures, the one which is best distinguished from the true example is the one with the current maximum score. And so, single-best MIRA considers only one constraint:
\[
s(\mathbf{x},\mathbf{y})-s(\mathbf{x},\mathbf{y'})\ge L(\mathbf{y},\mathbf{y'})
\]
where
\[
\mathbf{y}' = \arg\max_{\mathbf{y}'}\; s(\mathbf{x},\mathbf{y}')
\]

McDonald et al. also discusses factored MIRA, with a set of constraints factored over arcs rather than complete dependency structures:
\[
\forall (l,j)\in \mathbf{y}, (k,j)\not\in \mathbf{y} s(l,j)-s(k,j)\ge 1
\]
In other words, a margin of one should separate the score of each correct arc $(l,j)$ and each incorrect arc $(k,j)$. Considering complete dependency structures again shows that the correct structure is separated from each incorrect structure by a margin of at least the number of incorrect arcs. While single-best MIRA yields an optimisation with one constraint, factored MIRA yields an optimisation with $O(n^2)$ constraints, because each of the $n$ constraints deals with $n-1$ incorrect arcs. (McDonald et al. note that the constraints applied in factored MIRA may rule out the true optimal dependency structure.)

They test on two languages: Czech and English. Czech, like German and Dutch, exhibits freer word order and hence generates more non-projective dependencies than English. The hypothesis that the non-projective MST algorithm yields better parsers for languages like Czech is borne out, with the best accuracy obtained from the factored MIRA model: 84.4%, versus 83.3% obtained with the McDonald et al. (2005) projective dependency parser. On the other hand, both models trained on a non-projective parser for English underperform, obtaining 90.2% relative to the 90.9% of the original McDonald et al. (2005) projective parser.

Seldom is one hammer good enough for every nail!

We computational linguists, apparently by reputation, have a bad rap for formalism infidelity. In reality, each approach and each model is often good at handling certain aspects of a problem, and not so good at others. To know these strengths and weaknesses is to know the right tool (model, formalism, …) for the job.

We’ll focus on non-projective dependency parsing, a view on the parsing problem which lets us exploit some of the well-developed machinery and robust algorithms of graph theory. The seminal paper to which much of the recent interest is owed is Non-projective dependency parsing using spanning tree algorithms (McDonald et al., 2005).

Parsing is the teasing out of structure from the linear form of words. One way to represent this structure is by grouping particular words together (constituent parsing). Another is dependency parsing, where we represent a relationship between two words by drawing an arc between them. In a sentence like Johann loathes the buffalo, there would be three arcs: one from loathes to Johann (Johann being the loather), one from loathes to buffalo (being the ruminant loathed), and one from buffalo to the.

The dependencies naturally form a graph, with the words of the sentence as the vertices and the dependencies as arcs. Also, assume that each arc has a weight which expresses its ‘goodness’. This graph formulation lets us characterise parsing in this way: start by assuming a dependency between every pair of words. Then, remove arcs to form a spanning tree so as to maximise the sum of edge weights.

A given dependency graph is known as projective when the edges do not cross when the vertices are embedded from left-to-right in the order that they appear in the sentence. In other words, all the dependencies are properly nested.

While the vast majority of dependencies in English are projective, English does license constructions which yield crossing dependencies. In languages exhibiting freer word order, the scales tip further towards non-projectivity. Work prior to McDonald et al. (2005) on graph parsing would focus on projective dependency parsing as a simplification of the full setting. What McDonald et al. showed was surprising in two ways. Not only is the non-projective case which is easier to solve, the non-projective algorithm is in \(O(n^2)\), while the projective algorithm is in \(O(n^3)\).

Let’s assume for the time being that each candidate arc $(i,j)$ has a score $s(i,j)$ representing its ‘goodness’. To parse a sentence, we start with the complete graph $G=(V,E)$, then run a MST (maximum spanning tree) algorithm to find the subgraph $G^*=(V,E^*)$ such that $\sum_{(i,j)\in E^*} s(i,j)$ is maximised, and every vertex in $V$ appears in $E^*$.

While Prim’s and Kruskal’s algorithms are known to every CS undergrad, the corresponding algorithm to find a MST over a directed graph is less well known. Thankfully, the Chu-Liu-Edmonds algorithm solves the problem and is not too difficult to describe.

For each vertex $v \in V$, we add the incoming edge $(v^*, v)$ which maximises the score $s(v^*, v)$. If the resulting graph has no cycles, then it can be shown that the resulting edge set induces an MST: we are done.
Otherwise, we perform a contraction on the cycle, shrinking it down to a supernode and setting the weights of the incoming and outgoing connections to and from the supernode. Because it can be shown that an MST on the contracted graph corresponds to an MST on the original graph, we simply run CLE recursively on the contracted graph.

When CLE completes, we break the cycle in each supernode, then unpack the supernode into the set of vertices it originally represented, resulting in a spanning tree over the vertices and dependency edges which maximises the sum of edge weights.

But before we can apply this approach, we must answer the question: How do we find the scores $s(i,j)$ on dependency edges? We’ll cover this in part two.