2D CKY algorithm
Summary
The CKY algorithm is a method of enumerating parse trees for context-free languages. Given some string w, and a particular context-free grammar in Chomsky Normal Form (G), the CKY algorithm lists all of the possible ways G can parse w. By combining a compact parse forest representation with a table-driven dynamic programming approach, the CKY algorithm can construct a potentially exponential number of trees in O(card(G)2 * |w|3) time.
The Algorithm
Let us be given some grammar G and some string w. We want not only to discover whether or not w is in G, but also to enumerate all of the possible trees which witness this membership. Let all indices into the table be zero-based.
- Item Form
-
Our version of the algorithm works inside of a table T of size |w|2 / 2 - one-half of a |w|2-sized table, cut on the diagonal in the positive direction on both axes. Each cell of this table is a set of local trees. Each of these local trees is either binary branching, unary branching, or trivial. In general, each child of a local tree in G will itself be the root some other local tree in G, and trivial trees will represent non-terminals. For the specific purposes of our algorithm, each child of a trivial tree will also store a pointer to a set of local trees. For the purpose of this description, let a child of a local tree c that points to some cell (p,q) be represented as c->(p,q).
- Initialization
-
The algorithm begins by filling the bottom row of the table (row zero) with trivial trees rooted by the characters w0...w|w| - 1 - cell T[0,0] contains w0(~,~), cell T[0,1] contains w1(~,~), and so on, up to |w|. If one of these trivial trees is not in G, then w cannot be in L(G), and processing terminates.
- Progress
-
An empty cell indexed by (a,b) is filled by examining pairs of cells. Initially, let {(i,j), (k,m)} = {(a, b), (a, b)}.
To fill the cell, let {(i,j), (k,m)} = {(i, j - 1), (k + 1, m - 1)}, and examine the pair of cells {(i,j), (k,m)}. For each of these pairs of cells, examine all of the possible ordered pairs of local trees (p, q) that can be formed by picking one local tree from cell (i,j) and one local tree from cell (k,m). Add a local tree Z(p->(i,j), q->(k,m)) to cell (a,b), if and only if Z(p,q) is in G. Repeat this while j and m are greater than zero.
- Goal
-
Our goal is a single set of local trees in the cell (0, r-1), for which each local tree is rooted by a start symbol. For example - in an instantiation of the grammar G(A, B), our goal would be a set of local trees, each of which is rooted by either A or B.
If the goal is reached, w is in G, and the pointers linking local trees to sets of local trees are sufficient to enumerate all trees witnessing this membership.
Algorithm Limitations
The CKY algorithm itself has several inherent limitations:
- The algorithm requires that each local tree (or rewriting rule) of the grammar be in Chomsky Normal Form (CNF). While there exists a simple algorithm to convert any n-branching grammar to CNF, the algorithm adds symbols to (and changes the structure of) the grammar. The parse trees that CKY generates from this altered grammar may be of limited utility.
- The algorithm can construct any particular parse forest in polynomial time, but the set of trees may be of exponential size (relative to |w|). To actually list out this set of trees, not surprisingly, will require exponential time.
Implementation Features
Our implementation of the CKY algorithm:
- Is written in Standard C++
- Uses the Standard Template Library (STL) for basic data structures (strings, lists, associative arrays).
- Is appropriately templatized, so as to avoid arbitrary restrictions on the language and state alphabets.
Implementation Limitations
Our implementation of the CKY algorithm currently does not:
- Work with trees in three or more dimensions. Work in this area is well underway; a suitable set of tree representations, along with a working algorithm, should be completed by the end of Summer, 2004.
- Support arbitrary-branching local trees. Work is underway to bring support for these non-CNF grammars to the recognizer; CKY will follow.
The Future
Once we have finished our discussions, we will add more information about our implementation.
