July 21, 2004

Embedding trees

It is useful for our CKY-style parsing algorithm to be able to handle arbitrary-branching trees of arbitrary dimensions as well as the binary-branching trees of arbitrary dimensions that it already handles. However, the straight-forward approach, which would be to use n-dimensional state trees paired with labels as the domain of the delta function of our automaton, will cause the time complexity to blow up exponentially as the branching factor increases. To reduce the time complexity of the CKY-style parsing algorithm, and also to be able to simply adapt the arbitrary-branching automata to our existing binary-branching code, it is desirable to "embed" the arbitrary-branching automaton rules into binary-branching automaton rules.

Thanks to our lemma, a binary-branching, n-dimensional tree in left-child right-sibling form can be viewed as a simple array of n+1 nodes, where the (i+1)-st node (indexed starting at 0) is the (n-i)-th dimensional successor of the i-th node. The node indexed at i can have no successor in any dimension less than i because the lemma forbids it, and it can have no successor in any dimension greater than i because the node already has a parent in each such dimension, and so the inclusion of such a successor would make the tree more than binary-branching.

Thus, to embed an arbitrary-branching tree into a binary-branching structure, we can simply iterate along the proper binary-branching structure while checking for illegal links. When such links are found, we replace the parent node of the link with a new, unique label, and we introduce a new rule rooted in that same label. The tree for the new rule consists of the original parent node of the link along with its yield in that link. If the dimension of that link is less than n-1, then the higher-dimensional parent nodes in the tree are padded out by using a special node value which we shall call tau. Tau is not permitted to be the root or head of any rule in the automaton, and it is understood that a tau node is not part of the tree but is intended to help us "excavate" or reconstruct the arbitrary-branching structure after parsing is complete. Because tau is never the root or the head and occurs only in binary-branching rules, we can guarantee that each tau node will always have exactly one successor. The arbitrary-branching structure will be excavated in two steps: first, each and every tau node is replaced with its successor. Second, each uniquely labeled node that was generated in the embedding is replaced with its n-dimensional successor. Once the new rule is generated, it then must be recursively embedded until it too is reduced to a binary-branching structure.

Example embedding of a 3-dimensional tree:
3d example

Example embedding of a 4-dimensional tree (the dashed arrows indicate that the process continues until the tree is fully embedded):
4d example

We also need to preserve the heads of the yields when embedding the structure, which complicates things a bit. When a new tree is introduced from an i-dimensional link, its heads in dimensions less than i will be unaffected, so they can be imported directly from the pre-existing structure. In addition, the head offsets in dimensions greater than i must all be 1, as a value of 0 would make a a tau node a head, which is not permitted. In dimension i, we find that the head offset will simply be one less than the i-dimensional head offset of the pre-existing structure, as we have effectively cut off the first node in the i-dimensional local yield. If the new head offset is -1, then the first node in the pre-existing i-dimensional local yield was the head, and so the i-dimensional head of the new rule is irrelevant, as nothing will ever be hung beneath it. In this case, if we must designate a head, we will simply use a head offset of 0.

The heads of the pre-existing structures may change as well. In this case, we can simply do a bottom-up traversal of the binary-branching structure after it is fully embedded. In this traversal, we simply check for head offsets that are greater than 1, and when we find them, we reset them to 1. This keeps the head pointing to the node where the removed subforest will be attached, so that the head information can be reconstituted later.

In the second step of the tree excavation, we can recover the original heads by adding together the i-dimensional head offsets of the adjoined structures, where i is the dimension of the link that was originally removed in the embedding process. The sum will be the new i-dimensional head offset of the adjoined structure, which will be the same head offset used in the the arbitrary-branching rule that produced the adjoined structure.


automaton.embed(label,(n-1)-dimensional tree t, state) {
  for dim from n-1 to 0 {
    for i from dim+1 to n {
      if (t has link i) {
        restate t as unique t'
        u = filltree(t.link(i),i,tree.id_head_offset(i)-1)
        embed(unique label T', u, t')
        t.unlink(i)
      }
    }
    t = t.link(dim) if dim > 0
  }
  fix_head(tree,n-1)
  add(label,tree,state)
}

filltree(n-dim tree t, dim, offset) {
  tree.set_head_offset(offset,dim)
  for i from dim to n-1 {
     t = new tree with t as 1-dimensional child
     t.setstate(tau)
  }
  return t
}

findheads(n-dimensoinal tree t, ydim) {
  if (ydim == 0) {
    set_head_offset(0)
    return
  }
  fixheads(tree.link(ydim),ydim-1)
  if( tree.get_head_offset() > 1 ) {
    tree.set_head_offset(1)
  }
}
Posted by kellyia at July 21, 2004 07:35 PM