Plugin to Python NLTK

0
ID: 
2.8
Task leader: 
jordi.saludes
Assignees: 
jordi.saludes
Status: 
Completed

To develop a python plugin for gf (based on the planned C plugin) and connect it to relevant parts of the Natural Language Toolkit (http://www.nltk.org/)

Subtasks

2.8.1 Develop python bindings to gf.

2.8.2 nltk integration.

GF python bindings

Using the GF python bindings

This is how to use some of the functionalities of the GF shell inside Python.

Installation

Due to some ghc glitch, it only builds on Linux.

You'll need the source distribution of GF, ghc and the Python development files1. Then, go to the python bindings folder and build it:

 cd GF/contrib/py-bindings
 make

It will build a shared library (gf.so) that you can import and use into Python as shown below.

Testing installation

To test if it works correctly, type:

 python -m doctest example.rst

Examples

Loading a pgf file

First you must import the library:

% import gf

then load a PGF file, like this tiny example:

% pgf = gf.read_pgf("Query.pgf")

We could ask for the supported languages:

% pgf.languages()
[QueryEng, QuerySpa]

The start category of the PGF module is:

% pgf.startcat()
Question

Parsing and linearizing

Let's us save the languages for later:

% eng,spa = pgf.languages()

These are opaque objects, not strings:

% type(eng) 
(type 'gf.lang')

and must be used when parsing:

% pgf.parse(eng, "is 42 prime") 
[Prime (Number 42)]

Yes, I know it should have a '?' at the end, but there is not support for other lexers at this time.

Notice that parsing returns a list of gf trees. Let's save it and linearize it in Spanish:

% t = pgf.parse(eng, "is 42 prime")
% pgf.linearize(spa, t[0])
'42 es primo'

(which is not, but there is a '?' lacking at the end, remember?)

Getting parsing completions

One of the good things of the GF shell is that it suggests you which tokens can continue the line you are composing.

We got this also in the bindings. Suppose we have no idea on how to start:

% pgf.complete(eng, "")
['is']

so, there is only a sensible thing to put in. Let's continue:

% pgf.complete(eng, "is ")
[]

Is it important to note the blank space at the end, otherwise we get it again:

% pgf.complete(eng, "is")
['is']

But, how come that nothing is suggested at "is "? At the current point, a literal integer is expected, so GF would have to present an infinite list of alternatives. I cannot blame it for refusing to do so.

% pgf.complete(eng, "is 42 ")
['even', 'odd', 'prime']

Good. I will go for 'even', just to be in the safe side:

% pgf.complete(eng, "is 42 even ")
[]

Nothing again, but this time the phrase is complete. Let us check it by parsing:

% pgf.parse(eng, "is 42 even")
[Even (Number 42)]

Deconstructing gf trees

We store the last result and ask for its type:

% t = pgf.parse(eng, "is 42 even")[0]
% type(t)
(type 'gf.tree')

What's inside this tree? We use unapply for that:

% t.unapply()
[Even, Number 42]

This method returns a list with the head of the fun judgement and its arguments:

% map(type, _)
[(type 'gf.cid'), (type 'gf.expr')]

Notice the argument is again a tree (gf.tree or gf.expr, it is all the same here.)

% t.unapply()[1]
Number 42

We will repeat the trick with it now:

% t.unapply()[1].unapply()
[Number, 42]

and again, the same structure shows up:

% map(type, _)
[(type 'gf.cid'), (type 'gf.expr')]

One more time, just to get to the bottom of it:

% t.unapply()[1].unapply()[1].unapply()
42

but now it is an actual number:

% type(_)
(type 'int')

We ended with a full decomposed fun judgement.


  1. In Ubuntu I got it by installing the package python-all-dev