3rd PM - Workshop day

Minutes and notes from the working groups day.

Cultural heritage WG

Participants: Ramona, Inari, Milen, LauriC

The task is to verbalize an ontology. The ontology in this case is the Gothenburg City Museum ontology, which contains information about the museum objects in that particular museum. We don't need to prepare for unrestricted vocabulary or user input; at least not in text form.

GF grammar details

The GF grammar is two-part. We have a direct verbalization, that is, just the ontology triples translated into GF syntax. As one triple contains only one piece of information, the resulting sentences from this GF grammar are of type Mona Lisa is a Painting. Mona Lisa is painted by Leonardo da Vinci. Mona Lisa is located in Louvre. This grammar is still useful as a lexicon and facts about the items in the lexicon. Creating a concrete syntax needs necessarily some human work: the ontology does not contain translations for the terms in all languages we'd want, and even for languages it covers, it doesn't have all linguistic information needed in a GF grammar. (WP2 and WP3 tools could be used to help the grammar writer; more on that later.)

In addition to the direct verbalization, we have also a discourse-building grammar. It consists of GenText.gf and GenLexicon.gf. The GenLexicon is a more refined version of the direct verbalization grammar: it has some irrelevant parts removed (such as numerical codes, which have no information value for a human) and some of the lexical information already aggregated. For example, the ontology triples for a painting might look like this:

xx:Guernica   yy:paintedBy  zz:Picasso
xx:Guernica   cc:hasColour   dd:Black
xx:Guernica   cc:hasColour   dd:White
xx:Guernica   cc:hasColour   dd:Gray

We can see that there are multiple instances of the predicate hasColour, and we can aggregate that information into only one colour in the GF grammar, called BlackWhiteGray.

(A more generic approach would be like this:

White : Colour ;
Black : Colour ;
Red   : Colour ;
Multicolour : Colour -> Colour -> Colour ;

Multicolour Red (Multicolour Black White)
Multicolour Black (Multicolour Red (Multicolour White Black))

We could generate every combination of colours with that, but as the task is to verbalize only the paintings in Gothenburg Museum Ontology, we can get every existing combination and have them in the grammar spelled out. This means more items in the GF grammar, but it makes the language more fluent: many languages have specific words for some common combinations of colours, but with the Multicolour way of combining colours we'd have to choose a generic rule with which to combine colours, such as black and white and grey and blue and red.)

Finally, GenText.gf is the grammar where we do the information aggregation on a semantic level. We build multiple discourse patterns; since there are many ways to give information about a painting. This is in practice implemented with huge bunch of dependent types: a sentence that tells a painting's author, colour and location is of type Painting Author Location, and the items are all checked to be compatible with the ontology; we can't build a sentence "Guernica was painted by Pablo Picasso and it is located in Helsinki" if the details don't match with the direct verbalization grammar. We can create proof objects automatically for every original triple. There are also proof objects with 3 or more arguments (and they are created by ???).

Anyway, the functions in GenText look somewhat like this:

BuildSentence -> (p: Painting) -> (a: Author) -> (l: Location) -> AuthorProofObject p a -> LocationProofObject p l -> FinalProofObject p a l -> Text ;

Tools to help the grammar writer

We want translations of the terms, not new ones. We can use WP3 lexicon management tools to search FactForge's art terms. We can also use the example-based grammar writing tool.

Future work

Allow users to query the ontology and output exactly the answers they want. But that's not for the next deliverable.

Example-based grammar writing WG

Participants: Ramona, LauriC, LauriA, Aarne, Jordi, John, Inari, everyone

User interface

  • Questions in natural language; How do you say "boring pizza" easier for translators than How do you say Mod Boring Pizza
    • also an option to show the abstract syntax tree
  • Some categories are not very intuitive for humans: Kind is a noun phrase without article, and if we ask how to say Boring Pizza, a user might add an article, and then the user input can't be parsed as a CN
  • Ask for other example, if the example given by the user is ambiguous (for example, singular and plural are identical).
  • Give the user examples of the rules formed by their input: if Boring Fish is "boring fish", then Boring Cheese is "boring cheese".

Hybrid Systems & Patents WG

Minutes

Participants: Aarne, Ramona, Milen, Borislav, Cristina, Meritxell

Decisions taken

Write GF grammars to solve problems in the SMT system: compound and biological names, word reordering, gender agreement.

Biological names are different than compound names and raise different challenges. LauriC can provide a database of compounds and biological names.

Increase parser robustness by chunking the claims and parsing the chunks separately and recombine the results with the help of the grammar. Reduce ambiguities with bottom up disambiguation based on the corpus.

Simplify the query language because some of the English queries are weird in other languages.

Will need someone else to work on French grammars.

The user interface allows querying the retrieval system using the controlled language, free text and a combination of both.

Results are shown highlighting the relevant words in relation to the query. If not possible to find the words in translated documents (lexicon is needed) then highlight the whole sentence.

GF and SMT tools have been shared and installed to start hybridisation studies.

TO DO LIST

Generate synthetic corpora and alignments from GF grammars.

Provide a translation of the lexicon.

Write simple abstract syntax representation and grammar for the results. Write templates for each topic.

Comparison between equivalent tools in GF and SMT systems.

Math WG

Patents WG

This WG is the same as Hybrid Systems & Patents WG .

Translators tools WG

Participants: Lauri A, Lauri C, Chunxiang, Inari

User interface

  • We'll aim to use already existing stuff as much as possible: Krasimir's editor as a base, decorate that.
  • User authorization; discussed options such as
    • Google authorization module
    • LDAP
    • OpenID
    • LauriC will look at translation project management websites (for example Project Open) to see if they have something to offer for us.
    • Discussed about usability; we'd like to offer a possibility to use gmail or facebook or whatever account if you're afraid of technology and think it's too complicated otherwise, or create your own account if you're afraid of selling your soul to big corporations.
  • Lexicon management
    • We don't want to have another translation memory. Not GF-like to remember strings.
    • Modifying lexicon means modifying grammar; lexicon is just a name for leaf productions. Could example-based grammar writing also allow the user to modify idioms?
    • Hierarchy of demands, from the minimum to the ideal:
      • A complaint button, "this is wrong please fix it", and the message goes to a grammar writer who is actually a person.
      • Correct an error by inserting stuff manually, recompile the grammar and then it works (supposing the user did the right thing in the first place).
      • Error correction on the fly, corrections straight to PGF, no need for recompilation, user doesn't even need to know anything about PGF and compilation
    • Tabular editor, new features:
      • Implement "search for similar terms" query, and allow going higher up the tree
      • Modify terms in the row
      • Add new rows
      • Delete rows
      • By clicking a term, make it appear in the text in the editor tab
      • If term has many translations in one language, show only 1 row and add an expand button in the slot for the language that has the multiple translations

Back-end

  • LauriC has added a term tab to Krasimir's editor. Currently just a separate tab, no connection to the editor tab, but we should be working on it.
  • We have PGF stuff on our own server. Everything works fine.
  • The code for document manager is in darcs, but it's not working and there is no documentation.
  • Chunxiang's tabular editor: ideal workflow should be like this
    • Java backend: gets access to SPARQL endpoint, does the quer(y|ies)
    • JavaScript frontend: outputs the results nicely, allows user to go up or down in a hierarchy (which means more SPARQL queries done by the Java backend); allows user to delete concepts, add concepts and their translations and modify translations fetched from the ontology. All this information is stored in some place, probably in JSON.
    • Java backend: gets access to the storage, outputs nice GF files. Probably not touching existing GF files, but adding new modules: add new abstract syntax, and concrete syntaxes for every language user has chosen. Add to the human-made LexiconXyz.gf the automatically created file only as an import.
      • where to get the information about abstract types? One solution: the querying is connected to the existing grammar; not just saying "I want flowers", but "I want all things similar to this pizza", and if pizza's abstract syntax type is Kind, we'll know that all things returned by the query will be also of the type Kind.
    • Open question: how to modify the PGF? But anyway, even if PGF is modified, we should add all changes to gf files as well, because a human might want to modify the grammar later and recompile to new PGF. LauriA interested to work on it.