3. The MOLTO Translation Tools Prototype

This section describes the code constituting the prototype. The code base of the translation tools extended prototype currently consists of the following parts.

MOLTO TT editor
GlobalSight (adapted for MOLTO)
TermFactory (adapted for MOLTO)
OntoText API
MOLTO Grammar Tools API

This document describes the prototype's software packages, their installation, use and current limitations. The last two components are not discussed further in this document, because they are described in other MOLTO deliverables. The services currently provided by the GF server are outlined in the MOLTO Grammar Tools API document. The GlobalSight WS API was described in the MOLTO Translation Tools API document. TermFactory is documented at length in the TermFactory manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml .

3.1 The MOLTO Translation Tools (TT) Editor

This section describes the GF translation editor originally developed by Bringert and Angelov at UGOT and reworked at UHEL.

To guide the development of a suitable translation editor API to support MOLTO translation needs, UGOT created a prototype web-based translation editor. It is implemented using the Google Web Toolkit and usable for authoring with small multilingual grammars. To use it from the web, all that is needed is a reasonably modern web browser. To install it locally, one needs in addition a web server, MySQL database and GF services.

The editor runs entirely in the web browser, so once you have opened the web page and have documents and grammars loaded, you can continue translation editing while you are offline.

3.1.1 Software requirements

In order to install the editor, you need to have the following components:

The editor code itself (in the eclipse package)
For developer version only:
- Eclipse Helios JEE (3.6)
- Google Web Toolkit plugin (tested with version 2.3.1)
Web server
- Apache (tested with 2.2.14 on Ubuntu)
- FastCGI (libapache2-mod-fastcgi)
Database
- HSQL (tested with version 1.8.1)
- HSQL-MySQL (1.8.1) -- a slightly modified version: hsql-mysql-1.8.1-molto.zip
- MySQL server (tested with 5.1.54 and 5.1.62)
GF server
- GF (tested with 3.3.3)
- Haskell (tested with ghc 7.0.4, cabal-install 0.10.2)

In this section we assume that the user has Apache, MySQL and GF server configurations done. Please see Appendix for instructions on background settings.

3.1.2 Installation

3.1.2.1 Developer version

The prototype TT editor code is packaged as an Eclipse project archive http://tfs.cc/molto/molto-tt-0.9-linux-eclipse-20120529.zip ready for import in Eclipse (Helios).

Import the project in Eclipse. You should have Google Web Toolkit plugin (tested with version 2.3.1). The runtime editor files are found in TT-0.9/www/editor/. To install the runtime, the following files are placed under Apache2 server root (here /var/www) as shown.

/var/www/editor$ ls 
grammars  index.html  org.grammaticalframework.ui.gwt.EditorApp  WEB-INF

When you have placed the files under /var/www, then you can launch the project in Eclipse. Choose from the menu Run -> Run configurations -> Web Application -> (new configuration). In the tab Server untick Run built-in server. If you have put the files in directory /var/www/editor, then the launch address will be 127.0.0.1:8888/editor/index.html?gwt.codesvr=127.0.0.1:9997.

Web server: Apache2 fastcgi and action modules must be enabled for the services. See installation notes at the end for a sample Apache2 virtual host below to handle the services from port 8888 (the default).

GF server: The editor requires also an installation of GF server. The server binaries are content-service (for authentication and simple mysql database management) and pgf-service (for gf grammars). When compiling, the cabal option --global should be used; then the GF service binaries get installed in /usr/local/bin. They can be copied/linked under webserver (by default Apache2) fcgi-bin directory as follows.

/var/www/fcgi-bin$ ls -l 
content-service -> /usr/local/bin/content-service
pgf-service -> /usr/local/bin/pgf-service

Database: The TT editor back end requires an installation of MySQL, HSQL and a Haskell library hsql-mysql by Krasimir Angelov. Further instructions how to create a database for MOLTO TT tools are in the installation notes.

The content service needs to read mysql database connection parameters from file /usr/local/bin/fpath. It should be in the same directory as content-service and contain four tokens, the mysql host and database names and the database owner credentials.

/usr/local/bin$ cat fpath
localhost moltodb moltouser moltopass

Then, the database is created by typing the following:

/usr/local/bin$ ./content-service fpath

update_grammar (grammar cache)

delete_grammar

grammars (listing)

save (document to mysql db)

load (document)

search (documents)

delete (document)

-->

Sign in: The prototype editor currently uses the Google authentication API for sign in. Authentication and authorization for Google APIs allow third-party applications to get limited access to a user's Google accounts for certain types of activities. A user needs to have a Google account to sign in to the application.

3.1.2.2 User version

All back-end requirements are needed also for the user version. Now, instead of opening the package in Eclipse, the only thing needed is to place the following files under Apache2 server root (here /var/www) as shown.

/var/www/editor$ ls 
grammars  index.html  org.grammaticalframework.ui.gwt.EditorApp  WEB-INF

Then, to run the editor, just type the address 127.0.0.1:8888/editor/index.html?gwt.codesvr=127.0.0.1:9997 into browser.

3.1.2.3 Limitations

Ideally, the same login should work throughout the different parts of the distributed toolkit. There should be some group scheme to set group level access restrictions. Eventually, we may want to provide MOLTO single-sign-on as a replacement for Google authentication.

3.1.3 Grammar manager

The prototype editor has a simple grammar manager that is supposed to allow a user to upload her grammars to the editor's grammar cache under her name. The cache kept is on the editor server for reasons of speed and xss restrictions. The user chooses the current grammar from among the cached grammars using a drop-down list.

3.1.3.1 Limitations

The grammar manager is not yet completed.

3.1.4 Document manager

The prototype editor has a simple document manager that saves a translated document in and retrieves one from from the mysql database using ContentService. The current document is saved in the database using a diskette icon on the editor page. The Documents tab shows the currently saved documents and allows the user to load a selected document for continued translation.

3.1.4.1 Limitations

Naming of documents is not yet supported. Both the grammar manager and document manager remain to be linked to the TMS.

3.1.5 Term manager

The TT editor includes a simple tabular equivalents editor for searching and editing translation correspondences from the web of data, including TermFactory services. The equivalents editor is an independent web application that may also be used standalone or as a plugin to other applications. When complete, the equivalents editor lets the user extend their GF grammars with terms entered in the term editor and/or upload them as term proposals to TermFactory.

3.1.5.1 Installation

The equivalents editor was built with the ExtJS javascript library. It can be downloaded from http://tfs.cc/molto/molto-term-editor.tgz. Unpack it and put the whole molto_term_editor directory under /var/www/ (or wherever your web server wants them, for example in Windows the path is probably C:\Program Files\Apache\htdocs). Open the file editor_sparql.html in a browser.

Note that this is also included in the complete editor as one of the tabs. As for function, the versions are identical. The screenshot below is from the standalone version.

3.1.5.2 Use

The term editor consists of two tabular grids. In the first (left side) grid, enter a term in the text input and opt for wider or narrower concepts. In the latter case (the default) the editor shows on the right another grid of concepts that are classed narrower than the search term in the data source (by default, OntoText FactForge) and their designations in a predefined selection of languages. In the former case, the editor fills out the left side grid with concepts that are classed in the data source as wider than the search term. Clicking on one of them does a search for its subconcepts and terms, shown in the right side grid.

The term grid is editable and the editor remembers the user's edits to the cells in the grid.

3.1.5.3 Limitations

The data source and choice of languages are not yet user definable. The editor is not yet connected to the TermFactory or GF grammar back ends.

3.1.6 Editor

In the current version, there is a sign-in box and tabs for grammars, documents, editor, and terms, plus two to query and browse the loaded grammar. The latter services are familiar from other GF front ends and based on the GF grammar Web API.

3.1.6.1 Use

After sign in, the editor calls content-service to show the logged in user's grammars from the grammarusers mysql table in the grammar list. The user chooses a domain grammar. This brings to view the initial vocabulary known by the grammar as fridge magnets to choose from. Alternatively, the user can type or paste text in the editor window. At every new input, the active translation unit is sent to the back end for translation, and the set of fridge magnets is updated. When a translation unit is complete and translatable, it is simultaneously translated to all the available languages and the translations are shown on the screen (in blue). If an input is not parsable, the editor underlines the unparsable part. The user can back off to the point of deviation using backspace. In addition, There is a button for clearing the input.

The editor guides the text author by showing a set of fridge magnets and offers autocompletion to hint how a text can be continued within the limits of the current grammar.

3.1.6.2 Limitations

The prototype gives a first rough idea of how a web based GF translation editor could work. At present, however, it remains oriented to a very small vocabulary (fridge magnets are not apt to work well with thousands of words). It is also doubtful that the setup is fast enough for the amount of interactivity caused at speeds involved in professional translation. A reconsideration how the editor and the back end best play together is indicated. A related limitation is the strict left-to-right orientation of the parsing. UGOT seems to be working on a robust parser which allows other manners of combining parsing and editing. The proper disposition of the translation result is not worked out yet.

3.2 The extended translation tools prototype

We now move on to the extended prototype. We first recapitulate how the extended translation tools extend the one-translation scenario to a community of translators collaboratively using and maintaining MOLTO translation tools.

3.2.1 User management

For more flexibility (as well as vendor independence), the open source LDAP (The Lightweight Directory Access Protocol) based user management implementation from GlobalSight has been adapted for MOLTO. It allows distinguishing different roles and user groups, and controlling access to resources by roles. The GlobalSight user management solution has been conservatively extended for the needs of MOLTO TermFactory users. The following screenshot displays a user's roles as an ontology editor.

Term ontology management roles are defined per domain, where a domain is represented by a regular expression on ontology URIs. The MOLTO GlobalSight user management system lets a company project administrator create users and grant them MOLTO TermFactory ontology read and write permissions. The TermFactory back end GateService reads the permissions off the GlobalSight LDAP directory and database and controls access to TermFactory content accordingly. If a user's credentials are not sufficient, TermFactory Gate will not permit term ontology queries or commits. The MOLTO permissions come over and above any constraints that ontology endpoints may impose on the content they manage. They enable fine grained project level control on who is allowed to do what to shared or restricted TermFactory resources.

3.2.2 Document management

The simple document manager of the prototype editor remains to be upgraded to a more sophisticated XLIFF based document manager built using the GlobalSight document management API. See the MOLTO TT API document for more detail.

3.2.3 Lexical resources

A key consideration for the usability of MOLTO translation is the ease with which its text coverage can be extended by a user community. We need to pay great attention to adaptability. The most important factor in extensibility is lexical coverage. Grammatical coverage can be developed and maintained with language engineering, and grammatical gaps can often be circumvented by paraphrasing. In contrast, paraphrasing is not a real option for special domain terms. There are two cases to consider: either the abstract grammar misses concepts, or concrete grammars for some language/s are missing equivalents. In the first case, we need to extend the domain ontology and its abstract grammar. In the second case, we need to add terms.

For both ontology and term management, we apport to MOLTO the TermFactory ontology based terminology management concept. TermFactory is a system of distributed multilingual term ontology repositories maintained by a network of collaborative management platforms. It has been described at length in the TermFactory Manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml.

The user of the MOLTO translation editor has direct access through the equivalents editor to querying and editing term equivalents for concepts already in available ontologies, either already in TermFactory or 'raw' from the Web of Data, in particular, the OntoText services serving data from FactForge repository.

3.2.3.1 Term management

Say for instance there is no equivalent listed for cheese in some language's concrete grammar FooLang. The author/translator can use the equivalents editor to query for terms for the concept food:Cheese in TermFactory or do a search through OntoText services for candidate equivalents, or, if she knows the answer herself, submit equivalents through the equivalents editor. The new equivalent/s are saved in the user's own MOLTO lexicon, and submitted to TermFactory as term proposals for the community to evaluate.

3.2.3.2 Ontologies

If there is a conceptual gap not easily filled in through the equivalents editor, there is the option of forwarding the problem to an appropriate TermFactory collaborative platform. This route is slower, but the quality has a better guarantee in the longer run, as inconsistency or duplication of work may be avoided. Say there is no concept in the domain ontology for the new notion that occurs in the source text. In easy cases, new concepts can be added through the equivalents editor, subclassing some existing concept in the ontology. In more complex cases, where negotiations are needed in the community, an ontology extension proposal is submitted through a TermFactory wiki. TermFactory offers facilities for discussing and editing ontologies and their terms. In due time, them modified ontology gets implemented in a new release of the GF domain abstract grammar.

3.2.3.3 Ontology-grammar interface

TermFactory ontologies are extensible and support reasoning. Instead of implementing domain ontology-to-grammar bridges over and again for every new domain and application, it seems more promising to take advantage of the semantic network structure of (term) ontologies. Suppose verbalizations are already defined for a selection of upper or middle level ontologies. Special domain ontologies can subclass them and thereby also inherit the verbalizations that go with the superclasses and properties. UHEL is currently looking at the generalization of the MOLTO museum case ontology-to-grammar mapping in this direction.

3.2.4 Translation editing

The TT translation editor is just a prototype. Different scenarios and platforms may call for different combinations of its features. One way to go is to extend the prototype with further tabs and facilities for CAT tool support. But there is the also the opposite alternative to consider of calling MOLTO translation tool services from a third party editor. GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. It might just be feasible to embed MOLTO prototype editor functionalities into the GlobalSight editor(s). In the Globalsight setup, there is already support for importing cut-and-dried MT translations from a MT service, but here we are talking about something rather more intricate.

It is not immediately obvious which route would provide least resistance. From the point of view of GF usability, finding a neat way of embedding GF editing functions in third party translation editors could be a better sales position than trying to maintain a whole new MOLTO translation environment. (Unless of course, the new environment is clearly more attractive to targeted users than existing ones.) We may also try to have it both ways.

3.2.5 Reviewing/feedback

It was noted above that blind translation in the case of incomplete or inadequate coverage in resource grammars can occasion a round of reviewing and giving feedback on the translations before publication. This part of the process is in its main outlines familiar from the translation industry workflow, and can be implemented as a variation of it. In the MOLTO workflow, reviewer comments are not returned (just) to the human author/translator(s), but they should have repercussions in the ontology and grammar management workflows. This part requires modifying and extending the existing GlobalSight revisioning tools to communicate with the MOLTO lexical resources and grammar services. The GlobalSight revisioning tools now use email as the human-to-human communication channel. We probably want to use a webservice channel for machine-to-machine communication, and possibly some web commenting system as an alternative to email.

3.2.6 Grammar engineering

To the extent grammar engineering can be delegated to translation tool users, it must happen transparently without requiring knowledge of GF. One way to do this is through what is known as example-based grammar writing in GF. Example-based grammar writing is a new GF technique for backward-engineering GF source from example translations. It can play a significant role in the translation-to-grammar feedback cycle. This part of the TT API will be borrowed from the MOLTO Grammar Developer Tools API.

The following sections describe what parts of the above list are already in place in the prototype and what remains to do.

3.3 GlobalSight

GlobalSight (http://www.globalsight.com/) is an open source Translation Management System (TMS) released under the Apache License 2.0. Version 8.2. was released on Sept 15, 2011. As of version 7.1 it supports the TMX and SRX 2.0 Localization Industry Standards Association standards.[2] It was developed in the Java programming language and uses MySQL database and OpenLDAP directory software. GlobalSight also supports computer-assisted translation and machine translation.

According to the documentation, GlobalSight has the following features:

Customizable workflows, created and edited using graphical workflow editor
Support for both human translation and fully integrated machine translation (MT)
Automation of many traditionally manual steps in the localization process, including: filtering and segmentation, TM leveraging, analysis, costing, file handoffs, email notifications, TM update, target file generation
Translation Memory (TM) management and leveraging, including multilingual TMs, and the ability to leverage from multiple TMs
In Contect Exact matching, as well as exact and fuzzy matching
Terminology management and leveraging
Centralized and simplified Translation memory and terminology management
Full support for translation processes that utilize multiple Language Service Providers (LSPs)
Two online translation editors
Support for desktop Computer Aided Translation (CAT) tools such as Trados
Cost calculation based on configurable rates for each step of the localization process
Filters for dozens of filetypes, including Word, RTF, PowerPoint, Excel, XML, HTML, Javascript, PHP, ASP, JSP, Java Properties, Frame, InDesign, etc.
Concordance search
Alignment mechanism for generating Translation memory from previously translated documents
Reporting
Web services API for programmatic access to GlobalSight functionality and data
Integrated with Asia Online APIs for automated translation

3.3.1 Installation

The latest full Linux install version of GlobalSight is 7.1.0.x . It can be updated to the current version 8.2.2.0 using publicly available upgrade packages. The GlobalSight 7.2.0.0 base version and the upgrade packages are available from SourceForge. (Copies are available from tfs.cc under /srv/GlobalSight_backup/upgrade. More detailed install instructions, including scripts to install LDAP for GlobalSight can be found at http://tfs.cc/globalsight-molto-install/. A fully functional GlobalSight site also needs access to email services.

To upgrade from a working install of GlobalSight 8.2.2.0 to MOLTO GlobalSight, download, unpack and run http://tfs.cc/molto/GlobalSight_Installer_8.2.2.1.zip.

There is also a complete MOLTO GlobalSight eclipse project archive at http://tfs.cc/molto/molto-globalsight-8.2.2.1-linux-eclipse-20120529.zip containing the source as well as the runtime.

3.3.2 MOLTO GlobalSight

MOLTO GlobalSight differs from GlobalSight out of the box in two ways. First, MOLTO GlobalSight extends MOLTO user roles to terminology editing. It will be discussed in more detail below in connection with TermFactory. Second, GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. MOLTO GlobalSight extends the selection by embedding the MOLTO TT editor as a third option on the editor menu:

Clicking the option opens the Molto TT Editor in another window.

3.3.2.1 Limitations

As yet, content from the document under translation is not automatically imported into the MOLTO TT editor. Content can be cut and pasted into the MOLTO TT editor.

3.4 TermFactory

The MOLTO TermFactory prototype consists of the generic TermFactory codebase plus MOLTO related ontology content. At present, such content comprises the English-Finnish WordNet ontology. Integration of the TermFactory back-end with the MOLTO KRI over JMS is underway.

The TermFactory codebase consists of

a term ontology query/editing back end run as an Axis2 Tomcat web service
a Tomcat webapp that provides standalone term ontology query form and editor
a MediaWiki installation with a TermFactory editor extension
a link to the Disqus comment system

TermFactory is an architecture and a workflow for Semantic Web based, multilingual, collaborative terminology work. What this means in practice is that it applies Semantic Web and other document and language technology standards to the representation of multilingual special language terms and the related concepts, and provides a plan for how such terminologies can be collected, updated, and agreed about by professionals, not only terminology professionals, all over the globe, during their everyday work on virtual work platforms over the web. As a whole, TF could be termed a semantic web framework for multilingual terminology work.

TF provides

ontology and terminology formats
format conversions
query and edit tools
repositories
web services

for people to work on terms jointly or separately, building on the results of the work of others, while maintaining quality and consistency between the different contributions.

As a prototype, there is a MediaWiki platform for human to human collaboration on collectiong terminological data plus a TF editor plugin for conveying the results of the collaboration into TermFactory ontology format. Here is a snapshot of a random MOLTO TF concept in the Wiki.

MOLTO Wiki TF Editor page

3.4.1 Use

MOLTO TermFactory Mediawiki is used in the usual way a wiki works. In the demo prototype, it has been populated with the Finnish-English Wordnet (ca. 100K concepts, 2 languages, ca. 200K terms per language). The pages are generated automatically on demand. A Wordnet page currently only consists of a set of iframes and links to related lexical resources on the web. In actual use, each category (Wordnet is one) may generate its own boilerplate page design to help users describe and discuss the concepts of a category and their designations in different languages. A commenting system is in place that can be shared between different platforms and applications. The discussion threads are indexed by the URI of the relevant resource.

The TermFactory ontology content related to a resource can be queried and edited on the Mediawiki platform using a TermFactory ontology editor extension, shown on top of the page as the Entry Editor tab. Below is a snapshot showing the TF editor opened to the TermFactory entry corresponding to the chosen WordNet term.

MOLTO Wiki TF Editor page

Instead of going by way of fill-in forms, the TermFactory approach is to support direct WYSIWYG editing of localized ontology triples in a HTML textarea editor. The TermFactory editor application uses the CKEditor javascript textarea editor for this purpose. TF adds to the CKEditor standard release a special purpose plugin that adds TermFactory specific action buttons and a menu to the standard issue.

While staying conceptually close to the original RDF format of the data, the TermFactory editor layout is quite versatile. With suitable parameters, it can be tweaked to show ontology content editable in shapes already familiar to professional terminologists. There is a customisable, schema-aware insertion menu to help inserting relevant content, plus customisable input and output layout templates. The editor is not limited to TermFactory ontologies, as it is built on a general purpose textarea editor using a generic RDF to HTML mapping.

A specialty of TermFactory is that it supports terminological reflexion. The metaterminology used in the editor is not fixed, but can be changed by giving it a TF term ontology as parameter. Using TF localization and bridge ontologies, not only the editor interface, but also the content shown can be localized to a user community's conceptualization, language and terminology. Here is the same editor page fetched after setting Mediawiki language settings set to Finnish. Note how the terminological metalanguage used in the entry is now shown in Finnish. (The localization is not complete, because the current localization ontology's coverage has some gaps.)

MOLTO Wiki TF Editor page

3.4.2 Installation

The TermFactory source code is on svn at svn.it.helsinki.fi/repos/termfactory. A username and password on the repository server is needed for checkout. To check out a path, choose installation directory, go to it and do svn checkout https://<username>@svn.it.helsinki.fi/repos/termfactory/path.

The compiled web archive files for TF are

io/lib/tf-io.jar            The core library (offline tools)
ws/service/TFServices.aar   The Axis2 webservice archive
ws/servlet/TermFactory.war  The Tomcat webapp archive

These three archives should be enough for deployment of TF in Linux from binaries on Tomcat running Axis2. Installations of mysql and Jena TDB are needed for persistent storage of ontologies on the TermFactory server. File upload services require prior installation of WebDAV. Detailed TF source build and install instructions are available on request.

TermFactory MediaWiki is MediaWiki out of the box plus the TermFactory MediaWiki extension, downloadable from the TermFactory svn path fe/TermFactory. The extension requires installing TF back end, of course.

Install MediaWiki 1.16 (or newer)
Put everything under extensions/TermFactory
Add require("$IP/extensions/TermFactory/TermFactory.php"); to LocalSettings.php in the main directory
go to page Special:EditTerm

3.4.3 Limitations

User management between MediaWiki, TermFactory services, and TermFactory WebDAV is not fully in synch yet.