Skip to content

Encoding Conversion

Data conversion can include a variety of types of conversion. It could be converting data from a legacy encoding to a Unicode encoding. Or, it could be converting data from one writing system to another.

This section is an introduction to text conversion tools, based around the converter tools produced by SIL Global.

Text conversion can cover various automated changes to text including:

  • Encoding conversion - for example from a legacy encoding to Unicode encoding
  • Script conversion - changing from one script to another by transcription or transliteration
  • Data markup conversion - for example from usfm to xml

The same software tools are used for all types of conversion, so most of this document applies to all.

For detailed information on the different tools mentioned, follow the links to the relevant websites.

All text is contained within ‘documents’ of one sort or another, so the conversion software needs to:

  1. Handle the structure of the document, to extract the raw text that needs to be converted
  2. Convert the text to the destination form
  3. Update the document with the converted text, including any relevant changes to the metadata related to the text

In terms of SIL Converters, (1) and (3) are covered by the “client applications” and (2) by “transduction engines”. This gives great flexibility, since the same mappings used by the transduction engines can be used with various different document types by choosing an appropriate client application, and one client application can work with various different transduction engines, so the most appropriate transduction engine for the conversion in hand can be chosen.

The client applications can be either programs that work within the main application (eg a macro running within Microsoft Word) or programs that convert multiple documents in a batch process (eg the “Bulk Word Converter”).

The main set of tools, the SIL Converters package, is Windows-based. However, much of the underlying technology, in particular TECkit, is cross-platform.

Various other SIL tools, such as Fieldworks and AdaptIT, which have integration with these conversion tools built-in, work on various platforms. Developers can integrate TECkit conversion into their applications. The LibreOffice Linguistic Tools extension includes information on using Converters on Linux.

The SIL Converters package contains both client applications and transduction engines. The package also includes many predefined mappings for conversions. These can be used as they are - if they meet the conversion need - or as a starting point for similar conversions.

One of the strengths of SIL Converters (over a stand-alone transduction engine) is that it can chain conversions together, even ones using different transduction engines.

A central component of SIL Converters is the converter repository where mappings are stored. Once a mapping is installed in the repository, it becomes available to all client applications.

For developers, there is a simple COM interface to select and use a converter. It is easy to use from VBA, C++, C#, Perl, Python or any .NET/COM enabled language.

The SIL Converters package includes converters for:

  • Microsoft Word (Macro and bulk conversion)
  • Microsoft Access, Excel and Publisher
  • XML and SFM documents
  • Clipboard data

The LibreOffice Linguistic Tools extension includes the ability to convert LibreOffice documents by calling SIL Converters.

SIL FieldWorks, Speech Analyzer, Phonology Assistant and Adapt It software also include integration with SIL Converters.

The TECkit conversion toolkit is included in the SIL Converters package, but can also be installed and used as a standalone tool. It can be run on Windows, Mac and Linux.

The core component is the “TECkit engine”, a library that performs conversions based on mappings. This library is used in the other tools, including SIL Converters. Its mapping tables use multi-pass, context-sensitive rules, which can often be written so that the overall mapping can be used forwards or backwards. The rules are compiled into a binary form which is optimized for speed of conversion.

TECkit includes some command line tools, and, for developers, there are wrappers for C#, Perl, Python, C and C++.

Besides data conversion, there are other steps that may be required. For example, a keyboard may need to be created to replace the legacy keyboard that was used. Additionally, data markup may need converting, for example from usfm to xml.

This document does not address those issues.