Language Tags
Introduction
Language tags are standardised identifier for language information. They are used to identify the orthography of a text, locale information, languages, etc. A good overview on how to create a tag may be found here: tagging.md. The technical specification for the structure of a language tag is BCP47. BCP47 makes reference to the IANALanguage Subtag registry that contains basic definitions for all the language tag subcomponents that require registering.
Since a language tag is designed around tagging text, it is best to think of a
language tag as an orthography tag. At the orthography level, multiple tags may
refer to the same thing. Thus en
, en-Latn
, en-US
, and en-Latn-US
can all
be considered equivalent. It is difficult to work out what these equivalences
are. For this there is a json file available here:
https://ldml.api.sil.org/langtags.json
which groups tags into tag sets based on their orthographic equivalence. A
description of the fields is given here:
langtags.md.
There is also a python module given as a reference implementation
here,
which is available as langtag on pypi.
Using Language Tags
There are typically two key equivalent tags, the shortest tag and the full tag.
In the case of English, the shortest tag is en
and the full tag is
en-Latn-US
. These may be found in a langtags.json tag set in the tag
and
full
fields. Users typically prefer to work with the shortest or minimal tag,
while applications value the full tag because it contains all the information
they need to do their work. Thus en-Latn-US
describes all the key information
about the orthography: it’s language, script and region. Meanwhile users
typically think: “I just want English, so en
”.
The extensions mechanism for language tags also allow tags to be extended to
specify such things as sort orders, transcription orthographies, etc. These are
beyond the scope of langtags.json, but can have considerable impact. For
example, en-Latn-US-t-wsg
indicates that the text is in English but is derived
from Ghondi, for example via automated (or manual) translation. The text is
still English (so en
would be sufficient), but the tagger wanted to accentuate
the derivative nature of the text from another language.
Problems
Given the importance of the standard, one might expect language tags to be
stable. But they are not. If there is an orthography revision, the new
orthography often takes over the primary tag set for that orthography, and if
lucky, another tag will be created for the old orthography. For example, Germany
regularly updates its orthography. Thus there is: de-1901
for the 1901
orthography revision, and de-1996
which is the current orthography revision.
Thus before 1996, de
would have been equivalent to de-1901
, but after 1996
it became equivalent to de-1996
. It is very difficult to ensure the long term
future stability of the tagging of some text. Only when orthographies are
reformed, and so two tags may be created, is there any hope.
Applications, therefore, need to provide the ability to change the tagging of
data when necessary. For example, the ability to switch all occurrences of de
to de-1996
and then to reuse de
.
While orthographies are in early development, which includes until they are standardised, and can take decades, the language tag is particularly unstable. It is only once there is enough literature or a large enough user community of a particular orthography revision, that issues of tag stability need to be considered.