How do I encode...?

Glottals

Question: What character should I use to represent the glottal stop?

Answer: There are a lot of different things that people have done in the past.

If you want something that looks like a curly quote you should use U+02BC ʼ MODIFIER LETTER APOSTROPHE. You could use U+2019 ’ RIGHT SINGLE QUOTATION MARK, but there are at least two issues with that. It is considered punctuation with different properties than an orthographic character and if you use quote marks there is nothing to distinguish between the two characters. (Our Roman fonts (such as Andika, Charis, and Gentium) all have an alternate glyph for U+02BC ʼ MODIFIER LETTER APOSTROPHE which is a bit larger than normal to help distinguish the glyph from U+2019 ’ RIGHT SINGLE QUOTATION MARK.)

Many orthographies have used something that looks like the straight quote. There were so many problems with using U+0027 ' APOSTROPHE for this character that we requested the addition of a character to Unicode for that. You should use U+A78C ꞌ LATIN SMALL LETTER SALTILLO (one language even “cases” this and U+A78B Ꞌ LATIN CAPITAL LETTER SALTILLO is used for the uppercase). (Our Roman fonts (such as Andika, Charis, and Gentium) all have an alternate glyph for U+A78C ꞌ LATIN SMALL LETTER SALTILLO and U+A78B Ꞌ LATIN CAPITAL LETTER SALTILLO which are a bit larger than normal to help distinguish the glyph from U+0027 APOSTROPHE.)

U+02BE ʾ MODIFIER LETTER RIGHT HALF RING is sometimes used for transliterating Arabic hamza (glottal stop). This looks different from both U+A78C ꞌ LATIN SMALL LETTER SALTILLO and U+02BC ʼ MODIFIER LETTER APOSTROPHE and might be a good option for traditions which recognize the transliterated hamza.

Some Saskatchewan orthographies use an upper and lowercase glottal stop. Those are U+0241 Ɂ LATIN CAPITAL LETTER GLOTTAL STOP and U+0242 ɂ LATIN SMALL LETTER GLOTTAL STOP.

Of course, the IPA representation is U+0294 ʔ LATIN LETTER GLOTTAL STOP and some languages also use this in their orthographies (where casing is not required).

Diacritics

Question: I want to put a diacritic on a “dotted i” and want to retain the dot on the “i”. Can you add that feature to your fonts?

Answer: The Unicode Standard addresses this in chapter 7. You should encode it as U+0069 i LATIN SMALL LETTER I + U+0307 ◌̇ COMBINING DOT ABOVE + U+0301 ◌́ COMBINING ACUTE ACCENT.

Question: I need a “V”, “t”, “n” and “l” with a macron under each. Unicode does not have these characters. Can you add these to your PUA and get them into Unicode for me, or is there another way I can encode this character?

Answer: Unicode does have some precomposed characters because they already existed in standards. The Unicode Technical Committee will no longer accept precomposed forms unless there is a very convincing argument.

However, each of these can be encoded in Unicode. So, for example “V” with a macron under it should be encoded as two characters (U+0056 V LATIN CAPITAL LETTER V + U+0331 ◌̱ COMBINING MACRON BELOW):

The same thing can be done with each of your other characters, and, in fact, any other base + diacritic.

Question: You have left out one crucial Unicode range of four diacritics which are used within the Latin-script in the library world: U+FE20..U+FE23.

U+FE20 ◌︠ COMBINING LIGATURE LEFT HALF
U+FE21 ◌︡ COMBINING LIGATURE RIGHT HALF
U+FE22 ◌︢ COMBINING DOUBLE TILDE LEFT HALF
U+FE23 ◌︣ COMBINING DOUBLE TILDE RIGHT HALF

Transliterated Cyrillic records e.g. make heavy use of the first two.

Answer: Originally we made a deliberate decision not to include the combining half marks in our fonts. We consider U+0360 ◌͠ COMBINING DOUBLE TILDE and U+0361 ◌͡ COMBINING DOUBLE INVERTED BREVE to be the preferred characters to use. Thus, to put the U+0361 ◌͡ COMBINING DOUBLE INVERTED BREVE over an “ia”, the preferred encoding would be to put the U+0361 ◌͡ COMBINING DOUBLE INVERTED BREVE between “ia” (i + U+0361 ◌͡ COMBINING DOUBLE INVERTED BREVE + a):

However, we were convinced that the library world does need this range and so they were added to our Unicode Roman fonts (Andika, Charis, and Gentium). Positioning of these may not be perfect.

Question: I need a diacritic on an “i”. Should I use the dotless “i” that I found in Unicode or what should I do? I also need to have a diacritic that will go on the upper case “i” and I can’t find different heights for the diacritics.

Answer: This is where Unicode is really, really useful. You no longer need to encode two different versions of an “i” and two different versions of a diacritic. In fact, you should not! If you look at the character properties for the character you have suggested (U+0131 ı LATIN SMALL LETTER DOTLESS I) you will see that this character is only used for Turkish and Azerbaijani.

So, you should just use the base character plus the diacritic. (This makes data analysis much simpler as well.) Unicode, along with smart fonts, will automatically handle the dot removal for the “i” and height adjustment for the upper case “i”. For example, i with caron would be encoded as i + U+030C ◌̌ COMBINING CARON.

In the following example you can see that the diacritic is shifted down if you have characters that have descenders:

Overlays

Question: I need to use a slash “L” (U+0141 Ł LATIN CAPITAL LETTER L WITH STROKE). I can see that Unicode has a precomposed slash “L”. Would it be better for me to use the precomposed version or make it decomposed?

Answer: Sometimes people get confused about whether to use precomposed or decomposed characters that are in Unicode. A simple rule-of-thumb to go by is that if a character has diacritics (either above or below the character), it can be decomposed. If the character has an “overlay” (superimposed on the character) then the preformed (not precomposed) character should be used.

An easy way to find Unicode characters is to look at the Collation charts. This page is sorted alphabetically. However, it does not show character properties and decompositions, so if you find you need that information you will need to go to the Unicode website to find that information. You can find charts of all the Unicode characters at this site.

In the example we are using (U+0141 Ł LATIN CAPITAL LETTER L WITH STROKE) you will find that there is no decomposition listed for this character and so you should not use “L” + “/” (U+004C L LATIN CAPITAL LETTER L + U+0338 ◌̸ COMBINING LONG SOLIDUS OVERLAY). This also means that we should not be using the term “precomposed” for this character, rather, it is “preformed”.

Question: I cannot find a barred U+0261 ɡ LATIN SMALL LETTER SCRIPT G. Can you add it to your PUA?

Answer: Although what you are requesting looks different, fundamentally this is the same character as ǥ. You should encode it as U+01E5 ǥ LATIN SMALL LETTER G WITH STROKE. The Charis font allows you to choose the barred bowl form through Font Features (if you have an application which allows for this).

Character choices

Question: How do I know which version of the schwa to use? There is U+0259 ə LATIN SMALL LETTER SCHWA and U+01DD ǝ LATIN SMALL LETTER TURNED E.

Answer: This one will rise up and bite you if you are not careful! This is where looking at the documentation is important. If you look at U+0259 ə LATIN SMALL LETTER SCHWA you will see:

There are a number of useful bits of information here. Firstly, you see that it tells you U+018E Ǝ LATIN CAPITAL LETTER REVERSED E is associated with U+01DD ǝ LATIN SMALL LETTER TURNED E. The second bit of useful information is in the first cross reference you are given: U+018F Ə LATIN CAPITAL LETTER SCHWA. This tells us that U+018F is the upper case match to this character.

Another interesting test is to type both of the schwas into a word processor (like Word). Select them both and click on Format / Change Case… / UPPER CASE. You should see two different forms of the upper case schwa. This shows you how important it is to match the lower case character (which looks exactly the same) with the correct upper case character (which looks significantly different).

In this example you want to make sure that if you are using U+018F Ə LATIN CAPITAL LETTER SCHWA in your orthography, you should make sure the lower case is U+0259 ə LATIN SMALL LETTER SCHWA.

Question: I’ve noticed that when I’m looking for phonetic characters, not everything I want is in the IPA extensions.

For example, the beta which is used for a voiced bilabial fricative is, I believe, supposed to be encoded as U+03B2 β GREEK SMALL LETTER BETA, but that is in the Greek section, and its documentation does not make explicit that it is supposed to be used for a bilabial fricative nor that it is part of the IPA. So, I am still not absolutely sure I’ve got the right character.

Answer: You are right about the voiced bilabial fricative being encoded as U+03B2 β GREEK SMALL LETTER BETA. The bigger question, of course, is a need to know all the characters sanctioned as part of the IPA and what their Unicode codepoints are.

The official IPA site does not currently do this for us. There are several places you can check for this information. IPA Unicode codepoints gives several resources.

Question: I want an open o with the serif at the top. I see that Unicode now has U+2183 Ↄ ROMAN NUMERAL REVERSED ONE HUNDRED and U+2184 ↄ LATIN SMALL LETTER REVERSED C. Can I use those instead of U+0186 Ɔ LATIN CAPITAL LETTER OPEN O and U+0254 ɔ LATIN SMALL LETTER OPEN O?

Answer: U+2183 Ↄ ROMAN NUMERAL REVERSED ONE HUNDRED was added to the Roman numeral block for use as a Claudian letter. We do not recommend their use for anything other than what they were designed for. Please use U+0186 Ɔ LATIN CAPITAL LETTER OPEN O and U+0254 ɔ LATIN SMALL LETTER OPEN O if you need an open o and find a font which has the serif where you want it. Our SIL Roman Unicode fonts provide an alternate form, so if you have an application that can handle it, you can choose whether you want a top or bottom serif.

Question: I want a handwritten style a. Unicode has U+0251 LATIN SMALL LETTER ALPHA. Can I use that?

Answer: U+0251 ɑ LATIN SMALL LETTER ALPHA is in Unicode as an IPA symbol. Please do not use it instead of an “a”. You should find a font which has a handwritten style “a” at U+0061 a LATIN SMALL LETTER A. If you use U+0251 ɑ LATIN SMALL LETTER ALPHA you will have unexpected results with data analysis as well as when using uppercase/lowercase pairs.

The only time you would want to use U+0251 ɑ LATIN SMALL LETTER ALPHA in an orthography is if you have contrastive use between U+0251 ɑ LATIN SMALL LETTER ALPHA and U+0061 a LATIN SMALL LETTER A. Then, you would also want to use U+2C6D Ɑ LATIN CAPITAL LETTER ALPHA for the uppercase of U+0251 ɑ LATIN SMALL LETTER ALPHA.

Tone

Question: I see that Unicode (and your Charis font) has individual tone letters (U+02E5..U+02E9), but does not have the tone glides. Can you get those encoded in Unicode? They are very important in linguistic work.

Answer: Unicode can already handle these. You do need a smart font (like Charis) to make it work. You should type the tone letters in the correct linguistic order and they should become the correct tone glide. For example: