OBD:Text encoding: Difference between revisions

From OniGalore
Jump to navigation Jump to search
(general copy-edit and wiki touch-up)
(disambiguated/corrected all I could)
Line 1: Line 1:
{{TOCfloat|side=right}}
{{TOCfloat|side=right}}
Beyond the US English release of the game, Oni's text is known to have been translated into French, Italian, Spanish, German, Japanese and Chinese by various localization companies. Depending on the language version, vanilla Oni uses one of the following five encodings to render text:
Originally created in English, Oni has been translated into the following [[seven]] languages: French, Italian, Spanish, German, Russian, Japanese and Chinese.
:(An overview of the known language versions can be found [[OBD:Versions|HERE]], whereas localized content is detailed [[OBD:Localization|HERE]].)
Depending on the language version, vanilla Oni uses one of the following five encodings to render text:
*The original US version uses a trimmed-down [[wp:Mac_OS_Roman|Mac OS Roman]] code page that is effectively limited to US-ASCII (96 code points).
*The original US version uses a trimmed-down [[wp:Mac_OS_Roman|Mac OS Roman]] code page that is effectively limited to US-ASCII (96 code points).
*European localizations (UK English, French, Italian, Spanish, German) use a custom version of Mac OS Roman (192 code points).
*European localizations (UK English, French, Italian, Spanish, German) use a custom version of Mac OS Roman (192 code points).
Line 7: Line 9:
*The Japanese localization uses 1,357 code points mostly conforming to the [[wp:Shift_JIS|Shift JIS]] implementation of [[wp:JIS_X_0208|JIS X 0208]].
*The Japanese localization uses 1,357 code points mostly conforming to the [[wp:Shift_JIS|Shift JIS]] implementation of [[wp:JIS_X_0208|JIS X 0208]].
Properties of the fonts that are eventually used to render the text (via the encoding) are briefly described throughout the page.
Properties of the fonts that are eventually used to render the text (via the encoding) are briefly described throughout the page.
:(A more thorough overview of the glyphs can be found [[/Fonts|HERE]].)


A more thorough overview of the glyphs can be found [[/Fonts|HERE]].


==Encodings==
==Encodings==
===US English===
===US English===
Below is the code page implemented by [[TSFF]]Tahoma in the US English version of Oni. It is based on Mac OS Roman ("MacRoman"), but with two differences:
Below is the code page implemented by [[TSFF]]Tahoma in the US English version of Oni. It is based on [[wp:Mac_OS_Roman|Mac OS Roman]] ("MacRoman" for short), but with two differences:
*Of the 223 printable glyphs provided by MacRoman, 42 were not created in TSFFTahoma (shown as grey-on-black).
*Of the 223 printable glyphs provided by MacRoman, 42 are not implemented in TSFFTahoma (shown as grey-on-black).
*Control point 0x7F (a typically non-printable "delete" character) is visible as a box glyph (◻).
*Control point 0x7F (a typically non-printable "delete" character) has a visible box-like glyph (◻) in this implementation.
{|border=1 cellpadding=3 cellspacing=0
{|border=1 cellpadding=3 cellspacing=0
|-bgcolor=silver
|-bgcolor=silver
Line 20: Line 22:
|-
|-
!bgcolor=silver|0x2...
!bgcolor=silver|0x2...
!<span style="color:silver;font-size:70%">SP</span>!!&#33;!!&#34;!!&#35;!!&#36;!!&#37;!!&#38;!!&#39;!!&#40;!!&#41;!!&#42;!!&#43;!!&#44;!!&#45;!!&#46;!!&#47;
!<span style="color:silver;font-size:80%"><sup>S</sup><sub>P</sub></span>!!&#33;!!&#34;!!&#35;!!&#36;!!&#37;!!&#38;!!&#39;!!&#40;!!&#41;!!&#42;!!&#43;!!&#44;!!&#45;!!&#46;!!&#47;
|-
|-
!bgcolor=silver|0x3...
!bgcolor=silver|0x3...
Line 36: Line 38:
!bgcolor=silver|0x7...
!bgcolor=silver|0x7...
!&#112;!!&#113;!!&#114;!!&#115;!!&#116;!!&#117;!!&#118;!!&#119;!!&#120;!!&#121;!!&#122;!!&#123;!!&#124;!!&#125;!!&#126;!!&#9723;
!&#112;!!&#113;!!&#114;!!&#115;!!&#116;!!&#117;!!&#118;!!&#119;!!&#120;!!&#121;!!&#122;!!&#123;!!&#124;!!&#125;!!&#126;!!&#9723;
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0x8...
!bgcolor=silver|0x8...
!&#196;
!&#196;
!bgcolor=black|<span style="color:darkslategray">&#197;</span>
!bgcolor=black|<span style="color:darkslategray">&#197;</span>
!&#199;!!&#201;!!&#209;!!&#214;!!&#220;!!&#225;!!&#224;!!&#226;!!&#228;!!&#227;!!&#229;!!&#231;!!&#233;!!&#232;
!&#199;!!&#201;!!&#209;!!&#214;!!&#220;!!&#225;!!&#224;!!&#226;!!&#228;!!&#227;!!&#229;!!&#231;!!&#233;!!&#232;
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0x9...
!bgcolor=silver|0x9...
!&#234;!!&#235;!!&#237;!!&#236;!!&#238;!!&#239;!!&#241;!!&#243;!!&#242;!!&#244;!!&#246;
!&#234;!!&#235;!!&#237;!!&#236;!!&#238;!!&#239;!!&#241;!!&#243;!!&#242;!!&#244;!!&#246;
Line 49: Line 51:
!bgcolor=black|<span style="color:darkslategray">&#251;</span>
!bgcolor=black|<span style="color:darkslategray">&#251;</span>
!&#252;
!&#252;
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0xA...
!bgcolor=silver|0xA...
!&dagger;
!&dagger;
Line 64: Line 66:
!bgcolor=black|<span style="color:darkslategray">&#198;</span>
!bgcolor=black|<span style="color:darkslategray">&#198;</span>
!&#216;
!&#216;
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0xB...
!bgcolor=silver|0xB...
!bgcolor=black|<span style="color:darkslategray">&infin;</span>
!bgcolor=black|<span style="color:darkslategray">&infin;</span>
Line 81: Line 83:
!bgcolor=black|<span style="color:darkslategray">Ω</span>
!bgcolor=black|<span style="color:darkslategray">Ω</span>
!&#230;!!&#248;
!&#230;!!&#248;
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0xC...
!bgcolor=silver|0xC...
!&#191;!!&#161;!!&#172;
!&#191;!!&#161;!!&#172;
Line 89: Line 91:
!bgcolor=black|<span style="color:darkslategray">∆</span>
!bgcolor=black|<span style="color:darkslategray">∆</span>
!&#171;!!&#187;!!&hellip;
!&#171;!!&#187;!!&hellip;
!bgcolor=black|<span style="color:darkslategray;font-size:70%">NBSP</span>
!bgcolor=black|<div style="color:darkslategray;font-size:70%;line-height:0.8em">NB<br />SP</div>
!&#192;
!&#192;
!bgcolor=black|<span style="color:darkslategray">&#195;</span>
!bgcolor=black|<span style="color:darkslategray">&#195;</span>
!&#213;
!&#213;
!&OElig;!!&oelig;
!&OElig;!!&oelig;
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0xD...
!bgcolor=silver|0xD...
!&ndash;!!&mdash;!!&#8223;!!&rdquo;!!&#8219;!!&rsquo;!!&#247;
!&ndash;!!&mdash;!!&#8223;!!&rdquo;!!&#8219;!!&rsquo;!!&#247;
Line 104: Line 106:
!bgcolor=black|<span style="color:darkslategray">&#xfb01;</span>
!bgcolor=black|<span style="color:darkslategray">&#xfb01;</span>
!bgcolor=black|<span style="color:darkslategray">&#xfb02;</span>
!bgcolor=black|<span style="color:darkslategray">&#xfb02;</span>
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0xE...
!bgcolor=silver|0xE...
!&Dagger;
!&Dagger;
Line 111: Line 113:
!bgcolor=black|<span style="color:darkslategray">&#202;</span>
!bgcolor=black|<span style="color:darkslategray">&#202;</span>
!&#193;!!&#203;!!&#200;!!&#205;!!&#206;!!&#207;!!&#204;!!&#211;!!&#212;
!&#193;!!&#203;!!&#200;!!&#205;!!&#206;!!&#207;!!&#204;!!&#211;!!&#212;
|-bgcolor=yellow
|-bgcolor=orange
!bgcolor=silver|0xF...
!bgcolor=silver|0xF...
!bgcolor=black|[[File:Platform-Mac.png|12px]]
!bgcolor=black|[[File:Platform-Mac.png|12px]]
Line 128: Line 130:
|}
|}
;Minor notes
;Minor notes
*The layout was apparently "borrowed" before Mac OS 8.5 came out in 1998. How can we say that? Because 8.5 replaced MacRoman's glyph at 0xDB, which was the [[wp:Currency sign (typography)|international currency sign]] or scarab (¤), with the euro symbol (€), but TSFFTahoma still has the scarab.
*The MacRoman layout was apparently "borrowed" before 1998, when Mac OS 8.5 came out and the [[wp:Currency sign (typography)|international currency sign]] a.k.a. scarab (¤), at 0xDB, was replaced with the euro symbol (€).
*The Yen/Yuan symbol only has a single horizontal stroke, presumably due to lack of available detail at small font sizes.
*The actual font (see [[/Fonts|HERE]]) has some unusual typographical features, such as a single-stroke Yen/Yuan symbol (Ұ) and a vertical-stroke cent symbol (¢).
;Major notes
;Major notes
*Some of the removed glyphs (most importantly ß, Ê, ù and û, but also Ú and ú) occur in [[wp:Languages of the European Union#Knowledge|common European languages]]. This made the US TSFFTahoma unsuitable for [[wikt:EFIGS|EFIGS]] localizations, requiring the creation of a new version (see below).  
*Some of the removed glyphs (most importantly ß, ù and û, but also Ê, Ú and ú) occur in [[wp:Languages of the European Union#Knowledge|common European languages]]. This made the US TSFFTahoma unsuitable for [[wikt:EFIGS|EFIGS]] localizations, requiring the creation of a new version (see below).  
*The US engine cannot interpret any code points beyond the US-ASCII range (first 6 rows, white background), such as "…" (see [[#Ellipsis_issue|"Ellipsis issue"]] below). This is because of a provision for Asian encoding systems (EUC-CN and Shift JIS), which use two-byte sequences starting with a high-bit byte.
*The US engine actually cannot interpret any code points beyond the US-ASCII range (first 6 rows, white background), notably failing on "…" (see [[#Ellipsis_issue|"Ellipsis issue"]] below). This is because of a provision for Asian encoding systems (EUC-CN and Shift JIS), which use two-byte sequences starting with a high-bit byte.
 
 
----
----
===European===
===European===
The code page used by the five Western European versions (UK English, French, German, Spanish and Italian) is slightly different from the trimmed-down Mac OS Roman.
The code page used by the five Western European versions (UK English, French, German, Spanish and Italian) is slightly different from the trimmed-down Mac OS Roman.
*It tends to the needs of European localizations by adding back the following characters:
*It tends to the needs of European localizations by adding back the following characters:<br>German ß; French Ê and û; French/Italian ù; Spanish/Italian Ú and ú (relatively rare).
:German ß; French Ê and û; French/Italian ù; Spanish/Italian Ú and ú (relatively rare).
:'''N.B.''' The characters Æ and ÿ are not reinstated, despite their (very rare) occurrence in French script.
:'''N.B.''' The characters Æ and ÿ are not reinstated, despite their (very rare) occurrence in French script.
*Awkwardly enough, the six characters are not restored in their original positions (grey-on-black), but take the place of math symbols. Four more "math" positions are inexplicably filled with three duplicate characters (œ, ¡ and ª) and the truly enigmatic character ʖ̇ (which doesn't seem to be a character in any known language and is not in Unicode).
*Awkwardly enough, the six characters are not restored in their original positions (grey-on-black), but take the place of math symbols.<br/>Four more "math" positions are inexplicably filled with three duplicate characters (œ, ¡ and ª) and a truly enigmatic ʖ̇ , which doesn't seem to occur in any known language and has no dedicated code point in Unicode.
:'''N.B.''' The broken italic font variants do not fully implement the 10 new glyphs and use a regular question mark instead of the  ʖ̇.
:'''N.B.''' The broken italic font variants (see [[/Fonts#Italic|HERE]]) do not fully implement the 10 new glyphs and use a regular question mark instead of the  ʖ̇.
{|border=1 cellpadding=3 cellspacing=0
{|border=1 cellpadding=3 cellspacing=0
|-bgcolor=silver
|-bgcolor=silver
Line 146: Line 149:
|-
|-
!bgcolor=silver|0x2...
!bgcolor=silver|0x2...
!<span style="color:silver;font-size:70%">SP</span>!!&#33;!!&#34;!!&#35;!!&#36;!!&#37;!!&#38;!!&#39;!!&#40;!!&#41;!!&#42;!!&#43;!!&#44;!!&#45;!!&#46;!!&#47;
!<span style="color:silver;font-size:80%"><sup>S</sup><sub>P</sub></span>!!&#33;!!&#34;!!&#35;!!&#36;!!&#37;!!&#38;!!&#39;!!&#40;!!&#41;!!&#42;!!&#43;!!&#44;!!&#45;!!&#46;!!&#47;
|-
|-
!bgcolor=silver|0x3...
!bgcolor=silver|0x3...
Line 254: Line 257:
Coincidentally, with the 10 new glyphs, the European code page has exactly 96 glyphs in the US-ASCII half and 96 in the extension half (blue).
Coincidentally, with the 10 new glyphs, the European code page has exactly 96 glyphs in the US-ASCII half and 96 in the extension half (blue).
:'''N.B.''' Unlike the US version, all five Western European versions (including UK English) are able to render the full extended ASCII set.
:'''N.B.''' Unlike the US version, all five Western European versions (including UK English) are able to render the full extended ASCII set.
----
----
===Cyrillic===
===Cyrillic===
In the Russian version of Oni, TSFFTahoma fully implements the [[wp:Windows-1251|Windows-1251]] (Cyrillic) code page.
In the Russian version of Oni, TSFFTahoma implements the [[wp:Windows-1251|Windows-1251]] (Cyrillic) code page, with some deviations.
*All the Windows-1251 characters are present, although only 66 (purple) are used by Russian script.  
*The character 0x98, normally non-printable, is implemented as a visible box glyph (☐), slightly larger than 0x7F.
*The character 0x98 is normally non-printable, but in this font is visible as a box glyph (☐), like 0x7F.
*The character 0x81, normally a "Ѓ" glyph, is replaced with a thin space of inconsistent size (2px wide for all fonts, 3px for 13pt regular and 16pt regular).
*Apart from 0x20, there are two whitespace characters: the [[wp:Non-breaking space|non-breaking space]] and the [[wp:Soft hyphen|soft hyphen]].
*The character 0xA0, normally a [[wp:Non-breaking space|non-breaking space]], is a space of not-so-consistent size (anywhere from single to triple width, depending on the font).
 
*The character 0xAD, normally a [[wp:Soft hyphen|soft hyphen]], is a visible hyphen (similar to the [[wp:Hyphen-minus|hyphen-minus]], 0x2D) for 7pt fonts, and an inconsistently sized space for other fonts.<br/>(Oni's engine could in theory reserve a special treatment for soft hyphens and non-breaking spaces, specified in [[TSFL]]Roman, but in practice there is no such functionality.)
{|border=1 cellpadding=3 cellspacing=0
{|border=1 cellpadding=3 cellspacing=0
|-bgcolor=silver
|-bgcolor=silver
Line 266: Line 271:
|-
|-
!bgcolor=silver|0x2...
!bgcolor=silver|0x2...
!<span style="color:silver;font-size:70%">SP</span>!!&#33;!!&#34;!!&#35;!!&#36;!!&#37;!!&#38;!!&#39;!!&#40;!!&#41;!!&#42;!!&#43;!!&#44;!!&#45;!!&#46;!!&#47;
!<span style="color:silver;font-size:80%"><sup>S</sup><sub>P</sub></span>!!&#33;!!&#34;!!&#35;!!&#36;!!&#37;!!&#38;!!&#39;!!&#40;!!&#41;!!&#42;!!&#43;!!&#44;!!&#45;!!&#46;!!&#47;
|-
|-
!bgcolor=silver|0x3...
!bgcolor=silver|0x3...
Line 284: Line 289:
|-bgcolor=cyan
|-bgcolor=cyan
!bgcolor=silver|0x8...
!bgcolor=silver|0x8...
!Ђ||Ѓ||‚||ѓ||„||…||†||‡||€||‰||Љ||‹||Њ||Ќ||Ћ||Џ
!bgcolor=yellow|<div style="color:gray;font-size:70%;line-height:0.8em">S&nbsp;<br />&nbsp;P</div>
!‚||ѓ||„||…||†||‡||€||‰||Љ||‹||Њ||Ќ||Ћ||Џ
|-bgcolor=cyan
|-bgcolor=cyan
!bgcolor=silver|0x9...
!bgcolor=silver|0x9...
Line 290: Line 297:
|-bgcolor=cyan
|-bgcolor=cyan
!bgcolor=silver|0xA...
!bgcolor=silver|0xA...
!<span style="color:gray;font-size:70%">NBSP</span>||Ў||ў||Ј||¤||Ґ||¦||§||bgcolor=fuchsia|Ё||©||Є||«||¬||<span style="color:gray;font-size:70%">SHY</span>||®||Ї  
!bgcolor=orange|<div style="color:gray;font-size:70%;line-height:0.8em">NB<br />SP</div>
!Ў||ў||Ј||¤||Ґ||¦||§||bgcolor=fuchsia|Ё||©||Є||«||¬
!bgcolor=orange|
!®||Ї  
|-bgcolor=cyan
|-bgcolor=cyan
!bgcolor=silver|0xB...
!bgcolor=silver|0xB...
Line 307: Line 317:
!р||с||т||у||ф||х||ц||ч||ш||щ||ъ||ы||ь||э||ю||я
!р||с||т||у||ф||х||ц||ч||ш||щ||ъ||ы||ь||э||ю||я
|}
|}
;Italic fonts
:The Russian version only provides an implementation of Windows-1251 for regular and bold fonts. The five italic fonts (7pt, 9pt, 10pt, 12pt and 14pt) have exactly the same data (pixels and glyph descriptors) as for the European iteration of Mac OS Roman. This makes sense because italic fonts are inherently broken (see [[/Fonts#Italic|HERE]]) and thus not used by any text in vanilla Oni.
;Bold 14 font
:Somewhat surprisingly, the size-14 TSFT in the Russian version of TSFFTahoma does not have a complete Windows-1251 code page either. Instead it is limited to the US-ASCII character set (including the "printable delete" box at code point 0x7F), i.e., the upper section of the above table (white background). This causes no issue in vanilla Oni, but only because there is no text that uses bold 14. 
;Incomplete transparency
:A unique "feature" of the Russian/Cyrillic TSFFTahoma is that all the characters in the extended ASCII range (0x80-0xFF) have a slightly opaque background (about 3% opacity) in the regular (non-bold) font variant. This isn't visible ingame, but only because the engine (re)posterizes all the glyphs into 4-bit grayscale when rendering (so that only opacities above 6% are visible).
;Glyph alignment and spacing
:Last but not least, some fonts in the Russian TSFFTahoma have inconsistent vertical alignment, the most blatant example being 12 bold: some glyphs are one pixel shorter or taller than the full line height (ascender+descender), without a properly compensated vertical glyph offset; others simply have pixels that are not properly aligned within a glyph's rectangle. Besides, many glyphs have excessive padding to the left and/or right of a character, which affects readability.<br />'''N.B.''' There are other examples of poor alignment, e.g. for 12 bold, the character 0x9C (њ) has its right side cut off and is thus unusable (luckily it doesn't occur in Russian script).
----
----
===Chinese===
===Chinese===
The Chinese version of Oni has the same TSFFTahoma as the original US version (trimmed-down Mac OS Roman), but the engine cannot interpret the extended ASCII range, and in fact does not use TSFFTahoma at all. Instead the user launches a wrapper mini-app called '''oni.exe''' which in turn executes the code in '''Oni.dat''' (the game binary itself – in fact, the original Oni.exe from the US version), injecting a custom text engine found in '''xfhsm_oni.dll''' and the font data in '''xf_font.dat'''. Text strings that Oni intends to display are then intercepted by xfhsm_oni.dll and the resulting pixel data from xf_font.dat is injected into Oni's OpenGL context.
The Chinese version of Oni has the same TSFFTahoma as the original US version (trimmed-down Mac OS Roman), but the engine cannot interpret the extended ASCII range, and in fact does not use TSFFTahoma at all. Instead the user launches a wrapper mini-app called '''oni.exe''' which executes the main game code in '''Oni.dat''' (a renamed copy of the original Oni.exe from the US version) with text logic injected from '''xfhsm_oni.dll''' and font data loaded from '''xf_font.dat'''. Text strings that Oni intends to display are then intercepted by xfhsm_oni.dll and the resulting pixel data from xf_font.dat is injected into Oni's OpenGL context.


Unlike other versions of Oni, the Chinese font doesn't have a table listing the valid code points along with their "glyph descriptors" (i.e., instructions on how to extract a glyph from the raw pixel data). Instead all the glyphs have a standard size of 16x16 pixels and there are exactly 94x94=8,836 glyphs, filling up a standard [[wp:GB_2312|GB 2312]] plane (''qūwèi''), indexed through a compact numbering scheme known as [[wp:Extended_Unix_Code#EUC-CN|EUC-CN]]: each of the 94x94 code points is indexed by a pair of bytes that are both in the 0xA1-0xFE range. Code points that are not assigned under GB 2312 (e.g. rows 10-15 and 90-94) simply have blank pixel data in the corresponding regions of xf_font.dat.
Unlike other versions of Oni, the Chinese font doesn't have a table listing the valid code points along with their "glyph descriptors" (i.e., instructions on how to extract a glyph from the raw pixel data). Instead all the glyphs have a standard size of 16x16 pixels and there are exactly 94x94=8,836 glyphs, filling up a standard [[wp:GB_2312|GB 2312]] plane (''qūwèi''), indexed through a compact numbering scheme known as [[wp:Extended_Unix_Code#EUC-CN|EUC-CN]]: each of the 94x94 code points is indexed by a pair of bytes that are both in the 0xA1-0xFE range. Code points that are not assigned under GB 2312 (e.g., rows 10-15 and 90-94) simply have blank pixel data in the corresponding regions of xf_font.dat.


The pixel packing used by xf_font.dat is 1-bit black-and-white (i.e., without antialiasing), which is much more space-efficient than the 8-bit grayscale storage used in Oni's [[TSFT]]. Another gain comes from not having any glyph descriptors ([[TSGA]]s). Both a regular and a bold typeface are available (but in one size only, fixed-width 16x16).
The pixel packing used by xf_font.dat is 1-bit black-and-white (i.e., without antialiasing), which is much more space-efficient than the 8-bit grayscale storage used in Oni's [[TSFT]]. Another gain comes from not having any glyph descriptors ([[TSGA]]s). Both a regular and a bold typeface are available (but in one size only, fixed-width 16x16).
Line 318: Line 338:


In theory, EUC-CN allows for single-byte control codes, which would be interpreted as US-ASCII and rendered using Oni's own TSFFTahoma. In practice, all of the strings in the Chinese game data use only two-byte control sequences.
In theory, EUC-CN allows for single-byte control codes, which would be interpreted as US-ASCII and rendered using Oni's own TSFFTahoma. In practice, all of the strings in the Chinese game data use only two-byte control sequences.
----
----
===Japanese===
===Japanese===
Line 835: Line 857:
{{divhide|end}}
{{divhide|end}}


As for the first code page of the Japanese TSFFTahoma, it implements only the 0x20-0x7F range of characters, i.e., is limited to [[wp:US-ASCII|US-ASCII]]. This is consistent with the simplified logic used by the Japanese engine, where any high-bit byte (in the 0x80-0xFF range) is treated as the start of a two-byte sequence (in actual Shift JIS some high-bit bytes are interpreted as half-width kana).
As for the first code page of the Japanese TSFFTahoma, it implements only the 0x20-0x7F range of characters, i.e., is limited to [[wp:US-ASCII|US-ASCII]]. This is consistent with the simplified logic used by the Japanese engine, where any high-bit byte (in the 0x80-0xFF range) is treated as the start of a two-byte sequence. (In actual Shift JIS some high-bit bytes are interpreted as half-width kana, a feature that isn't supported by Oni's engine.)


It must be noted that, as compared to the separate .fnt files, TSFFTahoma provides a very rudimentary implementation of JIS X 0208 (only coding for 154 double-byte glyphs, whereas the .fnt files implement 1,357) and is essentially useless/unusable.
It must be noted that, as compared to the separate .fnt files, the Japanese TSFFTahoma provides a very rudimentary implementation of JIS X 0208 (only coding for 154 double-byte glyphs, whereas the .fnt files implement 1,357) and is essentially useless/unusable.
*The Japanese engine requires all four .fnt files to be present (bails out if any of them are missing) and uses them for all of the vanilla text strings, which only contain double-byte control codes. Thus, under normal conditions, TSFFTahoma remains completely unused in the Japanese version.
*The Japanese engine requires all four .fnt files to be present (bails out if any of them are missing) and uses them for all of the vanilla text strings, which only contain double-byte control codes. Thus, under normal conditions, TSFFTahoma remains completely unused in the Japanese version.
*If the US engine is used on the Japanese game data, then the .fnt files are ignored (obviously), and the incomplete TSFFTahoma is used to render the Japanese text strings as well as the few English strings supplied by the EXE. Due to the limited character set, many strings end up broken.  
*If the US engine is used on the Japanese game data, then the .fnt files are ignored (obviously), and the incomplete TSFFTahoma is used to render the Japanese text strings as well as the few English strings supplied by the EXE. Due to the limited character set, many strings end up broken.  


Possibly the incomplete Shift JIS code pages present in the Japanese TSFFTahoma represent an early attempt to implement all the glyphs within Oni's existing text system. As the number of kanji increased, supposedly, the TSFT grew prohibitively large due to the use of 8-bit grayscale storage for the pixel data, and the size taken up by the sparsely populated TSGA also increased out of proportion with the rest of the game data. Once the switch to separate .fnt files was made, no one bothered to clean up TSFFTahoma.
It appears that the Japanese localization team initially tried to put Oni's code page system to use, and to fill in all the required JIS glyphs into TSFT and TSGA. As the number of kanji increased, supposedly, the TSFT grew prohibitively large due to the use of 8-bit grayscale storage for the pixel data, and the size taken up by the sparsely populated TSGA also increased out of proportion with the rest of the game data. At some point the engine switched to separate .fnt files, and somehow no one bothered to clean up the incomplete code pages in TSFFTahoma.


At the time of writing, the code points and pixel data in the Japanese .fnt files have not been thoroughly analyzed and compared with JIS X 0208. We know that 1,357 glyphs are implemented, across 27 "lead bytes" (roughly 50 ''kuten'' rows). This is much smaller than the full ''kuten'' plane, and makes sense in terms of space efficiency. We also know that some code points are non-standard (rearranged) as compared to regular Shift JIS, although we do not yet know if this rearrangement is consistent with any common variation of Shift JIS. As long as Japanese game data contains text strings that match the game's encoding, non-standard code points are not a problem (but should be kept in mind).
At the time of writing, the code points and pixel data in the Japanese .fnt files have not been thoroughly analyzed and compared with JIS X 0208. We know that 1,357 glyphs are implemented, across 27 "lead bytes" (roughly 50 ''kuten'' rows). This is much smaller than the full ''kuten'' plane, and makes sense in terms of space efficiency. We also know that some code points are non-standard (rearranged) as compared to regular Shift JIS, although we do not yet know if this rearrangement is consistent with any common variation of Shift JIS. As long as Japanese game data contains text strings that match the game's encoding, non-standard code points are not a problem (but should be kept in mind).


==Text anomalies==
==Text anomalies==
===Ellipsis issue===
===Ellipsis issue===
Unlike other Western versions (UK English, French, German, Italian, Spanish, Russian), the US engine treats high-bit characters as part of a two-byte control sequence (a provision for Asian encodings), and therefore fails to render any character from the extended ASCII range. This happens twice in English Oni, because the ellipsis (…), encoded at 0xC9, was accidentally used in [[Quotes/Consoles/level_19d|these]] [[Quotes/Consoles/level_19e|two]] text consoles in place of three consecutive dots (probably an auto-substitution by a text editor). The result is that the two lines using a "…" are cut off at the offending character.
Unlike other Western versions (UK English, French, German, Italian, Spanish, Russian), the US engine treats high-bit characters as part of a two-byte control sequence (a provision for Asian encodings), and therefore fails to render any character from the extended ASCII range. This happens twice in English Oni, because the ellipsis character (…), encoded as 0xC9, was accidentally used in [[Quotes/Consoles/level_19d|These]] [[Quotes/Consoles/level_19e|Two]] text consoles in place of three consecutive dots (probably auto-substituted by a text editor). The result is that the two lines using a "…" are cut off at the offending character.


===(A1,A0) issue===
===(A1,A0) issue===
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE (single US-ASCII characters are also allowed in theory).
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE (single US-ASCII characters are also allowed in theory).


The text strings in the the Chinese version mostly conform to the EUC-CN scheme, except for the rare occurrence of the (A1,A0) sequence. This is not a valid control sequence under any common extension of EUC-CN, and in any case it does not correspond to any pixel data within xf_font.dat, which only covers the standard 94x94 ''qūwèi'' plane, corresponding to a strict 0xA1-0xFE range for the two encoding bytes.
The text strings in the Chinese version mostly conform to the EUC-CN scheme, except for the rare occurrence of the (A1,A0) sequence. This is not a valid control sequence under any common extension of EUC-CN, and in any case it does not correspond to any pixel data within xf_font.dat, which only covers the standard 94x94 ''quwei'' plane, corresponding to a strict 0xA1-0xFE range for the two encoding bytes. Any text string including the (A1,A0) sequence is broken off at the offending character: this is known to occur for
 
Another illegal sequence is (0xA3,0x89), which occurs only in the SUBTmessages entry xdash1 (five identical glyphs at the end of the string).


===Over-tall text===
===Over-tall text===
Although not strictly speaking a font issue, some of Oni's text fails to render because it doesn't fit vertically into a fixed-size frame (such as a [[:File:DATA_CONSOLE.png|text console]]). This is known to happen for [[Quotes/Consoles/level_1e|these]] [[Quotes/Consoles/level_8b|two]] consoles in the English version, and possibly for other screens in other language versions.
Although not strictly speaking a font issue, some of Oni's text fails to render because it doesn't fit vertically into a fixed-size frame (such as a [[:File:DATA_CONSOLE.png|text console]]). This is known to happen for [[Quotes/Consoles/level_1e|These]] [[Quotes/Consoles/level_8b|Two]] consoles in the English version, and possibly for other screens in other language versions.


===Over-long text===
===Over-long text===
Chinese glyphs have a fixed size of 16x16 pixels and do not fit horizontally into the drop-down lists, causing a disruptive line wrap to take place (this is visible in the Options screen's Resolution and Difficulty menus).
Although Chinese text strings typically have a much smaller number of glyphs than English originals, this is not always the case. The Chinese glyphs are also much wider on average, with each glyph taking up 16x16 pixels, and so there are situations where the rendered Chinese line is much wider than the English original, no longer fitting on one line as intended by the context.
 
This is only known to cause a problem for the "resolution" item in the Options menu (a WMM_ generated at runtime). The actual dropdown list is wide enough to accommodate even the longest resolution strings, but the currently selected resolution appears in a small window that is only 150 pixels wide, too narrow even for the shortest resolution string "640×480×16位" (which needs 176 pixels). As a result the active resolution is always displayed on two lines, no longer fitting into the frame vertically and thus unreadable.
 
===Chinese SUBT issues===
The Chinese (Windows) version of Oni is unique in that no game content was actually localized except for text. Because of the relative simplicity of the task, the Chinese team did not build a new set of game data files, and merely modified the original .dat and .raw from the US version. [[OBD:WMDD|WMDD]], [[OBD:WMM_|WMM_]] and [[OBD:IGSt|IGSt]] instances were patched inside each level's .dat, whereas the two [[OBD:SUBT|SUBT]] files were patched in level0_Final.raw. In the case of an IGSt, text is stored in a fixed-size array (384 bytes), which has more than enough space for any translated text. WMDD and WMM_ also have fixed-size arrays (256 and 64 bytes, respectively) with at least some spare space. SUBT files, however, have a much more compact storage.
 
The text strings of a SUBT file (stores in level0_Final.raw and indexed from the .dat part of the SUBT) are typically packed right next to each other, separated only by a single null char. Chinese text typically uses fewer glyphs, but each glyph is taking up two bytes instead of one, including punctuation and the trailing null. Thus for short sentences or interjections it is possible for a Chinese translation to completely fill up the space used by the original string and even extend into the next entry.
 
None of the Chinese translations in SUBTmessages or SUBTsubtitles are actually longer than the original English text, and it is only the extra null byte that intrudes on the next entry's handle on several occasions. The affected handle essentially becomes a null string, and the corresponding subtitle is never found and displayed.
 
In SUBTmessages this happens only once (the message corresponding to "xf1" overwrites the first character of "xreload", so Konoko is never prompted to reload her gun in the last training room). In SUBTsubtitles there are as many as 29 anomalies, summed up in the following table.
{|
|
{{divhide|&nbsp;List of corrupt subtitle handles in the Chinese version|align=left}}
{|border=1 cellspacing=0
!Culprit handle
!Original of culprit text (null char ° included)
!Victim handle
|-
!01_01_11
|Kerr:&nbsp;&nbsp;Good luck Konoko.°
!<strike>0</strike>1_02_01
|-
!01_03_07
|Griffin:&nbsp;&nbsp;All right Konoko. I'm giving you a shot at this.°
!<strike>0</strike>1_03_07
|-
!02_05_04
|Muro:&nbsp;&nbsp;Let me know when things start to get messy.°
!<strike>0</strike>2_05_05
|-
!02_06_02
|Griffin:&nbsp;&nbsp;Explain.°
!<strike>0</strike>2_06_03
|-
!02_09_03
|Griffin:&nbsp;&nbsp;So she's still stable?°
!<strike>0</strike>2_09_04
|-
!03_10_01
|Barabas:&nbsp;&nbsp;Let's get it on!°
!<strike>0</strike>3_10_02
|-
!03_11_01
|Barabas:&nbsp;&nbsp;She's with them.°
!<strike>0</strike>3_11_02
|-
!04_17_03
|Muro:&nbsp;&nbsp;I can't allow that.°
!<strike>0</strike>4_17_04
|-
!07_22_01
|Konoko:&nbsp;&nbsp;Showtime...°
!<strike>0</strike>7_23_01
|-
!07_26_15
|Konoko:&nbsp;&nbsp;Thanks.°
!<strike>0</strike>7_26_16
|-
!07_26_17
|Cop:&nbsp;&nbsp;No, we haven't secured a single area -- not even our armory.°
!<strike>0</strike>7_26_18
|-
!08_27_03
|Konoko:&nbsp;&nbsp;This is personal.°
!<strike>0</strike>8_27_04
|-
!09_31_02
|Shinatama:&nbsp;&nbsp;You are not who you think you are.°
!<strike>0</strike>9_31_03
|-
!09_31_03
|Konoko:&nbsp;&nbsp;What?°
!<strike>0</strike>9_31_04
|-
!09_31_24
|Konoko:&nbsp;&nbsp;No!°
!<strike>0</strike>9_31_25
|-
!11_40_07
|Mukade:&nbsp;&nbsp;We shall see...°
!<strike>c</strike>11_41_01konoko
|-
!12_46_02
|Konoko:&nbsp;&nbsp;What?°
!<strike>1</strike>2_46_03
|-
!12_46_06
|Konoko:&nbsp;&nbsp;Leave me alone...°
!<strike>1</strike>2_46_07
|-
!13_65_05
|Kerr:&nbsp;&nbsp;This may sting a bit...°
!<strike>1</strike>3_65_06
|-
!13_65_20
|Konoko:&nbsp;&nbsp;Griffin? But why?°
!<strike>1</strike>3_65_21
|-
!13_65_25
|Kerr:&nbsp;&nbsp;Muro.°
!<strike>1</strike>3_65_26
|-
!13_65_36
|Konoko:&nbsp;&nbsp;What?°
!<strike>1</strike>3_65_37
|-
!13_66_03
|Konoko:&nbsp;&nbsp;The crane controls...°
!<strike>1</strike>3_66_04
|-
!14_52_02
|Konoko:&nbsp;&nbsp;Gotcha.°
!<strike>1</strike>4_52_03
|-
!14_52_06
|Konoko:&nbsp;&nbsp;For you? Badly?°
!<strike>1</strike>4_52_07
|-
!00_01_09
|Shinatama:&nbsp;&nbsp;Super!°
!<strike>c</strike>00_01_10Shinatama
|-
!c00_01_10Shinatama
|Shinatama:&nbsp;&nbsp;Great!°
!<strike>0</strike>0_01_11
|-
!civmale3_trigger
|Civilian:&nbsp;&nbsp;Hi there!°
!<strike>c</strike>00_01_100shinatama
|-
!c00_01_101shinatama
|Shinatama:&nbsp;I'm sorry...so sorry! °
!<strike>c</strike>00_01_102shinatama
|}
{{divhide|end}}
|}
The systematic nature of this anomaly suggests that the Chinese team were careful not to exceed the string length of the original, and merely overlooked the extra null char (and of course didn't check the ingame rendition of the subtitles all that thoroughly).  
 


{{OBD}}
{{OBD}}

Revision as of 15:26, 4 January 2022

Originally created in English, Oni has been translated into the following seven languages: French, Italian, Spanish, German, Russian, Japanese and Chinese.

(An overview of the known language versions can be found HERE, whereas localized content is detailed HERE.)

Depending on the language version, vanilla Oni uses one of the following five encodings to render text:

  • The original US version uses a trimmed-down Mac OS Roman code page that is effectively limited to US-ASCII (96 code points).
  • European localizations (UK English, French, Italian, Spanish, German) use a custom version of Mac OS Roman (192 code points).
  • The Russian localization uses a full implementation of the Windows-1251 (Cyrillic) code page (224 code points).
  • The Chinese localization uses the EUC-CN implementation of GB 2312 (8,836 code points).
  • The Japanese localization uses 1,357 code points mostly conforming to the Shift JIS implementation of JIS X 0208.

Properties of the fonts that are eventually used to render the text (via the encoding) are briefly described throughout the page.

(A more thorough overview of the glyphs can be found HERE.)


Encodings

US English

Below is the code page implemented by TSFFTahoma in the US English version of Oni. It is based on Mac OS Roman ("MacRoman" for short), but with two differences:

  • Of the 223 printable glyphs provided by MacRoman, 42 are not implemented in TSFFTahoma (shown as grey-on-black).
  • Control point 0x7F (a typically non-printable "delete" character) has a visible box-like glyph (◻) in this implementation.
  ...0 ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...A ...B ...C ...D ...E ...F
0x2... SP ! " # $ % & ' ( ) * + , - . /
0x3... 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0x4... @ A B C D E F G H I J K L M N O
0x5... P Q R S T U V W X Y Z [ \ ] ^ _
0x6... ` a b c d e f g h i j k l m n o
0x7... p q r s t u v w x y z { | } ~
0x8... Ä Å Ç É Ñ Ö Ü á à â ä ã å ç é è
0x9... ê ë í ì î ï ñ ó ò ô ö õ ú ù û ü
0xA... ° £ § ß ® © ´ ¨ Æ Ø
0xB... ± Ұ µ π ª º Ω æ ø
0xC... ¿ ¡ ¬ ƒ « »
NB
SP
À Ã Õ Œ œ
0xD... ÷ ÿ Ÿ ¤
0xE... · Â Ê Á Ë È Í Î Ï Ì Ó Ô
0xF... Platform-Mac.png Ò Ú Û Ù ı ˆ ˜ ¯ ̆ ̇ ̊ ̧ ̋ ̨ ̌
Minor notes
  • The MacRoman layout was apparently "borrowed" before 1998, when Mac OS 8.5 came out and the international currency sign a.k.a. scarab (¤), at 0xDB, was replaced with the euro symbol (€).
  • The actual font (see HERE) has some unusual typographical features, such as a single-stroke Yen/Yuan symbol (Ұ) and a vertical-stroke cent symbol (¢).
Major notes
  • Some of the removed glyphs (most importantly ß, ù and û, but also Ê, Ú and ú) occur in common European languages. This made the US TSFFTahoma unsuitable for EFIGS localizations, requiring the creation of a new version (see below).
  • The US engine actually cannot interpret any code points beyond the US-ASCII range (first 6 rows, white background), notably failing on "…" (see "Ellipsis issue" below). This is because of a provision for Asian encoding systems (EUC-CN and Shift JIS), which use two-byte sequences starting with a high-bit byte.



European

The code page used by the five Western European versions (UK English, French, German, Spanish and Italian) is slightly different from the trimmed-down Mac OS Roman.

  • It tends to the needs of European localizations by adding back the following characters:
    German ß; French Ê and û; French/Italian ù; Spanish/Italian Ú and ú (relatively rare).
N.B. The characters Æ and ÿ are not reinstated, despite their (very rare) occurrence in French script.
  • Awkwardly enough, the six characters are not restored in their original positions (grey-on-black), but take the place of math symbols.
    Four more "math" positions are inexplicably filled with three duplicate characters (œ, ¡ and ª) and a truly enigmatic ʖ̇ , which doesn't seem to occur in any known language and has no dedicated code point in Unicode.
N.B. The broken italic font variants (see HERE) do not fully implement the 10 new glyphs and use a regular question mark instead of the ʖ̇.
  ...0 ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...A ...B ...C ...D ...E ...F
0x2... SP ! " # $ % & ' ( ) * + , - . /
0x3... 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0x4... @ A B C D E F G H I J K L M N O
0x5... P Q R S T U V W X Y Z [ \ ] ^ _
0x6... ` a b c d e f g h i j k l m n o
0x7... p q r s t u v w x y z { | } ~
0x8... Ä Ç É Ñ Ö Ü á à â ä ã å ç é è
0x9... ê ë í ì î ï ñ ó ò ô ö ú ù û ü
0xA... £ § ß ® © ´ ¨ Ø
0xB... ± Ұ µ Ê Ú ù ú û ª ß œ æ ø
0xC... ¿ ¡ ¬ ¡ ƒ ʖ̇ ª « » À Õ Œ œ
0xD... ÷ Ÿ ¤
0xE... Â Ê Á Ë È Í Î Ï Ì Ó Ô
0xF... Ò Ú Û Ù ˆ ˜ ¯

Coincidentally, with the 10 new glyphs, the European code page has exactly 96 glyphs in the US-ASCII half and 96 in the extension half (blue).

N.B. Unlike the US version, all five Western European versions (including UK English) are able to render the full extended ASCII set.



Cyrillic

In the Russian version of Oni, TSFFTahoma implements the Windows-1251 (Cyrillic) code page, with some deviations.

  • The character 0x98, normally non-printable, is implemented as a visible box glyph (☐), slightly larger than 0x7F.
  • The character 0x81, normally a "Ѓ" glyph, is replaced with a thin space of inconsistent size (2px wide for all fonts, 3px for 13pt regular and 16pt regular).
  • The character 0xA0, normally a non-breaking space, is a space of not-so-consistent size (anywhere from single to triple width, depending on the font).
  • The character 0xAD, normally a soft hyphen, is a visible hyphen (similar to the hyphen-minus, 0x2D) for 7pt fonts, and an inconsistently sized space for other fonts.
    (Oni's engine could in theory reserve a special treatment for soft hyphens and non-breaking spaces, specified in TSFLRoman, but in practice there is no such functionality.)
  ...0 ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...A ...B ...C ...D ...E ...F
0x2... SP ! " # $ % & ' ( ) * + , - . /
0x3... 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0x4... @ A B C D E F G H I J K L M N O
0x5... P Q R S T U V W X Y Z [ \ ] ^ _
0x6... ` a b c d e f g h i j k l m n o
0x7... p q r s t u v w x y z { | } ~
0x8... Ђ

 P
ѓ Љ Њ Ќ Ћ Џ
0x9... ђ љ њ ќ ћ џ
0xA...
NB
SP
Ў ў Ј ¤ Ґ ¦ § Ё © Є « ¬ ® Ї
0xB... ° ± І і ґ µ · ё є » ј Ѕ ѕ ї
0xC... А Б В Г Д Е Ж З И Й К Л М Н О П
0xD... Р С Т У Ф Х Ц Ч Ш Щ Ъ Ы Ь Э Ю Я
0xE... а б в г д е ж з и й к л м н о п
0xF... р с т у ф х ц ч ш щ ъ ы ь э ю я
Italic fonts
The Russian version only provides an implementation of Windows-1251 for regular and bold fonts. The five italic fonts (7pt, 9pt, 10pt, 12pt and 14pt) have exactly the same data (pixels and glyph descriptors) as for the European iteration of Mac OS Roman. This makes sense because italic fonts are inherently broken (see HERE) and thus not used by any text in vanilla Oni.
Bold 14 font
Somewhat surprisingly, the size-14 TSFT in the Russian version of TSFFTahoma does not have a complete Windows-1251 code page either. Instead it is limited to the US-ASCII character set (including the "printable delete" box at code point 0x7F), i.e., the upper section of the above table (white background). This causes no issue in vanilla Oni, but only because there is no text that uses bold 14.
Incomplete transparency
A unique "feature" of the Russian/Cyrillic TSFFTahoma is that all the characters in the extended ASCII range (0x80-0xFF) have a slightly opaque background (about 3% opacity) in the regular (non-bold) font variant. This isn't visible ingame, but only because the engine (re)posterizes all the glyphs into 4-bit grayscale when rendering (so that only opacities above 6% are visible).
Glyph alignment and spacing
Last but not least, some fonts in the Russian TSFFTahoma have inconsistent vertical alignment, the most blatant example being 12 bold: some glyphs are one pixel shorter or taller than the full line height (ascender+descender), without a properly compensated vertical glyph offset; others simply have pixels that are not properly aligned within a glyph's rectangle. Besides, many glyphs have excessive padding to the left and/or right of a character, which affects readability.
N.B. There are other examples of poor alignment, e.g. for 12 bold, the character 0x9C (њ) has its right side cut off and is thus unusable (luckily it doesn't occur in Russian script).



Chinese

The Chinese version of Oni has the same TSFFTahoma as the original US version (trimmed-down Mac OS Roman), but the engine cannot interpret the extended ASCII range, and in fact does not use TSFFTahoma at all. Instead the user launches a wrapper mini-app called oni.exe which executes the main game code in Oni.dat (a renamed copy of the original Oni.exe from the US version) with text logic injected from xfhsm_oni.dll and font data loaded from xf_font.dat. Text strings that Oni intends to display are then intercepted by xfhsm_oni.dll and the resulting pixel data from xf_font.dat is injected into Oni's OpenGL context.

Unlike other versions of Oni, the Chinese font doesn't have a table listing the valid code points along with their "glyph descriptors" (i.e., instructions on how to extract a glyph from the raw pixel data). Instead all the glyphs have a standard size of 16x16 pixels and there are exactly 94x94=8,836 glyphs, filling up a standard GB 2312 plane (qūwèi), indexed through a compact numbering scheme known as EUC-CN: each of the 94x94 code points is indexed by a pair of bytes that are both in the 0xA1-0xFE range. Code points that are not assigned under GB 2312 (e.g., rows 10-15 and 90-94) simply have blank pixel data in the corresponding regions of xf_font.dat.

The pixel packing used by xf_font.dat is 1-bit black-and-white (i.e., without antialiasing), which is much more space-efficient than the 8-bit grayscale storage used in Oni's TSFT. Another gain comes from not having any glyph descriptors (TSGAs). Both a regular and a bold typeface are available (but in one size only, fixed-width 16x16).

At the time of writing, the pixel data in xf_font.dat has not been thoroughly analyzed and compared with GB 2312, so we do not know for sure if all the GB 2312 glyphs are implemented or if there are some additional blanks. The encoding may also be one of several extensions of EUC-CN, although it should be kept in mind that control bytes need to remain inside the 0xA1-0xFE range for the raw 94x94 layout to work.

In theory, EUC-CN allows for single-byte control codes, which would be interpreted as US-ASCII and rendered using Oni's own TSFFTahoma. In practice, all of the strings in the Chinese game data use only two-byte control sequences.



Japanese

Japanese Oni uses a custom two-byte encoding that is mostly consistent with Shift JIS but with some of the control sequences rearranged in seemingly non-standard ways. Like Chinese Oni, the glyph data is stored in new, external files; in this case they are .fnt files stored in GameDataFolder. Three font sizes are available, with pixel sizes 11x11 (JPN_SMALL.fnt), 12x12 (JPN_MIDDLE.fnt) and 14x14 (JPN_BIG.fnt). The 14x14 font has a bold-face variant (JPN_BOLD.fnt). All four fonts are fixed-width, i.e. all glyphs have a square bounding box.

Unlike the Chinese version, the TSFFTahoma contained in the Japanese game data is not limited to the ASCII code page. There are a total of 154 double-byte code points (Romaji, punctuation, kana and kanji) across 19 code pages (TSGA) each corresponding to a different "lead byte" (0x81, 0x82, 0x83, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x95, 0x96, 0x97 and 0x98).

As for the first code page of the Japanese TSFFTahoma, it implements only the 0x20-0x7F range of characters, i.e., is limited to US-ASCII. This is consistent with the simplified logic used by the Japanese engine, where any high-bit byte (in the 0x80-0xFF range) is treated as the start of a two-byte sequence. (In actual Shift JIS some high-bit bytes are interpreted as half-width kana, a feature that isn't supported by Oni's engine.)

It must be noted that, as compared to the separate .fnt files, the Japanese TSFFTahoma provides a very rudimentary implementation of JIS X 0208 (only coding for 154 double-byte glyphs, whereas the .fnt files implement 1,357) and is essentially useless/unusable.

  • The Japanese engine requires all four .fnt files to be present (bails out if any of them are missing) and uses them for all of the vanilla text strings, which only contain double-byte control codes. Thus, under normal conditions, TSFFTahoma remains completely unused in the Japanese version.
  • If the US engine is used on the Japanese game data, then the .fnt files are ignored (obviously), and the incomplete TSFFTahoma is used to render the Japanese text strings as well as the few English strings supplied by the EXE. Due to the limited character set, many strings end up broken.

It appears that the Japanese localization team initially tried to put Oni's code page system to use, and to fill in all the required JIS glyphs into TSFT and TSGA. As the number of kanji increased, supposedly, the TSFT grew prohibitively large due to the use of 8-bit grayscale storage for the pixel data, and the size taken up by the sparsely populated TSGA also increased out of proportion with the rest of the game data. At some point the engine switched to separate .fnt files, and somehow no one bothered to clean up the incomplete code pages in TSFFTahoma.

At the time of writing, the code points and pixel data in the Japanese .fnt files have not been thoroughly analyzed and compared with JIS X 0208. We know that 1,357 glyphs are implemented, across 27 "lead bytes" (roughly 50 kuten rows). This is much smaller than the full kuten plane, and makes sense in terms of space efficiency. We also know that some code points are non-standard (rearranged) as compared to regular Shift JIS, although we do not yet know if this rearrangement is consistent with any common variation of Shift JIS. As long as Japanese game data contains text strings that match the game's encoding, non-standard code points are not a problem (but should be kept in mind).


Text anomalies

Ellipsis issue

Unlike other Western versions (UK English, French, German, Italian, Spanish, Russian), the US engine treats high-bit characters as part of a two-byte control sequence (a provision for Asian encodings), and therefore fails to render any character from the extended ASCII range. This happens twice in English Oni, because the ellipsis character (…), encoded as 0xC9, was accidentally used in These Two text consoles in place of three consecutive dots (probably auto-substituted by a text editor). The result is that the two lines using a "…" are cut off at the offending character.

(A1,A0) issue

Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE (single US-ASCII characters are also allowed in theory).

The text strings in the Chinese version mostly conform to the EUC-CN scheme, except for the rare occurrence of the (A1,A0) sequence. This is not a valid control sequence under any common extension of EUC-CN, and in any case it does not correspond to any pixel data within xf_font.dat, which only covers the standard 94x94 quwei plane, corresponding to a strict 0xA1-0xFE range for the two encoding bytes. Any text string including the (A1,A0) sequence is broken off at the offending character: this is known to occur for

Another illegal sequence is (0xA3,0x89), which occurs only in the SUBTmessages entry xdash1 (five identical glyphs at the end of the string).

Over-tall text

Although not strictly speaking a font issue, some of Oni's text fails to render because it doesn't fit vertically into a fixed-size frame (such as a text console). This is known to happen for These Two consoles in the English version, and possibly for other screens in other language versions.

Over-long text

Although Chinese text strings typically have a much smaller number of glyphs than English originals, this is not always the case. The Chinese glyphs are also much wider on average, with each glyph taking up 16x16 pixels, and so there are situations where the rendered Chinese line is much wider than the English original, no longer fitting on one line as intended by the context.

This is only known to cause a problem for the "resolution" item in the Options menu (a WMM_ generated at runtime). The actual dropdown list is wide enough to accommodate even the longest resolution strings, but the currently selected resolution appears in a small window that is only 150 pixels wide, too narrow even for the shortest resolution string "640×480×16位" (which needs 176 pixels). As a result the active resolution is always displayed on two lines, no longer fitting into the frame vertically and thus unreadable.

Chinese SUBT issues

The Chinese (Windows) version of Oni is unique in that no game content was actually localized except for text. Because of the relative simplicity of the task, the Chinese team did not build a new set of game data files, and merely modified the original .dat and .raw from the US version. WMDD, WMM_ and IGSt instances were patched inside each level's .dat, whereas the two SUBT files were patched in level0_Final.raw. In the case of an IGSt, text is stored in a fixed-size array (384 bytes), which has more than enough space for any translated text. WMDD and WMM_ also have fixed-size arrays (256 and 64 bytes, respectively) with at least some spare space. SUBT files, however, have a much more compact storage.

The text strings of a SUBT file (stores in level0_Final.raw and indexed from the .dat part of the SUBT) are typically packed right next to each other, separated only by a single null char. Chinese text typically uses fewer glyphs, but each glyph is taking up two bytes instead of one, including punctuation and the trailing null. Thus for short sentences or interjections it is possible for a Chinese translation to completely fill up the space used by the original string and even extend into the next entry.

None of the Chinese translations in SUBTmessages or SUBTsubtitles are actually longer than the original English text, and it is only the extra null byte that intrudes on the next entry's handle on several occasions. The affected handle essentially becomes a null string, and the corresponding subtitle is never found and displayed.

In SUBTmessages this happens only once (the message corresponding to "xf1" overwrites the first character of "xreload", so Konoko is never prompted to reload her gun in the last training room). In SUBTsubtitles there are as many as 29 anomalies, summed up in the following table.

The systematic nature of this anomaly suggests that the Chinese team were careful not to exceed the string length of the original, and merely overlooked the extra null char (and of course didn't check the ingame rendition of the subtitles all that thoroughly).