OBD:Text encoding: Difference between revisions

m
added a little detail on the level-load crash
m (added a little detail on the level-load crash)
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{TOCfloat|side=right}}
Originally created in English, Oni has been translated into the following [[seven]] languages: French, Italian, Spanish, German, Russian, Japanese and Chinese. An overview of the known language versions can be found on [[OBD:Releases]], but the details of these releases' localized content are found on [[OBD:Localization]].
Originally created in English, Oni has been translated into the following [[seven]] languages: French, Italian, Spanish, German, Russian, Japanese and Chinese.
 
:(An overview of the known language versions can be found [[OBD:Versions|HERE]], whereas localized content is detailed [[OBD:Localization|HERE]].)
Depending on the language version, vanilla Oni uses one of the following five encodings to render text:
Depending on the language version, vanilla Oni uses one of the following five encodings to render text:
*The original US version uses a trimmed-down [[wp:Mac_OS_Roman|Mac OS Roman]] code page that is effectively limited to [[wp:ASCII|US-ASCII]] (96 code points used, 256 available).
*The original US version uses a trimmed-down [[wp:Mac OS Roman|Mac OS Roman]] code page that is effectively limited to [[wp:ASCII|US-ASCII]] (96 code points used, 256 available).
*European localizations (UK English, French, Italian, Spanish, German) use a custom version of Mac OS Roman (192 code points used, 256 available).
*European localizations (UK English, French, Italian, Spanish, German) use a custom version of Mac OS Roman (192 code points used, 256 available).
*The Russian localization uses a (nearly) full implementation of the [[wp:Windows-1251|Windows-1251]] (Cyrillic) code page (224 code points used, 256 available).
*The Russian localization uses a (nearly) full implementation of the [[wp:Windows-1251|Windows-1251]] (Cyrillic) code page (224 code points used, 256 available).
*The Chinese localization uses the [[wp:Extended_Unix_Code#EUC-CN|EUC-CN]] implementation of [[wp:GB_2312|GB 2312]] (7,668 code points used, 8,836 available).
*The Chinese localization uses the [[wp:Extended Unix Code#EUC-CN|EUC-CN]] implementation of [[wp:GB 2312|GB 2312]] (7,668 code points used, 8,836 available).
*The Japanese localization uses 1,357 code points mostly conforming to the [[wp:Shift_JIS|Shift JIS]] implementation of [[wp:JIS_X_0208|JIS X 0208]].
*The Japanese localization uses 1,357 code points mostly conforming to the [[wp:Shift JIS|Shift JIS]] implementation of [[wp:JIS X 0208|JIS X 0208]].
Properties of the fonts that are eventually used to render the text (via the encoding) are briefly described throughout the page.
 
:(A more thorough overview of the glyphs can be found [[/Fonts|HERE]].)
Properties of the fonts that are eventually used to render the text (via the encoding) are briefly described throughout the page. A more thorough overview of the glyphs can be found on the [[/Fonts|Fonts subpage]] (to be created).


==Encodings==
==Encodings==
===US English===
===US English===
Below is the code page implemented by [[TSFF]]Tahoma in the US English version of Oni. It is based on [[wp:Mac_OS_Roman|Mac OS Roman]] ("MacRoman" for short), but with two differences:
Below is the code page implemented by [[TSFF]]Tahoma in the US English version of Oni. It is based on [[wp:Mac OS Roman|Mac OS Roman]] ("MacRoman" for short), but with two differences:
*Of the 223 printable glyphs provided by MacRoman, 42 are not implemented in TSFFTahoma (shown as grey-on-black).
*Of the 223 printable glyphs provided by MacRoman, 42 are not implemented in TSFFTahoma (shown as grey-on-black).
*Control point 0x7F (a typically non-printable "delete" character) has a visible box-like glyph (◻) in this implementation.
*Control point 0x7F (a typically non-printable "delete" character) has a visible box-like glyph (◻) in this implementation.
Line 129: Line 128:
|}
|}
;Minor notes
;Minor notes
*The MacRoman layout was apparently "borrowed" before 1998, when Mac OS 8.5 came out and the [[wp:Currency sign (typography)|international currency sign]] a.k.a. scarab (¤), at 0xDB, was replaced with the euro symbol (€).
*The MacRoman layout was apparently "borrowed" before 1998, when Mac OS 8.5 came out and the [[wp:Currency sign (generic)|international currency sign]] a.k.a. scarab (¤), at 0xDB, was replaced with the euro symbol (€).
*The actual font (see [[/Fonts|HERE]]) has some unusual typographical features, such as a single-stroke Yen/Yuan symbol (Ұ) and a vertical-stroke cent symbol similar to Unicode's Fullwidth Cent Sign (¢) character as seen in Windows Arial (note to Mac users: don't be confused, as this character will appear with a diagonal stroke on your system like the regular '¢' character).
*The actual font (see [[/Fonts|HERE]]) has some unusual typographical features, such as a single-stroke Yen/Yuan symbol (Ұ) and a vertical-stroke cent symbol similar to Unicode's Fullwidth Cent Sign (¢) character as seen in Windows Arial (note to Mac users: don't be confused, as this character will appear with a diagonal stroke on your system like the regular '¢' character).
;Major notes
;Major notes
*Some of the removed glyphs (most importantly ß, ù and û, but also Ê, Ú and ú) occur in [[wp:Languages of the European Union#Knowledge|common European languages]]. This made the US TSFFTahoma unsuitable for [[wikt:EFIGS|EFIGS]] localizations, requiring the creation of a new version (see below).  
*Some of the removed glyphs (most importantly ß, ù and û, but also Ê, Ú and ú) occur in [[wp:Languages of the European Union#Knowledge|common European languages]]. This made the US TSFFTahoma unsuitable for [[wikt:EFIGS|EFIGS]] localizations, requiring the creation of a new version (see below).  
*The US engine actually cannot interpret any code points beyond the US-ASCII range (first 6 rows, white background), notably failing on 0xC9's "…". This is because of a nominal but unused provision for Asian text encodings. See "[[#Ellipsis_issue|Ellipsis issue]]" below for details.
*The US engine actually cannot interpret any code points beyond the US-ASCII range (first 6 rows, white background), notably failing on 0xC9's "…". This is because of a nominal but unused provision for Asian text encodings. See {{SectionLink||Ellipsis issue}} for details.




Line 143: Line 142:
:'''N.B.''' The characters Æ and ÿ are not reinstated, despite their (very rare) occurrence in French script.
:'''N.B.''' The characters Æ and ÿ are not reinstated, despite their (very rare) occurrence in French script.
*Awkwardly enough, the six characters are not restored in their original positions (grey-on-black), but take the place of math symbols.<br/>Four more "math" positions are inexplicably filled with three duplicate characters (œ, ¡ and ª) and a truly enigmatic ʖ̇ , which doesn't seem to occur in any known language and has no dedicated code point in Unicode (the character you see here was constructed from Unicode's U+0296 Latin Letter Inverted Glottal Stop (ʖ) plus U+0307 Combining Dot Above.
*Awkwardly enough, the six characters are not restored in their original positions (grey-on-black), but take the place of math symbols.<br/>Four more "math" positions are inexplicably filled with three duplicate characters (œ, ¡ and ª) and a truly enigmatic ʖ̇ , which doesn't seem to occur in any known language and has no dedicated code point in Unicode (the character you see here was constructed from Unicode's U+0296 Latin Letter Inverted Glottal Stop (ʖ) plus U+0307 Combining Dot Above.
:'''N.B.''' The broken italic font variants (see [[/Fonts#Italic|HERE]]) do not fully implement the 10 new glyphs and use a regular question mark instead of the  ʖ̇.
:'''N.B.''' The broken italic font variants (see "Italic" section of [[/Fonts]] once it exists) do not fully implement the 10 new glyphs and use a regular question mark instead of the  ʖ̇.
{|border=1 cellpadding=3 cellspacing=0
{|border=1 cellpadding=3 cellspacing=0
|-bgcolor=silver
|-bgcolor=silver
Line 319: Line 318:
|}
|}
;Italic fonts
;Italic fonts
:The Russian version only provides an implementation of Windows-1251 for regular and bold fonts. The five italic fonts (7pt, 9pt, 10pt, 12pt and 14pt) have exactly the same data (pixels and glyph descriptors) as for the European iteration of Mac OS Roman. This makes sense because italic fonts are inherently broken (see [[/Fonts#Italic|HERE]]) and thus not used by any text in vanilla Oni.  
:The Russian version only provides an implementation of Windows-1251 for regular and bold fonts. The five italic fonts (7pt, 9pt, 10pt, 12pt and 14pt) have exactly the same data (pixels and glyph descriptors) as for the European iteration of Mac OS Roman. This makes sense because italic fonts are inherently broken (see "Italic" section of [[/Fonts]] once it exists) and thus not used by any text in vanilla Oni.  
;14pt bold font
;14pt bold font
:Somewhat surprisingly, the 14pt bold TSFT in the Russian version of TSFFTahoma does not have a complete Windows-1251 code page either. Instead it is limited to the US-ASCII character set (including the "printable delete" box at code point 0x7F), i.e., the upper section of the above table (white background). This causes no issue in vanilla Oni, but only because there is no text that uses 14pt bold.   
:Somewhat surprisingly, the 14pt bold TSFT in the Russian version of TSFFTahoma does not have a complete Windows-1251 code page either. Instead it is limited to the US-ASCII character set (including the "printable delete" box at code point 0x7F), i.e., the upper section of the above table (white background). This causes no issue in vanilla Oni, but only because there is no text that uses 14pt bold.   
Line 342: Line 341:


;N.B.
;N.B.
Unlike for other versions of Oni, an invalid code point does not interrupt the interpretation/rendering of a text string by xfhsm_oni.dll and can lead to a wide range of unexpected behavior: at best, a blank or otherwise unintended glyph will be displayed; at worst the rendered text will be garbled (memory corruption most likely), or the game may simply [[Blam|crash]].
Unlike for other versions of Oni, an invalid code point does not interrupt the interpretation/rendering of a text string by xfhsm_oni.dll and can lead to a wide range of unexpected behavior: at best, a blank or otherwise unintended glyph will be displayed; at worst the rendered text will be garbled (memory corruption most likely), or the game may simply crash with a [[Blam!]] message.


The current understanding is that xfhsm_oni.dll simply turns any two-byte code point QQ WW into the offset [(QQ-A1)*5E + (WW-A1)]*0x20, relative either to the start of the xf_font.dat data (for the 16x16 font) or to the middle of the data (for the small 12x12 font). Depending on the values of QQ and WW, both components of the offset can fall outside the intended 0-93 range, with values as high as 94 and as low as -161. There doesn't seem to be any sanity check, and the only special handling is for QQ=00 (in this case WW is ignored and the string is terminated).
The current understanding is that xfhsm_oni.dll simply turns any two-byte code point QQ WW into the offset [(QQ-A1)*5E + (WW-A1)]*0x20, relative either to the start of the xf_font.dat data (for the 16x16 font) or to the middle of the data (for the small 12x12 font). Depending on the values of QQ and WW, both components of the offset can fall outside the intended 0-93 range, with values as high as 94 and as low as -161. There doesn't seem to be any sanity check, and the only special handling is for QQ=00 (in this case WW is ignored and the string is terminated).
Line 348: Line 347:
A valid EUC-CN code point (with both bytes in the 0xA1-0xFE range) results in a valid offset pointing to an actual glyph for the relevant font, whereas illegal bytes or byte pairs may point to a different glyph within the same font, or to a glyph of the other font, or to a completely unrelated memory region. In the worst case scenario, pixel data will be read at 486,432 bytes (~475 kB) ahead of the actual pixel data (if displaying the code point 01,00 for the large font) or at 3008-3040 bytes (~3 kB) past the actual pixel data (if displaying the code point FF,FF for the small font).
A valid EUC-CN code point (with both bytes in the 0xA1-0xFE range) results in a valid offset pointing to an actual glyph for the relevant font, whereas illegal bytes or byte pairs may point to a different glyph within the same font, or to a glyph of the other font, or to a completely unrelated memory region. In the worst case scenario, pixel data will be read at 486,432 bytes (~475 kB) ahead of the actual pixel data (if displaying the code point 01,00 for the large font) or at 3008-3040 bytes (~3 kB) past the actual pixel data (if displaying the code point FF,FF for the small font).


Reading garbage pixel data shouldn't be causing memory corruption per se (merely nonsensical/garbled text), but if similar out-of-bounds pointers occur for glyph rendering, then xfhsm_oni.dll may occasionally overwrite its own memory or even Oni's. This has not been thoroughly investigated, but it seems advisable to ensure that all text consists of valid EUC-CN code points (which is unfortunately not the case, see [[#Invalid EUC-CN input|"Invalid EUC-CN input"]] below).
Reading garbage pixel data shouldn't be causing memory corruption per se (merely nonsensical/garbled text), but if similar out-of-bounds pointers occur for glyph rendering, then xfhsm_oni.dll may occasionally overwrite its own memory or even Oni's. This has not been thoroughly investigated, but it seems advisable to ensure that all text consists of valid EUC-CN code points (which is unfortunately not the case, see {{SectionLink||Invalid EUC-CN input}}).




Line 1,181: Line 1,180:
|}
|}


Without a proper sanity check, some illegal code points will clearly result in pixel data being loaded not from a valid glyph region, but from irrelevant memory that belongs either to xfhsm_oni.dll or to the main Oni engine, resulting in garbled text. Memory corruption or segmentation fault (access violation) may occur if similar out-of-bounds pointers are used when rendering glyph textures. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems, although this has not been investigated thoroughly.
Without a proper sanity check, some illegal code points will clearly result in pixel data being loaded not from a valid glyph region, but from irrelevant memory that belongs either to xfhsm_oni.dll or to the main Oni engine, resulting in garbled text. Memory corruption or segmentation fault (access violation) may occur if similar out-of-bounds pointers are used when rendering glyph textures. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems, although this crash is different because it happens without the Blam! dialog appearing; also, it can be avoided by turning down the graphics quality to Superlow. This indicates an issue related to the amount of memory being used, but it's possible the crash is also text-related; the cause has yet to be determined.


====Non-translated US-ASCII====
====Non-translated US-ASCII====
ASCII strings are much more harmful when handled by xfhsm_oni.dll, as compared to the two invalid code points (A3,A0) and (A3,0x89), because pairs of US-ASCII bytes, misinterpreted as EUC-CN code points, end up referencing completely strange memory regions (outside the region occupied by xf_font.dat). Unfortunately, there are a few ASCII strings that xfhsm_oni.dll can come across even during regular gameplay, and many more arise if one allows for modding.
ASCII strings are much more harmful when handled by xfhsm_oni.dll, as compared to the two invalid code points (A3,A0) and (A3,0x89), because pairs of US-ASCII bytes, misinterpreted as EUC-CN code points, end up referencing completely strange memory regions (outside the region occupied by xf_font.dat). Unfortunately, there are a few ASCII strings that xfhsm_oni.dll can come across even during regular gameplay, and many more arise if one allows for modding.
=====Count on it=====
=====Count on it=====
The following string in SUBTsubtitles has not been translated into Chinese:
The following string in SUBTsubtitles has not been translated into Chinese:
:Barabas:&nbsp;&nbsp;Count on it. When I get through with them they're...
:Barabas:&nbsp;&nbsp;Count on it. When I get through with them they're...
Being encoded as plain US-ASCII, this string is entirely illegal considering the limited implementation of EUC-CN by xfhsm_oni.dll, which does not detect US-ASCII as single-byte code points and keeps interpreting pairs of ASCII bytes as (invalid) quwei indices. Through lucky coincidence, the string has an even number of printable bytes, so that the null character is still in a suitable place for terminating the string (the EUN-CN parser will see it as a null lead-byte and will not keep reading further data). However, the string still consists of 31 invalid two-byte code points (not counting the null). As a further lucky coincidence, this string is never read by Oni's engine, because the subtitle's handle (02_05_05) is one of those that have been clobbered by the spurious double-null (see [[#Chinese_SUBT_issues|"Chinese_SUBT_issues"]] below). If it wasn't for the clobbering, the game would crash upon displaying this subtitle.
Being encoded as plain US-ASCII, this string is entirely illegal considering the limited implementation of EUC-CN by xfhsm_oni.dll, which does not detect US-ASCII as single-byte code points and keeps interpreting pairs of ASCII bytes as (invalid) quwei indices. Through lucky coincidence, the string has an even number of printable bytes, so that the null character is still in a suitable place for terminating the string (the EUN-CN parser will see it as a null lead-byte and will not keep reading further data). However, the string still consists of 31 invalid two-byte code points (not counting the null). As a further lucky coincidence, this string is never read by Oni's engine, because the subtitle's handle (02_05_05) is one of those that have been clobbered by the spurious double-null (see {{SectionLink||Chinese SUBT issues}}). If it wasn't for the clobbering, the game would crash upon displaying this subtitle.


=====Pre-beta ONLDs=====
=====Pre-beta ONLDs=====