Jump to content

OBD:Text encoding: Difference between revisions

mystery solved-ish & minor touch-up
(→‎Chinese SUBT issues: filling in the "victim" strings, fwiw)
(mystery solved-ish & minor touch-up)
Line 3: Line 3:
:(An overview of the known language versions can be found [[OBD:Versions|HERE]], whereas localized content is detailed [[OBD:Localization|HERE]].)
:(An overview of the known language versions can be found [[OBD:Versions|HERE]], whereas localized content is detailed [[OBD:Localization|HERE]].)
Depending on the language version, vanilla Oni uses one of the following five encodings to render text:
Depending on the language version, vanilla Oni uses one of the following five encodings to render text:
*The original US version uses a trimmed-down [[wp:Mac_OS_Roman|Mac OS Roman]] code page that is effectively limited to US-ASCII (96 code points).
*The original US version uses a trimmed-down [[wp:Mac_OS_Roman|Mac OS Roman]] code page that is effectively limited to US-ASCII (96 code points used, 256 available).
*European localizations (UK English, French, Italian, Spanish, German) use a custom version of Mac OS Roman (192 code points).
*European localizations (UK English, French, Italian, Spanish, German) use a custom version of Mac OS Roman (192 code points used, 256 available).
*The Russian localization uses a full implementation of the [[wp:Windows-1251|Windows-1251]] (Cyrillic) code page (224 code points).
*The Russian localization uses a (nearly) full implementation of the [[wp:Windows-1251|Windows-1251]] (Cyrillic) code page (224 code points used, 256 available).
*The Chinese localization uses the [[wp:Extended_Unix_Code#EUC-CN|EUC-CN]] implementation of [[wp:GB_2312|GB 2312]] (8,836 code points).
*The Chinese localization uses the [[wp:Extended_Unix_Code#EUC-CN|EUC-CN]] implementation of [[wp:GB_2312|GB 2312]] (7,668 code points used, 8,836 available).
*The Japanese localization uses 1,357 code points mostly conforming to the [[wp:Shift_JIS|Shift JIS]] implementation of [[wp:JIS_X_0208|JIS X 0208]].
*The Japanese localization uses 1,357 code points mostly conforming to the [[wp:Shift_JIS|Shift JIS]] implementation of [[wp:JIS_X_0208|JIS X 0208]].
Properties of the fonts that are eventually used to render the text (via the encoding) are briefly described throughout the page.
Properties of the fonts that are eventually used to render the text (via the encoding) are briefly described throughout the page.
Line 337: Line 337:
Two glyph sizes are available: 16x16 glyphs are stored in the first half of xf_font.dat, and 12x12 glyphs in the second half. Each 12x12 glyph is stored in the top left corner of a 16x16 bitmap, so the row/glyph alignment is the same in both cases: 2 bytes per pixel row and 32 bytes per glyph. The pixel packing is 1-bit black-and-white (i.e., without antialiasing), much more space-efficient than the 8-bit grayscale storage used in Oni's [[TSFT]]. Another gain comes from not having any glyph descriptors ([[TSGA]]s), and from having only two fonts instead of Oni's typical 15.
Two glyph sizes are available: 16x16 glyphs are stored in the first half of xf_font.dat, and 12x12 glyphs in the second half. Each 12x12 glyph is stored in the top left corner of a 16x16 bitmap, so the row/glyph alignment is the same in both cases: 2 bytes per pixel row and 32 bytes per glyph. The pixel packing is 1-bit black-and-white (i.e., without antialiasing), much more space-efficient than the 8-bit grayscale storage used in Oni's [[TSFT]]. Another gain comes from not having any glyph descriptors ([[TSGA]]s), and from having only two fonts instead of Oni's typical 15.


All the GB 2312 glyphs listed [[wp:GB_2312#Non-Hanzi_rows|HERE]] and [[wikt:Appendix:Chinese_hanzi_by_GB_2312_quwei_code|HERE]] are implemented, except for the euro sign and the ten glyphs from [[wp:Vertical_Forms|Vertical Forms]].
All the GB 2312 glyphs listed [[wp:GB_2312#Non-Hanzi_rows|HERE]] and [[wikt:Appendix:Chinese_hanzi_by_GB_2312_quwei_code|HERE]] are implemented, except for the euro sign (row 2) and the ten glyphs from [[wp:Vertical_Forms|Vertical Forms]] (row 6). Thus of the 8,836 available code points only 7,668 (including the ideographic space A1,A1) correspond to actual glyphs, whereas the other 1,168 correspond to blank pixel data (indistinguishable from a space).  


;N.B.
Unlike for other versions of Oni, an invalid code point does not interrupt the interpretation/rendering of a text string by xfhsm_oni.dll and can lead to a wide range of unexpected behavior: at best, a blank or otherwise unintended glyph will be displayed; at worst the rendered text will be garbled (memory corruption most likely), or the game may simply [[Blam|crash]].
Unlike for other versions of Oni, an invalid code point does not interrupt the interpretation/rendering of a text string by xfhsm_oni.dll and can lead to a wide range of unexpected behavior: at best, a blank or otherwise unintended glyph will be displayed; at worst the rendered text will be garbled (memory corruption most likely), or the game may simply [[Blam|crash]].
The current understanding is that xfhsm_oni.dll simply turns any two-byte code point QQ WW into the offset [(QQ-A1)*5E + (WW-A1)]*0x20, relative either to the start of the xf_font.dat data (for the 16x16 font) or to the middle of the data (for the small 12x12 font). Both components of the offset can exceed the intended 0-93 range, with values as high as 94 and as low as -161, depending on the values of QQ and WW, and there doesn't seem to be any sanity check. The only special case is if QQ==00, in which case WW is ignored and the string is terminated.
A valid EUC-CN code point (with both bytes in the 0xA1-0xFE range) results in a valid offset pointing to an actual glyph for the relevant font, whereas illegal bytes or byte pairs may point to a different glyph within the same font, or to a glyph of the other font, or to a completely unrelated memory region. In the worst case scenario, pixel data will be read at 486,432 bytes (~475 kB) ahead of the actual pixel data (for the code point 01,00) or at 3008-3040 bytes (~3 kB) past the actual pixel data (for the code point FF,FF).
Reading garbage pixel data shouldn't be causing memory corruption per se (merely nonsensical/garbled text), but if similar out-of-bounds pointers occur for glyph rendering, then xfhsm_oni.dll may occasionally overwrite its own memory or even Oni's. This has not been thoroughly investigated, but it seems advisable to ensure that all text consists of valid EUC-CN code points (which is unfortunately not the case, see [[#Invalid EUC-CN input|"Invalid EUC-CN input"]] below).




----
----
===Japanese===
===Japanese===
Japanese Oni uses a custom two-byte encoding that is mostly consistent with [[wp:Shift_JIS|Shift JIS]] but with some of the control sequences rearranged in seemingly non-standard ways. Like Chinese Oni, the glyph data is stored in new, external files; in this case they are .fnt files stored in GameDataFolder. Three font sizes are available, with pixel sizes 11x11 ('''JPN_SMALL.fnt'''), 12x12 ('''JPN_MIDDLE.fnt''') and 14x14 ('''JPN_BIG.fnt'''). The 14x14 font has a bold-face variant ('''JPN_BOLD.fnt'''). All four fonts are fixed-width, i.e. all glyphs have a square bounding box.
Japanese Oni uses a custom two-byte encoding that is mostly consistent with [[wp:Shift_JIS|Shift JIS]] but with some of the control sequences rearranged in seemingly non-standard ways. Like Chinese Oni, the glyph data is stored in new, external files; in this case they are .fnt files stored in GameDataFolder. Three font sizes are available, with pixel sizes 11x11 ('''JPN_SMALL.fnt'''), 12x12 ('''JPN_MIDDLE.fnt''') and 14x14 ('''JPN_BIG.fnt'''). The 14x14 font has a bold-face variant ('''JPN_BOLD.fnt'''). All four fonts are fixed-width, i.e. all glyphs have a square bounding box.
Line 877: Line 883:
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE and anything else is technically illegal (single US-ASCII characters could occur in theory, but are not handled properly by the custom text engine, xfhsm_oni.dll).
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE and anything else is technically illegal (single US-ASCII characters could occur in theory, but are not handled properly by the custom text engine, xfhsm_oni.dll).


The text strings in the Chinese version mostly conform to the EUC-CN scheme. A notable exception is the (A1,A0) sequence, which occurs in a few subtitles and is rendered with a blank glyph (i.e., a space between valid glyphs, undistinguishable from an ordinary ideographic space), apparently due to some kind of wraparound. At the time of writing it is not known what was meant by the (A1,A0) sequence, as it doesn't seem to be a valid control sequence under any common extension of EUC-CN.
The text strings in the Chinese version mostly conform to the EUC-CN scheme. A notable exception is the (A1,A0) sequence, which occurs in a few subtitles and is rendered with a blank glyph (i.e., a space between valid glyphs, undistinguishable from an ordinary ideographic space). It appears that xfhsm_oni.dll is simply subtracting 94 from both bytes and then using them as row-major indices into an array of 32-byte glyphs, so that (A1,A0) is simply equivalent to (0,-1) and points to the 32-byte region immediately preceding the first glyph in the relevant glyph array. Since subtitles use the small font (second half of xf_font.dat), (0,-1) merely points to the last glyph of quwei row 93 for the large font, which happens to be blank.
 
At the time of writing it is not known what was meant by the (A1,A0) sequence, as it doesn't seem to be a valid control sequence under any common extension of EUC-CN.


Another illegal sequence is (0xA3,0x89), which occurs only in the SUBTmessages entry xdash1 (five identical glyphs at the end of the string) and is somehow rendered as a ㈢, which would normally be encoded with (A2,E7). Such an improbable substitution is likely unintentional, and it is not known what the intended glyph was.
Another illegal sequence is (0xA3,0x89), which occurs only in the SUBTmessages entry xdash1 (five identical glyphs at the end of the string) and is rendered as ㈢. In this case, too, xfhsm_oni.dll is simply subtracting 94 from both bytes, ending up with (2,-24), which is equivalent to (1,70) and produces the glyph ㈢. The correct EUC-CN code for ㈢ would be (A2,E7), although it is unlikely that this is what the translator meant to write. Like for (A1,A0), it is not currently known what the intended glyph was.


At the time of writing the apparent wraparound behavior has not been investigated thoroughly, but it is established that some illegal code points are not recovered to a valid glyph at all, and instead result in garbled text or a crash. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems (through varying degrees of memory corruption), although this has not been investigated thoroughly either.
Without a proper sanity check, some illegal code points will clearly result in pixel data being loaded not from a valid glyph region, but from irrelevant memory that belongs either to xfhsm_oni.dll or to the main Oni engine, resulting in garbled text. Memory corruption or segmentation fault (access violation) may occur if similar out-of-bounds pointers are used when rendering glyph textures. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems, although this has not been investigated thoroughly.


===Over-tall text===
===Over-tall text===