OBD:Text encoding: Difference between revisions

m
→‎Invalid EUC-CN input: probably makes more sense that way
m (→‎Invalid EUC-CN input: probably makes more sense that way)
Line 884: Line 884:
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE and anything else is technically illegal (single US-ASCII characters could occur in theory, but are not handled properly by the custom text engine, xfhsm_oni.dll).
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE and anything else is technically illegal (single US-ASCII characters could occur in theory, but are not handled properly by the custom text engine, xfhsm_oni.dll).


The text strings in the Chinese version mostly conform to the EUC-CN scheme, but there are two recurrent invalid characters, as well as one instance of non-translated US-ASCII (!!!).
The text strings in the Chinese version mostly conform to the EUC-CN scheme, but there are two recurrent invalid characters, as well as some instances of non-translated US-ASCII (!!!).
====Non-translated US-ASCII====
ASCII strings are much more harmful when handled by xfhsm_oni.dll, as compared to the two invalid code points (A3,A0) and (A3,0x89), because pairs of US-ASCII bytes, misinterpreted as EUC-CN code points, end up referencing completely strange memory regions (outside the region occupied by xf_font.dat). Unfortunately, there are a few ASCII strings that xfhsm_oni.dll can come across even during regular gameplay, and many more arise if one allows for modding.
=====Count on it=====
The following string in SUBTsubtitles has not been translated into Chinese:
:Barabas:  Count on it. When I get through with them they're...
Being encoded as plain US-ASCII, this string is entirely illegal considering the limited implementation of EUC-CN by xfhsm_oni.dll, which does not detect US-ASCII as single-byte code points and keeps interpreting pairs of ASCII bytes as (invalid) quwei indices. Through lucky coincidence, the string has an even number of printable bytes, so that the null character is still in a suitable place for terminating the string (the EUN-CN parser will see it as a null lead-byte and will not keep reading further data). However, the string still consists of 31 invalid two-byte code points (not counting the null). As a further lucky coincidence, this string is never read by Oni's engine, because the subtitle's handle (02_05_05) is one of those that have been clobbered by the spurious double-null (see [[#Chinese_SUBT_issues|"Chinese_SUBT_issues"]] below). If it wasn't for the clobbering, the game would crash upon displaying this subtitle.
 
=====Pre-beta ONLDs=====
The "level definitions" ([[ONLD]]s) of [[Pre-beta_content#Cut_levels|pre-beta levels]] are never seen in vanilla Oni, but would appear in the "Load Game" dialog if a valid level#_Final.dat were to be supplied at startup (e.g. by a mod). Since xfhsm_oni.dll does not actually support US-ASCII, any untranslated ONLDs are potentially disruptive.
 
The following 8 pre-beta ONLDs were fully translated: "The Airport Part Deux" (level_05), "Obsolete" (level_07), "The Arena of Pain" (level_30), "Crossing Zone" (level_31), "Pit" (level_32), "Crossing Zone Too" (level_33), "Capture" (level_34), "Territories" (level_35).
 
The following 8 pre-beta ONLDs remained as US-ASCII: "Test_Stuff" (level_36), "AlexTestSite" (level_55), "Experimental_II" (level_66), "MARTY'S SOUND CORRIDOR" (level_68), "FiringRange" (level_71), "One Room" (level_77), "One Room 2" (level_88) and "Test Barn II" (level_99).
 
The most awkward case is that of the string "BGI HQ" (ONLDlevel_16), which was translated only partly: "HQ" was replaced with a pair of GB 2312 glyphs, but the first four characters "BGI " remained as plain ASCII (i.e., as two illegal EUC-CN code points).
 
=====Cheat messages=====
None of the 38 cheat messages was translated into Chinese (!!!), so that means 38 more strings entirely made of illegal EUC-CN code points. Any time a cheat is entered, xfhsm_oni.dll attempts to display one of the following strings, which almost always causes a crash on modern Windows systems. Note how the null byte does not interrupt the input if it occurs in a trail-byte position.
{|
|
{{divhide| List of invalid EUC-CN strings triggered by cheats|align=left}}
{|border=1 cellspacing=0 cellpadding=3
!Cheat
!Invalid double-byte arrays (ASCII)
|-valign=top
!shapeshifter
|<tt>Ch<u>an</u>ge<u> C</u>ha<u>ra</u>ct<u>er</u>s <u>En</u>ab<u>le</u>d°<br /><u>Ch</u>an<u>ge</u> C<u>ha</u>ra<u>ct</u>er<u>s </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!liveforever
|<tt>In<u>vi</u>nc<u>ib</u>il<u>it</u>y <u>En</u>ab<u>le</u>d°<br /><u>In</u>vi<u>nc</u>ib<u>il</u>it<u>y </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!touchofdeath
|<tt>Om<u>ni</u>po<u>te</u>nc<u>e </u>En<u>ab</u>le<u>d°</u>to<u>uc</u>ho<u>fd</u>ea<u>th</u><br /><u>Om</u>ni<u>po</u>te<u>nc</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!canttouchthis
|<tt>Un<u>st</u>op<u>pa</u>bl<u>e </u>En<u>ab</u>le<u>d°</u>ca<u>nt</u>to<u>uc</u>ht<u>hi</u>s°<br /><u>Un</u>st<u>op</u>pa<u>bl</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!fatloot
|<tt>Fa<u>t </u>Lo<u>ot</u> R<u>ec</u>ei<u>ve</u>d°</tt>
|-valign=top
!glassworld
|<tt>Gl<u>as</u>s <u>Fu</u>rn<u>it</u>ur<u>e </u>En<u>ab</u>le<u>d°</u>gl<u>as</u>sw<u>or</u>ld<br /><u>Gl</u>as<u>s </u>Fu<u>rn</u>it<u>ur</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!winlevel
|<tt>In<u>st</u>an<u>tl</u>y <u>Wi</u>n <u>Le</u>ve<u>l°</u>wi<u>nl</u>ev<u>el</u></tt>
|-valign=top
!loselevel
|<tt>In<u>st</u>an<u>tl</u>y <u>Lo</u>se<u> L</u>ev<u>el</u></tt>
|-valign=top
!bighead
|<tt>Bi<u>g </u>He<u>ad</u> E<u>na</u>bl<u>ed</u><br /><u>Bi</u>g <u>He</u>ad<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!minime
|<tt>Mi<u>ni</u> M<u>od</u>e <u>En</u>ab<u>le</u>d°<br /><u>Mi</u>ni<u> M</u>od<u>e </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!superammo
|<tt>Su<u>pe</u>r <u>Am</u>mo<u> M</u>od<u>e </u>En<u>ab</u>le<u>d°</u>su<u>pe</u>ra<u>mm</u>o°<br /><u>Su</u>pe<u>r </u>Am<u>mo</u> M<u>od</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!reservoirdogs
|<tt>La<u>st</u> M<u>an</u> S<u>ta</u>nd<u>in</u>g <u>En</u>ab<u>le</u>d°<br /><u>La</u>st<u> M</u>an<u> S</u>ta<u>nd</u>in<u>g </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!roughjustice
|<tt>Ga<u>tl</u>in<u>g </u>Gu<u>ns</u> E<u>na</u>bl<u>ed</u><br /><u>Ga</u>tl<u>in</u>g <u>Gu</u>ns<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!chenille
|<tt>Da<u>od</u>an<u> P</u>ow<u>er</u> E<u>na</u>bl<u>ed</u><br /><u>Da</u>od<u>an</u> P<u>ow</u>er<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!behemoth
|<tt>Go<u>dz</u>il<u>la</u> M<u>od</u>e <u>En</u>ab<u>le</u>d°<br /><u>Go</u>dz<u>il</u>la<u> M</u>od<u>e </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!elderrune
|<tt>Re<u>ge</u>ne<u>ra</u>ti<u>on</u> E<u>na</u>bl<u>ed</u><br /><u>Re</u>ge<u>ne</u>ra<u>ti</u>on<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!moonshadow
|<tt>Ph<u>as</u>e <u>Cl</u>oa<u>k </u>En<u>ab</u>le<u>d°</u>mo<u>on</u>sh<u>ad</u>ow<br /><u>Ph</u>as<u>e </u>Cl<u>oa</u>k <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!munitionfrenzy
|<tt>We<u>ap</u>on<u>s </u>Lo<u>ck</u>er<u> C</u>re<u>at</u>ed</tt>
|-valign=top
!fistsoflegend
|<tt>Fi<u>st</u>s <u>Of</u> L<u>eg</u>en<u>d </u>En<u>ab</u>le<u>d°</u>fi<u>st</u>so<u>fl</u>eg<u>en</u>d°<br /><u>Fi</u>st<u>s </u>Of<u> L</u>eg<u>en</u>d <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!killmequick
|<tt>Ul<u>tr</u>a <u>Mo</u>de<u> E</u>na<u>bl</u>ed<br /><u>Ul</u>tr<u>a </u>Mo<u>de</u> D<u>is</u>ab<u>le</u>d°<u>Ul</u>tr<u>a </u>Mo<u>de</u> E<u>na</u>bl<u>ed</u></tt>
|-valign=top
!carousel
|<tt>Sl<u>ow</u> M<u>ot</u>io<u>n </u>En<u>ab</u>le<u>d°</u>ca<u>ro</u>us<u>el</u><br /><u>Sl</u>ow<u> M</u>ot<u>io</u>n <u>Di</u>sa<u>bl</u>ed</tt>
|}
{{divhide|end}}
|}
 
=====Debug printout and console=====
Oni has a well-hidden [[Developer Mode]] in which it can print informational output directly to the screen instead of writing to a text file. There are fully automatic warnings from the engine (e.g. about too many visible polygons or too many particles), or more or less regular printout (e.g., about a character's current animation status) that can be toggled through [[BSL:Variables|script variables]], or custom "dprint" messages that the developers used for visual feedback while testing [[BSL|scripts]]. Dev mode also has a togglable command line ("CMD: ") for entering script commands in real time. Both the debug printout and the command line use the main glyph-rendering pipeline (intercepted by xfhsm_oni.dll), with a small font size. This makes Dev mode essentially unusable in Chinese Oni, as most if not all of the debug printout or console output will be plain ASCII.
 
Interestingly, Oni ''does'' have some primitive debug printout that is not intercepted by xfhsm_oni.dll and thus is displayed normally using the smallest-sized TSFT from level0_Final's TSFFTahoma. All (most?) of the primitive printout is available without Dev mode. There is the Ctrl+Shift+Y hotkey (FPS display), some HUD-like overlays toggled by [[BSL:Variables|script variables]] (e.g., chr_debug_characters), and finally some 3D sprites added to the game (e.g., health indicators or name labels displayed above a character's head).


====(A3,89)====
====(A3,89)====
Line 1,265: Line 1,170:


Without a proper sanity check, some illegal code points will clearly result in pixel data being loaded not from a valid glyph region, but from irrelevant memory that belongs either to xfhsm_oni.dll or to the main Oni engine, resulting in garbled text. Memory corruption or segmentation fault (access violation) may occur if similar out-of-bounds pointers are used when rendering glyph textures. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems, although this has not been investigated thoroughly.
Without a proper sanity check, some illegal code points will clearly result in pixel data being loaded not from a valid glyph region, but from irrelevant memory that belongs either to xfhsm_oni.dll or to the main Oni engine, resulting in garbled text. Memory corruption or segmentation fault (access violation) may occur if similar out-of-bounds pointers are used when rendering glyph textures. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems, although this has not been investigated thoroughly.
====Non-translated US-ASCII====
ASCII strings are much more harmful when handled by xfhsm_oni.dll, as compared to the two invalid code points (A3,A0) and (A3,0x89), because pairs of US-ASCII bytes, misinterpreted as EUC-CN code points, end up referencing completely strange memory regions (outside the region occupied by xf_font.dat). Unfortunately, there are a few ASCII strings that xfhsm_oni.dll can come across even during regular gameplay, and many more arise if one allows for modding.
=====Count on it=====
The following string in SUBTsubtitles has not been translated into Chinese:
:Barabas:&nbsp;&nbsp;Count on it. When I get through with them they're...
Being encoded as plain US-ASCII, this string is entirely illegal considering the limited implementation of EUC-CN by xfhsm_oni.dll, which does not detect US-ASCII as single-byte code points and keeps interpreting pairs of ASCII bytes as (invalid) quwei indices. Through lucky coincidence, the string has an even number of printable bytes, so that the null character is still in a suitable place for terminating the string (the EUN-CN parser will see it as a null lead-byte and will not keep reading further data). However, the string still consists of 31 invalid two-byte code points (not counting the null). As a further lucky coincidence, this string is never read by Oni's engine, because the subtitle's handle (02_05_05) is one of those that have been clobbered by the spurious double-null (see [[#Chinese_SUBT_issues|"Chinese_SUBT_issues"]] below). If it wasn't for the clobbering, the game would crash upon displaying this subtitle.
=====Pre-beta ONLDs=====
The "level definitions" ([[ONLD]]s) of [[Pre-beta_content#Cut_levels|pre-beta levels]] are never seen in vanilla Oni, but would appear in the "Load Game" dialog if a valid level#_Final.dat were to be supplied at startup (e.g. by a mod). Since xfhsm_oni.dll does not actually support US-ASCII, any untranslated ONLDs are potentially disruptive.
The following 8 pre-beta ONLDs were fully translated: "The Airport Part Deux" (level_05), "Obsolete" (level_07), "The Arena of Pain" (level_30), "Crossing Zone" (level_31), "Pit" (level_32), "Crossing Zone Too" (level_33), "Capture" (level_34), "Territories" (level_35).
The following 8 pre-beta ONLDs remained as US-ASCII: "Test_Stuff" (level_36), "AlexTestSite" (level_55), "Experimental_II" (level_66), "MARTY'S SOUND CORRIDOR" (level_68), "FiringRange" (level_71), "One Room" (level_77), "One Room 2" (level_88) and "Test Barn II" (level_99).
The most awkward case is that of the string "BGI HQ" (ONLDlevel_16), which was translated only partly: "HQ" was replaced with a pair of GB 2312 glyphs, but the first four characters "BGI " remained as plain ASCII (i.e., as two illegal EUC-CN code points).
=====Cheat messages=====
None of the 38 cheat messages was translated into Chinese (!!!), so that means 38 more strings entirely made of illegal EUC-CN code points. Any time a cheat is entered, xfhsm_oni.dll attempts to display one of the following strings, which almost always causes a crash on modern Windows systems. Note how the null byte does not interrupt the input if it occurs in a trail-byte position.
{|
|
{{divhide|&nbsp;List of invalid EUC-CN strings triggered by cheats|align=left}}
{|border=1 cellspacing=0 cellpadding=3
!Cheat
!Invalid double-byte arrays (ASCII)
|-valign=top
!shapeshifter
|<tt>Ch<u>an</u>ge<u> C</u>ha<u>ra</u>ct<u>er</u>s <u>En</u>ab<u>le</u>d°<br /><u>Ch</u>an<u>ge</u> C<u>ha</u>ra<u>ct</u>er<u>s </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!liveforever
|<tt>In<u>vi</u>nc<u>ib</u>il<u>it</u>y <u>En</u>ab<u>le</u>d°<br /><u>In</u>vi<u>nc</u>ib<u>il</u>it<u>y </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!touchofdeath
|<tt>Om<u>ni</u>po<u>te</u>nc<u>e </u>En<u>ab</u>le<u>d°</u>to<u>uc</u>ho<u>fd</u>ea<u>th</u><br /><u>Om</u>ni<u>po</u>te<u>nc</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!canttouchthis
|<tt>Un<u>st</u>op<u>pa</u>bl<u>e </u>En<u>ab</u>le<u>d°</u>ca<u>nt</u>to<u>uc</u>ht<u>hi</u>s°<br /><u>Un</u>st<u>op</u>pa<u>bl</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!fatloot
|<tt>Fa<u>t </u>Lo<u>ot</u> R<u>ec</u>ei<u>ve</u>d°</tt>
|-valign=top
!glassworld
|<tt>Gl<u>as</u>s <u>Fu</u>rn<u>it</u>ur<u>e </u>En<u>ab</u>le<u>d°</u>gl<u>as</u>sw<u>or</u>ld<br /><u>Gl</u>as<u>s </u>Fu<u>rn</u>it<u>ur</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!winlevel
|<tt>In<u>st</u>an<u>tl</u>y <u>Wi</u>n <u>Le</u>ve<u>l°</u>wi<u>nl</u>ev<u>el</u></tt>
|-valign=top
!loselevel
|<tt>In<u>st</u>an<u>tl</u>y <u>Lo</u>se<u> L</u>ev<u>el</u></tt>
|-valign=top
!bighead
|<tt>Bi<u>g </u>He<u>ad</u> E<u>na</u>bl<u>ed</u><br /><u>Bi</u>g <u>He</u>ad<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!minime
|<tt>Mi<u>ni</u> M<u>od</u>e <u>En</u>ab<u>le</u>d°<br /><u>Mi</u>ni<u> M</u>od<u>e </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!superammo
|<tt>Su<u>pe</u>r <u>Am</u>mo<u> M</u>od<u>e </u>En<u>ab</u>le<u>d°</u>su<u>pe</u>ra<u>mm</u>o°<br /><u>Su</u>pe<u>r </u>Am<u>mo</u> M<u>od</u>e <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!reservoirdogs
|<tt>La<u>st</u> M<u>an</u> S<u>ta</u>nd<u>in</u>g <u>En</u>ab<u>le</u>d°<br /><u>La</u>st<u> M</u>an<u> S</u>ta<u>nd</u>in<u>g </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!roughjustice
|<tt>Ga<u>tl</u>in<u>g </u>Gu<u>ns</u> E<u>na</u>bl<u>ed</u><br /><u>Ga</u>tl<u>in</u>g <u>Gu</u>ns<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!chenille
|<tt>Da<u>od</u>an<u> P</u>ow<u>er</u> E<u>na</u>bl<u>ed</u><br /><u>Da</u>od<u>an</u> P<u>ow</u>er<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!behemoth
|<tt>Go<u>dz</u>il<u>la</u> M<u>od</u>e <u>En</u>ab<u>le</u>d°<br /><u>Go</u>dz<u>il</u>la<u> M</u>od<u>e </u>Di<u>sa</u>bl<u>ed</u></tt>
|-valign=top
!elderrune
|<tt>Re<u>ge</u>ne<u>ra</u>ti<u>on</u> E<u>na</u>bl<u>ed</u><br /><u>Re</u>ge<u>ne</u>ra<u>ti</u>on<u> D</u>is<u>ab</u>le<u>d°</u></tt>
|-valign=top
!moonshadow
|<tt>Ph<u>as</u>e <u>Cl</u>oa<u>k </u>En<u>ab</u>le<u>d°</u>mo<u>on</u>sh<u>ad</u>ow<br /><u>Ph</u>as<u>e </u>Cl<u>oa</u>k <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!munitionfrenzy
|<tt>We<u>ap</u>on<u>s </u>Lo<u>ck</u>er<u> C</u>re<u>at</u>ed</tt>
|-valign=top
!fistsoflegend
|<tt>Fi<u>st</u>s <u>Of</u> L<u>eg</u>en<u>d </u>En<u>ab</u>le<u>d°</u>fi<u>st</u>so<u>fl</u>eg<u>en</u>d°<br /><u>Fi</u>st<u>s </u>Of<u> L</u>eg<u>en</u>d <u>Di</u>sa<u>bl</u>ed</tt>
|-valign=top
!killmequick
|<tt>Ul<u>tr</u>a <u>Mo</u>de<u> E</u>na<u>bl</u>ed<br /><u>Ul</u>tr<u>a </u>Mo<u>de</u> D<u>is</u>ab<u>le</u>d°<u>Ul</u>tr<u>a </u>Mo<u>de</u> E<u>na</u>bl<u>ed</u></tt>
|-valign=top
!carousel
|<tt>Sl<u>ow</u> M<u>ot</u>io<u>n </u>En<u>ab</u>le<u>d°</u>ca<u>ro</u>us<u>el</u><br /><u>Sl</u>ow<u> M</u>ot<u>io</u>n <u>Di</u>sa<u>bl</u>ed</tt>
|}
{{divhide|end}}
|}
=====Debug printout and console=====
Oni has a well-hidden [[Developer Mode]] in which it can print informational output directly to the screen instead of writing to a text file. There are fully automatic warnings from the engine (e.g. about too many visible polygons or too many particles), or more or less regular printout (e.g., about a character's current animation status) that can be toggled through [[BSL:Variables|script variables]], or custom "dprint" messages that the developers used for visual feedback while testing [[BSL|scripts]]. Dev mode also has a togglable command line ("CMD: ") for entering script commands in real time. Both the debug printout and the command line use the main glyph-rendering pipeline (intercepted by xfhsm_oni.dll), with a small font size. This makes Dev mode essentially unusable in Chinese Oni, as most if not all of the debug printout or console output will be plain ASCII.
Interestingly, Oni ''does'' have some primitive debug printout that is not intercepted by xfhsm_oni.dll and thus is displayed normally using the smallest-sized TSFT from level0_Final's TSFFTahoma. All (most?) of the primitive printout is available without Dev mode. There is the Ctrl+Shift+Y hotkey (FPS display), some HUD-like overlays toggled by [[BSL:Variables|script variables]] (e.g., chr_debug_characters), and finally some 3D sprites added to the game (e.g., health indicators or name labels displayed above a character's head).


===Over-tall text===
===Over-tall text===