Jump to content

OBD:Text encoding: Difference between revisions

→‎Invalid EUC-CN input: filling in and correcting some stuff; only level0 IGSt left to review
(→‎Invalid EUC-CN input: filling in and correcting some stuff; only level0 IGSt left to review)
Line 884: Line 884:
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE and anything else is technically illegal (single US-ASCII characters could occur in theory, but are not handled properly by the custom text engine, xfhsm_oni.dll).
Unlike the Japanese version, where non-standard Shift JIS sequences are explicitly allowed in the .fnt files, the Chinese version does not have a code table and relies on a standard EUC-CN encoding, with exactly 8,836 code points (94x94). A proper EUC-CN control sequence consists of two bytes that are both in the range 0xA1-0xFE and anything else is technically illegal (single US-ASCII characters could occur in theory, but are not handled properly by the custom text engine, xfhsm_oni.dll).


The text strings in the Chinese version mostly conform to the EUC-CN scheme. A notable exception is the (A1,A0) sequence, which occurs in a few subtitles and is rendered with a blank glyph (i.e., a space between valid glyphs, undistinguishable from an ordinary ideographic space). It appears that xfhsm_oni.dll is simply subtracting 94 from both bytes and then using them as row-major indices into an array of 32-byte glyphs, so that (A1,A0) is simply equivalent to (0,-1) and points to the 32-byte region immediately preceding the first glyph in the relevant glyph array. Since subtitles use the small font (second half of xf_font.dat), (0,-1) merely points to the last glyph of quwei row 93 for the large font, which happens to be blank.
The text strings in the Chinese version mostly conform to the EUC-CN scheme, but there are two recurrent invalid characters, as well as one instance of non-translated US-ASCII (!!!).
====Non-translated US-ASCII====
The following string in SUBTsubtitles has not been translated into Chinese:
:Barabas:  Count on it. When I get through with them they're...
Being encoded as plain US-ASCII, this string is entirely illegal considering the limited implementation of EUC-CN by xfhsm_oni.dll, which does not detect US-ASCII as single-byte code points and keeps interpreting pairs of ASCII bytes as (invalid) quwei indices. Through lucky coincidence, the string has an even number of printable bytes, so that the null character is in a suitable place for terminating the string (the EUN-CN parser will see it as a null lead-byte and will not keep reading further data). However, the string still consists of 31 invalid two-byte code points (not counting the null). As a further lucky coincidence, this string is never read by Oni's engine, because the subtitle's handle (02_05_05) is one of those that have been clobbered by the spurious double-null (see [[#Chinese_SUBT_issues|"Chinese_SUBT_issues"]] below).


At the time of writing it is not known what was meant by the (A1,A0) sequence, as it doesn't seem to be a valid control sequence under any common extension of EUC-CN.
====(A3,89)====
The illegal sequence (A3,0x89) occurs only in the SUBTmessages entry '''xdash1''', the original English text being "Face the center of the room and [c.tap the forward key just before pressing and holding it down again (tap W then press and hold W)].".


Another illegal sequence is (0xA3,0x89), which occurs only in the SUBTmessages entry xdash1 (five identical glyphs at the end of the string) and is rendered as ㈢. In this case, too, xfhsm_oni.dll is simply subtracting 94 from both bytes, ending up with (2,-24), which is equivalent to (1,70) and produces the glyph ㈢. The correct EUC-CN code for ㈢ would be (A2,E7), although it is unlikely that this is what the translator meant to write. Like for (A1,A0), it is not currently known what the intended glyph was.
There are five identical (A3,0x89) glyphs at the end of the string, just before the (double) null. All of them end up rendered as ㈢. What happens under the hood is that xfhsm_oni.dll is simply subtracting 161 from both bytes, ending up with (2,-24), which is equivalent to (1,70) and produces the GB 2312 glyph ㈢. The correct EUC-CN code for ㈢ would be (A2,E7), although it is unlikely that this is what the translator meant to write. It is not currently known what the intended glyph was, as it doesn't seem to be a valid control sequence under any common extension of EUC-CN.
 
====(A3,A0)====
The illegal sequence (A3,A0) is much more common than (A3,0x89). It occurs in SUBT entries (both in actual subtitles and in "messages"), as well as in the [[OBD:IGSt|IGSt]] resources of multiple [[OBD:TxtC|TxtC]] (text consoles) and one [[OBD:OPge|OPge]] (objective page). Lists of occurrences are provided below.
 
Like for (A3,0x89), the pixel data addressed by the invalid code point remains within the same font, in this case at the (A2,FE) slot, which happens to be blank (and thus indistinguishable from an intentional space glyph).
 
Unlike for (A3,0x89), there are multiple examples to look at, so we can make an informed guess as for what the intended glyph was: either an ordinary ideographic space, (A1,A1), or some variant thereof (such as a non-breaking space).
 
{|
|
{{divhide| List of Chinese SUBTmessages entries containing the (A3,A0) code point|align=left}}
{|border=1 cellspacing=0 cellpadding=3
!Handle
!Original text
!(A3,A0) location
|-valign=top
!xcombo
|To move diagonally, [c.use key combinations like (W+A) or (S+D)].°
|bytes 30-31,34-35
|-valign=top
!c01_50_11
|To perform a somersault escape move, [c.begin running (W,A,D,S) and then press (SHIFT)].°
|bytes 42-43
|-valign=top
!autoprompt_hypo
|Press [c.Q] to pick up HYPO SPRAY.°
|bytes 12-13
|-valign=top
!autoprompt_cell
|Press [c.Q] to pick up [b.ENERGY CELL].°
|bytes 16-17
|-valign=top
!xtabhypo
|Press [c.(TAB)] to use a hypo.°
|bytes 0-1
|}
{{divhide|end}}
|}
 
{|
|
{{divhide| List of Chinese SUBTsubtitles entries containing the (A3,A0) code point|align=left}}
{|border=1 cellspacing=0 cellpadding=3
!Handle
!Original text
!(A3,A0) location
|-valign=top
!01_01_08
|Shinatama:  Daodan latency holding at twenty seven point one. Bioplasmic waveforms stable. A class three adrenal spike when you gave the order, but nothing out of the ordinary.°
|bytes 0-1
|-valign=top
!01_01_09
|Kerr:  What are you sending her into?°
|bytes 0-1
|-valign=top
!01_01_10
|Griffin:  It's a simple bust: in and out. She needs a trial run.°
|bytes 0-1
|-valign=top
!01_03_02
|Griffin:  Well done Konoko. Fall back, I'll have you picked up.°
|bytes 0-1
|-valign=top
!01_03_04
|Griffin:  Negative. Fall back.°
|bytes 0-1
|-valign=top
!02_05_03
|Barabas:  You know it. They aren't getting out of here alive.°
|bytes 0-1
|-valign=top
!02_05_08
|Barabas:  I'm ready for anything. You made sure of that.°
|bytes 0-1
|-valign=top
!02_05_09
|Muro:  There is always someone stronger. Have you forgotten?°
|bytes 0-1
|-valign=top
!02_05_10
|Barabas:  No. I haven't. I'll be careful.°
|bytes 0-1
|-valign=top
!02_05_11
|Muro:  See that you are. You know the consequences of failure.°
|bytes 0-1
|-valign=top
!02_05_12
|Receptionist:  Please have a seat, someone will be right with you.°
|bytes 0-1
|-valign=top
!14_54_02
|Civilian:  Konoko, please don't hurt me. They made me do it, I swear.°
|bytes 44-45
|-valign=top
!15_59_01
|Muro:  Welcome sister. I am very impressed with what you have been able to accomplish without drawing on the full power of your Chrysalis. You are capable of so much more. Let me show you...°
|bytes 32-33,36-37
|}
{{divhide|end}}
|}
 
{|
|
{{divhide| List of Chinese [[IGSt]] containing the (A3,A0) code point|align=left}}
{|border=1 cellspacing=0 cellpadding=3
!Owner
!Page
!Original text
!(A3,A0) location
|-valign=top
![[Quotes/Objectives#CHAPTER_08_._AN_INNOCENT_LIFE|OPgelevel_10]]
!align=center|3
| There's no one left to trust.°
|bytes 0-1
|-
|colspan=4 bgcolor=silver|
|-valign=top
!TxtClevel_1f
!align=center|1
|Reload............................R (or LEFT MOUSE BUTTON)°
|bytes 16-17
|-valign=top
!TxtClevel_1f
!align=center|2
|Reload............................R (or LEFT MOUSE BUTTON)°
|bytes 16-17
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_2a|TxtClevel_2a]]
!align=center|1
|ENCRYPT SEQUENCE TaL0315-68 seq. 1°
|bytes 28-29,38-39
|-valign=top
![[Quotes/Consoles/level_2b|TxtClevel_2b]]
!align=center|1
|ENCRYPT SEQUENCE TaL0315-68 seq. 2°
|bytes 28-29,38-39
|-valign=top
![[Quotes/Consoles/level_2c|TxtClevel_2c]]
!align=center|1
|ENCRYPT SEQUENCE TaL0315-68 seq. 3°
|bytes 28-29,38-39
|-valign=top
![[Quotes/Consoles/level_2d|TxtClevel_2d]]
!align=center|1
|VOICE ENCRYPT 01.967.23 <Dr. Singh, Earnest M.>°<br />"Dr. Kafelnikov and I have just completed Test Part 483 in the ESS (Environmental Stress Simulator) with the settings prescribed by protocol AT-MOK 64.°
|bytes 26-27<br />bytes 56-57
|-valign=top
![[Quotes/Consoles/level_2e|TxtClevel_2e]]
!align=center|1
|VOICE ENCRYPT 01.965.04 <Dr. Kafelnikov, Roland V.>°
|bytes 26-27
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_3a|TxtClevel_3a]]
!align=center|1
|(ref.TANKER&nbsp;v1.6&nbsp;-&nbsp;1.9)°
|bytes 18-19,28-29,32-33
|-valign=top
![[Quotes/Consoles/level_3d|TxtClevel_3d]]
!align=center|1
|VAGO BIOTECH - Life is for Everyone°<br />(ref.BIOTECHNOLOGY TODAY vol.XXI)°
|bytes 8-9,24-25<br />bytes 24-25
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_4b|TxtClevel_4b]]
!align=center|1
|WCG.subref.AirCOn&nbsp;Region&nbsp;7Dispatch>°
|bytes 34-35,40-41
|-valign=top
![[Quotes/Consoles/level_4c|TxtClevel_4c]]
!align=center|1
|WCG.subref.AirCOn&nbsp;Environmental Update&nbsp;>°
|bytes 34-35,46-47
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_6a|TxtClevel_6a]]
!align=center|1
|WCG.subref.AirCon&nbsp;General Alert&nbsp;>°
|bytes 34-35,44-45
|-valign=top
![[Quotes/Consoles/level_6b|TxtClevel_6b]]
!align=center|1
|WCG.subref.NavCom&nbsp;Advisory >°
|bytes 44-45
|-valign=top
![[Quotes/Consoles/level_6c|TxtClevel_6c]]
!align=center|1
|WCG.subref.Custodial Heads-Up&nbsp;>°
|bytes 44-45
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_8a|TxtClevel_8a]]
!align=center|1
|CLASSIFIED - Clearance Gamma S16 and Above Only>°
|bytes 10-11,22-23,30-31
|-valign=top
![[Quotes/Consoles/level_8b|TxtClevel_8b]]
!align=center|1
|GENERAL ACCESS - Clearance Alpha G1>°
|bytes 24-25
|-valign=top
![[Quotes/Consoles/level_8f|TxtClevel_8f]]
!align=center|1
|GENERAL ACCESS - Clearance Alpha A1>°
|bytes 24-25
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_10b|TxtClevel_10b]]
!align=center|1
|<-> security mainframe&nbsp;<->OVERRIDE°
|bytes 14-15
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_14a|TxtClevel_14a]]
!align=center|1
|Project: 14&nbsp;(1.3.51)°
|bytes 10-11
|-valign=top
![[Quotes/Consoles/level_14b|TxtClevel_14b]]
!align=center|1
|Project: 14&nbsp;(9.1.28)°
|bytes 10-11
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_18a|TxtClevel_18a]]
!align=center|2
|TCTFdb88\sld\zZ1 Update: Omega Security Vault Retrofit°
|bytes 32-33
|-valign=top
![[Quotes/Consoles/level_18b|TxtClevel_18b]]
!align=center|1<br/><br/>2<br/>3<br/>4
|<<Clearance Theta K12 and Above Only>>°<br/>TCTF32\sld\taL15 Shinatama/Konoko Relationship Analysis°<br />TCTF32\sld\taL15 Shinatama/Konoko Relationship Analysis [cont]°<br />TCTF32\sld\taL15 Shinatama/Konoko Relationship Analysis [cont]°<br />TCTF32\sld\taL15 Shinatama/Konoko Relationship Analysis [cont]°
|bytes 8-9,20-21,28-29<br/>bytes 32-33<br/>bytes 32-33<br/>bytes 32-33<br/>bytes 32-33
|-valign=top
![[Quotes/Consoles/level_18c|TxtClevel_18c]]
!align=center|1
| -Internal car park facilities closed.&nbsp;&nbsp;Traffic redirected to security kiosk A as per protocol Theta K12.°
|bytes 34-35
|-valign=top
![[Quotes/Consoles/level_18d|TxtClevel_18d]]
!align=center|1
|SECURITY ALERT&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;19:35:06°
|bytes 8-9,10-11,12-13,14-15,16-17,18-19,20-21,22-23
|-
|colspan=4 bgcolor=silver|
|-valign=top
![[Quotes/Consoles/level_19b|TxtClevel_19b]]
!align=center|1
|5)&nbsp;&nbsp;Low Orbit satellite control signal burrow established°<br />6)&nbsp;&nbsp;STURMANDERUNG mountain compound construction°<br />7)&nbsp;&nbsp;Daodan core technology (ref.TITAN\ssob)°
|bytes 4-5,6-7<br />bytes 4-5,6-7<br />bytes 4-5,6-7
|-valign=top
![[Quotes/Consoles/level_19c|TxtClevel_19c]]
!align=center|1
|9)&nbsp;&nbsp;&nbsp;Daodan core technology (ref.TITAN\uwlb)°<br />10)&nbsp;&nbsp;STURMANDERUNG mountain compound construction°<br />11)&nbsp;&nbsp;Symbiote candidate selection and implantation°
|bytes 4-5,6-7,8-9<br />bytes 6-7,8-9<br />bytes 6-7,8-9
|-valign=top
![[Quotes/Consoles/level_19d|TxtClevel_19d]]
!align=center|1<br/><br/><br/>2<br/><br/><br/><br/>3
|13)&nbsp;&nbsp;ACC installation modification COMPLETE°<br />14)&nbsp;&nbsp;STURMANDERUNG mountain compound COMPLETE°<br />15)&nbsp;&nbsp;STURMANDERUNG transmitter array COMPLETE°<br />1)&nbsp;Initialize°<br />2)&nbsp;Test with current settings°<br />3)&nbsp;Edit current settings°<br />4)&nbsp;Abort current process°<br />Frequency: 1002&nbsp;&nbsp;&nbsp;Amplitude: 233&nbsp;&nbsp;Mode: 1°<br />input>Frequency&nbsp;=&nbsp;9999,&nbsp;Amplitude&nbsp;=&nbsp;9999,&nbsp;Mode:&nbsp;9999°
|bytes 6-7,8-9<br />bytes 6-7,8-9<br />bytes 6-7,8-9<br />bytes 4-5<br />bytes 4-5<br />bytes 4-5<br />bytes 4-5<br />bytes 14-15,16-17,18-19,32-33,34-35<br />bytes 10-11,14-15,26-27,32-33,36-37,48-49,56-57
|-valign=top
![[Quotes/Consoles/level_19e|TxtClevel_19e]]
!align=center|1
|1)&nbsp;Stop Point A operators must coordinate Blue tunnel for two-way traffic.°<br />2)&nbsp;Reprimand all personnel for sloppy operating behavior.°<br />3)&nbsp;Replace all remaining doors with Musashi DX1000.°<br />
|bytes 4-5<br />bytes 4-5<br />bytes 4-5
|}
{{divhide|end}}
|}


Without a proper sanity check, some illegal code points will clearly result in pixel data being loaded not from a valid glyph region, but from irrelevant memory that belongs either to xfhsm_oni.dll or to the main Oni engine, resulting in garbled text. Memory corruption or segmentation fault (access violation) may occur if similar out-of-bounds pointers are used when rendering glyph textures. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems, although this has not been investigated thoroughly.
Without a proper sanity check, some illegal code points will clearly result in pixel data being loaded not from a valid glyph region, but from irrelevant memory that belongs either to xfhsm_oni.dll or to the main Oni engine, resulting in garbled text. Memory corruption or segmentation fault (access violation) may occur if similar out-of-bounds pointers are used when rendering glyph textures. Possibly invalid EUC-CN input is what is causing most Chapters of the Chinese Oni version to crash on modern Windows systems, although this has not been investigated thoroughly.