Development notes/PC-98

From Touhou Patch Center
Jump to navigation Jump to search

So how will this be used?

As there are no DLLs on DOS, the thcrap port has to be compiled to a standalone TSR executable which intercepts some DOS interrupts.

Basically, the user has to call thcrap98 first, followed by how the game would be normally run:

	cd \thcrap98
	loadhigh thcrap98 runconf.js
	cd \th0x
	game

Having to copy the thcrap98 directory on every single HDI will be really annoying. Thus, the setup will copy the existing games (with all save data, of course!) from the 5 individual HDIs onto a single new, big HDI with every game on it. It will also install FreeDOS and add a nice start menu to automatically execute the commands above.

Windows -> DOS porting

(um... let's save that for later)

thcrap initialization

Good news: On DOS, .exe files are not too different from .com files. On invocation, the whole file is loaded into one contiguous block of memory. And I already feared that the different segments would be spread all over memory...

But well, that makes things no more different than on Windows! We'll just get the start address of the .exe file and patch away.

... and "getting the start address" is where our problems begin. :-)

Problem #1: Under normal circumstances, int 21,4b does not return until program execution has finished. Figures, DOS isn't a multitasking operating system, after all.

Solution: Call int21,4b01, then patch, then run (set some registers and jump to CS:IP given in the PSP). Gee, this is so much easier than on Windows for once!

Gotta try this tomorrow. \o

Eh, and what about Diet?

Yup. After calling int21,4b01, packed .exe files are still packed. What do?

a) Get GreaseMonkey to complete his open-source unpacker
b) Try inserting an interrupt after the unpacking code to do the patching then
c) Bundle TRON.EXE with the patch (terrible idea, as it is a 16-bit executable and the patches then can't be installed on 64-bit operating systems)

Hey, Nmlgc, where is the progress?

Sorry, but the native PC-98 version of Turbo C++ keeps crashing for no apparent reason. Looks like I have to use the Open Watcom cross compiler after all (this would have been the better method from the beginning, but well...). This also means that I have to unfortunately rewrite a large chunk of my TSR testing code.

Damn, I really hope that this is the last instance of effort being wasted in this whole PC-98 patching affair.

2013-04-19: Got used to Open Watcom, everything works again as expected. Let's do some patching tomorrow! \o

OK, so how do we go about doing this?

Step 1: Get int21,4b01 to work for any executable, without doing any patching. This means hooking any call to int21,4b00 with int21,4b01 and the required jumps Step 2: Take a look at the Diet decompression code and look for a breakpoint where the entire program has been decompressed.

The Solution To Our Daily Memory Model Problems

Yet another really awkward thing to debug. Simple variable assignments suddenly fail, yet the C code compiles just fine.

The reason: Segments are getting mixed up, some pointers need to be far, some near, and everything gets really confusing...

Solution:

  • Just compile everything with the Compact model (near code, far data). :-)

Oh shit.

So apparently, whenever int21 is called, a pointer to some CPU register structure is stored in the PSP. The registers are the restored from this structure when a program terminates... and this somehow screws up our possibilities of returning to command.com after termination.

Shit.

I need to go to bed. Will have to look closer...

Looking closer

Alright, we now at least end up at the right code position after program termination. The registers still seem messed up, though...

So, the state at 07ec:01a7 needs to be the exact same as at 07ec:01ac.

Super-portable solution: Just store our SS:SP in a global variable. The registers shouldn't matter at all - they are restored from this stack anyway.

Hello World and th02 work now... every other game shows some weird error:

  • th01: shows intro, hangs somewhere in OP.EXE, before the menu displays
  • th03: hangs after the third invocation of ZUN.COM (zun -5)
  • th04: fails memory check in first invocation of ZUN.COM. Definitely _not_ a memory problem!
  • th05: seems to run through all ZUN.COM invocations, but shows STOP error before the intro

A~ha! Trying it with a utility recommended in the FreeDOS source, http://www.beroset.com/asm/showregs.asm, it displays a bunch of junk when the patch is active...

... oh wow, my old ROT-13 shenanigans were part of the problem! th01 and th05 now work as well. th03 and th04 still show the same issues.

WHAT THE FUCK IS GOING WRONG HERE

It's not Diet.

Seems to be the memory extender.

Alright, the hard way...

th03

ZUN.COM fails at the function at cs:0256... which is just a wrapper around int21,4c "Terminate process".

So the FLAGS register for the IRET in exec_user() (the very last function of the process termination chain) gets "lost from the stack" somewhere in the process, ZUN.COM's segment gets moved to the FLAGS register... and the emulation halts.

Fun.

Interestingly, this happens every time... which means that the games only worked because of sheer chance.

Basically, some stack reference somewhere is assigned incorrectly.

And in the end, there's no way around creating a temporary stack to execute int21,4b01 on. But how, when the register structure for the interrupt is on the stack?

Simple, we just put it in a data segment. :-)
CAUTION: The program now must be compiled without assuming that SS = DS! For Open Watcom, this is the compiler option -zu (C Compiler Switches -> 9. Code Generation Option Switches -> SS not assumed equal to DS).

Alright! Now, th03 works too. That only leaves...

th04

"Seems to be the memory extender."

Probably not. It fails at the first memory check because it can't allocate the full requested amount (480 KB). Being below 640 KB, theoretically, this shouldn't require a memory extender at all.

Heck, th05 asks for 520 KB and gets it easily.

Observation: This entire patch does not work on MS-DOS. Switching back to command.com after TSR-ing halts the OS with some memory allocation error. Same issue?

And it's not because of a wrong memory allocation strategy or UMB link state.

EMM386 - Japheth to the rescue

Because we don't even had UMB access available in the first place~! In the end, we did have too little memory after all. th05 probably didn't request a contiguous block of memory.

I mean, just look at the weird, inconsistent terminology used in this mess:

Area Name Accessed using normally called Note
- 640 KB Conventional Memory DOS "ought to be enough for anybody"
640 KB - 1024 KB Expanded Memory memory expander EMM386.EXE Graphics memory and other drivers are located here, too - memory expander is "only" required for allocators to access unused memory here
1024 KB - 1088 KB High Memory DOS FreeDOS relocates its kernel there
1088 KB - Extended Memory memory extender HIMEM.SYS Requires hardware-specific code

The emm386 on the FreeDOS(98) page, surprisingly, doesn't work at all. It initializes, but then halts the system as soon as FreeCOM wants to swap itself to extended memory.

Oh no... do I have to port yet another DOS component to PC-98?!?

Hmm, let's try the latest version of FreeDOS' standard memory expander, Jemm. We obviously can't use the combined expander+extender (JemmEx) due to native IBM-compatible code in the extender, but the expander is worth a try...

WHAT THE- It works like a charm! Amazing!

Now, all this high loading business actually works, and th04 runs just fine.

What a nice place for our pre-Reitaisai PC-98 hiatus. On to the server!

Hashing, Diet and other problems

On first thought, DOS' flat memory model would perfectly lend itself to hashing the executable directly in memory, without having to reload the .exe file. This would be necessary for the patcher to work through Diet:

  1. Hash the compressed .exe file in-memory
  2. Put a breakpoint as soon as the executable is unpacked (one register should have the PSP)
  3. Hash the uncompressed .exe file in-memory
  4. Patch away

Sounds good, right? Well, there are three problems here:

  1. The .exe header itself is not kept in memory - the image starts directly after it. Sure, we can adjust the hashes accordingly... but that would make initial setup a pain because it has to parse each .exe header to get to the image. Or we'd have to use three sets of hashes: One for identification, one for the in-memory Diet version, and one for the actual, uncompressed game in memory.
  2. How do we get the size of the program? The second field in the Program Segment Prefix only gives us the total amount of memory initially allotted to the program. That memory is not reset to zero, which means that we have the contents of the .exe file at the beginning and a lot of shell code (literally. Remnants from whatever command.com derivative you happen to be using) afterwards.
  3. Can we be sure that the final jump to the uncompressed code always is at the same relative place? Seems that a thorough analysis of Diet is required to do this...

The solution: Diet itself

Diet itself is actually a pretty cool piece of software. Not only does it have the ability to unpack a program back to its original state (the headers and some padding are quite different from what TRON.EXE spits out), it can also load itself as a TSR and automatically decompress the program on int21,4b! If started with

	diet -Z -E

it even automatically replaces the compressed executable on disk with its decompressed version! However, due to our "don't change or put anything in the game directory" policy, this is not what we should do.

But that still leaves problem #2 unsolved.

(Unfortunately, DIETTECH.DOC, which probably contains exactly what we need for both question #2 and #3, seems to have vanished from the Net.)

Give me some time to think about all that.

... or just unpacking it outright

Here's the source code for another Diet unpacker. The tombexcavator project is licensed under the GPL v3, so we could easily use that in our setup.

"You don't really want to do SHA-256 on a vintage 16-bit computer, do you?"

It sounded kind of weird to me too, but yeah, that was the plan. Turns out that there are indeed two major problems:

  1. It doesn't even work. Due to the fact that its internal state uses 32-bit words, I haven't managed to compile a 16-bit build that produces correct results so far.
  2. It is slow. Like, really slow. In my benchmarks, I ended up with a speed of about 354 byte per MHz per second. With the most widespread Neko Project configuration of 78.6432 MHz (just a little above the recommended specs for PC-98 Touhou), this makes about 28 KB per second.

However, given the fact that the assembly doesn't look exactly perfect to me, it might still be possible to optimize this for decent performance (after I made it work in the first place). All of the PC-98 games require a 80486 CPU anyway, and tapping into the extended instruction set (most notably BSWAP) might help in that regard (as opposed to just compiling everything for the 8086 because you don't know better). (And yes, I've verified that this works on real hardware.)

And by "decent performance", I mean at least doubling the speed.

... or we could just do the sane thing and do CRC32 instead. But remember that I'm in the PC-98 business for the fun, so it should be clear what I'll be trying wasting my time on first.

... oh wow, this is viable after all?!?

So the biggest problem here is getting a 32-bit build of the SHA-256 algorithm. Since there's no way of setting a certain bitness for a C source file, I first tried adding a .asm compilation of the same file created by Virtual C++... but well, that didn't work either. The linker was complaining about something like "can't mix 16-bit and 32-bit segments".

Damn. I mean, using 32-bit opcodes and registers from Real Mode (= 16-bit code) clearly works. (This behavior might be specific to PC-98?) But apparently, there are problems with relocations etc. and this is why Watcom complains. Oh well.

Then let's try to just create a 32-bit build containing just the SHA-256 example code. And there we already have the next problem: Finding a protected mode runtime that works on any system.

CauseWay
Doesn't work (error 11).
DOS/4GW
Blanks the screen? Seems like the splash went wrong.
... oh wait, reading the help file on configuration settings reveals that we have to manually
set DOS16M=1
if running on a PC-98 :). But then it works.
Not on real hardware though - aside from having to switch to pure MS-DOS mode, it then says
DOS/16M error: [17]  system software does not follow VCPI or DPMI specifications
Disabling EMM386.EXE fixes it and the application runs nicely. However, we should keep in mind that we do need a memory expander, if only to keep Reimu's scenario in Icon th04.png Lotus Land Story from crashing before the Stage 5 boss battle. (Yes, this happens on real hardware too.) I have not yet managed to find a solution that runs both.
Oh, and this also introduces a dependency on (the legendary) DOS4GW.EXE.

And lo and behold, suddenly we're at 80 KB per second on the recommended configuration, and the generated assembly still leaves a lot of room for improvement. The single largest PC-98 Touhou executable, REIIDEN.EXE, is 233 KiB, can we get there?

OH MY GOD

Firstly: I know that any times measured in all of this aren't really precise - but second resolution is the best I can get right now. clock() does unfortunately not work on PC-98. Thus, time() is the best we can do. Every other interrupt or I/O port (haha) I could find only returns calender time.

Secondly: There's really more to the performance than the MHz number. Even though the real PC-98 I own (WHICH DOES NOT COME WITH THE FM SOUND BOARD YOU ABSOLUTELY NEED FOR TOUHOU BGM) has 133 MHz, the algorithm runs at least 5 times faster there. (I'm not going to make that my point of reference though.)

On to optimization, then.

Hmm, so the obvious thing to optimize first would be the endianness conversion at the beginning. Thanks to Watcom's mighty #pragma aux directive, we can express C functions as one-line assembly instructions with all the register juggling handled by the compiler:

#pragma aux bswap_32 = " bswap eax " parm [eax];
uint bswap_32(uint x);

And yay, that already gives us 10 KB/s more. But the best is yet to come:

Bit rotation is a very integral part of the SHA-256 algorithm, done quite often during transformation. Originally, this was coded as a macro, and it thus compiled quite poorly.

#define ROTRIGHT(a,b) (((a) >> (b)) | ((a) << (32-(b))))

But there's an x86 instruction to do the exact same thing:

#pragma aux ROTRIGHT= " ror eax, cl " parm [eax] [cl];
uint ROTRIGHT(uint x, char bits);

This gives an enormous speed boost. We're now at 143 KB/s in Neko Project! ^_^

(And if we now could get rid of the CL register and directly assemble [bits] as an immediate value, I'd be *really* happy.)

Wait a moment....

It is actually impossible to "put a breakpoint as soon as the executable is unpacked" in advance - that position in memory is of course overwritten with the unpacked data.

What now? Are we screwed?

... Actually not. If we keep in mind that, most of the time, our modifications to the executables rarely target the first few instructions, we could just only apply the patches when INT 21 is hit for the first time with a new command line. And suddenly, we have bypassed all the problems we would have had with Diet! Great!