VMU hackery (2017)

Table Of Contents

  • About the VMU
  • About the VMU's CPU
  • Flash storage
  • VMU's comms abilities
  • Exploring the known ROM
  • The SEGA-to-VMU coms protocol revealed
  • MAPLE protocol on STM32
  • Dumping the previously-unseen American v1.05 ROM
  • The final ROP attack to dump the ROM
  • ONE MORE THING: uM23 ! (that's right! A full ARM Cortex-M23 emulator in VMU assembly!)
    • Memory map
    • Interrupts
    • Instructions
    • Structure
    • Hypercalls
  • Emulating the VMU
  • Back to our emulators
    • Code size
    • Multiple programs concurrently on the VMU for the first time
    • Code demos
  • Downloads

About the VMU

Back in 1999, SEGA released the VMU. It was a companion to the dreamcast - a smart memory card for saved games that also had some buttons, a buzzer, and a screen, so that it itself could be used for mini-games. Inserted into the controller, the VMU acted as a second screen. Standalone, it ran off its two CR2032 batteries and allowed save game file management, minigames, and simple animations. The VMU could store one single game (mandatorily starting at flash address 0) and multiple datafiles. A relatively large community sprang up around the VMU in the early 2000s, with a few homebrew games made, and some features documented. Later, some official documents appeared online from SEGA, and then the world lost interest.

Currently, one may get a VMU on eBay for $3, with free shipping. This got my attention, so I got one. Sadly, most of the websites dealing with VMU hacking from the 2000s are long gone. Web Archive only has a few archived (they have a terrible tendency to honor "robots.txt" from 2010 on content collected before it existed, so any domain that is resold usually causes old archives to be deleted). I did find the official documentation for the VMU (googling for "vmu.pdf" mirrored here helped) and pieces of the official dev kit. Over the next month I uncovered most of the VMU's secrets and today I present them here, together with the most practical way to program a VMU to date. The official doc that I mentioned above is mandatory reading if you intend to program the VMU.

About the VMU's CPU

I shall give a short summary of the VMU CPU (not to be used as a substitute for reading VMU.pdf). The CPU is an 8-bit one. Much in the tradition of old 8-bit CPUs, it is rather user-hostile. This one is in fact quite a bit more so than most. The architecture is Harvard, of course. Code may run from ROM (64K) or Flash (2x 64K pages, running from second page is undocumented). PC is not directly accessible as a register but is 16 bits exactly. Instructions are variable length with a minimum length of one byte and a maximum of 3 bytes. There are a few classes of data memories. There is: internal RAM (2 banks x 256 bytes), work RAM a.k.a. VRAM (512 bytes), graphics RAM (2 banks x 96 bytes + 1 bank x 4 bytes), and SFRs, that is hardware registers (128 bytes). So how does the 8-bit CPU address all this memory? Banking and other weird methods, of course! So, first things first: data memory addresses are ... 9 bits long. Yes, 9 bits. Wait, you might say, this is an 8-bit CPU, no? Yes it is. So, what gives? For instructions that include an address in them, the full 9 bits are indeed encoded. But, you might ask, how would indirect addressing work? Well, it sort of doesn't. And it sort of does. There are RAM locations that can be used as pointer registers. But RAM bytes are 8 bits in size (as bytes often are). So how do we address a 9-bit address space using 8 bits? We do not. The bottom two pointer registers (R0 & R1) can only address data memory locations 0x000 - 0x0FF. The top two pointer registers (R2 & R3) can only address data memory locations 0x100 - 0x1FF. This might appear to mean that one cannot use a single pointer to iterate the entire data memory space. This is actually correct. Weird, eh? Moving on... What does the data address space look like? The bottom 256 locations (0x000 - 0x0FF) address the current bank of internal memory. The BANK is selected by bit BANK0 in the status byte PSW. The next 128 bytes (0x100 - 0x17F) address SFRs (the various hardware registers like SPI configs, clock settings, etc). The top 128 bytes (0x180 - 0x1FF) address the current bank of graphics RAM, which has 3 banks. Bank selection here is done using the XBNK SFR. At this point you might notice that we ran out of data address space and did not at all mention work RAM (VRAM). That is because it has no direct visibility. It can only be accessed indirectly. There is a 9-bit address stored in VRMAD2:VRMAD1 SFRs and there is an indirect data SFR VTRBF that accesses the pointed-to byte. Optionally, VRMAD can be auto-incremented on each access to VTRBF. Stack is always in internal RAM bank 0 and it is a full-ascending stack with no indication of overflow/underflow.

Every single byte decodes as an opcode, no free ones! The CPU has some curious ways to deal with control flow. For unconditional jumps, one can use JMP/JMPF/BR/BRF (which all only differ in range, cycle count, size, and whether they are relative or absolute). Conditional jumps come in a few interesting varieties, actually. BZ/BNZ jump based on the zero flag in PSW. BP/BN jump based on whether a given bit is set in a given RAM location. BPC does the same as BP, but after testing the bit, it will also clear it. DBNZ is useful for fixed-length loops - it will decrement a value in a given RAM location, and if the result is nonzero, take the jump. It is useful due to the fact that it does not affect flags. BE/BNE jump if a given RAM location (or immediate) contains a value that is equal to the current value of the accumulator. Additionally (and this is cool) they will set the carry flag if the RAM value (or immediate) is greater than or equal to the accumulator value. This, by the way, is the only way to compare the accumulator with something without destroying the accumulator. CALL/CALLF/CALLR can be used to make a function call. Simply-put they push the address of the next instruction and then jump like JMP/JMPF/BRF would. They differ only in length/range/absoluteness. RET can be used to return from a function call (and RETI from an interrupt). Both just pop PC from stack and go there.

There are also some useful bit operators. CLR1 will clear a given bit in a RAM location. SET1 will set one, and NOT1 will invert one. Pretty simple but convenient and fast (one cycle for any of those). There is also a less-useful-than-you-think XCH instruction that swaps the value of the accumulator and a RAM location. For logical ops, we have ROR/ROL/RORC/ROLC which rotate right and left though carry or not. These can be used to implement any other kind of shift, given time. We also have AND/OR/EOR. LD and ST can be used to store a value from accumulator to a RAM address or to retrieve one. PUSH & POP can push any RAM location to stack or retrieve it from there. There is also NOP, of course.

All arithmetic and logical operations are done to the accumulator. Accumulator is always the implied destination of any operation. It is also accessible at address 0x100 in RAM (so it can be also used as a RAM location). So what math can we do? ADD/ADC/SUB/SUBC/INC/DEC/MUL/DIV. Wait? Did I say MUL and DIV? Yes. These rather complex and uncommon operators in 8-bit CPUs do make an appearance here. The multiplication is a 16 x 8 -> 24 bit operation, and DIV is 16 / 8 -> 16 + 8 quotient-remainder operation in 7 cycles each. Very fancy. INC and DEC are useful as they do not mess with the carry flag.

Most instructions that need an 8-bit value get it as an immediate in the instruction, from a direct RAM reference by encoding the 9-bit address in the instruction, or by indirect access using one of the 4 pointer registers (called @Rj mode). In reality the pointer regs are a bit more complex as there is 4 sets of them and which set is currently used is determined by some bits in the status SFR(PSW), but that is not really all that useful so we'll skip discussing it.

So, given that we have a Harvard architecture, can we at all read code space as data? Sort of. LDC instruction will load the codespace byte at address (TRH:TRL) + ACC into ACC. This can be used to implement data tables in ROM/flash. The implied addition actually helps to make it convenient. So is that all? Not quite. There are two undocumented intructions as well. They are LDF & STF. LDF loads a byte from flash into ACC. How does it differ from LDC? LDC addresses current code space (which is ROM or flash). LDF always addresses flash. Also LDF can access the entire 128K of flash (top bit of address is taken from bottom bit of SFR at 0x154, i call it FLASHCTL). STF is sort of a compliment to LDF and is used to write to flash (in a rather convoluted manner, see ROM for details). This will only work from ROM, so it is of no use to anyone writing code that lives in flash. Later you'll see how I repurposed this opcode.

Flash storage

Speaking of flash, what is the format of the data flash? The VMU allows for storage of multiple datafiles and one minigame. How? The format is actually rather like FAT16, with only a few minor differences. I will note the differences here, and assume you can find a FAT16 spec all by yourself. Cluster size is always one block. The superblock is in the last block (255) and not first (0). The FAT and the directory are also at the end. They are located using the superblock and can be anywhere, but conventionally the FAT is in block 254 and directory in blocks 241 - 253. The "unused" fat marker is 0xFFFC, the "end of chain" marker is 0xFFFA. Conventionally data is not stored past block 200. The minigame must begin at block 0 and be continuous, so the data files are usually positioned near the end of the free space, to leave as much space for the game. The superblock has a few useful fields, and a lot of useless ones. The useful ones are all 16-bit little-endian values as follows. FAT's block number is stored in 0x46, and FAT's length in blocks is at 0x48. Directory's block number is at 0x4A, and its length in blocks is at 0x4C. VMU's BIOS supports two types of files: 0xCC = game and 0x33 = datafile. Other types will be ignored but will break nothing, so feel free to make them. VMU's BIOS does not support nested directories, but, once again, it is quite easy to make a nested directory for your own purpose by just making it a file with a given type and treat it as a directory. Anyways, directory entries are 32-bytes in size. Their format follows. At offset 0x00 we have the type byte (0x33 or 0xCC). At 0x01, we have copy protection flag (0xFF or 0x00). At 0x02 we have a 16-bit block number for where the file starts. The next 12 bytes are the file name (NULL-terminated if there is space). The next 8 bytes are the timestamp (in a weird BCD format). Then at 0x18 is the file length in blocks. At 0x1A is the file header offset in blocks (VMU expects each file to have a header, but for non-games you can skip it). For games the header offset is usually just 0x01.

VMU's comms abilities

VMUs can communicate with each other by being plugged into each-other's connectors. The actual method of communications is effectively SPI. The VMU comes with two SPI units, and for VMU-to-VMU communications, they configure one as SPI master and another as slave. The pins are wired to the connected such that when two are plugged in to each other, one's master pins connect to the other's slave pins. I did not spend much time on this interface since it is rather limited and slow. It operates at file-level granularity, lacking useful primitives like direct writes and read of flash area. It also requires user interaction. However, if you want two-way comms and a method to send data to the VMU from your Arduino, this is the way to do it.

The other communications protocol that VMU supports is MAPLE. This is the proprietary protocol that SEGA came up with. I was able to figure out how it works. But decoding scope traces get annoying very quickly, so I wrote a plugin for the wonderful Saleae Logic analyzer to decode it. I present it here today as well (it was my first time writing a saleae plugin and I hate C++, so the code is a bit of a mess. It does not help that their development story is "use old sdk, debug, then switch to new SDK and use, you cannot debug new app - it will crash under debugger"). MAPLE protocol includes: a physical layer (two wires of single-ended 3.3V-level logic referenced to ground); a data-link layer that provides for packet preamble and CRC/post-amble to allow packets to have clear borders; and a network layer that includes addressing packets to places, device discovery, and device configuration. The physical layer is interesting because I've seen two very different implementations of it. The SEGA-to-Controler MAPLE bus is just two wires and is bidirectional. The Controller-to-VMU bus has four wires, and essentially is comprised of two MAPLE busses, each unidirectional. Googling around found a document on a MAPLE-bus gun accessory, which shed some light unto the higher-layer protocol. It was a start, but to get more info, the ROM would have to be disassembled.

Exploring the known ROM

The SEGA SDK For the VMU comes with the image of the VMU ROM v1.04, in Japanese. It looked like garbage and did not match the instruction set, also featuring lots of repeating bytes. Usually in ROM repeating bytes are 0xFF or 0x00. I figured it was probably a simple addition or XOR obfuscation and was right. The magic bytes were 0x37 and 0x43 for the two included biosses (one is normal, one skips the clock-setting-ui at boot and just runs the game).

Disassembling the ROM one instruction at a time, looking at the CPU docs was getting old too. I wrote an IDA plugin then to disassemble the instruction set of the VMU. I then taught it about the known (and later ones I reverse engineered) IO Registers and known API call addresses. I am releasing it here today too (I still hate C++, so the code is a bit messy). This IDA plugin will properly disassemble any ".vms" file and produce proper readable code, with cross-references and all the other conveniences you've come to expect from an IDA plugin, including labeling the syscalls to ROM.

The SEGA-to-VMU commms protocol revealed

Being able to disassemble the ROM was a huge help, and I was finally able to decode the entire higher-level MAPLE protocol. A few previously-undocumented details were found along the way, as well as some errors in previous documentation. Curiously, the entire comms with the Dreamcast is driven by software despite being very fast at 2Mbps. To start with, MAPLE protocol sends almost all data in 32-bit words. They are assembled from bytes big-endian, but then sent little endian. That means that after any reception, and before any transmission, bytes need to be byteswapped in groups of 4. This is not simply "little endian" since strings are also affected. The basic structure of a packet is one word. Its constituent bytes, in order from LSB to MSB are TYP, TO, FROM, and LEN. TYP is, obviously, the type of packet being sent, TO and FROM are addresses . Their top 2 bits identify the dreamcast port the VMU is connected to (A=0, B=1, etc), bottom 6 are 0 for dreamcast and 1 or 2 for VMU, depending on whether it is in slot1 or slot 2 of controller. LEN is the number of additional 32-bit words being sent in the packet. Given the length being in words, and header being a word, why did I say the data is sent almost always in words? Because after all this, one extra byte is sent as a checksum. The checksum used is just XOR of all the previously-sent bytes of this packet. This means that every packet is at least 5 bytes long, and at most 1025 bytes.

Type Byte Packet Name Data Notes
0x01 Get Dev Info/Init - Replies with device_info. Also must be first command sent to device
0x02 Get Extended Dev Info - Replies with extended_device_info
0x03 Reset Dev - Resets the device's MAPLE interface
0x04 Shutdown Dev - Deinits device's MAPLE interface
0x05 Device Info transfer u32 functions_supported;
u32 per_function_info[3];
u8 irrelevant[2];
char dev_name[30];
char license[60];
u32 pwrInfo;
Reply to device_info
0x06 Extended Device Info transfer u32 functions_supported;
u32 per_function_info[3];
u8 irrelevant[2];
char dev_name[30];
char license[60];
u32 pwrInfo;
char version[];//till end of packet
Reply to extended_device_info
0x07 ACK - Generic acknowledgement
0x08 Data Xfer data... Transfer of some generic data in reply to a request
0x09 Get Condition u32 dst_function Sort of like a secondary read. VMU only replies if target function is 8 (aka CLOCK). In that case reply is data xfer: {u8 zeroes[11], u8 buttonState}.
0x0A Get Memory Info u32 dst_function Gets memory info. Reply data depends on function. For flash (function 2), some volume metrics are sent back, for LCD (function 4), some dimensions
0x0B Read Block u32 dst_function; u16 block; u8 phase; u8 pt; For clock function, reply is just 12 bytes of clock data, for flash: pt shoudl always be zero, block is flash block to read, phase should be zero. Reply will be a data x-fer with 512 bytes
0x0C Write Block u32 dst_function; u16 block; u8 phase; u8 pt; u8 data[] For LCD function, data is 192 bytes, for clock function - 12, for flash: pt shoudl always be zero, block is flash block to read, phase should be 0..3. as you write each quarter of the block in turn. Data will be 128 bytes. After four quarters written, command 0x0D must be sent.
0x0D Complete Write u32 dst_function; u16 block; u8 phase; u8 pt; Called after 4 calls to flash write. block, pt, should be same as for write, phase should be 4 (one more than last write)
0x0E Set Condition u32 dst_function; u8 data[] Sort of like a secondary write. VMU only replies if target function is 8 (aka CLOCK). In that case data is: {u8 reserved[2]; u8 dutyCy; u8 period}. This is used to configure the buzzer on the VMU to make a beep. It will go on until stopped
0xFA ERROR: with code u32 code Some kind of error. Attached word is the error code
0xFB ERROR: invalid flash address - The command you sent had an invalid flash address in it
0xFC Please resend last packet - Causes interlocutor to resend last packet. In reality VMU does not record replies, so it will re-execute the last command it remembers if it can. Commands with data VMU will not re-execute. VMU will send you this if you send too much too fast or a CRC error occurs in the transmission.
0xFD ERROR: unknown command - Command you sent was incomprehensible
0xFE ERROR: unknown function - Command you sent had an incomprehensible "function" code

As you see, some of this had never been documented properly or at all, like the flash write complete command, and the ability to read buttons or sound buzzer. My guess is because the former can be skipped if you issue a read just after a write and the latter hasn't been used since inside the controller the VMU's buttons are inaccessible). Well, now you can use them all. This opens the possibility of using the VMU as a convenient I/O device for some other project.

MAPLE protocol on STM32

It was time to implement the communications protocol. Sending MAPLE packets is easy - you're the master, send them as fast or as slow as you'd like. I used an STM32F103 board I had lying around for this. This worked and the VMU replied. I could see this on the logic analyzer trace. Capturing this reply in the STM32 was much harder. The VMU belts it out at 2 megabits per second, and since MAPLE is not like SPI or any other protocol that STM32 supports, the reception had to be done by hand. After MANY attempts in various directions, a hand-unrolled state machine in ARM assembly finally worked. And even then, this required overclocking the STM32 to 80MHz. But who cares? It worked! I could now send and receive messages using the MAPLE bus. This allowed experimentation with all those commands I learned in the ROM. It all worked - I could draw to screen, beep the buzzer, read buttons, read and write flash. Cool! I am releasing this code today as well.

A very interesting sidenote: as long as you give the VMU ample time to process messages, you'll not get back the dreaded 0xFC reply, and thus if you only intend to write to it, you can do it at any speed, and ignore the responses. So, if you just want to draw onscreen, or upload code (like the code I shall provide later), you need not write fancy assembly or use mine. Even a lowly arduino will do.

Well, now that I knew how the communications worked, what the messages were, and how to write data to the VMU, it was time to try to run some code on it. Well, sort of. I still lacked any way to do that. I started writing my own assembler for the weird CPU in there, but then realized that Marcus Comstedt, from the first wave of VMU developers in the 2000s, had already written one, and miraculously someone else had uploaded it to github, so it was still online here. Sweet! It was not perfect, but it saved me a day of work finishing mine! Next up: a C compiler. I toyed with this idea for a while. But the more I looked at the CPU architecture, the more it was clear how C-hostile it is. There is effectively no sane way I can think of to generate not-terribly-bloated code for it, so the idea was abandoned. Later, I indeed came up with another way to enable anyone to program the VMU without learning LC86K assembly later.

Dumping the previously-unseen American v1.05 ROM

However, for now, what bugged me was that the version 1.05 ROM of the English VMU had never been dumped from the VMU (to the best of my knowledge). As you'll undoubtedly know from reading the VMU datasheet, the CPU executes from codespace, selectable from ROM or Flash at any given point in time. The instruction LDC can be used to read the current code space at address (TRH:TRL) + ACC. One cannot call functions in ROM from Flash or vice-versa using a normal subroutine call. Instead one uses the CHANGE meta-instruction. The docs never mentioned what that instruction actually is, but ROM disassembly made it clear: It was a combination of NOT1 EXT,0 and a JMPF to the destination address. So first one clearly switches the memory space in use, with delayed effect, and the second does a jump to a known address. How does return work? It does not. To return the ROM does the same thing to return to a known return address. The delayed effect was interesting and warranted experimentation. It turns out that the memory space is switched only when a JMPF is executed. So doing just the naive thing: NOT1 EXT,0 and then trying to LDC some data failed to read the ROM. So much for the easy option. Well, what else could we try?

The Japanese ROM we had, logically, should be similar to the English one in structure. After all, they do the same thing. Likely various functions are in different places, but the logic should be similar. If only I could find some info about where an LDC instruction lives in the English ROM, I could CHANGE to it, craft a stack to have it return back to me using a fixed-location CHANGE in the ROM. Most of the LDC instructions in the ROM deal with drawing characters onscreen, and are not easily reachable from user code. Plus - we do not even know their addresses, and guessing where a particular byte is in the 65,536-byte address space takes a while. What else?

Well, there was one more use of LDC in there - in the function that copied data from ROM into the MAPLE TX buffer, byteswapping it in the process. This was used to send back the reply to the "Device info" command as it is long and static. Well, cool, but how do we find that function in the ROM we cannot read? Well, recall that we discovered that the IRQ handler for the MAPLE bus is at 0x0043. Never mind, you cannot recall it since it is not documented (docs call it a rather cryptic "VMS interrupt request"). Anyways, from ROM disassembly it was clear. Well, it reads the MAPLE RX buffer (aka WORK RAM), processes the bytes in there as a command, prepares a reply, and sends it by writing it to WORK RAM and toggling the "send bit" in the MAPLE comms register (undocumented - see later). We know that besides "get device info" command, which has a lot of code, a few commands that also call that ROM-to-WORK-RAM-with-byteswap code have less. Specifically, the simplest code path to it is the "get memory" info command with the "LCD" function code. OK, so, since the MAPLE RX buffer is the same thing as the WORK RAM, we can populate the WORK RAM with a "get memory info" command with the "LCD" function code set, prepare a special return address on stack (an address of a CHANGE in ROM, which we know exists because their locations are hardcoded), place a RET in our flash's hardcoded location ROM will CHANGE to, turn off interrupts, CHANGE into ROM space to jump to the MAPLE IRQ handler directly, and see what happens. This is effectively a ROP attack, while blindfolded. Well, it does what we expected - it populates the reply packet into WORK RAM. Ok, so it works, and we can get control back to us. But all we got was a few bytes we already knew - the "memory info" reply for the "LCD" function. Nothing useful came of this, right?

Wrong. Besides the data in the WORK RAM, the ROM code left us a treasure chest above the stack pointer. There, it pushed return addresses of internal functions used to do this work. Given that we know that a CALL instruction is 2 bytes, and the longer CALLF is 3, we know the callsites of each function too. Since we were clever to pre-populate most common regs (ACC, B, C, TRH, TRL) with magic values before this call, we see where they were pushed on stack too. The stack layout matches the Japanese ROM, but the addresses differ. Sadly they are not simply offset by a constant amount, but this is useful already.

The deepest function call is the call to the function that copies ROM data to WORK RAM. It takes pointer in TRH:TRL and number of words to copy in B. We know where it is called from, and that after it returns, a few more instructions execute before another RET is executed. None of them harmful (they do a MAPLE send). This means that if we craft the stack just right, we can call it to read any ROM location. Does it work? Yes it does! I used my little "show hex" function to display one byte at a time onscreen and jotted them down on paper. Before you know it, I was able to figure out where this function actually lives (by reading the CALL instr I knew called it), to call it more directly. But I was not going to dump 64K by hand. I needed a better solution. SPI was an option, but it is a pain. Why not just copy ROM to flash?

The final ROP attack to dump the ROM

The official write_to_flash API has limits - it only writes in the first 64K, and only within the actual game file. This is annoying. Luckily, looking at the Japanese ROM, we can see that after all these annoying checks, a few calls deep is the REAL write_flash func with no silly limitations. Using my existing on-screen dumping method I was able to find where it lived by dumping the official function (which starts at a documented hardcoded address in ROM, so that it may be called using a CHANGE). The location DID differ from the Japanese ROM, but, curiously, a jumptable of many useful funcs, including this one, seemed to be the same at 0xE000. I think this is on purpose. Cool! Anyways, at 0xE024 is a jump to a func that will write any aligned 128 bytes anywhere in flash.

Well, at this point, of course, I combined the two ROP gadgets to dump the entire ROM 0x0000 - 0xFFFF to Flash at 0x4000 - 0x13FFF. It took about 4 minutes. I then used my STM32-based MAPLE comms board to read the flash and save it to a file. It disassembled perfectly. To the best of my knowledge this is the first time this ROM version has been extracted from a VMU. I present it for your pleasure here as well. (If SEGA for some reason cares, I can remove it, but I figure enough decades have passed that this is probably OK).


OK, I had the ROM, I could run code, what else is there to do? Well, There is always ONE MORE THING. Since the VMU is durable, cheap, has a screen and buttons, and can communicate with others, it is an ideal development and teaching device. It is a pity that the CPU architecture is so user-hostile that I cannot imagine targeting a C compiler to it. Nobody's going to write serious code in assembly for an obscure old CPU, right?

(that's right! A full ARM Cortex-M23 emulator in VMU assembly!)

Well... I would. So what will I write? Given my goals of making the VMU accessible to all, I decided to write an ARM emulator of course... In VMU assembly... Well, ARM is large, so Thumb. Cortex-M0 is a good target, but it lacks a few useful instructions that help with code density and speed. Cortex-M3 instruction set is too complex to easily decode (especially in assembly on a weird CPU). What to do? Cortex-M23! It is like a Cortex-M0, but with most of what M0 has always been missing added. Namely, it is Cortex-M0 with the addition of these instructions: UDIV, SDIV, MOVW, MOVT, CBZ, CBNZ - exactly what it has been missing. The only things it needs for perfection would be UMULL, UMLAL, SMULL, SMLAL, but sadly ARM did not see fit to add them. Currently no Cortex-M23 silicon exists, so the VMU now might be the first real device you can get your hands on that will run Cortex-M23 code! Ha!

Memory map

Anyways, back to our VMUs... I got to work on writing a full Cortex-M23 emulator. It was hard work, but I managed to do it. This emulator, unlike my previous Cortex-M0 emulator for the AVR, is complete. Not only does it expose all of the VMU's functionality to the emulated code, it even allows interrupt handling, nesting, and exceptions to work properly like a real Cortex-M23. The exposed memory map is as follows:

Start End Mapped to Repeats every Notes
0x00000000 0x0FFFFFFF Flash @ start of emulated program 128K The running program is always mapped here at 0
0x10000000 0x1FFFFFFF Internal RAM bank 1 256 B -
0x20000000 0x2FFFFFFF Work RAM 512 B -
0x30000000 0x3FFFFFFF Flash 128K Raw flash as it really is
0x40000000 0x4FFFFFFF SFRs 128 B Direct access to all of the VMU's CPU's special hardware regs
0x50000000 0x5FFFFFFF Graphics RAM 512 B Direct access to all of the VMU's graphics memory
0x60000000 0x6FFFFFFF IRQC 4 B Our emulated IRQ controller
Each address space region repeats inside at a given boundary. This is on purpose. For example that allows you to treat work ram and internal ram together as a large chunk of RAM together. Just consider the 768 bytes at 0x1FFFFF00. Due to repeat, the first 256 B map to internal ram, and the next 512 B to work RAM. Cool right? What happened to internal RAM bank 0, you might ask. Well, the VMU OS uses the bottom 128 bytes so we leave those alone. In theory we could stop the OS from running, but we'd lose RTC, so we leave it be. The top 128 B are usually allocated to stack. Not quite so in uM23. The emulated CPU's state is 67 bytes. The emulated IRQ controller uses another 4 bytes. We allocate another 50 bytes to stack, which is enough for us. That accounts for bank 0 of internal RAM. Speed? Fast enough for flappy bird :).

IRQ # Vector number Name
0 16 IRQ0
1 17 IRQ1
2 18 IRQ2
3 19 IRQ3
4 20 SIO_0
5 21 SIO_1
6 22 MAPLE interrupt
7 23 P3 (Port 3 is buttons)
8 24 T0L.OVF (timer 0 low overflow)
9 25 T0H.OVF (timer 0 high overflow)
10 26 T1L.OVF (timer 1 low overflow)
11 27 T1H.OVF (timer 1 high overflow)
12 28 Base timer Irq 0
13 29 Base timer Irq 1


All interrupts from the host CPU are actually forwarded into the emulated CPU. Yes, you read that right! You can handle VMU interrupts from your emulated code! I did not emulate the entire NVIC (too much work) though. I wrote my own IRQ controller (IRQC) which is a simple one. There is a 16-bit mask register at offset 0x02 and status register at 0x00. An interrupt is only caused if the mask bit is set and status bit is set. Status bits are cleared by writing one to them. IRQ numbers (and ARM vector numbers) are mapped as you see in the table. Normal ARM exception also work. Yes! You can write a custom SVC handler and expect it to work. Yes, HardFault will happen when you access unaligned, nonexistent, or read-only memory. The whole thing works!


The emulated CPU supports pretty much every instruction that Cortex-M23 has! I do not implement the optional security extensions, of course. In fact, a few extra instructions are supported over what a stock Cortex-M23 supports. Which? UMULL, UMLAL, SMULL, SMLAL! Why? Because I felt strongly about these and felt like I should. While an unmodified Cortex-M23 compiler will not emit them, the change is simple to make it. Alternatively (see mandelbrot example) a simple gcc-ism can be used to get the functionality. This provides for large speed gains for certain types of emulated code. WFI and WFE are also supported. WFI will halt the VMU CPU, waiting for interrupts (even masked ones), like a button press or a time expiry. WFE will put the VMU to sleep (wake up only by user pressing the SLEEP button).


The structure of uM23 is pretty simple. The main CPU emulation code is in core.s. This implements the instruction decoding and execution. The File soc.c imeplements the SoC layer. What's that? Mostly readmem()/writemem() and mapping those unto real hardware. The file ui.s implements the UI that allows you to select the file you want to run. It also houses the 4x6 font (a version of the Tom Thumb font) and the code to render it onscreen. Since the VMU used a FAT16-like filesystem and it can get fragmented, defrag.s implements a simple defragmentor to defrag binaries before we emulate them (yes, I wrote a disk fragmentation utility in assembly). Lastly, uM23.s implements the glue that holds this all together. It also implements the hypercall handlers.


"Hypercalls"? Of course there is a method to call out to the emulator to request additional services. The opcode(s) for that are 0xDExx where xx is hypercall number. In case you wanted to programatically set the hypercall number, 0xDE00 is used and the number is stored in the low 8 bits of R12. Parameters are in R0..R3 and return value's in R0. What are the supported hypercalls?

Number Name Params Returns Notes
1 DbgPutchar char ch - Prints a character to emulator's debug console. Do not do this on real hardware
2 DbgPutword u32 val - Prints a 32-bit word to emulator's debug console. Do not do this on real hardware
3 GetSelfAddr - const void* Get address of this app in the flash (while it is mapped at 0, it is also somewhere in flash range 0x30000000 - 0x3FFFFFFF). This hypercall gets that address
4 GetFontAddr - const u8* The emulator includes a 4x6 font. This gets its address so emulated code can use it
5 FlashWrite u32 addr - Data to write must be located at 0x10000080. Address must be 128-byte aligned.
6 DrawChar char ch, u8 row, u8 col, bool inverted - Draw a char to screen at a given coordinate (not pixel coords, char coords).
7 Exec u32 addr - Execute another Cortex-M23 binary. Param is address of where in flash it is
8 Exit - - Return to uM23

Emulating the VMU

I was not going to test my first large work in an assembly language of a new CPU on real hardware with no debugging capabilities, of course. It was time to write a VMU emulator. Yup. Luckily, MAME had a LC86K CPU emulator. Unluckily it was written in some incomprehensible useless language (C++). Luckily, I was able to rewrite the CPU code in C. I then wrapped it into the rest of the VMU emulation code I wrote. Added some SDL magic for UI and there it was. The emulator is also part of what I am releasing today. The emulator is pretty easy to use. The buttons A and B are mapped to those in the VMU. Arrow keys are too. M and S are mapped to MODE and SELECT. The current emulator includes the japanese 1.04 ROM, but I did test it with my dumped 1.05 American ROM with great success. On the command line VMU emulator takes a few params. First is the name of the gamefile to load. If you find some online (look for VMS extension), they will work here. If you want to load some other files, give them as additional params, interspersing with names to give them in the internal filesystem. This is how I tested uM23 before I loaded it up onto real hardware. Afterwards I also added a debugging facility to permit me to debug the code in the emulator. Given that every single possible opcode byte is used in this CPU, this was not easy seemingly. But, STF is useless from flash (where my code lives) so I reused that opcode to mean "debug" and the emulator will interpret the following bytes to mean different kinds of debug actions. See source for more details.

Back to our emulators

Code size

The emulator took quite a while to write and a lot longer to debug. I tested each opcode with most the representative data I could think of and all corner cases I could find in the ARMv8-M doc's footnotes. The CPU core emulator is just over 2000 instructions, optimized for size. I am sure, though, that it can be improved. Afterall, this is my first major piece of code for this obscure CPU. The total amount of code in the entire uM23 is about 3000 instructions This includes UI, defrag code, memory access code, and all else.

Multiple programs concurrently on the VMU for the first time

What does uM23 do? When run it will find all VMU datafiles with names ending in ".M23" and show them in a scrollable menu. When one is selected (by pressing A), it will show info on the file. If A is pressed again, the file is run as a Cortex-M23 binary. If the file is fragmented, defragmentation is performed, since the emulator assumes file is linear in flash (for speed). Normally the VMU can only have one minigame in it. uM23 acts like that game. But since it allows running any datafile as a binary, you can have as many as you wish. This is pretty cool!

Code demos

What good is an emulator with no demos? I put together two. The first is a simple Mandelbrot set grapher. It is not at all optimized. Since the display is one-bit-black-and-white, I simply color white if it takes over 8 iterations to diverge, and black else. Vertical symmetry is not exploited. The code can be compiled normally for Cortex-M23 and it works. Alternatively, it can be forced to use SMLAL instruction I support for a huge speedup and code size shrink. You can find the source for it in the archive I provide. The second demo I provide is a flappy bird game clone. Why? Because I was bored and this seemed like a funny enough idea. I then had to slow it down since else I could never score a single point. Haha! This demo shows how to scroll screen efficiently, how to draw, how to read buttons, how to use hypercalls to write text onscreen, etc. You can see both demos here on the right in the video and animation.


So, where can you get all this goodness?
=> [RIGHT HERE] <=
This archive contains all the things promised above. The license is as always: free for non-commercial use with attribution, else contact me. Enjoy! If you have questions/comments/etc, please feel free to write me an email. Enjoy your VMU hackery!
© 2012-2017