Well, as it may be apparent, I like writing emulators. Since my last one (ARM emulator that booted linux), I was wondering how much faster one could emulate an ARM cpu on an atmega device. The C-language core netted 10KHz. This time I decided to reduce the scope of the problem, and try to use a smaller device. So the goal was then: Cortex-M0 emulator on an ATTiny85. The original thumb instruction set is slightly extended in ARMv6-M, adding sign extension and byte reversal instructions, as well as some really weird status registers. I did not bother with the weird status register, since I found them unnecessary for most anything you'd use this for. The emulator was written in C and then rewritten directly in AVR assembly. The assembly is quite a bit faster than a C-language emulator (3x, in fact). This fulfills my secondary goal of shining a spotlight on the inadequacy of avr-gcc.
The memory map of the emulated Cortex is as follows 0x40000000 - 0x4000FFFF allows access to the first 64K of the host AVR's flash memory space. It can be easily modified to support parts with over 64KB of flash, but I did not do this since then the code will not run on an ATTiny (which lacks a RAMPZ register). 0x20000000-0x2000FFFF allows direct access to the host AVR's RAM, including access to I/O memory and even registers themselves. You can access the EEPROM using the EEPROM registers, of course.
Device support is quite wide. The assembly code requires that the device support LPM and MOVW instructions. The emulator core is 1300 instructions, so it fits in an ATTiny85. You can, in fact port the code to not use LPM and MOWV, but at the cost of larger code size. One problem: ATTiny-series AVRs do not have a multiply instruction. To support both ATTiny and ATmega at optimal speed, the multiply instruction is emulated by a separate file. The ATTiny version is mul_attiny.S and can do a multiply in about 500 cycles (quite slow). The ATmega version is mul_atmega.S and does a multiply in 34 cycles. I've tested this emulator on an ATTiny85 and found it to work well.
The memory footprint of the emulator is quite tiny. The entire emulator state in ram is 61 bytes in size, which can be reduced 4 bytes further by sacrificing some speed (3% or so). The emulator code uses 2 levels of calls, and pushes nothing to stack besides return addresses, which means that on the AVR it only needs 4 bytes of stack. The remaining RAM is free for the ARM code to use as it pleases. The emulator core, on start, resets the stack to the end of ram, leaves 12 bytes free, and then points the emulated ARM's SP register to that location, thus giving the ARM code a working stack pointer on start.
Emulated code has access to all the AVR's peripherals, but that's not all. There is support for hypercalls to the emulator. The thumb opcode "BKPT" (encoded 0xBExx for any value of xx), causes the emulator to call the function "bkpt(u8 bkptNum)" with the value of xx. The function is free to access the CPU state and modify it at will. For example, one may read registers to gather parameters and write them to produce return values. In my example, I implemented a UART TX function in AVR assembly (uart.S) and use "bkpt " to call it to allow the code to write to a UART.
Operational parameters? On average the emulated CPU runs at 200KHz on a 16MHz ATTiny. This is pretty good, considering that it's a 32-bit CPU. To compare more fairly, using 32-bit ops on a ATTiny it can do about 2.67 million ops/second, so the emulator adds only about a 10x overhead - very nice indeed. Why did I do this then? Well, since it is an emulator, you can store the code it runs elsewhere, like a huge externali2c EEPROM, allowing tiny AVRs to execute huge programs. You can also plumb in a huge external i2c RAM, to give it a lot of RAM. Currently I store the code in the AVR itself as an array, but this need not be so. You can use a linker script to segment a piece of the flash code just for ARM code, or, as previously mentioned, use external storage. My little demo app here measures the Vcc using the ADC and shows it to three decimal places, then produces some primes, and then just toggles the output port infinitely.
Give me the code already! OK OK. Code is here: [LINK]. To build just run make. If you want to try it on simulavr, run "make BUILD=sim" and then "simulavr --device atmega16 --file uM0 -W 0x20,-". Sadly you have to use atmega since simulavr does not support attiny85. If you want to use the fast multiply code (if you run on a device that has the "mul" instruction), change the makefile to use "mul_atmega.S" instead of "mul_attiny.S". Shown above is a simple test circuit that works with the given sample code. Code is under GPLv3. Enjoy. Feedback? Email it.