ROMRAM - Dmitry.GR

Using QSPI RAM with RP2040's SSI in read-write mode

The problem
1. Towards a solution
2. Read-only RAM is not much use
Let the nasty hacks begin
1. The horror
2. Emulators all the way down
Polishing it to perfection
Performance
Download
Comments...

The problem

Can you use 8MB of external RAM with RP2040, memory mapped, like real memory? I call this ROMRAM

RP2040 is a rather versatile chip. One of its most convenient features is support for flash XIP via SSI. SSI is quite configurable and can support all sorts of flash chips. It is, of course, not entirely bug free (try to configure it for SPI commands and QPI addresses, for example, see how that goes), but a large memory with a fast cache is super nice. There is only one issue: RP2040's XIP mode only supports read and execute accesses not writes. This makes sense given its purpose and what it was designed for, but who cares about that? COULD we attach a RAM to it? Well, actually this is not too hard. QSPI SRAM chips exist, made by ISSI, APMEMORY, and (my favourite) VilsionTech. They talk more or less the same protocol, and getting SSI to talk to them is trivial. This is useless... You can indeed manually issue read and write accesses to it, but it is not memory-mapped and thus useless. Could it be? Sure? Enabling XIP and configuring it properly will work - the RAM will support read and execute, but not write. This is still not all that useful either.

Towards a solution

First of all, how would you boot without persistent memory? I solved this by having both a flash and a RAM onboard. RP2040 only has a single nCS pin for SSI and only a single memory mapped address range, so we'll not be able to use them both. The idea is to boot from flash, copy flash to the start of RAM, and continue running from RAM. How do we make all of this work? It does not take much: two OR gates and a NOT gate will do. In my design I used a tiny SMD dual-OR gate IC and a tiny SMD NAND gate as an inverter. We'll also need two resistors. The output of RP2040's SSI's nCS is pulled up, and is an input to one of the inputs of each OR gate. A GPIO pin called RAM/nROM and pulled down by default is the other part of the equation. It goes to input of (gate A) and to the inverter. The output of the inverter is an input to the other OR gate (gate B). Gate A's output wll go to the flash chip's nCS input, gate B's output goes to RAM's nCS.

What does this accompish? When we boot, the GPIO is floating, the pulldown will provide a logic low, this means that RP2040's SSI accesses flash (via gate A), and we can boot. The first stage loader can load a larger second stage loader to internal RAM. That loader can copy the entire appliction (in my case 2MB) from flash to RAM using almost all the internal memory as temporary space (in my case 256KB). It can toggle the RAM/nROM pin and reconfigure SSI as needed to access flash and RAM. Then, XIP can be enabled, and with proper SSI config, the RAM/nROM can be left in the high state, causing all accesses to go to RAM now.

This will almost work. If you actually try this, you'll find a fun bug. If you attempt to reset the RP2040 using its RUN pin, you'll note that the manual is wrong, and the GPIO module does NOT get reset, the pin does NOT go back to floating, and you are still accessing RAM and not flash. Oopsie... Not sure how this was not noticed. In my case this was not a problem since when I ran out of pins, I moved RAM/nROM to an i2c io expander, and its nRST input does work. If you plan to use this without an io expander, keep this annoyance in mind.

Read-only RAM is not much use

OK, so our RP2040 now has a memory-mapped RAM. This is quite useless since we cannot write to it directly. Oh, sure, we can issue SSI commands, but this is (1) annoying, (2) boring, and (3) will not allow unmodified software that needs a few megabytes of RAM to run. How do we make this better? With nasty hacks, of course! The RP2040 has a few features (and misfeatures) that we can glue together to improve the situation. The XIP cache allows us to flush lines in it, which will be important since the cache has no idea that the backing store is writeable and can change. There is also an MPU which we can [ab]use.

Let the nasty hacks begin

By default, a write anywhere to 0x10xxxxxx (normal cached access to XIP) will be treated as a command to flush a cache line. That means that any write attempt in normal code will be silently ignored. No fun! Let's use the MPU to write-protect the region. Now a write attempt will trigger a HardFault. Ok, that's better! Our HardFault handler can now ... quickly interpret the faulting instruction, emulate the write, flush the cache line, and resume. This sounds easy ... NOT.

The horror

Let's consider the concept. Clearly, this HardFault handler cannot itself live in XIP memory, since we do not want the XIP cache attempting a read while we're trying to issue a write. There will also be some other limits. We can only emulate accesses we can understand. What other kinds are there? There are two more sources of writes in the system besides code. One is DMA. The answer here is simple: we're targeting running unmodified code from elsewhere. Such code would not be relying on RP2040's DMA, so no issue here. And if you use DMA, be careful to not attempt to DMA to our ROMRAM (reads are OK). The second source of writes we cannot understand and emulate is the Cortex-M0 CPU itself. The CPU will push 8 words to the current stack on any interrupt or fault. If the current stack lives in our ROMRAM, these writes will fail (caught by the MPU) and we'll have lost the info we need to resume the current code. The answer is, more or less, the same as before. Most likely "existing code" does not directly manipulate the stack pointer, so this should be avoidable. If you are writing new code and relying on ROMRAM, keep your stack in an internal memory of some sort. Easy.

Emulators all the way down

How easy is it to write a super fast partial ARMv6M emulator that can properly emulate any write instruction, including complex ones like STMIA? It is actually not too hard, especially if you throw some RAM at the problem. The simplest way to dispatch on the instruction type is to use the top 7 bits of it. That implies a table of 127 entries. That is 256 bytes of just jump instructions. This is not too hard to justify, really. So, as we take the HardFault, and after we assume that PC and CPSR.T are set right (checking for this would take more cycles), we can read the faulting instruction. Shift it right, add this to PC, and then come the 128 jump instructions to dispatch based on all the possible 128 cases. Most of them will go to a "some other fault happened" label since they do not decode to a valid instruction that could have caused a write. There are 2 variants of each: STRB, STRH, and STR that we need to handle. Since ARMv6M requires all writes to be properly aligned, we need not worry about any QSPI RAM page-crossing limitations here. We get the value from the proper register (decoding this using a few more jumptables), byteswap it (SPI is BE, CPU is LE), and issue the write directly to the SSI hardware.

And then there is STMIA... This is a complex beast that can write up to 8 words to RAM at any word-aligned address. There are three ways I could have handled this. The first is to issue each write as a word write to QSPI. This will work for all QSPI RAMs. The second is to issue it as one long write. This is the fastest option, but it will only work on Vilsion RAMs since both ISSI and APMEMORY chips wrap all accesses to a 1KB-address-window. The third option is to detect crossing a 1KB boundary, and switch between the above options. This is the most complex option, and the checking itself may be more cost than it is worth. My code uses option two, since I use chips from VilsionTech. With that, emulating STMIA is just a matter of sending the proper words to write in a row fast enough.

Polishing it to perfection

There are always hardware bugs

Fast enough? What!? Yes... RP2040's SSI seems to ignore the programmed "NDF" value for write-only transactions. Once it has started a write, it will raise nCS anytime the TX FIFO is empty. This means that you need to fill it just fast enough to keep it busy. This, in turn, means that you should carefully watch your SSI clock divisor... There was also an issue I found with writing to the SSI FIFO too fast (even when it is empty) and a NOP was needed. Do not ask... There were more bugs in the SSI. For example, sometimes, requesting a cache flush would trigger a XIP read. As you can imagine, this completely breaks things if we're in the middle of issuing a write command. The solution there was to delay all cache flushing till after the writing is done. This was only an issue for STMIA, of course, since all other writes are simple already. It might be reasonable to ask whether interrupts could cause any issues to this requirement of precise timing. The answer is no, since this code runs in HardFault context - interrupts will wait for it to be finished. This prioritization is important, since it also allows the interrupt handlers to easily write to ROMRAM.

Memory protection

This brings us to another interesting topic. I mentioned that the MPU is used to catch the writes. But I did not mention disabling the MPU. One might ask how it is that I flush the proper cache lines without re-triggering it (since cache flushes are done via writes). The answer is HFNMIENA. This bit in the MPU config needs to be set. It tells the CPU core to ignore the MPU while running in HardFault and NMI contexts. Not having to wrangle the MPU for each write saves valuable cycles in the handler, allowing it to be faster. But what if you do not want the entire ROMRAM region to be writeable? This is supported. Two global variables exist. One (mRomRamStart) records the address of the first writeable ROMRAM address, the other (mRomRamLen) records the writeable area length. They may be modified anytime to adjust the writeable region. In rePalm project, I use them to split ROMRAM into three regions, for example. Region A is always below mRomRamStart and is always read-only (the copy of the code we're running that the second stage loader copied to RAM). Region B is next and is writeable or not based on an API call to protect it or not (PalmOS is weird). Region C is always writeable. This is pretty easy to do with the provided knobs.

Multi-CPU

What would multi-CPU support look like for ROMRAM? You'd need to simply add use of one of the hardware mutexes to make sure two cores do not try to write at the same time. I leave this as an exercise to the reader. The rest of it will work. Just point the HardFault vectors from both CPUs to the same ROMRAM HardFault handler and you're done. Cool, right?

Performance

Ok, the million dollar question: how fast is it? Well, reads and execute are native speed, since they work via the usual pathways and cache. Write speeds depend on how writes are done. Each write instruction is emulated, and thus write instructions that write more produce hugher throughput. This is good news for things like memcpy, since that kind of code usually uses STMIA. To put some hard numbers on it, I see memcpy to ROMRAM hitting 36Mbit/s at stock clock rates, which is not too terrible. This works well for cases when memory is mor read thatn written (which is common). We can approximate the actual cost of a write by looking at the instructions of the handler. The actual math differs based on the registers used. Let's check on a simple STR(immediate). The exception entry and exit take 12 and 10 cycles respectively. Exception entry code to handle various entry modes and getting the proper exception frame pointer takes 6 or 7 cycles. Dispatching based on instruction type takes 9 cycles. Getting the address calculated takes around 17 cycles. The math to verify that we're within writeable bounds takes 11 cycles or so. Getting the value to write takes around 10 cycles. Issuing the write command takes 28 cycles. Then we wait for the SSI to finish. At DIV of 4, it will need 256 cycles to finish issuing the write command. But we overlap with he first 6 of those in code, so effectively it takes 250 cycles of waiting for us to continue. Cleanup takes 10 more cycles. So all-in-all, a single word write took us 12+10+6+9+17+11+10+28+250+10 = 363 cycles. Some of this could be cut a little with some creative work (eg: by overlapping more of the SSI write and the data-getting. This optimization is also left as a exercise to the reader).

Download

The code download for the second stage loader and the HardFault handler is [HERE]. License is BSD 2-clause. I am too lazy (and disgusted) to turn this into some sort of an arduino or a micropython plugin, but I am sure someone else will. My provided code will build standalone with no dependency on anything. License is BSD-2 clause. Enjoy

Table of Contents