My business card runs Linux, yours can too

Table of Contents

Linux card project cover image
  1. Why?
  2. Parts selection
  3. What to emulate
    1. A MIPS primer
    2. What system?
  4. Let's emulate!
    1. The CPU
    2. The FPU
    3. The MMU
      1. MMU basics
      2. The MIPS MMU
    4. Communication
    5. Hypercalls
  5. Bring on the hardware!
    1. The honeymoon period
    2. How not to design a DMA unit
    3. Clocks again
    4. SD card support
    5. Coolness enhancement
  6. How it works
    1. How a normal DECstation boots
    2. How uMIPS boots
    3. How uMIPS runs
    4. Linux changes
  7. Improving performance
    1. Instruction cache
    2. Improving CPU speed
    3. Improving RAM bandwidth
    4. Dirty hacks specifically for Linux
  8. How to build and use one
    1. Building
    2. Building from source
    3. If you are lazy
    4. Using
  9. In conclusion
    1. Acknowledgements
    2. Downloads
  10. Comments...

Why?

Linux card in action

A long long time ago (in 2012) I ran Linux on an 8-bit AVR. It was kind of a cool record at the time. I do not think anyone has beaten it - nobody's managed to run Linux on a lower-end device than that 8-bit AVR. The main problem was that is was too slow to be practical. The effective speed was 10KHz, the boot time was 6 hours. Cool, but I doubt that any one of those people who built one of those devices based on my design ever waited for the device to boot more than once. It was time to improve it!

So what could I improve? A number of things. First, I wanted the new design to be speedy enough to boot in a few minutes and reply to commands in seconds. This would make using the device practical and not a test of patience. Second, I wanted it to be easy to assemble for anyone. This meant no components with tight spacing, no components with too many pins, and no components with contacts hidden underneath them. A part of this wish was also that someone could actually assemble one, meaning that I had to select components that are actually buyable in the middle of the current ongoing shortage of, well, everything. Additionally, I wanted the device to be easy to interface with. The original project required a USB-to-serial adapter. This would not do. And, finally, I wanted the whole thing to be cheap and compact enough to serve as my business card.

Parts selection

Linux card schematics

Some things were pretty easy to decide on. For storage, for example, microSD is perfect - easy to interface with, widely available, cheap. I picked a simple microSD slot that is easy to solder and easy to buy: Amphenol 1140084168.

Some choices were a litle harder, but not too much so. For example, I was surely not going to use DRAM again. It requires too many pins, necessitating more soldering than I would consider acceptable, given that I wanted this device to be easy to assemble. SRAM in megabyte sizes does not really exist. But there is a cool thing called PSRAM. It is basically DRAM, but in easy mode. It itself takes care of all the refreshing and externally acts just like SRAM. Ok, cool, but still that would usually be a lot of pins. Right? Enter "AP Memory" and "ISSI". They make QSPI PSRAM chips in nice SOIC-8 packages. AP Memory has models with 2MB and 8MB of RAM per chip, ISSI has them in 1MB, 2MB, and 4MB sizes. I decided to use these. They are available and my code supports them all!

There were some miscellaneous choices, like which regulator to use. I chose MIC5317-3.3YM5TR due to having worked with it before and it being available in my "random chips" box. It is also easily available to buy.

The USB connector was also a fun choice. I settled on: none. With the proper PCB thickness, one can lay out the board edge to fit into the end of a USB-C cable. I've seen this done before for micro-USB and figured it could be done for USB-C as well. At the end, though, I did not even need to do it, since someone else already saved me the 30 minutes it would have taken. I just had to remember that the board thickness needs to be 0.8mm for this to work.

The last choice was the hardest - which microcontroller to use. The criteria were: built-in USB, no more than 32 pins with at least 0.65mm spacing, no pin-less packages, actually available to buy, QSPI support, as fast as possible. I did not get my last two wishes. After much searching and filtering for "in stock", I was forced to settle for an ATSAMD21 series chip, specifically the ATSAMDA1E16. It is not fast (specced to 48MHz, I clock it at 90MHz), it has many bugs (especially in its DMA engine), but it can be bought, it is easy to solder, and it'll have to do...

What to emulate

Linux card boot log

I could have just taken my old ARM emulator (uARM) and used that. But what's the fun there? I decided to pick a new target. The ideal emulation target will: (1) be a RISC chip so that I have to spend fewer cycles on decoding instructions, (2) have no condition codes (like MIPS) or only set them on demand (like ARM), so that I am not wasting time calculating them every virtual cycle, (3) be 32-bit since 16-bit machines are all funky and 64 bit is a pain to emulate, (4) be known, and (5) have a workable set of GNU tools and Linux userspaces available. This set of requirements actually only leaves a few candidates: PowerPC, ARM, MIPS. I've done ARM, and I had no desire to mess with an endian-switchable CPU, so MIPS it was! This gives rise to the internal name of the project: uMIPS.

A MIPS primer

MIPS is old - one of the original RISC designs. If you are a RISC-V fanboy(/girl/being), MIPS will look familiar - it is where 99.9994% of the initial RISC-V spec was copied from. The original MIPS was a 32-bit design, optimized for ease-of-designing-it. It has (and does not hide) a delay slot, has a lot of registers, including a hard-wired zero register, and does not use condition codes. The original design was R2000, back in 1986, followed soon by the improved R3000 in 1988. These were the last chips implementing the MIPS-I instruction set. MIPS-II was short lived and only included the R6000, which barely saw the light of day. The real successors were the MIPS-III R4000-series chips, released in 1991. These were 64-bit already in 1991! Clearly, the easiest target would be the R2000/R3000 chips with their simple MIPS-I instruction set.

MIPS-I is a rather simple instruction set. So much so that a complete emulator of just the instructions can be done in under 1000 lines of C code without any dirty tricks. The floating point unit is optional, so it can be skipped (for now). The MMU is weird. It is just a TLB that the software must fill manually. This may seem like a rather unuusal choice, but in reality it is a clever one, if you're in 1986 and tring to minimize the number of transistors in your chip. Why have a hardware pagetable walker, when you can make the software do it? You may ask how it handles the situation where the code that would do the walking is itself not mapped? Well, a part of physical memory is always hard-mapped at a certain address, and all exception handlers live there. Even if this were not the case, since the software manages the TLB, it would not be hard to reserve an entry for this purpose. The hardware even has support for some "wired" entries that are meant to be permanent. More on all of this later.

What system?

MIPS R2000/R3000 is a processor. A processor does not a complete system make. What system to emulate? I searched around for a cool system and settled on DECstation2100 (or its big brother - DECstation3100). Why bother? It seemed like a simple system that Linux does support. I was not, however, going to emulate the whole damn thing. Why? I initially had no plans to emulate the LANCE network adapter or the SII SCSI adapter. The last part might surprise you, since we will need a disk to use as our root fs. More on this later.

Let's emulate!

The CPU

MIPS is a rather old instruction set, which shows in a few places. The main one is that it attempts to prevent signed overflow. The normal instructions used for addition and subtraction will cause an actual exception if they cause an overflow. This does not map to how CPUs are used today, so nobody cares, but I still had to emulate it. There are "unsigned" versions of the instructions for addition and subtraction that do not do this, which is what all modern compilers will emit on MIPS.

I wrote an emulator for the CPU in C first, to allow easy testing on my desktop, while the PCBs were being manufactured. It was not fast, nor meant to be, but it did allow for testing. You can see this emulator in cpu.c. Along the way here, I implemented some features of the R4000 CPU optionally. It turned out that to boot Linux compiled with modern compilers, this is necessary, as the compilers assume these instructions exist. Technically this is a bug. Realistically, I am likely the only person to ever notice. So, which features did I need to add? Likely branches (BxxL instructions), conditional traps (Tcc/TccI instructions), and atomics (LL/CC instructions).

Of course, C is not the language one uses when one wants to go fast. I wrote an emulator in assembly too, targetting ARMv6-M (for the Cortex-M0 MCU I chose). I later added a sprinking of enhancements for ARMv7-M (in case I ever upgrade the project to a fancier CPU). This was tested on a Cortex-M7 and worked well too. The assembly emulator core is contained in cpuAsm.S and the ARMv6-M specific parts are in cpuM0.inc

I mentioned delay slots earlier. What is a delay slot? Well, back in the day it was considered cool to expose your CPU's pipeline to the world. Just kiding, it was just a way to save some more transistors. Basically, the instruction after a jump will be executed even if the jump happens. This is called the delay slot. A naive way to avoid dealing with this is to place a NOP after each jump instruction. But with a good compiler, the delay slot can be put to a good use in almost all cases. Obviously one cannot place a jump instruction in the delay slot, since the CPU is already jumping somewhere. Doing this is illegal and undefined. An issue arises, however, if the instruction in the delay slot causes an exception of any sort. The CPU will record that the instruction was in the delay slot, and point the exception handler to the jump whose delay slot we're in. There is no way to return to this "in delay slot" state, so the exception handler is expected to take steps to somehow execute the delay-slot instruction and then complete the jump.

The FPU

The DECstation came with an FPU, so that floating point operations would be fast. Back then this was a separate chip, which was optional in a MIPS R2000/R3000 system. Linux, in fact, will more-or-less corectly emulate the FPU if it is not present, but this is slow. I used this mode initially, and even fixed a few bugs in Linux's emulation, but, in the end, I implemented an FPU emulator. This was necessary since it seems like a lot of MIPS binaries I could find all assume the FPU is available and use it freely. I never reimplemented the FPU emulator in assembly, instead calling out to the C FPU emulator when needed. I figure that squeezing a few cycles out of each instruction is meaningless when the actual FPU operation takes hundreds. The code for this is in fpu.c.

The MMU

MMU basics

(this is a very oversimplified summary, feel free to skip if you know this, and do not complain to me that it is not perfectly accurate!)

Most CPUs access memory using virtual addresses (VA). The hardware works in terms of physical addresses (PA). Ability to map one to the other is the underpinning of memory safety in modern operating systems. The purpose of an MMU (Memory Management Unit) is to translate virtual addresses to physical addresses, to allow for this mapping. Normally this is done using a tree-like structure in RAM, called a pagetable. Most CPUs have a component whose job it is to walk that structure to resolve what physical address a given virtual address maps to. This component is a pagetable walker. In most cases the pagetable has 3 or 4 levels, which means that resolving a VA to a PA requires reading 3 or 4 words from main memory. Clearly you do not want to do 3 useless memory accesses for every useful one. So usually another component is included in an MMU - a TLB (Translation Lookaside Buffer). Basically you can think of a TLB as a cache of some of the current pagetable's contents. The idea is that before you go off doing those 3-4 memory reads into the pagetables, you can check and see if the TLB has a matching entry. If so, you can skip the pagetable walk.

Clearly, like any cache, the TLB needs to stay in sync with the things it caches (the current pagetables). So, if the OS changes the pagetables, it needs to flush the TLB, since it might have stale entries. Usually, TLBs expose very little interface to the CPU, so there isn't a way to go read all the entries and remove only the newly-invalid ones. Additionally, this would be slow, so this is not usually done. However, invalidating the entire TLB also has costs - it needs to be re-filled, at the cost of 3-4 memory accesses per entry. This could hurt performance. A solution commonly used is called an ASID.

What are the four main cases when pagetables might be modified? (1) Adding a new mapping over a virtual address that previously was not mapped to anything, (2) changing permissions on on existing mapping, (3) removing a mapping, and (4) entirely changing the memory map (for example to switch to a completely different process). In case 1, no TLB flush is necessary, since no stale TLB entry can exist. Cases 2 and 3 do indeed require flushing the TLB, but they aren't that common. Case 4 is quite common, though. It is done at every context switch. One might point out that since we're changing the entire memory map, the entire TLB would be invalid, and thus flushing it isn't a problem. This is wrong. Besides mapping userspace things, the MMU also maps various kernel structures, and there is no point penalizing them.

If we could somehow tag which entries in the TLB go with which process, and temporarily disable them when another process runs, we could avoid a lot of context-swich flushing and the performance costs imposed by it. It would also be cool if we could tag entires that belong to the kernel and are valid in every process. Well, this exact technology exists in many MMUs. The idea here is that each pagetable entry will have a bit marking it as "global" (valid in all memory maps) or not. There should also be a register in the CPU setting the current ASID (Address Space ID). When a TLB entry is populated from the pagetables, the current ASID is recorded in it. When a lookup in the TLB is done, only entries matching the current ASID or those marked "global" will match. Cool!

The MIPS MMU

The idea at the time was to save transistors. Which of the above could be cut? Well, cutting out the TLB guarantees terrible performance in all cases. But that pagetable walker, do we really need it? What if we make the sotware do it? We can add a little bit of assistance, like ability to manage the TLB efficiently, but skip on the pagetable walker hardware. This is what MIPS did. Here is the MIPS virtual address space:

Addresses Name Mapping 0x00000000..0x7fffffff kuseg mapped via MMU 0x80000000..0x9fffffff kseg0 mapped to physical 0x00000000..0x1fffffff, cached if there is a cache, only accessible in priviledged mode 0xa0000000..0xbfffffff kseg1 mapped to physical 0x00000000..0x1fffffff, not cached, only accessible in priviledged mode 0xc0000000..0xffffffff kseg2 mapped via MMU, only accessible in priviledged mode

So, as you can see, some VAs do not map via the MMU at all. This means that code living there is able to run no matter the state of the MMU. Linux, predictably, puts the kernel in kseg0. The kernel does, however, need to be able to dynamically map things in as well. kseg2 is one gigabyte of address space that is mappable via the MMU that the kernel can use. Memory-mapped devices will usually be accessed via kseg1. The 2gigabytes at the bottom of the address range(kuseg) are for userspace tasks.

What entry in a TLB should one replace when one needs to insert a new entry? An obvious answer might be "the one least recently used", but that would require tracking use, which costs transistors too. A simplification is "the one least recently added". This is easy, but it hides a fatal flaw. Imagine your TLB has N entries, and your workload sequentially uses N + 1 addresses, such that each would need a TLB entry. Now you'll always be replacing the entry you're about to need, guaranteeing that you NEVER hit the TLB and do a lot of pointless pagetable walks. How do we avoid this? The simplest method is replace a random entry. Sure, it might be the entry you're about to need, but for an N-entry TLB the chances are 1/N.

Generating random numbers is slow in software, so MIPS R2000/R3000 provide some help. The CPU has a register called, literally RANDOM which is supposed to be constantly incrementing, every cycle. Since the "when" of "when will you next need a new TLB entry" is not predictable, this is as good as random, and requires very few transistors. The idea is that whenever you need to replace a TLB entry, you use a special instruction TLBWR to write to a random entry. I did not tell you about ASIDs by accident either. The MIPS R3000 MMU implements a 6-bit ASID.

Communication

The DECstation had a few ways to communicate with the outside world. It had a built-in network card, which I do not emulate. It was optional, and I haven't found a use for it (yet?). Maybe I will later - it does not look complex. It also had a SCSI controller which one could attach hard disks and other SCSI peripherals to. Emulating this would be a fun challenge, and I'll probably get to it later, but I did not do it now - it was not necessary - I wrote a paravirtualized disk driver for Linux using hypercalls, more on this later. There was also an optional framebuffer card one could install that added support for a monochrome or a color display. Emulating these would also not be too hard, but my business card lacks a display, so I did not do it either - plus I am not even sure that Linux can make a use of it.

The last method of communications that the DECstation had was DC7085 - a serial port controller that is basically a clone of a PDP11-era DZ-11. It supports four serial ports at a blistering 9,600bps speed (or any integer division thereof). Each serial port was allocated a purpose, and they were wired to different connectors indicating this purpose. #0 was for the keyboard, #1 for the mouse, #2 for modem, and #3 for printer. To the machine they are all the same, this was just the purpose DEC assigned to them. The stock PROM would use #3 as serial console instead if it did not detect a keyboard at #0, thus it is customary to use #3 as serial console for Linux on the DECstation. My PROM surrogate does not bother looking for or supporting external keyboard, and just defaults to serial console on #3. That being said, since it is cool to allow multiple login sessions, I also export #0 as a second virtual serial port, so that you may login from two serial consoles at once, and do two things at once. How cool is that?

So, how do I export these serial ports? When you connect the card to a computer, it'll show up as a USB composite device comprised of two CDC-ACM virtual serial ports. One of them is port #3, another is port #0 on the virtual DZ-11. How will you know which is which? #3 has the boot console printing and will have the initial sh prompt. If you do not see this, try the other one, computers do not always number them in the order I export them.

Hypercalls

In the real world the PROM had to probe the real hardware to detect what was present where. As my PROM is running in an emulator, there is no need for such mess. We can simply request things from the emulator in an agreed-upon way. That way is a hypercall - a special invalid instruction that, if encounted in supervisor mode, the emulator will treat as a request for some kind of service. The instruction I chose is 0x4f646776, which is in the COP3 (coprocessor 3) decode space that was not allocated to any real purpose in these chips. The calling convention is close to the normal C calling convention on MIPS: parameters are passed in $a0, $a1, $a2, and $a3, return values are in $v0 and $v1. The $at register gets the "hypercall number" - the specific service we're requesting.

A few hypercalls are implemented. #0 is used to get the memory map. The parameter is word index of the memory map to read. Word 0 is "how many bits the memory map bitmap contains", word 1 is "how many bytes of RAM each bit represents", words 2 and on are the bits of the map, up to the total specified in word 0. This can be used to build a memory map that the PROM can furnish to the running OS and allows me to have discontinuous RAM. Linux supports this and I tried it, but did not end up needing it. It is here in case I change my mind and need it again.

Hypercall #1 outputs a single byte to the debug console (which is the same as DZ-11 port 3). This is used by the PROM and mbrboot to output debug strings without needing to have a complete DZ-11 driver in there. Hypercall #5 will terminate emulation. This can be used on the PC version of the emulator to quit peacefully.

Hypercalls #2, #3, and #4 are used for SD card access. #2 will return card size in sectors, #3 will request a read of a given sector to a given physical RAM address and reply with a nonzero value if that worked. #4 will do the same for a card write.

Bring on the hardware!

The honeymoon period

Linux card board layout

The first revision of this board came up well initially, after I sorted out the mess that is ATSAMD21's clocking system. I appreciate flexibility as much as the next guy, but this thing is TOO flexible. It took a lot longer than I'd care to admit to get this thing running at a sane speed and to enable some peripherals. The docs were too sparse to be of much use, too. Atmel, what happened to you? You used to have the best docs!

The first revision of the board had two memory chips, each on their own SPI bus, an SD card on an SPI bus, and USB with the proper resistors. The USB was perfect. Unlike everyone and their grandmother (STMicro, I am glaring at you), Atmel did not license annoying Synopsis USB IP. They made their own. It is easy to use, elegant, and works well. Seriously, it just worked. In two days I got the hardware to work and wrote a USB device stack. I tip my hat to the team that worked on the USB controller. That being said, I have concerns. My main issue: USB descriptors aren't small. They are constant. I'd prefer to keep them in flash. I'd prefer to, but cannot. The USB unit uses a built-in DMA unit to read the data to send. This DMA unit CAN access flash, but if you have any flash wait states enabled, it sends garbage. I suspect that Atmel only tested it for reading from RAM, forgot that some memories have wait states, and did not account for that. Keeping all my descriptors in RAM is a colossal waste of RAM, which there is only 8KB of. Remember that tiped hat? I rescind it, Atmel.

Using the SPI units directly worked well enough, until I tried to speed them up. Past about 18MHz The received data was garbled (missing a bit or two, all the following bits shifted). No amount of searching found an issue in my code, and all sample code did more or less the same things. My bus analyzer showed no issues. What gives? THIS GIVES (archived)! I was beyond furious when I found this forum post. Here I was, trying to build a fast device, and my SPI bus was going to be limited to the speed of a tired snail calmly strolling through peanut butter! With some more testing I found that the SPI units will work fine to about 16MHz, which I'll have to live with.

The SPI units have no FIFOs, so code must manually feed them one byte at a time and read one byte at a time. This means that there is space between bytes on the bus as code wrangles bytes in and out of registers and memory. This is a waste of potential speed. The solution is DMA. Luckily this chip has DMA. Unluckily, it is fucked beyond belief, to a point where I am beginning to suspect that it was designed by a sleep-deprived stark raving lunatic.

How not to design a DMA unit

A normal garden-variety DMA unit has some minimal global configuration, and a few channels, each independent from the rest. Each channel will usually have a source address, a destination address, a length, and some configuration, to store things like transfer chunk size, trigger, interrupt enable bits, etc. Thus it is common in ARM MCUs to have each channel have precisely these 4 32-bit configuration registeres: SRC, DST, LEN, CFG. This is 16 bytes of SRAM per channel. ATSAMD21 has 12 DMA channels, so that would be 192 bytes of config data for the DMA unit as a whole. Not that much. Well, Atmel was having none of this! Instead, the unit itself only has a POINTER to where in the user RAM all this config data lives. For every transfer, the DMA unit will load its internal state for the active channel from this structure in RAM, and then operate on the channel. If another channel's data was already loaded, it will be written out to RAM first. Depending on your experience level, you may already be on your third or fourth "oh, hell fucking no" as you read this...

Why is this bad? Let's imagine two SPI units being fed by DMA. Each one will have two DMA channels, one for receive, one for transmit. Four channels are active in total. Now what happens as both the SPI units are enabled? Two DMA channels (the transmit ones) will go active and attempt to send a byte. One will go first, then the second. This will generate 14(!!) bus transactions to the RAM! Four to read config data for one channel, one to read the byte to send, four to write back this config data, four to read the config data for the next channel, and one more to read the byte to send. So in order to send 2 bytes, the DMA unit did 14 RAM accesses. Not great. But wait...there's more. Let's take a look what happens next, as the SPI units finish sending this byte and clocking in the received byte, but are also ready for the next byte to send! At this point in time, logically only four bytes need to be moved (two from the units into the receive buffers, two from the transmit buffers into the units). Let's see how this plays out. Remember the DMA unit's internal config data is currently loaded to the second transmit channel's. First, it'll have to do 4 writes to write that data out, then 4 reads to load the first receive channel's structures, one write to memory to write the received byte to RAM, 4 writes to write out this channel's structures out, 4 reads to load structures for receive channel number 2, one write of the received byte to RAM, 4 bytes to write out the config structure for this channel back out to RAM, and then the 14 we already discussed to send the next two bytes. That adds up to 36 RAM accesses to simply read two bytes and write two bytes. All this pain, simply to save the transistors on the 192 bytes of SRAM it would have taken for the DMA unit to store all the config data internally.

So, why is this bad? Let's say our MCU is running at its designed speed of 48MHz, its SPI units running at their designed max speed of 12MHz. At the point the second bytes need to be sent and first received bytes need to be received, we'll need to perform 36 accesses to RAM, but also 4 accesses to the SPI unit. The SPI unit is on an APB bus, which means that any access to it takes at least 4 cycles. This means that in between each sent and received byte we'll need 36 + 4 * 4 = 52 cycles. If the SPI unit runs at 1/4 the CPU speed, then it will send/receive a byte every 8 * 4 = 32 cycles. So every 32 cycles we'll need to do 52 cycles' worth of work. When they do not get enough cycles, the DMA channels give up and stop working... Oops...

So, what can be done? I worked out a hybrid method where I send data using CPU writes and receive using DMA. This worked for two channels, but would not work for more. Once I got rev2 boards that had 4 RAM chips, even this failed, as just the 4 receive DMA units starved each other of bandwidth and got cancelled. Why was Atmel so damn stingy with internal SRAM? We'll probably never know. But they could have solved this exact issue simpler than with 192 bytes of SRAM in the DMA unit. Just adding a 4-byte FIFOs into the SPI units would do as well, then each DMA transaction could transfer more than a single byte, alleviating this traffic jam. Sadly, apparently nobody at Atmel has even tried to actually use their chip for anything. Atmel, what happened to you?

Clocks again

My clocking woes were not over yet. This chip has a number of internal oscillators, one of which is supposed to be a rather precise 32KHz oscillator called OSC32K. I wanted to use that as a source clock for a timer to implement my virtual real time clock. Well, despite much pain and many tears, the damn clock would not start... ever. The code should be simple: SYSCTRL->OSC32K.bit.EN32K = 1; SYSCTRL->OSC32K.bit.ENABLE = 1; while (!SYSCTRL->PCLKSR.bit.OSC32KRDY); Yeah... that did not happen. At the end, I decided that I can use a less-precise OSC32KULP to clock my timer. That one did start and I was able to use it. By this point in the project I was worn out, desensitized to this chip's many faults, and completely out of WTFs, so I resigned myself to a slightly imprecise real-time clock and trudged on.

SD card support

Not really much to say about SD card support. Been there, done that, got the t-shirt. My initial code for the prototype used multi-block reads and writes for better card access speed, but in the final prototype I was forced to abandon it since one of the RAM chips shares the SPI bus with the SD card, so leaving the card selected was not an option. This was not that big a deal since SD access is rarely, if ever, a bottleneck here. Any card up to 2TB is supported.

In the final revision of the board I wired up card detect pin to the MCU. It is not yet used, but I might find a use for it. I also added a card "activity" LED which lights up when card is accessed. It is simply a LED between the card's chip select line and Vcc. Whenever the card is selected, it is on. This LED also surves a second purpose. If at boot time the SD card or SPI SRAMs fail to initialize, it'll blink out an error code to help identify the problem.

Coolness enhancement

Now that the prototype worked and I was doing the layout for the final version, I decided to do some things to make it look cool. I buried all the traces in layers 2 and 3, leaving layers 1 and 4 uninterrupted copper. It loooks super cool! Of course the top layer copper is interrupted for the actual SMT pads, but other than that, it is all perfectly smooth and looks amazing!

How it works

How a normal DECstation boots

Normally there is a built in 256KB ROM (called PROM by DEC) at physical address 0x1fc00000 that contains enough code to show messages onscreen and accept keyboard input, talk to SCSI devices, load files from disk to RAM and jump to them. This PROM also provides a lot of services to the loaded operating system via an array of callbacks. This includes things like console logging, EEPROM-backed environment variables, memory mapping info, etc. This is rather similar to UEFI. Normally this PROM would read the environment variables from EEPROM that would tell it which device to boot, and then load a kernel and boot from that device if all goes well. This emulator does not boot this way

How uMIPS boots

I had no desire to include a large ROM in the emulator, as the flash space in the microcontroller is limited. I also do not have a graphical console or a keyboard per se. That being said, I had to implement a sizeable subset of the PROM somehow, since MIPS Linux uses it. What to do? I decided to come up with my own boot process, which can still work just as well. There is indeed a ROM at 0x1fc00000. This is necessary for rebooting to work from Linux. That rom is tiny - 32 bytes. Its source code is found in the "romboot" directory. It merely loads the first sector of the SD card to the start of RAM at 0x80000000 and jumps to it. The first sector of the SD card contains a standard MBR partition table and up to 446 bytes of code. The code that lives here can found in the "mbrboot" directory. It is also rather simple. It looks through the partition table for a partition with type byte of 0xBB. If not found, an error is shown. Else, the partition in its entirety is read into RAM at 0x80001000, and then jumped to. This partition can be arbitrarily large, and this is where my implementation of the "PROM" lives. The actual size limit on it is placed by the fact that MIPS Linux expects to be loaded at 0x80040000. This is no accident - the first 256K of RAM is reserved for the PROM to use as long as the operating system expects to use PROM's services. Thus the limit on the loader's size is 252K.

My PROM implementation's code can be found in the "loader" directory. It will search the SD card for a partition marked as active, attempt to mount it as FAT12/16/32, and look for a file called "VMLINUX" in the root directory. If found, it will be parsed as an ELF file, properly loaded, and run. Else an error will be shown. As this code has no serious size limits, it implements a proper ability to log to console, printf, and all sort of such creature comforts. As far as PROM services go, it provides console logging, memory mapping info, and reading environment variables, at least enough to make Linux happy. I have not tried to boot other operating systems on uMIPS (yet?).

The kernel commandline I pass is rather simple: earlyprintk=prom0 console=ttyS3 root=/dev/pvd3 rootfstype=ext4 rw init=/bin/sh. The first parameter provides for early boot logging via the PROM console, which is useful to see. After the kernel is up, it'll use the third serial port for console. Originally for the DECstation that was the printer serial port, but Linux users on DECstation use that for serial console due to that being the easiest port to convert into a simple serial port. The rest just tells the kernel how to boot. I prefer to boot into sh, and then issue exec init myself, thus the init=/bin/sh

How uMIPS runs

After all the optimizations (which I'll detail in a bit) the effective speed of my virtual MIPS R2000/R3000 on this infernal ATSAMD21 chip is around 900KHz. The CPU spends around 8% of its time handling timer interrupts, and thus around 0.83MIPS of CPU cycles are left for useful work. With this, the kernel takes around 2 minutes to boot and run sh. Executing busybox's init and getting to the login prompt takes another minute. Overall not too bad. Commands reply instantly, or in a few seconds. It takes gcc around 2 minutes to compile a hello world C program, and I estimate that in a few days' time, one could rebuild the kernel on the device itself, copy it to /boot, and reboot into it. Yes, I do intend to try this and time it.

The emulated real time clock is actually real time, plus or minus the inaccuracies of ATSAMD21's ultra-low-power 32KHz timer. It is ok enough that you will not notice. Try the uptime command.

There is just one thing I did not yet address concerning running Linux on uMIPS. The storage. I said that it is an SD card, but surely DECstation had no SD card slot. Now, I could emulate the proper SCSI card in the DECstation, and fake a SCSI disk, but I am much too lazy. I might have to do this if I want to boot another OS, but Linux is open source, why go through unnecessary trouble? I instead created my own very simple paravirtualized disk driver which uses a hypercall to talk directly to the emulator and request sectors to be read or written directly into the virtual RAM. To Linux, this looks just like DMA, except instant. The whole implementation of the driver is under 200 lines of code and can be seen in pvd.patch

Linux changes

I made some changes to the kernel to make life easier. They are provided as patches against the 4.4.292 kernel, and as is a working kernel image. Why that version? Because when I started that project, it was an LTS version of the kernel, and since RAM is short, I wanted the smallest possible kernel, so this was preferable to a later version. The config I am using is available in kernel_4.4.292.config

I did alot of work making the kernel as small as possible. Since Linux does not support paging out pieces of the kernel, every byte of kernel code is one byte fewer available to use for user space. I ruthlessly removed options that were not needed. In the end I got the kernel down to just under 4MB, which is pretty damn good, considering that MIPS instructions are not very dense.

As part of this work, I made a few code patches. For various reasons (cough..delay slots..cough) the kernel can find itself needing to interpret userspace code, or parse userspace instructions. No matter what kernel configs I gave, the code to handle microMIPS (a future MIPS expansion not known in the days of R2000/R3000) was present. It was wasting space and time trying to handle things that would never happen. The patch useless_exc_code.patch removes this code if the target CPU does not support microMIPS

Before I implemented my FPU emulator, I was using the kernel's FPU emulation code that traps and executes FPU instructions. It had a bug. If compiled for a 32-bit MIPS processor it did not properly emulate some FPU instructions that operate on the double type. I believe this is wrong. It was causing crashes in code compiled for R3000. The patch fpu.patch modifies the kernel's MIPS FPU emulator by adding a config option to enable the full FPU emulation even on MIPS-I chips.

Due to the differences between the R2000/R3000 and the R4000 the kernel needs to know at build time which CPU it is being built for. If you attempt to run the wrong kind of kernel on the wrong kind of CPU, it only gets far enough to panic about it. Fine, OK, but then why does this flag not affect a lot of TLB-handling code. Both kinds are always compiled in, despite us knowing at build time with 100% certainty that at least half of it will not ever be of any use? The patch tlbex_shrinkify.patch wraps the useless code in checks for the compile-time-selected CPU type and thus removes some kernel code, saving valuable bytes.

As uMIPS runs with a real real-time clock, I did not want Linux to spent too much time handling timer interrupts. Normally, a 128Hz timer is used on DECstations by Linux. I added options for 64Hz, 32Hz, and 16Hz timer ticks as well. This reduces effective timer resolution, but effectively unloads the virtual CPU from having to spend most of its time handling timer interrupts. The patch clocksrc.patch does this, and the one called kill_clocksrc_warning.patch silences a pointless warning about timer resolution.

Improving performance

Instruction cache

Linux card board layout

One thing the processor will surely do every cycle is fetch an instruction. This means that every cycle begins with a memory access. For us that is a painful subject thanks to Atmel's errata-ridden SPI unit. And not just that, memory translation also needs to happen, and that also takes time. A good way to avoid both of these problems is a VIVT instruction cache. It'll read instructions 32 bytes at a time, and allow us to hopefully often not need to translate addresses or reach for main memory. I allocated 2KB of RAM to this cache. It is 32 sets of 2 ways of 32 byte lines. Whenever memory mappings change, it needs to be invalidated. I do this automatically and thus the running code on the virtual MIPS CPU does not need to know about it. The measured hit rate while booting Linux is around 95%, which is pretty nice for such a small cache. The geometry was determined experimentally by profiling how long a boot takes with various cache geometries. This one was found to be the best.

Improving CPU speed

ATSAMD21 series is specified to run at 48MHz. In my testing they run perfectly well up to 96MHz, with some specific chips able to hit 110MHz. I found no chip unstable at 96MHz, so I decided to just run at 90MHz, for some safety margin. This immediately got me a pretty serious performance uplift. No, it is not really 100%, since (1) SPI RAM is still limited by the SPI speed limit, and (2) flash memory has wait states which had to increase for the larger speed. But this did give me an honest 65% improvement. Still a good start. Now RAM SPI runs at CPU / 6 = 15MHz.

Improving RAM bandwidth

Since I could not make the RAM SPI units go faster due to Atmel's incompetence, I decided to go wider! I can drive four units at once. Given, there is overhead to each read and write command, but still this is faster than one or two. My code initially supported one, two, or four RAM chips, but for simplification I dropped that support and now only support four-channel RAM. Quite the statement eh? This microcontroller has four-channel RAM! The emulator accesses RAM in increments of 32 bytes. The RAM read/write commands themselves are 4 bytes each. This means that for a single-RAM chip situation, reading 32 bytes takes (4 + 32) * 8 = 288 SPI bits. In dual-channel configuration it'll take (4 + 16) * 8 = 160 SPI bits, since the command is still 4 bytes long, but we only read 16 bytes from each RAM , for a total of 32. For quad-channel RAM, we thus have (4 + 8) * 8 = 96 SPI bits to read 32 bytes. This is a 66% improvement from the single-channel case! In reality the improvement is less, since quad-channel mode cannot use DMA at all, so it is a bit slower. Real-life measurement shows that quad-channel mode is a 50% improvement over the single-channel case. But still, given this damn chip, any improvement is an improvement I'll take.

But, why are all the RAM acceses 32 bytes in size? Well, as you see RAM accesses are slow. A typical 32-byte access takes 140-ish SPI cycles, which is around 12 microseconds. If every access took that long, my emulated CPU would be limited to no more than 85,000 memory accesses per second. That is too slow to be practical. Something had to be done. I decided on a cache. Sadly, my microcontroller has a very limited amount of RAM, so the cache had to be small. I evaluated various cache geometries, and found that a 20-set 2-way cache with 32-byte lines produced the best performance uplift for the emulator. It gets a 91% hit rate while bootling the kernel, which is a pretty good payoff for 1.25KB of RAM. With a hit taking around half a microsecond and a miss taking around 12 microseconds, adding this cache improved the average memory access by 87%! Yes, this is effectively an L2 cache. Now, how many emulators do you know that have an L2 cache to paper over the terrible performance of their chosen host hardware, eh? The cache allocates on reads and writes, except for reads and writes of precisely 32 bytes in size. Those are passed through directly because they are either SD card access DMA or icache fetches that do not need to also be cached in this cache.

After some more profiling, I rewrote the "hot" part of the memory access code in assembly for some more speed gain. GCC may have come a long way since a decade ago, but it still does not hold a candle up to hand-written assembly. I removed support I had for one and two-channel RAM to simplify the hot path as well. So now you need to populate all four RAM slots for the card to boot. If you populate different RAM sizes, the smallest one will dictate the final usable RAM size. The usable RAM size will always be four times the size of the smallest RAM chip. This isn't a big deal, the DECstation came with 4MB of RAM, and could be outfitted with a maximum of 24MB. This card can be outfitted with 32MB, so you'll be living like a king! That being said, due to the size of the Linux kernel, you're not going to get a successfull boot unless you have at least 6MB of RAM, and uMIPS will refuse to boot if that is the case (eg: if you populate 4x 1MB chip).

Dirty hacks specifically for Linux

Remember how on MIPS the operating system must do its own pagetable walking and filling of the TLB? As you can imagine this happens often. Very often. How could I speed this up without causing any correctness issues? On taking the TLB refill exception, I verify the handler has not changed and matches the expected bytes, if so, I do what it would have done, but in native code, not emulated MIPS. This helps this particular code run quite a bit faster. Correctness is not compromised since this is only done if the handler matches what is expected, byte for byte.

I also mentioned that due to how delay slots work, if a CPU takes an exception on an instruction in the delay slot, the kernel must be able to completely emulate that instruction, or in other way execute it and then jump to the right place? Linux uses the fact that MIPS has no PC-relative instructions, except jumps, and it is illegal to place a jump in the delay slot. How? Instead of emulating the delay-slot instruction, Linux copies it out to a special page in memory, where it is followed by a trap. Linux then jumps there in user mode to let it execute, catches the trap, and then re-directs execution where it should go. Now, if this sounds like a giant hassle to you, you are right. What can we do? Well, if an instruction in a delay slot causes an actual exception (like an illegal access, or a TLB refill exception, or some such thing), not much can be done. But what we CAN do is not make things worse. uMIPS will not deliver IRQs before executing an instruction in the delay slot of a branch. At worst, this will delay an IRQ by a cycle, which makes no difference to correctness. The benefit is that this sort of instruction copying and juggling can be done less.

How to build and use one

Building

Linux card PCBs

Now, why you really came here. How do you get one? Well, you could try knowing me personally and asking for my business card, I have a few to give out, but other than that, here is how to do it.

You'll need to order the board from a board fabrication place. I am a fan of JLPCB and recommend them. The gerber files I provide come in two flavours. One as you see my card exactly, and one without my name and contact info :). This is a four-layer board, the board house will ask you for layer order, it is: GTL, G1, G2, GBL. At least JLPCB has options to also cover the edge connector in gold for better contact, called "gold fingers" and to grind the board edge to 45° for easier insertion. I suggest selecting both of these options - they are free. Remember to set the board thickness to 0.8mm.

While you wait for the board to arrive, you'll want to order the parts. You'll need four of the same memory chip (I have the links above), an ATSAMDA1E16, an AMPHENOL 11400841 SD card slot, and a MIC5317-3.3YM5TR regulator. You'll also want to (optionally) order an 0603 sized blue or white LED for SD activity light. If you choose to have that LED, you'll also need a 430 ohm resistor in 0603 or 0805 size. Besides that, you'll need in 0603 or 0805 sizes: 2x 5.1Kohm resistor, 1x 1Kohm resistor, 3x 0.1uF capacitor, and 7x 1.0uF capacitor. You will also need an SD card and any SWD programmer capable of programming the ATSAMD chip. There are many out there. Pick your favourite.

You'll need an SD card as well. 128MB is the bare minimum here if you want to fit the busybox-based rootfs in. To fit the debian or hybrid image I am providing, you'll want at least 512MB. You can write the image to the card using your favourite tool for that. On Linux and MacOS that is probably dd, on windows, Win32DiskImager.

Once you've assembled the board, program the MCU with the provided binary software/emu/uMIPS.bin and you're done!

Building from source

You'll want to build a few things. You'll need both an ARM (CodeSourcery) and a MIPS GCC toolchain (I used mips-mti-linux I found online). First, build "romboot", "mbrboot", and "loader". Then, build the kernel. I provided the config, patches, etc. Then you'll want to build the emulator. To build for the MCU, use make CPU=atsamd21. To build for PC, try make CPU=pc. Then you can build the SD card image. You'll want to copy the MBR from one of mine and modify it, then use mkdisk.sh to embed your kernel, mbrboot, and loader. Use a loopback mount to copy in your rootfs.

If you want to run the emulator on PC, there are a few things to note. First of all, Ctrl^C will kill it :). Second, unlike the MCU version, the PC version does not incorporate the rom loader in the binary, so you'll need to provide a pointer to it on the command line. A typical command line is ./uMIPS ../romboot/loader.bin ../disk.wheezy

If you are lazy

For the lazy ones I am trialing selling all the parts and the board together as a kit on tindie. I'll see how this goes. My suspicion is that it'll end up being a giant pain in my ass and not worth the time, but I am giving it a fair shot. EDIT: Apparently not, and not even with a good reason. I quote: Please resubmit for admin approval once you have addressed: Other Reason.. LOL, how about NO? As a sidenote, if anyone knows companies that do this sort of thing for me (sell a kit I designed), please drop me a line by email. If you are really really lazy, I might consider having a batch of these factory-assembled by JLPCB as well. If you are interested, click here and let me know. No promises yet.

Using

I provide a few disk images. The smallest is the busybox-based one (disk.busybox) - it is small, fast, and cool. I built the busybox from source for MIPS-I with as many applets enabled as I could imagine being needed. The second image is a full debian wheezy (last version to support MIPS-I) rootfs. I should warn you that debian's "init" starts like 3000 processes while it boots, so that takes a long time. If you are using the debian disk image (disk.wheezy), I strongly suggest to just mount proc and sys, and do your things in "sh" without running "init", but it will work if you do ... eventually. I also provide a hybrid image (disk.hybrid). It has a busybox shell and init, but has all of the debian binaries, so things not provided by busybox are still there and work, like gcc and vim. This is the "hybrid" image.

Using the LinuxCard is easy, insert the SD card, connect USB-C to a computer, and open your favourite serial console app (minicom, PuTTY, etc), if you do not see the boot log, try the other virtual serial port (two exist). In case of a boot error, the SD card LED will blink in an infinite pattern, you can see the code for details on what various numbers of blinks mean.

Once you see the shell prompt, you can play around, or continue boot to login by typing exec init. After this you'll be able to login as "root" with the password of "mips". There will also be a login prompt on the second serial port as well. So cool!

In conclusion

Acknowledgements

I send a great many thanks to my cats for cutely lying under my table to keep me company during the many hours spent on this project. I send a giant, Mount Rushmore-sized middle finger to Atmel for this sorry excuse of a chip.

Downloads

The source code, gerbers, schematics, and all else except the disk images can be downloaded [here]. The license on my code is simple: free for all non-commercial use, including using as your own business card. For commercial use (like if you wish to sell kits of this project), contact me.

The disk images are large so I have no desire to host them on my site, so: [Google Drive] or [MEGA].

Comments...

© 2012-2022