MMU Caching Finally Explained

I had an absolute breakthrough tonight regarding how the WE32101 MMU handles caching. In fact, when I implemented the change, my simulator went from having 108 page miss errors during kernel boot, to 3. The cache is getting hit on almost every virtual address translation, and because of what I learned, the code is more efficient, too.

The key to all this was finally looking up 2-way set associative caching (see here, here, and here), which the WE32101 manual identifies as the caching strategy on the chip. Once I read about it, I was enlightened.

Remember how I was complaining about not enough bits being used for the tag? It turns out this is exactly by design. The tag is used in the cache to identify that you’ve found the right segment or page descriptor, yes, but it’s not used to index them. Instead, the “missing” bits are used as an index.

The way n-way set associative caching works is that some bits are used as a tag, and some bits are used as an index into the cache. You use that index to find a set of n entries. The SD cache is technically a 1-way set associative cache, because there’s only one entry in each set. Bits 17–19 are used to form an index into one of the 8 available cache slots. The tag is compared against the value in the slot. If they match, it’s a cache hit. If not, it’s a cache miss.

In the example above, you can see that the SSL part of a virtual address for a contiguous segment is broken into a 10-bit tag, and the three bits left over that I thought were “missing” are actually used as as an index.

The PD cache works similarly, except it’s 2-way set associative. There’s a “Left” and “Right” entry in each cache slot. You compare your tag against each entry to see if you have a match. When writing into the cache, you pick which one to use by examining the Most Recently Used bit of the left cache entry, which flips back and forth between 1 and 0 depending on whether left or right entry has been replaced most recently. In this way, each set in the cache is itself a Least Recently Used cache.

Once I figured this out, it was a breeze to implement. In fact it simplified the code, and made look-ups a lot faster. I was doing an O(n) linear cache scan before, now it’s O(1).

The darn 3B2 simulator still doesn’t boot all the way, but every time I find and fix a bug, I know I’m getting closer.

3B2 System Timers and SIMH

I think I made a grievous error when I originally laid out how the 3B2 system timers would work in SIMH, and last night I started down the path of correcting that.

The 3B2/400 has multiple sources of clocks and interrupts: There’s a programmable interval timer with three outputs being driven at 100KHz, a timer on the UART that runs at 235KHz, and a Time-of-Day clock. They’re all driven at different speeds, and they’re all important to system functionality.

SIMH offers a timing service that allows the developer to tie any of these clocks to the real wall clock. This is essential for time-of-day clocks or anything else that wants to keep track of time. I used this functionality to drive the UART and programmable interval timers at their correct clock speeds.

But that’s completely wrong. Of course this is obvious in retrospect, but it seemed like a good idea at the time. The problem is that the main CPU clock is free to run as fast as it can the SIMH host. On some hosts it will run very fast, on some hosts it will run quite a bit slower. You can’t possibly know how fast the simulated CPU is stepping.

When your timers are tied to the wall clock but your CPU is running as fast as it can, there are going to be all kinds of horrible timing issues. I had lots of unpredictable and non-reproducible behavior.

Last night, I undid all of that. The timers are now counting down in CPU machine cycles. I used the simple power of arithmetic to figure out how many CPU machine cycles each step of each timer would take, and just did that instead.

Now, it seems like everything is a lot more stable, and much less unpredictable.

Simulating the Disk Controller

I spent last night probing my 3B2/310’s hard disk controller with a logic analyzer so I can see exactly how it behaves, both with and without a hard disk attached to the system. It proved to be very tricky to get the logic analyzer probes attached because the motherboard is so incredibly dense. In fact, I couldn’t get a probe attached to the chip select line no matter how hard I tried. There just wasn’t any room to fit a probe between the chip and a nearby resistor array, so I resorted to using a little piece of wire to just touch against the pin. I could have used three hands for that operation.

I exported the data from Saleae Logic as a CSV file, and wrote a little Python script to parse the commands out of it. This is what I got:

I can now match this sequence up exactly with my simulator, so I can simulate the precise behavior, including the delay between status 0x80 (which means “I’m busy, try again later”) and 0x40 (which means “I’m done now”)

Unfortunately, the simulator still hangs after loading and executing /etc/init from the boot floppy, but now at least I know it has nothing to do with a badly behaving disk controller. I have proof that mine behaves the same way as the real thing.

It’s Hard Disk Time

My next mini-project in the 3B2/400 simulator will be emulating the hard disk. The 3B2/400 used a NEC μPD7261A hard disk controller (PDF datasheet here), which has proved to be harder to emulate correctly than I would have liked.

So far, my hard disk controller emulation has been limited to the most minimal functionality needed to get the emulator to pass self-checks at all. Other than that, it’s just a skeleton. But I believe that it’s actually hanging up the floppy boot process now when UNIX tries to discover what hard drives are attached, so it’s time to get serious and fix it.

My progress isn’t good. I am following the datasheet to the letter, trying to give the correct status bits at the correct time, but the 3B2 just gets confused. It never even tries to read data off the drive, it just gives up trying to read status bits. So, clearly I’m doing something wrong, but I don’t know what it is.

Tonight I will strap a logic analyzer to the PD7261a in my real 3B2 and see exactly what it’s doing. I’ll report on my findings when I have them.

The Equipped Device Table Question is Answered

And just like that, it’s solved. I figured out the mystery of the Equipped Device Table.

The answer was in some obscure piece of System V Release 3 source code. The 3B2 system board has a 16-bit register called the Control and Status Register (CSR). In the CSR is a bit called named TIMEO that I never figured out.

It turns out that I just wasn’t reading the disassembled ROM code closely enough. The exception handler checks this status bit whenever it catches a processor exception while filling the EDT. If the bit is set, it skips the device.

So what is TIMEO? It’s the System Bus Timeout flag, according to the SVR3 source code.

The correct behavior, then, if nothing is listening at an I/O card’s address is to set an External Memory Exception, plus set this bit in the CSR. Once I implemented that in my simulator, the EDT started working exactly the same as it does on my real 3B2/310. Success!

The Mystery of the Equipped Device Table

[EDIT: I have made a followup post detailing the answer to this mystery!]

There is yet one more puzzling aspect of the 3B2 that I do not yet understand, and that is the equipped device table, or EDT. I’ve documented the nitty-gritty details on my main 3B2 reverse-engineering page, so I won’t bore you with the details. But the short version is this:

The EDT is what tells the system what I/O boards are installed. On startup, the ROM code probes each I/O slot to see what kind of card is in it. Each card has a read-only register that returns the board’s 16-bit Option Number, and the ROM fills in a special table in RAM with the Option Number for each card it discovers. It doesn’t fill in any other information, just the Option Number. For example, the SCSI expansion card returns option number 0x100, and the EPORTS card returns option number 0x102. That’s the only information the ROM gets from the card. Later, the user can run a program called filledt that looks up all kinds of other information about the card in a database held on disk.

So here’s today’s puzzler: How does the system decide that there’s nothing in a slot?

This is what I see when I boot my emulator:

But, this is what I see when I boot my real 3B2/310 (with no hard disk attached):

Clearly, the real 3B2 has discovered that there’s nothing in the expansion slots. I have confirmed this by running filledt on both the real system, and on the simulator.

This raises the question of how exactly the real 3B2 uses an option number read to determine there’s nothing in the slot. The ROM code should reveal all, but the code is very simple and straightforward: It just reads an address from each card and sticks it into memory, that’s all. It doesn’t branch on any magic values or anything: Just read a number, put it in memory.

The ROM also sets up an exception handler during the card slot read such that any exception causes the entire boot process to fail, so apparently it does not expect there to be any kind of bus read error.

So how? I’ve tried the following experiments:

  • Making each empty slot return 0xffff on read.
  • Making each empty slot return 0x0000 on read.
  • Making each empty slot raise an External Memory Exception on read.
  • Making each empty slot raise an External Memory Exception on write.

None of these, it seems, does the right thing. This is another problem that could probably be solved easily by having access to the right documentation, but that’s a rant for another time.

A Last Word on MMU Caching

Over the weekend I conducted several experiments with caching using my 3B2 simulator. I learned a few critical bits of information. For background, see this post and this post.

The first and most important thing I learned is that indexing cache entries only by their tags does not work. There are collisions galore, and no way to recover from them. However, if I index SD cache entries by the full SSL, and PD cache entries by the full SSL+PSL, everything seems to work perfectly. This leaves several big questions unanswered, but they are probably unanswerable. After all, I have no way of looking inside the MMU to see how it actually indexes entries in its cache, I can only go on published information and make educated guesses.

Second, I learned that implementing MMU caching is required for the 3B2 simulator to work correctly. Until this weekend, I had not implemented any caching in the simulated MMU because I assumed that caching was only used for performance reasons and could be skipped. But this is not true. In fact, UNIX SVR3 changes page descriptors in memory without flushing the cache and relies on the MMU to use the old values until it requests a flush. Not having implemented caching was a source of several serious bugs in the simulator.

Third, I learned that the “Cacheable” bit in segment descriptors is inverted. When it’s a 1, caching is disabled. When it’s a 0, caching is enabled.

The 3B2/400 simulator now makes it all the way through booting the SVR3 kernel and starting /etc/init. There are some other bugs preventing init from spawning new processes, but I hope to have these ironed out soon.

MMU Caching Redux

I had a Eureka! moment last night about MMU caching, but it all came tumbling down this morning.

My realization was that the Segment Descriptors are 8 bytes long, and that Page Descriptors are 4 bytes long. So, if we assume that the virtual address encodes the addresses of the SDs and PDs on word-aligned boundaries (and SDs and PDs are indeed word-aligned in memory), then you don’t need the bottom three bits for SD addresses, nor do you need the bottom two bits for PD addresses. Voila!

But this morning, I remembered two very important facts:

  • The SSL field in the virtual address is an index into the table of SDs, not an address offset.
  • Likewise, the PSL field is an index into the table of PDs, not an address offset.

From the WE32100 manual, here’s a pictorial representation of how an address is translated when using contiguous segments.

And here’s a concrete example. Immediately after the MMU is turned on in the 3B2/400 boot process, it translates the virtual address 0x40060f8c. Let’s break that down.

The SSL, bits 17–29 of the address, is 3 (b0000000000011). This is an index into a table, so to find SD number 3 in section 1, you take a base address for section 1 and add (3 * 8), since each SD is eight bytes long—two words.

So now I’m right back at square one. How does caching work if these are table offsets instead of address offsets? If the TAG field of an SD cache entry only holds bits 20–29, I don’t yet grok how it uniquely identifies an SD.

For further reading, you can check out The WE32100 Microprocessor Information Manual pages 308–315, and The WE32101 MMU Datasheet pages 164–175.

MMU Caching for Fun and Profit

I’m in the middle of a very long, very drawn out project to try to emulate the AT&T 3B2/400 computer. I should probably have been sharing my progress more frequently than I have been, but it has for the most part been a painful and solitary endeavor.

Today, though, there is something in particular that is bothering me greatly, and I must yell into the void to get this frustration off my chest. And that is, how in the hell does the MMU cache work?

So first, a little background.

The MMU on the 3B2/400 divides virtual addresses into four parts. The first two bits (the SID) identify the section of virtual memory, of which there are four. The next 12 bits (the SSL) identify a physical memory offset to locate either a contiguous segment of memory, or a set of pages. The details of how contiguous segments and paged memory work are not that important for this discussion.

The actual segments and pages are described in the 3B2’s memory using Segment Descriptors and Page Descriptors. These descriptors are directions to the MMU that tell it how to do a translation from virtual address to physical address.

For performance, the MMU doesn’t want to go talking to main memory every time it does a translation. So, it has an on-chip cache. It can store 8 Segment Descriptors per section (a total of 32), and it can store 16 Page Descriptors per section (a total of 64). The cache keeps things humming along.

So far, so good, right? Makes sense? Following along?

Here’s where things get crazy. Here’s how the MMU datasheet describes the internals of the cache entries.

Note the language on the “tag” field for each of the cache entry formats. These tags are supposed to uniquely identify the Segment Descriptor and Page Descriptor entries, right? And I presume they’re used for looking up a virtual address in the cache?

But wait! Bits 20–29 can’t possibly be enough to uniquely identify a segment descriptor. To look up a Segment Descriptor in main memory, you need all 13 bits of the SSL! There would surely be collisions in the cache, right?

And the problem extends to the Page Descriptor cache. Again, bits 13–16 and 18–29 of the virtual address can’t identify a Page Descriptor uniquely. You need all 13 bits of the SSL and all 6 bits of the PSL to locate a Page Descriptor in main memory.

This is my confusion. If those bits identify an SD and a PD uniquely in the cache, why can’t they identify them uniquely in main memory? And why aren’t there collisions in the cache?

This mysterious machine gets harder to understand all the time.