Nutshell: July 2010

Mame has again proven to be an invaluable debugging tool on my embedded software project. The problem began to rear its ugly head when I had moved a subroutine from one file to another. However, I found that this subroutine wasn't even being called when I ran the software, which at first glance seems to make no sense at all.

I traced the origin of the exception flow into the code that executes the rs232 ISR. I observed the condition of the 68000 CPU regs in the MAME debugger, and compared the conditions to a build of the software in which I had made no changes to this particular subroutine. I noticed that in the "correctly" executing software, the SR (status flags register) was left with the value 2700 after returning from the ISR to the (OS) interrupt dispatcher. The 2 indicates that the CPU is in supervisor mode, which is what it should be on the 68000 when executing any privileged operation, such as an ISR.

In the broken software, the 2 was not set after leaving the ISR. Consequently, upon returning from the dispatcher, the SP (A7) is now being incorrectly loaded with 0x3FFFF0 - this address is un-initialized (0), so the return causes the return address of 0 to be popped from the stack. The 68000 detects this as an invalid condition and throws an exception (address 0 is the first entry in the vector table, which is not a vector but rather contains the initial value to be loaded into SP at reset). I didn't notice until later, but the 0x3FFFF0 happens to be the value in USP (user stack pointer register: eventually I would remember that the 68k loads the stack pointer from the USP when not in supervisor mode).

The question now is, why is the SR being clobbered when returning from the ISR? Note that the MAME debugger has no means for source-level debugging: you have to follow along in the ASM listing when using the MAME debugger - a minor inconvenience most of the time. In this case, I finally noticed that the code in question was running into some additional instructions which were not to be found in the listing for the ISR. The code was running these additional instructions and eventually encountering an RTE (return from exception).

The RTE returns from the jump or branch, loading the return address from the stack as might be expected. However, RTE also loads the SR, which is probably not what we want. In this case, the SR is being loaded with whatever happens to be in memory at that location. In the working code, the value loaded into SR happened to coincide with the return address that had been stacked. In the working code, that return address just happened to have the 0x2000 bit set, as needed to keep the SR in supervisor mode. In the broken code, I had moved a subroutine around, and the address that it ended up linking to did not allow the 2 to be set.

Eventually it became obvious that the ISR (written in assembly) was missing an RTS instruction at the end, and the execution was falling through to the next section of code that had been linked in (link order determined by a linker configuration script). This turned out to be some unused legacy code (we seem to have a lot of that) whose purpose was to ...change between supervisor and user mode and reload the status register!

The problem I was working on for the last week is finally seeing some progress. Actually, manifestation of two different problems, which can make debugging embedded software awfully difficult.

First, there is a matter of adjusting the timeout for the watchdog. It turns out that the software is very sensitive to how much data it receives over the RS232 port and will result in process times taking longer periods between "petting" the watchdog. The right answer is somewhere between "not to short" and "not to long". The timeout must have enough wiggle room that we don't get a reset because a task takes a little bit longer when we've bombarded it with serial port messages. On the other hand, we have the potential for a real error where a task totally goes out in the weeds.

The side effect of the watchdog timeout that has caused me so much grief lately is that the software apparently doesn't cleanly reset everything on a "warm" reboot... particularly, it seems to leave interrupts enabled. The result is a race condition after a watchdog reset, where the interrupt fires before the software has a chance to do some other initialization, resulting in a really bizarre error in a part of the software that shouldn't even be executing, all because of wrongly or uninitialized data following the reset.

I have given the MAME debugger quite the workout lately: breakpoints, data watchpoints, execution traces - all these features are indispensable.

Nutshell

Friday, July 23, 2010

Debugging m68k in MAME

Tuesday, July 6, 2010

Kick the dog