Stable As A Two-Legged Stool
I am writing this blog post in more than a bit of an agitated state, having just gone through about a 3-month period where NetBSD was crashing on a more than weekly basis. Since the upgrade to 7.0.2, my server has been plagued with instability.
Initially, I suspected memory problems, and replaced the 2G sticks with the older 1.5G sticks I had from earlier. This appears to have solved a few crashes that panicked the kernel and left a (relatively unhelpful) AE_NO_MEMORY error.
Now, however, it seems every 2-5 days the machine will go down silently. What could be going on, now? Thankfully, it at least produces a crash dump of the core, so I can try some forensic techniques to winnow down the possibilities.
Debugging Without A ‘DEBUG’ Kernel
Researching through the NetBSD crash dump guide was only somewhat helpful – though the link in the article to Hubert Feyrer’s website was useful. The guide also recommended I recompile my kernel with ‘DEBUG’ flags enabled, before trying to use crashdump. Hmn, no thanks, I’ll pass.
As a side rant here: NetBSD Foundation, you really need to get a new wiki. The current one is clumsy, incomplete, and woefully out of date! The most useful information I get is when I poke around on the mailing list archives or search up independent NetBSD blogs.
Using NetBSD’s guide, it it initially looked like there was a bug in the ale() network driver, or the ifmedia controls. Since I have had Ethernet act up once before, I followed the crashdump traceback:
Reading symbols from /netbsd...done.
0xffffffff8063d735 in cpu_reboot ()
#0 0xffffffff8063d735 in cpu_reboot ()
#1 0xffffffff80865182 in vpanic ()
#2 0xffffffff8086523d in panic ()
#3 0xffffffff808a84d6 in trap ()
#4 0xffffffff80100f46 in alltraps ()
#5 0xffffffff806583b4 in mii_anar ()
#6 0xffffffff801e42c0 in atphy_service ()
#7 0xffffffff806570ab in mii_mediachg ()
#8 0xffffffff80438037 in ifmedia_ioctl ()
#9 0xffffffff803d6b35 in ale_ioctl ()
#10 0xffffffff803ca2ed in doifioctl ()
#11 0xffffffff8088011b in soo_ioctl ()
#12 0xffffffff80875da9 in sys_ioctl ()
#13 0xffffffff808801fa in syscall ()
#14 0xffffffff80100691 in Xsyscall ()
But it seems like doing a gdb crashdump might have been overkill, and counterproductive. Using Hubert Feyrer’s tips, I was actually able to get a useful dump of the kernel log buffer just before the crash. This is much more simple, and doesn’t need any debug flags, either:
uvm_fault(0xffffffff810b16c0, 0xffffffff817e5000, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff806583b4 cs 8 rflags 10287 cr2 ffffffff817e59ec ilevel 6 rsp fffffe802f6c2ba8
curlwp 0xfffffe807d5bb9a0 pid 10237.1 lowest kstack 0xfffffe802f6c02c0
cpu0: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
snprintf() at netbsd:snprintf
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
atphy_service() at netbsd:atphy_service+0x1a7
mii_mediachg() at netbsd:mii_mediachg+0x48
ifmedia_ioctl() at netbsd:ifmedia_ioctl+0xa0
ale_ioctl() at netbsd:ale_ioctl+0x34
doifioctl() at netbsd:doifioctl+0x2f4
soo_ioctl() at netbsd:soo_ioctl+0x284
sys_ioctl() at netbsd:sys_ioctl+0x17e
syscall() at netbsd:syscall+0x9a
--- syscall (number 54) ---
cpu0: End traceback...
Hmn… UVM fault errors! It might not be the networking drivers, after all – though they do happen to show up in the backtrace. Could mmap()’d I/O devices be behaving badly in the new kernel?
Since I have not had any such errors on previous NetBSD versions, my immediate solution has been a roll-back to the 7.0 kernel. So far, two days without issue – only time will tell, however. If I have more crashes, my next step will be running MEMTEST86 on my server to double-check that I don’t have faulty memory. I find this unlikely, however, since I have tested with two different sets of memory sticks — and because the server had been running stable until 7.0.2.
I’ll keep an eye on it!