Thread by @xiphmont, Good 'morning' all. Up late today, because last night's Windows machine debugging [...]

Good 'morning' all. Up late today, because last night's Windows machine debugging ended at... about 8:35am.
But it was a happy ending!

So the kids got my old workstation, Fishcore, rehabilitated into a WIndows gaming PC for Christmas. It's a big old beast of a Xeon that had gotten flaky in storage during my move to NH a few years ago, figured a little TLC would get it back in shape.

It served me well for a good ten years; bought the P9X79WS motherboard for it when it first came out, and for the kids' version it got a nice e5-1860v2 (a processor that came out seven years after the mobo!). X79 feels like it was or was near Intel's pinnacle, nice hardware.

Also, I was one of the lucky bastards to land a 3080 on a whim. So, a hell of an Xmas gift, no lie.

...but the machine was never right after setting it up.

Aside from setup woes (windows software RAID: never again), it ran incredibly well most of the time. But it would also just crawl to a halt a few times an hour for no discernible reason. Lots of disk activity.

The machine also inherited Fishcore's old RAID in cut-down form. I saw how fast the kids filled up eveything else with games, so I added an SSD cache to the big spinning rust (still big) and the disk subsystem seemed perfectly snappy too.

But something was happening that just hammered it into the ground for no apparent reason, and the throughput during these episodes was *terrible*. I turned off inessential service after inessential service, etc, etc, no change.

The raid was on an LSI9271-8iCC, and getting it to work at all was no joke. Aside from a hardware blunder (previously documented here), getting the driver to not instantly bluescreen too days.

...so I tentatively blamed the RAID and decided to just ditch it for an SSD for now.

Got fancy too fast, I thought, let's get things stable.

Re-imaging onto an SSD went smoothly enough, and the seconds-per-frame episodes seemed better... not gone but better...

The machine was still hammering disk. The SSD was keeping up much better, but it was continuously logging 100-150MB second on write. For several minute bursts.

Multiple processes appeared to be doing this. I kept turning things off one by one until it came down to svchost.

I am not a Windows person. I understand svchost no deeper than 'kind of like systemd and hated even more'.
Why would it hammer disk so hard, and why could a *very* fast caching raid not keep up when a single SSD could?

It couldn't be about throughput. The RAID had way more. Random access? But the RAID was caching both reads writes and had more cache than the main machine was using for flushes.

But what if Windows was sitting in a spinlock waiting for write-to-media confirmation? OH.

Logs. svchost was logging millions of tiny messages, and waiting for each one to commit to media. The logs didn't look big, but perhaps log rolling was adding to the pressure.

So checking the logs and... sure enough...

"Correctable bit error on page <foo>, removing from service"

Bad SDIMM apparently. It's a Xeon with full scrubbing so it was correcting all the errors, but svchost was logging each and every one of them, forcing a flush, waiting for completion, and moving on to the next.

This is *entirely correct behavior*. It was doing exactly what it should have been doing. And the message got through... eventually. After a month. Because no other part of the Windows system reported anything whatsoever amiss.

Aside from logging a serious system fault to a log hidden six control panels deep, the error wasn't reported anywhere else, and nothing checked for it. None of the troubleshooting tools saw it. None of the hardware diagnostic tools saw it. Nothing else cared. W.T.F.

I only found it after having an OH moment sparked by a random forum post after a month of googling. GAH. But it was the problem. And fixing it fixed the lags.

The one bit of good fortune (aside from being able to afford to do this at all, that's a lot of good fortune there) was that it turned out the RAM wasn't bad. Nor the slot. Nor the mem controller on the CPU.

The mobo BIOS had cached bad training data for that DIMM slot.

The P9X79 halfway-exposes a bunch of 'overclocking tools' for enthusiasts, and one of them is that it lets you explicitly trigger link training for the physical, analog layer of the SDIMM slots. It's part of (and necessary for) DDR3 and is usually prebaked or automatic.

You need to to this to reliably get the kind of speed over the physical wires and connectors that DDR3 uses. PCIe, SAS, OcuLink, etc, all do the same. Like I said, it's usually automatic and for recent specs, continuous. The PHYs are always monitoring and running tests.

The P9X79 lets you trigger this for the SDIMM slots. And then caches the results apparently for eternity. But it doesn't tell you that. The manual says:

"MemOK: Press and hold MemOK button to optimize memory."

Thanks. Very clear.

Anyway, it was the problem.

Machine is happy, kids are happy, and the G Fish (like G Man except apparently merman style) is finally not embarrassed by the Xbox next to it.

Latest Threads Unrolled: