Fixing stability issues with 1st generation Ryzen chips on Debian

2019-01-03 - Louis-Philippe Véronneau

I was an early adopter when Ryzen - AMD's latest CPU line - came out. The prices were very good, the chips had a lot of cores and they ran pretty fast. At the time I thought the Ryzen 1600 CPU with its 6 core and 12 threads all running at 3.4 GHz with a TDP of 65W (with support for ECC RAM) made the perfect homeserver chip.

Fast forward two years: I've finally got around the stability issues I was having that hung my server at random intervals. Sometimes, everything was fine for months, but I also experienced random system freezes twice in a week. Since I'm using full disk encryption on all the drives in my server, a whole system freeze meant I had to go back home and reboot the server manually.

I first thought I was affected by a "rare" bug that touched the first batch of Ryzen CPUs so I RMAed mine and had to handle nearly a month of downtime. Sadly, it didn't solve my problem. Two weeks ago I decided I was tired of this whole reboot cycle and tried to see if upgrading to a more recent kernel (4.9 -> 4.18) did the trick. The problem only got worse and my server ended up freezing each and every night. As always, no errors showed up anywhere in the logs.

With the 4.18 kernel, the timing of the system freezes got me thinking and I found this bug report in Launchpad. Turns out the problem is caused by bad low-power handling. When the CPU idles for a long time, it enventually freezes and hangs the whole system. This is corroborated by this AMD report that states:

      1109 MWAIT Instruction May Hang a Thread

      Description: Under a highly specific and detailed set of internal timing
                   conditions, the MWAIT instruction may cause a thread to
                   hang in SMT (Simultaneous Multithreading) Mode.
      Potential Effect on System: The system may hang or reset.
      Suggested Workaround: System software may contain the workaround for
                            this erratum.
      Fix Planned: No fix planned

To fix the problem I've:

  • disabled SMT in the BIOS
  • disabled "Cool 'n Quiet" in the BIOS
  • disabled "Global C-states" in the BIOS
  • set "Power Supply Idle Control" to "Common current idle" in the BIOS
  • set idle=nomwait in the kernel
  • set processor.max_cstate=5 in the kernel

Disabling C-States means that the CPU cores always run at 3.4 GHz and the chip consumes 50W at idle instead of 30W, but that's a price I'm willing to pay to have a stable server.

Note that from what I've read online, the Ryzen 2 chips aren't affected by this. Don't take my word for it though. I guess I've learnt the hard way that trying to build a stable system out of a bleeding edge platform is a bad idea.