Few days ago I found out that my 24/7 system with Ryzen 9 3900X fails prime95 torture test when calculating FFTs with a length of 16K or 20K with AVX2 enabled at stock settings.
This seemed a bit strange to me, because I actually never overclocked this CPU, so the degradation can’t really be a problem here. Said tests fail on BIOS defaults (just after clearing CMOS) as well as with my 24/7 profile with infinity fabric clock and memory clock set to 1866/3733, but the CPU is fully stock (no PBO, no voltage offset). Other than this very specific case, system seems to be 100% stable. It passes OCCT Small data set AVX2 test for over 9 hours, it passes prime95 AVX2 with all other FFT lengths in the 1-64K range for hours, but 16K and 20K fails within few minutes. The fails only occur on worker threads 17,18,21 and 22, as far as I know prime95 always assigns same workers to same CPUs, which would suggest that only 2 out of 12 cores are failing. I run prime95 with SUM(INPUTS) error and Rounding checking enabled. I read about other people having similar problems with their Ryzen 3000 series CPUs1, but those reports were mostly about 3950X.
I am not sure if this issue actually occurred with my CPU from the first day or is it something which started only recently. As mentioned before I only observed this behavior few days ago, as I do not run prime95 with smallFFT preset (which includes 16K and 20K) that often. I definitely run it just after I got my new setup working (summer last year), but back then all BIOSes available for my motherboard had awfully broken PPT/TDC/EDC reporting and that made CPU boost behave in very funky way, because of that I cannot really compare situation back then with present.
My setup consists of:
- Ryzen 9 3900X
- ASRock X570 Taichi
- GSkill F4-3200C14D-32GTZKW (2x16GB B-Die)
- Samsung EVO 850 500GB
- Palit Gamerock Premium GTX1080
- Noctua NH-D15S
- Corsair AX860
So far I tried following things to fix this:
- clearing CMOS - does not help;
- latest AMD chipset drivers (2.07.14.327) - nothing changes;
- different BIOSes for my motherboard - 3.00; 3.20 and 3.40 - all behave the same way;
- running with just 1 stick of RAM - issue still occurs;
- enabling PBO - seems to help a bit (the error occurs bit later), this is most likely due to higher VCore being applied to the CPU, but I cannot make it pass the 16K test with PBO alone (maybe I need to test more combinations of PBO params);
- enabling Vcore offset of +50mV - this seems to fix the issue, but it makes the CPU runs slightly slower (higher Vcore, higher power usage - boost algorithm slows down the CPU faster);
- disabling AVX2 in prime95 - it is not really a fix, but the issue was not observed with AVX2 disabled.
I tried contacting both motherboard manufacturer (ASRock) and AMD, but I did not get any significant tips. However AMD support acknowledged that it MIGHT BE CPU defect. Both parties suggested trying CPU in different rig, but I can’t do it at the moment.
In my opinion it seems that CPU just can’t cope with prime95 16K FFT AVX2 test, because the voltage/frequency curve on which it operates is just too aggressive. On default settings it should be able to pass everything prime95 is throwing at it. The whole issue is not a big thing, but it is kind of annoying knowing that not so cheap CPU is not fully stable at default settings. My CPU is from early batches as I got it in summer 2019 and from what I heard the silicon quality got better over time, so hopefully new CPUs are totally stable even under prime95 and I am just unlucky with my chip.
I am still investigating this issue and I will update this post when/if I find some solution.
[11.10.2020] Update: Since writing original post I still did not figure out why my 3900X is failing 16k and 20k tests with AVX2 enabled in prime95. I tried RMAing CPU at the retailer, but apparently it was working fine in their tests (keep in mind they used most likely different motherboard). I also tested CPU with Linux Version of prime95 on Ubuntu 20.04 and tried fresh Windows install, but problem occurs on every system. In the meantime new version of prime95 was released (30.3b6), but that one still fails for me as well.
For now I gave up and bumped LLC to 1 (on ASRock BIOS it is the setting meaning lowest Vdroop) and my system can finally run 16k FFTs without a failure.
For me the problem seems to be caused by both motherboard not supplying voltage stable enough for the CPU to run correctly (this would require a bit of measurements to confirm)and aggressive binning from AMD side, not allowing for any margin of error for the stability of VCore. Maybe I will just switch to Zen3 once it is released and forget about my 3900X….