If you have a workstation built around an AMD Ryzen/Threadripper or Intel Xeon processor chances are you're using ECC memory. ECC memory is a worthy investment to improve the reliability of your machine and if properly monitored will allow you to spot memory problems before they become catastrophic.
On recent Linux kernels the rasdaemon tools can be used to monitor ECC memory and report both correctable and uncorrectable memory errors. As we'll see with a little bit of tweaking it's also possible to know exactly which DIMM is experiencing the errors.
First of all you'll need to intall rasdeamon, it's packaged for most Linux distributions:
# apt-get install rasdaemon
# dnf install rasdaemon
# zypper install rasdaemon
The package is currently marked as unstable so you'll need to unmask it first:
# echo "app-admin/rasdaemon ~amd64" >> /etc/portage/package.keywords
Then I recommend enabling sqlite support, this makes rasdaemon record events to disk and is particularly useful for machines that get rebooted often:
# echo "app-admin/rasdaemon sqlite" >> /etc/portage/packages.use
Finally install rasdaemon itself:
Then we'll setup rasdaemon to launch at startup and to record events to an on-disk sqlite database.
Debian/Ubuntu/Fedora/openSUSE and other systemd-based distros
Note that on Fedora rasdaemon will not work if Secure Boot is enabled because of kernel lockdown. You will have to disable either kernel lockdown or Secure Boot if you want to use rasdaemon.
# systemctl enable rasdaemon # systemctl start rasdaemon
Gentoo with OpenRC
Add the following line to
rasdaemonto the default run-level and start it
# rc-config add rasdaemon default # rc-config start rasdaemon
Configuring DIMM labels
At this point rasdaemon should already be running on your system. You can now use the ras-mc-ctl tool to query the errors that have been detected. From now on I will use data from my machine to give an example of the output.
# ras-mc-ctl --error-count Label CE UE mc#0csrow#2channel#0 0 0 mc#0csrow#2channel#1 0 0 mc#0csrow#3channel#1 0 0 mc#0csrow#3channel#0 0 0
The CE column represents the number of corrected errors for a given DIMM, UE
represents uncorrectable errors that were detected. The label on the left
shows the EDAC path under
/sys/devices/system/edac/mc/ of every DIMM.
This is not very readable. Since the kernel has no idea of the physical layout of your motherboard it will print the EDAC paths instead of the names of the DIMM slots. We can confirm that the labels are missing with this command:
# ras-mc-ctl --print-labels ras-mc-ctl: Error: No dimm labels for ASUSTeK COMPUTER INC. model PRIME B450-PLUS
To identify which DIMM slot corresponds to which EDAC path you will have to
reboot your system with only one DIMM inserted, write down the name of the
slot you insterted it in and then printing out the paths with
In my case this was the mapping:
mc#0csrow#0channel#0 DIMM_A1 mc#0csrow#0channel#1 DIMM_A2 mc#0csrow#1channel#1 DIMM_A2 mc#0csrow#1channel#0 DIMM_A1 mc#0csrow#2channel#0 DIMM_B1 mc#0csrow#2channel#1 DIMM_B2 mc#0csrow#3channel#1 DIMM_B2 mc#0csrow#3channel#0 DIMM_B1
Note that there's more than one path per DIMM label, that's fine.
With this data at hand create a text file under
You will need to fill it up with the mapping data in the following format:
Vendor: <motherboard vendor name> Model: <motherboard model name> <label>: <mc>.<row>.<channel>
You can obtain the motherboard vendor and model name with the following command:
# sudo ras-mc-ctl --mainboard ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME B450-PLUS
The label lines take a string (the name of the physical DIMM slot), then the
numbers in the EDAC path corresponding to the physical slot. You can put
more than one label entry per line by separating them with a semicolon. If a
given label is associated with more than one EDAC path you can add the separate
<mc>.<row>.<channel> sequences by separating them with a comma.
In my case the resulting file (
/etc/ras/dimm_labels.d/asus) looks like this:
Vendor: ASUSTeK COMPUTER INC. Model: PRIME B450-PLUS DIMM_A1: 0.0.0, 0.1.0; DIMM_A2: 0.0.1, 0.1.1; DIMM_B1: 0.2.0, 0.3.0; DIMM_B2: 0.2.1, 0.3.1;
You can find another example of this, with configuration entries for a bunch of other motherboards, in the edac-utils repo.
Once the file is ready it's time to load the labels in the kernel with the following command:
# ras-mc-ctl --register-labels
Printing out labels and error counts will now use the physical DIMM slot names. This is much better if you need to figure out which of your DIMMs is faulty and needs to be replaced:
# ras-mc-ctl --print-labels LOCATION CONFIGURED LABEL SYSFS CONTENTS DIMM_A1 0:0:0 missing DIMM_A2 0:0:1 missing DIMM_A1 0:1:0 missing DIMM_A2 0:1:1 missing mc0 csrow 2 channel 0 DIMM_B1 DIMM_B1 mc0 csrow 2 channel 1 DIMM_B2 DIMM_B2 mc0 csrow 3 channel 0 DIMM_B1 DIMM_B1 mc0 csrow 3 channel 1 DIMM_B2 DIMM_B2 # ras-mc-ctl --error-count Label CE UE DIMM_B2 0 0 DIMM_B1 0 0 DIMM_B1 0 0 DIMM_B2 0 0
To persist the DIMM names across reboots load the
rac-mc-ctl service at
Debian/Ubuntu/Fedora and other systemd-based distros
# systemctl enable ras-mc-ctl # systemctl start ras-mc-ctl
Gentoo with OpenRC
# rc-config add ras-mc-ctl default # rc-config start ras-mc-ctl
You're done! After rebooting your system rasdaemon will be continually running
and recording errors. You can use
ras-mc-ctl to print out a summary of all
the errors that have been seen and recorded. Since the counts are stored on
disk they will be persisted across reboots. Here's some example output from my
# ras-mc-ctl --summary Memory controller events summary: Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5 PCIe AER events summary: 1 Uncorrected (Non-Fatal) errors: BIT21 No Extlog errors. No devlink errors. Disk errors summary: 0:0 has 6646 errors No MCE errors.
ras-mc-ctl --statusprints out
ras-mc-ctl: drivers are not loaded
For rasdaemon to work the EDAC kernel drivers for your particular machine need to be loaded. They are usually loaded automatically at boot. You can check out which ones are loaded with this command:
# lsmod | grep edac amd64_edac_mod 32768 0 edac_mce_amd 28672 1 amd64_edac_mod
If the EDAC drivers haven't been loaded automatically either your kernel doesn't provide one for your machine or you need to manually load it. Check the EDAC kernel documentation for more details.