NAME¶
mcelog.triggers - mcelog trigger scripts reference
SYNOPSIS¶
/etc/mcelog/bus-error-trigger
/etc/mcelog/cache-error-trigger
/etc/mcelog/dimm-error-trigger
/etc/mcelog/iomca-error-trigger
/etc/mcelog/page-error-trigger
/etc/mcelog/socket-memory-error-trigger
/etc/mcelog/unknown-error-trigger
DESCRIPTION¶
mcelog(8) maintains thresholds of errors using a
leaky-bucket
algorithm. When the number of errors in a specific time window exceeds a
pre-configured threshold a
trigger will be executed. Triggers are
usually shell scripts in the
/etc/mcelog directory but can be also
other internal actions. Thresholds and triggers can be configured in
mcelog.conf(5)
Trigger will run as the user configured for mcelog in
mcelog.conf, by
default root. The default trigger action can be overridden by specifying a
different trigger script in the configuration file. Actions in addition to the
default trigger (like notifying an administrator) can be put into the
respective
/etc/mcelog/*.local script which is executed after the
default action. This allows updating the default scripts without overriding
local actions. All trigger actions are also logged to syslog.
The DIMM and socket memory error triggers
The
/etc/mcelog/dimm-error-trigger and
/etc/mcelog/socket-memory-error-trigger scripts are executed when a
DIMM or a CPU socket exceeds a configured corrected or uncorrected memory
error threshold. The thresholds are configured in the
mcelog.conf
[dimm] and
[socket] sections. The default triggers log a warning
message in the system log. The triggers are only executed when mcelog runs as
a daemon.
Arguments are passed as environment variables
THRESHOLD |
human readable threshold status |
MESSAGE |
Human readable consolidated error message |
TOTALCOUNT |
total corrected or uncorrected count of errors for current DIMM
depending on what triggered the event |
LOCATION |
Consolidated location as a single string |
DMI_LOCATION |
DIMM location from DMI/SMBIOS if available |
DMI_NAME |
DIMM identifier from DMI/SMBIOS if available |
DIMM |
DIMM number reported by hardware |
CHANNEL |
Channel number reported by hardware |
SOCKETID |
Socket ID of CPU that includes the memory controller with the DIMM |
CECOUNT |
Total corrected error count for DIMM |
UCCOUNT |
Total uncorrected error count for DIMM |
LASTEVENT |
Time stamp of event that triggered threshold (in time_t format,
seconds) |
THRESHOLD_COUNT |
Total umber of events in current threshold time period of specific
type |
After the default action local actions in
/etc/mcelog/dimm-error-trigger.local or respective
/etc/mcelog/socket-memory-error-trigger.local are executed.
The page error trigger
The
/etc/mcelog/page-error-trigger script is executed by mcelog in daemon
mode when a page in memory exceeds a pre-configured corrected or uncorrected
error threshold. mcelog internally also implements offlining the page through
the kernel. This is configured through the
[page] section of
mcelog.conf(5)
The environment arguments are the same as for the
dimm-error-trigger
script
After the default action local actions in
/etc/mcelog/page-error-trigger.loccal are executed.
The cache error trigger
The
/etc/mcelog/cache-error-trigger shell script is called for cache
error handling in daemon mode when a CPU reports excessive corrected cache
errors. This could be a indication for future uncorrected errors.
This trigger is configured through the
[cache] section in the
mcelog.conf(5) configuration file. The threshold is defined by the CPU.
The default trigger offlines the affected CPU cores, unless it is the last
core running.
Arguments are passed as environment variables
MESSAGE |
Human readable error message |
CPU |
Linux CPU number that triggered the error |
LEVEL |
Cache level affected by error |
TYPE |
Cache type affected by error (Data,Instruction,Generic) |
AFFECTED_CPUS |
List of CPUs sharing the affected cache |
SOCKETID |
Socket ID of affected CPU |
After the default action local actions in
/etc/mcelog/cache-error-trigger.local are executed.
The bus-uc-threshold-trigger
The
bus-uc-threshold-trigger runs on uncorrected errors on a IO bus. It
is configured through the
bus-uc-threshold-trigger and
bus-uc-threshold-trigger-threshold options in
/etc/mcelog.conf(5). By default it logs a message with the error
location to the system log. After the default action local actions in
/etc/mcelog/bus-uc-error-trigger.local are executed.
Arguments are passed as environment variables
MESSAGE |
Human readable consolidated error message. |
LOCATION |
Consolidated location as a single string |
SOCKETID |
Socket ID of CPU that includes the memory controller with the DIMM |
LEVEL |
Interconnect level |
PARTICIPATION |
Processor Participation (Originator, Responder or Observer) |
REQUEST |
Request type (read, write, prefetch, etc.) |
ORIGIN |
Memory or IO |
TIMEOUT |
The request timed out or not |
The iomca-error-trigger
The
iomca-error-trigger runs when a socket receives bus or interconnect
errors. It is configured through the
iomca-error-trigger and
iomca-error-trigger-threshold options in
/etc/mcelog.conf. By
default it logs a message with the error location to the system log. After
the default action local actions in
/etc/mcelog/iomca-error-trigger.local
are executed.
Arguments are passed as environment variables
MESSAGE |
Human readable consolidated error message |
LOCATION |
Consolidated location as a single string |
SOCKETID |
Socket ID of CPU that includes the memory controller with the DIMM |
CPU |
Linux CPU number that triggered the error |
SET |
PCI segment number |
BUS |
PCI bus number |
DEVICE |
PCI device number |
FUNCTION |
PCI function number |
The unknown-error-trigger
The
unknown-error-trigger runs on any errors not otherwise categorized.
It is configured through the
unknown-error-trigger and
unknown-error-trigger-threshold options in
/etc/mcelog.conf. By
default it logs a message to the system log. After the default action local
actions in
/etc/mcelog/unknown-error-trigger.local are executed.
Arguments are passed as environment variables
MESSAGE |
Human readable consolidated error message |
LOCATION |
Consolidated location as a single string |
SOCKETID |
Socket ID of CPU that includes the memory controller with the DIMM |
CPU |
Linux CPU number that triggered the error |
STATUS |
IA32_MCi_STATUS register value |
ADDR |
IA32_MCi_ADDR register value |
MISC |
IA32_MCi_MISC register value |
MCGSTATUS |
IA32_MCG_STATUS register value |
MCGCAP |
IA32_MCG_CAP register value |
SEE ALSO¶
http://www.mcelog.org
mcelog(8), mcelog.conf(5)