본문 바로가기

리눅스

[리눅스] 시스템 로그 EDAC(Error Detection And Correction) 로그

728x90

시스템 로그 EDAC(Error Detection And Correction) 로그

 EDAC = 오류 감지 및 수정

하드웨어 환경

$ dmidecode -t system
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0100, DMI type 1, 27 bytes
System Information
	Manufacturer: HP
	Product Name: ProLiant DL380 G7
	Version: Not Specified
	Serial Number: SXXXXXXXXA
	UUID: 39444835-7926-4753-1346-64344631364E
	Wake-up Type: Power Switch
	SKU Number: XXXXXX-B21
	Family: ProLiant

Handle 0x2000, DMI type 32, 11 bytes
System Boot Information
	Status: No errors detected

운영체제 환경

$ cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)

$ getconf LONG_BIT
64

$ uname -r
3.10.0-1062.18.1.el7.x86_64

시스템 로그(/var/log/messages)

kernel: mce: [Hardware Error]: Machine check events logged
kernel: EDAC MC0: 1 CE error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)

kernel: EDAC MC0: 1 CE error on CPU#0Channel#0_DIMM#0

- 오류 감지 및 수정(error detection and correction, EDAC

- 메모리 컨트롤러(memory controller, MC)

- 수정 가능한 오류(correctable errors, CE)

- 듀얼 인라인 메모리 모듈(dual in-line memory module, DIMM)

장애 확인 및 장애 메모리 슬롯 위치

$ grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:4
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0

메모리 슬롯 위치 확인(dmidecode 명령)

$ dmidecode -t memory | grep -v "Size: No Module Installed" | grep -C 3 -i Size
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: 3
	Locator: PROC 1 DIMM 3A
--
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 8192 MB
	Form Factor: DIMM
	Set: 12
	Locator: PROC 2 DIMM 3A

EDAC 유틸리티(edac-utils) 설치

$ yum install -y libsysfs edac-utils

edac-utils 명령 실행

$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU#0Channel#0_DIMM#0: 4 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU#1Channel#0_DIMM#0: 0 Corrected Errors

참고 사이트

- https://www.kernel.org/doc/html/v5.0/admin-guide/ras.html

728x90