Press "Enter" to skip to content

CPU 2: Machine Check Exception: 4 Bank 4: f61c2001ba080813

Last updated on September 12, 2008

A real, honest, good failure while running Breakin on banstyle.nuxx.net. It points to something being wrong with the second CPU or bank of memory.

In testing my server banstyle.nuxx.net has had its first real set of errors / failures. This is a good thing.

First, last night I started getting SMART warnings about bad blocks on ad6, which is the second hard drive. So today I just went ahead and ordered up a pair of ST3500320AS 500GB disks and a 3ware 8006-2LP, the same as is used in my current server.

Note the sdb errors, which are consistent with the other errors I’d been seeing indicating a bad block on the second hard disk.

Second, I came home today and found my server hung while running Breakin, displaying the error CPU 2: Machine Check Exception: 4 Bank 4: f61c2001ba080813 TSC 2561d00c4ef7 ADDR ce19fd00. So, at least I’ve got some place to look for what else might be the issue.

This error was decoded by AMD’s the AMD Machine Check Analysis Tool (MCAT). Since the machine contains Opteron 885 CPUs which are in the AMD K10 family, the /gh flag is used:

C:\Program Files\AMD\MCat>mcat /gh /cmd 4 0xf41c2000ba080a13 0xce19fd00 0x2561d00c4ef7
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): F41C2000 BA080A13
Error Address (0x): 00000000 CE19FD00
Error Misc    (0x): 00002561 D00C4EF7
Status Bit Decode :
   Uncorrectable ECC error
   Error address valid
   Error enable
   Error uncorrected
   Error overflow
   Error valid
Error Code    (0x): 0A13
   Error Type - Bus
   Participation Processor (PP) - Local node responded to the request (RES)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 3297 MB rage
   Syndrome  (0x): BA38
   Address decode: 00000000CE19FD00
      Node ID: 5
      Channel Select: 0
      Chip Select: 0

I then swapped the first and fourth DIMMs which are connected to the first CPU, then received this error after a few more hours of running Breakin:

C:\Program Files\AMD\MCat>mcat /gh /cmd 4 0xf41c2000ba080a13 0x8589bd00 0x147c1f963903
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): F41C2000 BA080A13
Error Address (0x): 00000000 8589BD00
Error Misc    (0x): 0000147C 1F963903
Status Bit Decode :
   Uncorrectable ECC error
   Error address valid
   Error enable
   Error uncorrected
   Error overflow
   Error valid
Error Code    (0x): 0A13
   Error Type - Bus
   Participation Processor (PP) - Local node responded to the request (RES)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 2136 MB rage
   Syndrome  (0x): BA38
   Address decode: 000000008589BD00
      Node ID: 5
      Channel Select: 0
      Chip Select: 0

I’ve got interleaving enabled and I suspect that with this they are being used in some sort of balanced fashion, round-robin or something like that. With this, I would have suspected that the error would then have moved from the upper 2048 MB of that CPU’s RAM to the lower 2048 MB, but it didn’t. It’s very likely that I’m wrong in this thinking, though.

So, tomorrow morning before work I’ll turn off interleaving, run the test again, and try to narrow it down to a single DIMM. If I can do that I’ll then move it over to the other CPU and see if it moves. Hopefully the error will appear in a manner which I can more easily narrow down to a single DIMM.

This third MCE, with Bank Interleaving set to Auto and Node Interleaving set to Disabled has resulted in the following:

C:\Program Files\AMD\MCat>mcat /gh /cmd 4 0xf61c2001ba080813 0x1b7ae9d00 0x4e7b5766b77b
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): F61C2001 BA080813
Error Address (0x): 00000001 B7AE9D00
Error Misc    (0x): 00004E7B 5766B77B
Status Bit Decode :
   Error associated with CPU core 0
   Uncorrectable ECC error
   Processor context corrupt
   Error address valid
   Error enable
   Error uncorrected
   Error overflow
   Error valid
Error Code    (0x): 0813
   Error Type - Bus
   Participation Processor (PP) - Local node originated the request (SRC)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 7034 MB rage
   Syndrome  (0x): BA38

Tomorrow I shall try to duplicate this result and memory location. If I can, I’ll begin shuffling parts around to see if the failure address moves. If I can’t, I’ll investigate the CPU.

This fourth MCE was generated with the same settings as the third, except with different disks (Seagate 500GB) and a new RAID controller (3ware 8006-2LP in JBOD mode):

C:\Program Files\AMD\MCat>mcat /gh /cmd 4 0xf41c2000ba080a13 0x1250f2d00 0xe0eb9e12d8e
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): F41C2000 BA080A13
Error Address (0x): 00000001 250F2D00
Error Misc    (0x): 00000E0E B9E12D8E
Status Bit Decode :
   Uncorrectable ECC error
   Error address valid
   Error enable
   Error uncorrected
   Error overflow
   Error valid
Error Code    (0x): 0A13
   Error Type - Bus
   Participation Processor (PP) - Local node responded to the request (RES)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 4688 MB rage
   Syndrome  (0x): BA38

This fifth MCE was generated with Bank and Node Interleaving both disabled:

C:\Program Files\AMD\MCat>mcat /gh /cmd 4 0xf41c2000ba080a13 0x1e224fd00 0x48e38a5ecb7
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): F41C2000 BA080A13
Error Address (0x): 00000001 E224FD00
Error Misc    (0x): 0000048E 38A5ECB7
Status Bit Decode :
   Uncorrectable ECC error
   Error address valid
   Error enable
   Error uncorrected
   Error overflow
   Error valid
Error Code    (0x): 0A13
   Error Type - Bus
   Participation Processor (PP) - Local node responded to the request (RES)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 7714 MB rage
   Syndrome  (0x): BA38

The sixth MCE was generated under the same conditions as #5, with both Node and Bank Interleaving disabled:
C:\Program Files\AMD\MCat>mcat /gh /cmd 4 0xf61c2001ba080813 0x1a78d9d00 0x42f9116690d7
Processor Number  : 0
Bank Number       : 4
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): F61C2001 BA080813
Error Address (0x): 00000001 A78D9D00
Error Misc    (0x): 000042F9 116690D7
Status Bit Decode :
   Error associated with CPU core 0
   Uncorrectable ECC error
   Processor context corrupt
   Error address valid
   Error enable
   Error uncorrected
   Error overflow
   Error valid
Error Code    (0x): 0813
   Error Type - Bus
   Participation Processor (PP) - Local node originated the request (SRC)
   Timeout (T) - Request did not time out
   Memory Transaction Type (RRRR) - Generic read (RD)
   Memory or IO (II) - Memory Access (MEM)
   Cache Level (LL) - Generic, includes L3 cache (LG)
Bank 4 North Bridge Errors:
   ECC Error - DRAM ECC error detected in the NB.
   Error address at 6776 MB rage
   Syndrome  (0x): BA38

2 Comments

  1. Paul
    Paul September 9, 2008

    Good luck banstyle, you’ll pull through, we miss you!

Leave a Reply