When is a SMART failure not an impending disk failure?

My old TrueNAS has been playing up.  It dropped a drive, rebooted, found its old mate, and resilvered it but afterwards still had errors:

  1. One or more devices has experienced an unrecoverable error.
  2. Pool state is DEGRADED: One or more devices are faulted in response to persistent errors.
  3. Disk ATA WDC WD40EFRX-68N is FAULTED
  4. The system had an unscheduled system reboot.
  5. Pool state is ONLINE: One or more devices is currently being resilvered.
  6. Pool state is ONLINE: One or more devices has experienced an unrecoverable error.

A few years back my Intel “server grade” motherboard died 6 months out of warranty and Intel were spectacularly unhelpful about it.  Gumtree to the rescue with a second hand AMD Phenom CPU and motherboard, at least 10 years old when I bought it and ran beautifully for another 2+ years.  However, the reboots are getting a little bit more often so it’s finally getting a bit long in the tooth.  Plus, a dodgy drive most likely?

Time to build a new NAS.  No point screwing about, AMD Ryzen 5 3600, ASRock x570 motherboard, 16GB ECC, 4U rackmount case, 10 hotswap bays, treat yo’self!

I also bought 5 x used 4TB SAS drives for a new TrueNAS build, seller seemed reputable enough.  Worth a gamble on cheap storage.  I already had an HBA that will let me use up to 8 SATA or SAS drives in addition to the 6 x SATA connectors on the motherboard.

First used drive I plugged in, passes SMART short test and fails SMART long test after only a couple of minutes (it should take 10 hours to complete).  Consistently.  On the same LBA.  Uh oh.

Well, fine.  Let’s be doubly sure.  Badblocks will surely tell me just how ruined this drive is.  Load it up with 5 passes (random, 0x55, 0xaa, 0xff, 0x00).  That’ll show me media errors.  Fast forward 5 days.  No errors.

Run SMART long again.  No errors.

So..  this drive obviously had something Really Bad(tm) on one of the sectors.  But NOT a drive problem.  Maybe power got pulled during a write or something, who knows.  But it’s fixed, I can’t fault it.  5 passes of badblocks and a followup SMART long are perfect.  SMART also reports 0 elements in grown defect list (which is the measure SAS drives use instead of reallocated sectors in SATA land).

 

 

 

 

 

I trust this drive, it’s had 5 passes of badblocks with no errors and passed SMART long test twice and it is recording no defects on the media.  SMART can be dumb sometimes!

 

Loading