User Tools

Site Tools


linux:check-hdd-health

Linux HDD Diagnostics & Health Check

This guide outlines the steps to diagnose hard drive issues, ranging from system freezes (I/O Wait) to physical hardware failure.

Step 1: Is the Disk Slowing Down the System?

Before checking the disk physically, check if the disk is the bottleneck causing system lag or “freezes”.

Check I/O Wait (vmstat)

Run this command to see system activity in real-time:

vmstat 1

What to look for:

  • wa (Wait): If this is high (over 20-30%), the CPU is idle waiting for the disk. 75%+ means a total storage deadlock.
  • b (Blocked): The number of processes stuck waiting for I/O. If > 0 constantly, processes are hanging.

Check Kernel Logs (dmesg)

If the disk is disconnecting or timing out, the kernel will log it.

dmesg | grep -i "error\|fail\|ata\|scsi"

Red Flags:

  • ``ata1.00: failed to read native max address``
  • ``I/O error, dev sda, sector …``
  • ``task rsync:20263 blocked for more than 120 seconds``

Step 2: S.M.A.R.T. Health Analysis

Use ``smartctl`` to query the drive's internal health logs.

The "Quick Filter" Command

This command filters out the noise and shows only the critical health indicators:

smartctl -a /dev/sdX | grep -E "(Health|Error|Reallocated|Pending|Uncorrectable|CRC|Load_Cycle|Power_On)"

Replace /dev/sdX with your drive (e.g., /dev/sda)

Step 3: Interpreting the Results

Here is how to read the attributes based on our diagnosis:

The "Certificate of Death" (Critical Failures)

If any of these are greater than 0, the drive is dying and must be replaced immediately.

  • Reallocated_Sector_Ct: The drive found bad sectors and moved data to a reserve area.
    • Diagnosis: Physical surface damage.
  • Current_Pending_Sector: The drive cannot read these sectors. This causes System Freezes as the drive retries indefinitely.
    • Diagnosis: The primary cause of I/O deadlocks.
  • Offline_Uncorrectable: Data in these sectors is permanently lost.

The "Silent Killers" (Performance & Wear)

These indicate why a drive might be slow or unreliable, even if “Healthy”.

  • Load_Cycle_Count: How many times the head parked.
    • Warning: Laptop drives (2.5“) park aggressively. If this is > 300,000, the mechanics are worn out.
  • UDMA_CRC_Error_Count: Communication errors.
    • Diagnosis: Usually a bad SATA/USB Cable, not the drive itself.

A Note on Seagate Drives

Ignore high raw values for:

  • ``Raw_Read_Error_Rate``
  • ``Seek_Error_Rate``

On Seagate drives, these are internal counters, not error counts. Only worry if the “VALUE” drops below “THRESH”.

Step 4: Identifying SMR Drives (The RAID Killer)

If a drive is healthy but causes RAID arrays to freeze during sync (speed drops to ~700KB/s), it is likely SMR (Shingled Magnetic Recording).

Symptoms:

  • Good SMART status.
  • Terrible write performance on small files.
  • RAID Resync takes weeks or stalls.

Solution:

  • Do not use SMR drives in RAID (ZFS/MDADM).
  • Use them only as single disks for Backup or Media storage.

Summary Checklist

Attribute Value Verdict
Reallocated / Pending > 0 REPLACE IMMEDIATELY (Dead)
Load Cycle Count > 600k WARNING (Mechanical Wear)
CRC Errors > 0 CHECK CABLE
Resync Speed < 1MB/s SMR DRIVE (Unsuitable for RAID)
linux/check-hdd-health.txt · Last modified: by odefta