Table of Contents
Linux HDD Diagnostics & Health Check
This guide outlines the steps to diagnose hard drive issues, ranging from system freezes (I/O Wait) to physical hardware failure.
Step 1: Is the Disk Slowing Down the System?
Before checking the disk physically, check if the disk is the bottleneck causing system lag or “freezes”.
Check I/O Wait (vmstat)
Run this command to see system activity in real-time:
vmstat 1
What to look for:
- wa (Wait): If this is high (over 20-30%), the CPU is idle waiting for the disk. 75%+ means a total storage deadlock.
- b (Blocked): The number of processes stuck waiting for I/O. If > 0 constantly, processes are hanging.
Check Kernel Logs (dmesg)
If the disk is disconnecting or timing out, the kernel will log it.
dmesg | grep -i "error\|fail\|ata\|scsi"
Red Flags:
- ``ata1.00: failed to read native max address``
- ``I/O error, dev sda, sector …``
- ``task rsync:20263 blocked for more than 120 seconds``
Step 2: S.M.A.R.T. Health Analysis
Use ``smartctl`` to query the drive's internal health logs.
The "Quick Filter" Command
This command filters out the noise and shows only the critical health indicators:
smartctl -a /dev/sdX | grep -E "(Health|Error|Reallocated|Pending|Uncorrectable|CRC|Load_Cycle|Power_On)"
Replace /dev/sdX with your drive (e.g., /dev/sda)
Step 3: Interpreting the Results
Here is how to read the attributes based on our diagnosis:
The "Certificate of Death" (Critical Failures)
If any of these are greater than 0, the drive is dying and must be replaced immediately.
- Reallocated_Sector_Ct: The drive found bad sectors and moved data to a reserve area.
- Diagnosis: Physical surface damage.
- Current_Pending_Sector: The drive cannot read these sectors. This causes System Freezes as the drive retries indefinitely.
- Diagnosis: The primary cause of I/O deadlocks.
- Offline_Uncorrectable: Data in these sectors is permanently lost.
The "Silent Killers" (Performance & Wear)
These indicate why a drive might be slow or unreliable, even if “Healthy”.
- Load_Cycle_Count: How many times the head parked.
- Warning: Laptop drives (2.5“) park aggressively. If this is > 300,000, the mechanics are worn out.
- UDMA_CRC_Error_Count: Communication errors.
- Diagnosis: Usually a bad SATA/USB Cable, not the drive itself.
A Note on Seagate Drives
Ignore high raw values for:
- ``Raw_Read_Error_Rate``
- ``Seek_Error_Rate``
On Seagate drives, these are internal counters, not error counts. Only worry if the “VALUE” drops below “THRESH”.
Step 4: Identifying SMR Drives (The RAID Killer)
If a drive is healthy but causes RAID arrays to freeze during sync (speed drops to ~700KB/s), it is likely SMR (Shingled Magnetic Recording).
Symptoms:
- Good SMART status.
- Terrible write performance on small files.
- RAID Resync takes weeks or stalls.
Solution:
- Do not use SMR drives in RAID (ZFS/MDADM).
- Use them only as single disks for Backup or Media storage.
Summary Checklist
| Attribute | Value | Verdict |
|---|---|---|
| Reallocated / Pending | > 0 | REPLACE IMMEDIATELY (Dead) |
| Load Cycle Count | > 600k | WARNING (Mechanical Wear) |
| CRC Errors | > 0 | CHECK CABLE |
| Resync Speed | < 1MB/s | SMR DRIVE (Unsuitable for RAID) |
