Linux HDD Diagnostics & Health Check

Linux HDD Diagnostics & Health Check

This guide outlines the steps to diagnose hard drive issues, ranging from system freezes (I/O Wait) to physical hardware failure.

Step 1: Is the Disk Slowing Down the System?

Before checking the disk physically, check if the disk is the bottleneck causing system lag or “freezes”.

Check I/O Wait (vmstat)

Run this command to see system activity in real-time:

vmstat 1

What to look for:

wa (Wait): If this is high (over 20-30%), the CPU is idle waiting for the disk. 75%+ means a total storage deadlock.
b (Blocked): The number of processes stuck waiting for I/O. If > 0 constantly, processes are hanging.

Check Kernel Logs (dmesg)

If the disk is disconnecting or timing out, the kernel will log it.

dmesg | grep -i "error\|fail\|ata\|scsi"

Red Flags:

``ata1.00: failed to read native max address``
``I/O error, dev sda, sector …``
``task rsync:20263 blocked for more than 120 seconds``

Step 2: S.M.A.R.T. Health Analysis

Use ``smartctl`` to query the drive's internal health logs.

The "Quick Filter" Command

This command filters out the noise and shows only the critical health indicators:

smartctl -a /dev/sdX | grep -E "(Health|Error|Reallocated|Pending|Uncorrectable|CRC|Load_Cycle|Power_On)"

Replace /dev/sdX with your drive (e.g., /dev/sda)

Step 3: Interpreting the Results

Here is how to read the attributes based on our diagnosis:

The "Certificate of Death" (Critical Failures)

If any of these are greater than 0, the drive is dying and must be replaced immediately.

Reallocated_Sector_Ct: The drive found bad sectors and moved data to a reserve area.
- Diagnosis: Physical surface damage.
Current_Pending_Sector: The drive cannot read these sectors. This causes System Freezes as the drive retries indefinitely.
- Diagnosis: The primary cause of I/O deadlocks.
Offline_Uncorrectable: Data in these sectors is permanently lost.

The "Silent Killers" (Performance & Wear)

These indicate why a drive might be slow or unreliable, even if “Healthy”.

Load_Cycle_Count: How many times the head parked.
- Warning: Laptop drives (2.5“) park aggressively. If this is > 300,000, the mechanics are worn out.
UDMA_CRC_Error_Count: Communication errors.
- Diagnosis: Usually a bad SATA/USB Cable, not the drive itself.

A Note on Seagate Drives

Ignore high raw values for:

``Raw_Read_Error_Rate``
``Seek_Error_Rate``

On Seagate drives, these are internal counters, not error counts. Only worry if the “VALUE” drops below “THRESH”.

Step 4: Identifying SMR Drives (The RAID Killer)

If a drive is healthy but causes RAID arrays to freeze during sync (speed drops to ~700KB/s), it is likely SMR (Shingled Magnetic Recording).

Symptoms:

Good SMART status.
Terrible write performance on small files.
RAID Resync takes weeks or stalls.

Solution:

Do not use SMR drives in RAID (ZFS/MDADM).
Use them only as single disks for Backup or Media storage.

Summary Checklist

Attribute	Value	Verdict
Reallocated / Pending	> 0	REPLACE IMMEDIATELY (Dead)
Load Cycle Count	> 600k	WARNING (Mechanical Wear)
CRC Errors	> 0	CHECK CABLE
Resync Speed	< 1MB/s	SMR DRIVE (Unsuitable for RAID)

Table of Contents