====== Linux HDD Diagnostics & Health Check ======
This guide outlines the steps to diagnose hard drive issues, ranging from system freezes (I/O Wait) to physical hardware failure.
==== Step 1: Is the Disk Slowing Down the System? ====
Before checking the disk physically, check if the disk is the bottleneck causing system lag or "freezes".
=== Check I/O Wait (vmstat) ===
Run this command to see system activity in real-time:
vmstat 1
**What to look for:**
* **wa (Wait):** If this is high (over 20-30%), the CPU is idle waiting for the disk. **75%+ means a total storage deadlock.**
* **b (Blocked):** The number of processes stuck waiting for I/O. If > 0 constantly, processes are hanging.
=== Check Kernel Logs (dmesg) ===
If the disk is disconnecting or timing out, the kernel will log it.
dmesg | grep -i "error\|fail\|ata\|scsi"
**Red Flags:**
* ``ata1.00: failed to read native max address``
* ``I/O error, dev sda, sector ...``
* ``task rsync:20263 blocked for more than 120 seconds``
==== Step 2: S.M.A.R.T. Health Analysis ====
Use ``smartctl`` to query the drive's internal health logs.
=== The "Quick Filter" Command ===
This command filters out the noise and shows only the critical health indicators:
smartctl -a /dev/sdX | grep -E "(Health|Error|Reallocated|Pending|Uncorrectable|CRC|Load_Cycle|Power_On)"
//Replace /dev/sdX with your drive (e.g., /dev/sda)//
==== Step 3: Interpreting the Results ====
Here is how to read the attributes based on our diagnosis:
=== The "Certificate of Death" (Critical Failures) ===
If any of these are **greater than 0**, the drive is dying and must be replaced immediately.
* **Reallocated_Sector_Ct:** The drive found bad sectors and moved data to a reserve area.
* //Diagnosis:// Physical surface damage.
* **Current_Pending_Sector:** The drive cannot read these sectors. This causes **System Freezes** as the drive retries indefinitely.
* //Diagnosis:// The primary cause of I/O deadlocks.
* **Offline_Uncorrectable:** Data in these sectors is permanently lost.
=== The "Silent Killers" (Performance & Wear) ===
These indicate why a drive might be slow or unreliable, even if "Healthy".
* **Load_Cycle_Count:** How many times the head parked.
* //Warning:// Laptop drives (2.5") park aggressively. If this is > 300,000, the mechanics are worn out.
* **UDMA_CRC_Error_Count:** Communication errors.
* //Diagnosis:// Usually a bad **SATA/USB Cable**, not the drive itself.
=== A Note on Seagate Drives ===
**Ignore** high raw values for:
* ``Raw_Read_Error_Rate``
* ``Seek_Error_Rate``
On Seagate drives, these are internal counters, not error counts. Only worry if the "VALUE" drops below "THRESH".
==== Step 4: Identifying SMR Drives (The RAID Killer) ====
If a drive is healthy but causes RAID arrays to freeze during sync (speed drops to ~700KB/s), it is likely **SMR (Shingled Magnetic Recording)**.
**Symptoms:**
* Good SMART status.
* Terrible write performance on small files.
* RAID Resync takes weeks or stalls.
**Solution:**
* Do **not** use SMR drives in RAID (ZFS/MDADM).
* Use them only as single disks for Backup or Media storage.
==== Summary Checklist ====
^ Attribute ^ Value ^ Verdict ^
| **Reallocated / Pending** | > 0 | **REPLACE IMMEDIATELY** (Dead) |
| **Load Cycle Count** | > 600k | **WARNING** (Mechanical Wear) |
| **CRC Errors** | > 0 | **CHECK CABLE** |
| **Resync Speed** | < 1MB/s | **SMR DRIVE** (Unsuitable for RAID) |