====== Linux HDD Diagnostics & Health Check ======

This guide outlines the steps to diagnose hard drive issues, ranging from system freezes (I/O Wait) to physical hardware failure.

==== Step 1: Is the Disk Slowing Down the System? ====

Before checking the disk physically, check if the disk is the bottleneck causing system lag or "freezes".

=== Check I/O Wait (vmstat) ===
Run this command to see system activity in real-time:
<code bash>
vmstat 1
</code>

**What to look for:**
  * **wa (Wait):** If this is high (over 20-30%), the CPU is idle waiting for the disk. **75%+ means a total storage deadlock.**
  * **b (Blocked):** The number of processes stuck waiting for I/O. If > 0 constantly, processes are hanging.

=== Check Kernel Logs (dmesg) ===
If the disk is disconnecting or timing out, the kernel will log it.
<code bash>
dmesg | grep -i "error\|fail\|ata\|scsi"
</code>
**Red Flags:**
  * ``ata1.00: failed to read native max address``
  * ``I/O error, dev sda, sector ...``
  * ``task rsync:20263 blocked for more than 120 seconds``

==== Step 2: S.M.A.R.T. Health Analysis ====

Use ``smartctl`` to query the drive's internal health logs.

=== The "Quick Filter" Command ===
This command filters out the noise and shows only the critical health indicators:
<code bash>
smartctl -a /dev/sdX | grep -E "(Health|Error|Reallocated|Pending|Uncorrectable|CRC|Load_Cycle|Power_On)"
</code>
//Replace /dev/sdX with your drive (e.g., /dev/sda)//

==== Step 3: Interpreting the Results ====

Here is how to read the attributes based on our diagnosis:

=== The "Certificate of Death" (Critical Failures) ===
If any of these are **greater than 0**, the drive is dying and must be replaced immediately.

  * **Reallocated_Sector_Ct:** The drive found bad sectors and moved data to a reserve area.
    * //Diagnosis:// Physical surface damage.
  * **Current_Pending_Sector:** The drive cannot read these sectors. This causes **System Freezes** as the drive retries indefinitely.
    * //Diagnosis:// The primary cause of I/O deadlocks.
  * **Offline_Uncorrectable:** Data in these sectors is permanently lost.

=== The "Silent Killers" (Performance & Wear) ===
These indicate why a drive might be slow or unreliable, even if "Healthy".

  * **Load_Cycle_Count:** How many times the head parked.
    * //Warning:// Laptop drives (2.5") park aggressively. If this is > 300,000, the mechanics are worn out.
  * **UDMA_CRC_Error_Count:** Communication errors.
    * //Diagnosis:// Usually a bad **SATA/USB Cable**, not the drive itself.

=== A Note on Seagate Drives ===
**Ignore** high raw values for:
  * ``Raw_Read_Error_Rate``
  * ``Seek_Error_Rate``
On Seagate drives, these are internal counters, not error counts. Only worry if the "VALUE" drops below "THRESH".

==== Step 4: Identifying SMR Drives (The RAID Killer) ====

If a drive is healthy but causes RAID arrays to freeze during sync (speed drops to ~700KB/s), it is likely **SMR (Shingled Magnetic Recording)**.

**Symptoms:**
  * Good SMART status.
  * Terrible write performance on small files.
  * RAID Resync takes weeks or stalls.

**Solution:**
  * Do **not** use SMR drives in RAID (ZFS/MDADM).
  * Use them only as single disks for Backup or Media storage.

==== Summary Checklist ====

^ Attribute ^ Value ^ Verdict ^
| **Reallocated / Pending** | > 0 | **REPLACE IMMEDIATELY** (Dead) |
| **Load Cycle Count** | > 600k | **WARNING** (Mechanical Wear) |
| **CRC Errors** | > 0 | **CHECK CABLE** |
| **Resync Speed** | < 1MB/s | **SMR DRIVE** (Unsuitable for RAID) |