====== Linux HDD Diagnostics & Health Check ====== This guide outlines the steps to diagnose hard drive issues, ranging from system freezes (I/O Wait) to physical hardware failure. ==== Step 1: Is the Disk Slowing Down the System? ==== Before checking the disk physically, check if the disk is the bottleneck causing system lag or "freezes". === Check I/O Wait (vmstat) === Run this command to see system activity in real-time: vmstat 1 **What to look for:** * **wa (Wait):** If this is high (over 20-30%), the CPU is idle waiting for the disk. **75%+ means a total storage deadlock.** * **b (Blocked):** The number of processes stuck waiting for I/O. If > 0 constantly, processes are hanging. === Check Kernel Logs (dmesg) === If the disk is disconnecting or timing out, the kernel will log it. dmesg | grep -i "error\|fail\|ata\|scsi" **Red Flags:** * ``ata1.00: failed to read native max address`` * ``I/O error, dev sda, sector ...`` * ``task rsync:20263 blocked for more than 120 seconds`` ==== Step 2: S.M.A.R.T. Health Analysis ==== Use ``smartctl`` to query the drive's internal health logs. === The "Quick Filter" Command === This command filters out the noise and shows only the critical health indicators: smartctl -a /dev/sdX | grep -E "(Health|Error|Reallocated|Pending|Uncorrectable|CRC|Load_Cycle|Power_On)" //Replace /dev/sdX with your drive (e.g., /dev/sda)// ==== Step 3: Interpreting the Results ==== Here is how to read the attributes based on our diagnosis: === The "Certificate of Death" (Critical Failures) === If any of these are **greater than 0**, the drive is dying and must be replaced immediately. * **Reallocated_Sector_Ct:** The drive found bad sectors and moved data to a reserve area. * //Diagnosis:// Physical surface damage. * **Current_Pending_Sector:** The drive cannot read these sectors. This causes **System Freezes** as the drive retries indefinitely. * //Diagnosis:// The primary cause of I/O deadlocks. * **Offline_Uncorrectable:** Data in these sectors is permanently lost. === The "Silent Killers" (Performance & Wear) === These indicate why a drive might be slow or unreliable, even if "Healthy". * **Load_Cycle_Count:** How many times the head parked. * //Warning:// Laptop drives (2.5") park aggressively. If this is > 300,000, the mechanics are worn out. * **UDMA_CRC_Error_Count:** Communication errors. * //Diagnosis:// Usually a bad **SATA/USB Cable**, not the drive itself. === A Note on Seagate Drives === **Ignore** high raw values for: * ``Raw_Read_Error_Rate`` * ``Seek_Error_Rate`` On Seagate drives, these are internal counters, not error counts. Only worry if the "VALUE" drops below "THRESH". ==== Step 4: Identifying SMR Drives (The RAID Killer) ==== If a drive is healthy but causes RAID arrays to freeze during sync (speed drops to ~700KB/s), it is likely **SMR (Shingled Magnetic Recording)**. **Symptoms:** * Good SMART status. * Terrible write performance on small files. * RAID Resync takes weeks or stalls. **Solution:** * Do **not** use SMR drives in RAID (ZFS/MDADM). * Use them only as single disks for Backup or Media storage. ==== Summary Checklist ==== ^ Attribute ^ Value ^ Verdict ^ | **Reallocated / Pending** | > 0 | **REPLACE IMMEDIATELY** (Dead) | | **Load Cycle Count** | > 600k | **WARNING** (Mechanical Wear) | | **CRC Errors** | > 0 | **CHECK CABLE** | | **Resync Speed** | < 1MB/s | **SMR DRIVE** (Unsuitable for RAID) |