Checking hard drive and filesystem health

Table of Contents

SMART
Check results
Gsmartcontrol
Badblocks
Further reading

...wip...

SMART-up-

Assuming the smartctl command (manual) is available (part of smartmontools package) and your drive is SMART¹ capable and enabled. If it isn't, check this article.

SMART tests and queries can be run on a hard drive that is in use. All smartctl commands require elevated privileges.

First, let's see what we have:

#> smartctl --scan-open
/dev/sda -d sat # /dev/sda [SAT], ATA device
/dev/sdb -d sat # /dev/sdb [SAT], ATA device
/dev/sdc -d sat # /dev/sdc [SAT], ATA device
#> smartctl --info /dev/sda # let's use /dev/sda in this article

This output includes both spinning and solid state drives, and all have SMART enabled, as the relatively short --info output shows.

Is it necessary to specify the device type with -d sat for every subsequent command? I think not (the man page doesn't clarify). But if you do, it always has to be the last option.²

Getting all SMART information:

#> smartctl /dev/sda -a
## including non-SMART info:
#> smartctl /dev/sda -x

That's a lot. Let's see the overall health report only:

#> smartctl /dev/sda -H
...
SMART overall-health self-assessment test result: PASSED

That's nice to hear, but a little thin. Let's see what data has been stored:

#> smartctl /dev/sda -A -l error
...

Let's store that to a file for later comaprison:

#> smartctl /dev/sda -A -l error > sda.before

I highly recommend reading the -A, --attributes section of the man page to fully understand what all this means, and avoid shock reactions when misunderstanding certain labels or values.

You can try -f [brief|old|hex] for slightly different output; the new brief format decodes additional data in the FLAGS column.

We will perform a long (extended) self-test. But first, let's see when the last such test was performed:

#> smartctl /dev/sda -a | grep -iE 'Hours|Extended'
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       10101
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     10042         -
...

The first line tells us how long the device has been in operation: 10101 hours. The last line tells us at which lifetime hour the last extended test was performed: 10042 hours.

The test results are stored indefinitely. You don't need to perform a new test to read out the most recent test's data.

Let's assume that the last extended test was long ago, and we want to run it now. But first, let's make sure "autosave of device vendor-specific Attributes" is enabled:

#> smartctl /dev/sda -S on
...
SMART Attribute Autosave Enabled.

I had hoped to get extra information from this after performing a test, but I cannot find it anywhere. Maybe over time. It should affect the "autosave of device vendor-specific Attributes". The man page says that a) smartctl has no way of checking the status of this setting (whether it is turned on or not) and b) the setting "is preserved across disk power cycles, so you should only need to issue it once".

#> smartctl /dev/sda -t long
...

All SMART tests are performed in the background. You can check where the test execution is at with something like

#> smartctl -a /dev/sda | grep -A1 execution
Self-test execution status:      ( 244) Self-test routine in progress...
                                        40% of test remaining.

until it says ( 0) The previous self-test routine completed without error or no self-test has ever...

Check results-up-

#> smartctl /dev/sda -A -l error
...

That's the most important part. Again: read the -A, --attributes section of the man page to fully understand what all this means, and avoid shock reactions when misunderstanding certain labels or values.

In short, you usually want to pay closest attention to RAW_VALUE.

Store it to a file again, and compare with before:

#> smartctl /dev/sda -A -l error > sda.after
#> diff sda.before sda.after
...

I see nothing that worries me there, but of course yours may be different.

Some other values may have been logged, e.g.:

smartctl -l scttemp /dev/sda
...

The presence of such data varies from device to device. Please refer to the -l TYPE, --log=TYPE section of the man page.

Gsmartcontrol-up-

I recommend installing gsmartcontrol. If you followed this article it should be familiar already, but in addition to doing much of the legwork for you it can give you valuable additional explanation (when hovering fields with your pointer).

Badblocks-up-

SMART only goes that far. While I believe that it works, I want to make doubly sure and check for bad blocks again - with badblocks.

Its man page says:

Important note: If the output of badblocks is going to be fed to the e2fsck or mke2fs programs, it is important that the block size is properly specified, since the block numbers which are generated are very dependent on the block size in use by the filesystem. For this reason, it is strongly recommended that users not run badblocks directly, but rather use the -c option of the e2fsck and mke2fs programs.

After some deliberation I decide to make a read-only check with badblocks and only use e2fsck if errors are found.

Both utilities are owned by the e2fsprogs package.

The drive/filesystem in question needs to be unmounted, so I fired up my trusty SysRescueCD USB stick.

This was simple:

#> for x in a b c; do badblocks -v /dev/sd$x > badblocks.sd$x; done

All three files were empty...

Some helpful links:

https://askubuntu.com/questions/539184/how-do-i-check-the-integrity-of-a-storage-medium-hard-disk-or-flash-drive
https://www.thomas-krenn.com/en/wiki/SMART_tests_with_smartctl
https://blog.shadypixel.com/monitoring-hard-drive-health-on-linux-with-smartmontools/
https://www.maketecheasier.com/check-repair-filesystem-fsck-linux/
https://www.linuxtechi.com/check-hard-drive-for-bad-sector-linux/