[Nagiosplug-devel] Problem with check_ide_smart

Alain Williams addw at phcomp.co.uk
Tue Feb 3 11:36:01 CET 2009


Hi,

I was monitoring a disk with check_ide_smart and did not receive any warnings.
The disk then went bad (fortunately Linux raid so I did not loose anything).
I noted from my logs that the error first became apparent several weeks before, but
did not pick it up due to email issues (the machine was being commissioned).

Is this a known issue ?



please find below the output of the command:

	smartctl -a -d ata /dev/sda

The line that is causing concern is the one that starts:

	198 Offline_Uncorrectable

The disk in error is a Seagate Barracuda.

I tried to rewrite the entire disk, but the error showed up soon after.

I ran a badblocks check (also below) and it showed up many errors.


Regards


**************** Output of smartctl below

smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST31000340AS
Serial Number:    9QJ1Z6NL
Firmware Version: SD15
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Not recognized. Minor revision code: 0x29
Local Time is:    Mon Jan  5 13:57:56 2009 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 634) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 226) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   098   096   006    Pre-fail  Always       -       83694887
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       63
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       29792870
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1410
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       63
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Unknown_Attribute       0x0032   016   016   000    Old_age   Always       -       84
188 Unknown_Attribute       0x0032   100   098   000    Old_age   Always       -       4295032856
189 Unknown_Attribute       0x003a   100   100   000    Old_age   Always       -       0
190 Unknown_Attribute       0x0022   071   047   045    Old_age   Always       -       538640413
194 Temperature_Celsius     0x0022   029   053   000    Old_age   Always       -       29 (Lifetime Min/Max 0/10)
195 Hardware_ECC_Recovered  0x001a   042   024   000    Old_age   Always       -       83694887
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 84 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 84 occurred at disk power-on lifetime: 1409 hours (58 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 30 39 b3 00  Error: UNC at LBA = 0x00b33930 = 11745584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 30 39 b3 e0 00  27d+17:43:19.816  READ DMA
  27 00 00 00 00 00 e0 00  27d+17:43:19.815  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  27d+17:43:19.812  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  27d+17:43:19.793  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  27d+17:43:19.792  READ NATIVE MAX ADDRESS EXT

Error 83 occurred at disk power-on lifetime: 1409 hours (58 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 30 39 b3 00  Error: UNC at LBA = 0x00b33930 = 11745584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 30 39 b3 e0 00  27d+17:43:16.833  READ DMA
  27 00 00 00 00 00 e0 00  27d+17:43:16.833  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  27d+17:43:16.830  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  27d+17:43:16.811  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  27d+17:43:16.811  READ NATIVE MAX ADDRESS EXT

Error 82 occurred at disk power-on lifetime: 1409 hours (58 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 30 39 b3 00  Error: UNC at LBA = 0x00b33930 = 11745584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 30 39 b3 e0 00  27d+17:43:13.861  READ DMA
  27 00 00 00 00 00 e0 00  27d+17:43:13.860  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  27d+17:43:13.857  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  27d+17:43:13.838  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  27d+17:43:13.837  READ NATIVE MAX ADDRESS EXT

Error 81 occurred at disk power-on lifetime: 1409 hours (58 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 30 39 b3 00  Error: UNC at LBA = 0x00b33930 = 11745584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 30 39 b3 e0 00  27d+17:43:10.878  READ DMA
  27 00 00 00 00 00 e0 00  27d+17:43:10.878  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  27d+17:43:10.875  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  27d+17:43:10.855  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  27d+17:43:10.855  READ NATIVE MAX ADDRESS EXT

Error 80 occurred at disk power-on lifetime: 1409 hours (58 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 30 39 b3 00  Error: UNC at LBA = 0x00b33930 = 11745584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 30 39 b3 e0 00  27d+17:43:07.903  READ DMA
  27 00 00 00 00 00 e0 00  27d+17:43:07.903  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  27d+17:43:07.900  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  27d+17:43:07.897  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  27d+17:43:07.885  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1398         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

**************** Badblocks output:

[root at bfpsev log]# badblocks -s -v /dev/sda
Checking blocks 0 to 976762584
Checking for bad blocks (read-only test): 5872768 5872768/      976762584
5872792 5872792/      976762584
5872793 5872793/      976762584
5872794 5872794/      976762584
5872795 5872795/      976762584
5872796
5872797
5872798
5872799
5872800
5872801
5872802
5872803
5872804
5872805
5872806
5872807
5872808
5872809
5872810
5872811
5872812
5872813
5872814
5872815
5872816
5872817
5872818
5872819
5872820
5872821
5872822
5872823
5872824
5872825
5872826
5872827
5872828
5872829
5872830
5872831
5872832
5872833
5872834
5872835
5872836
5872837
5872838
5872839
5872840
5872841
5872842
5872843
5872844
5872845
5872846
5872847
5872848
5872849
5872850
5872851
5872852
5872853
5872854
5872855
5872856
5872857
5872858
5872859
5872860
5872861
5872862
5872863
5872864
5872865
5872866
5872867
5872868
5872869
5872870
5872871
5872872
5872873
5872874
5872875
5872876
5872877
5872878
5872879
5872880
5872881
5872882
5872883
5872884
5872885
5872886
5872887
5872888
5872889
5872890
5872891
5872892
5872893
5872894
5872895
5872896
5872897
5872898
5872899


-- 
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256  http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: http://www.phcomp.co.uk/contact.php
Past chairman of UKUUG: http://www.ukuug.org/
#include <std_disclaimer.h>




More information about the Devel mailing list