[Nagiosplug-devel] [ nagiosplug-Bugs-2930789 ] check_ide_smart ignores SMART errors !

SourceForge.net noreply at sourceforge.net
Tue Jan 12 18:55:40 CET 2010


Bugs item #2930789, was opened at 2010-01-12 17:55
Message generated for change (Tracker Item Submitted) made by oernii
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=397597&aid=2930789&group_id=29880

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: General plugin execution
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Ernest Beinrohr (oernii)
Assigned to: Nobody/Anonymous (nobody)
Summary: check_ide_smart ignores SMART errors !

Initial Comment:
I began using this plugin and now found that it is no good. Everything seems good, the plugin checks the disks and some tests but not i do have a BAD drive. smartctl reports 7 unrecovereble errors. A clear sign for imminent failure and I am replacing it therefore. But check_ide_smart reports that everything is OK !

There is clearly some problem, the plugin should NOT ignore such a thing as SMART errors. It's its main and only purpose. Here are the outputs for my /dev/sdd. 

PS: check_ide_smart v1991 (nagios-plugins 1.4.13). also tried nagios-plugins-1.4.14-61-g45e2

$ /usr/lib/nagios/plugins/check_ide_smart  -d /dev/sdd
Id=  1, Status=15 {PreFailure , OnLine }, Value=114, Threshold=  6, Passed
Id=  3, Status= 3 {PreFailure , OnLine }, Value= 93, Threshold=  0, Passed
Id=  4, Status=50 {Advisory    , OnLine }, Value=100, Threshold= 20, Passed
Id=  5, Status=51 {PreFailure , OnLine }, Value=100, Threshold= 36, Passed
Id=  7, Status=15 {PreFailure , OnLine }, Value= 43, Threshold= 30, Passed
Id=  9, Status=50 {Advisory    , OnLine }, Value= 90, Threshold=  0, Passed
Id= 10, Status=19 {PreFailure , OnLine }, Value=100, Threshold= 97, Passed
Id= 12, Status=50 {Advisory    , OnLine }, Value=100, Threshold= 20, Passed
Id=184, Status=50 {Advisory    , OnLine }, Value=100, Threshold= 99, Passed
Id=187, Status=50 {Advisory    , OnLine }, Value= 93, Threshold=  0, Passed
Id=188, Status=50 {Advisory    , OnLine }, Value=100, Threshold=  0, Passed
Id=189, Status=58 {Advisory    , OnLine }, Value=100, Threshold=  0, Passed
Id=190, Status=34 {Advisory    , OnLine }, Value= 86, Threshold= 45, Passed
Id=194, Status=34 {Advisory    , OnLine }, Value= 14, Threshold=  0, Passed
Id=195, Status=26 {Advisory    , OnLine }, Value= 46, Threshold=  0, Passed
Id=197, Status=18 {Advisory    , OnLine }, Value=100, Threshold=  0, Passed
Id=198, Status=16 {Advisory    , OffLine}, Value=100, Threshold=  0, Passed
Id=199, Status=62 {Advisory    , OnLine }, Value=200, Threshold=  0, Passed
OffLineStatus=130 {Completed}, AutoOffLine=Yes, OffLineTimeout=10 minutes
OffLineCapability=123 {Immediate Auto SuspendOnCmd}
SmartRevision=10, CheckSum=161, SmartCapability=3 {SaveOnStandBy AutoSave}


-------------------
$ smartctl -a /dev/sdd
smartctl version 5.38 [i586-mandriva-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     MAXTOR STM31000340AS
Serial Number:    9QJ1CCR0
Firmware Version: MX1A
User Capacity:    1 000 204 886 016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Jan 12 18:46:47 2010 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 ( 634) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 227) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       60566982
  3 Spin_Up_Time            0x0003   093   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       40
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   043   037   030    Pre-fail  Always       -       61272258239168
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       9615
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       40
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   093   093   000    Old_age   Always       -       7
188 Unknown_Attribute       0x0032   100   099   000    Old_age   Always       -       4295032834
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   086   049   045    Old_age   Always       -       14 (Lifetime Min/Max 4/30)
194 Temperature_Celsius     0x0022   014   051   000    Old_age   Always       -       14 (0 4 0 0)
195 Hardware_ECC_Recovered  0x001a   046   026   000    Old_age   Always       -       60566982
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 7 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 7 occurred at disk power-on lifetime: 9006 hours (375 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 03 b8 13 00  Error: UNC at LBA = 0x0013b803 = 1292291

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 fd b7 13 e0 00   2d+21:57:34.419  READ DMA
  27 00 00 00 00 00 e0 00   2d+21:57:34.417  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   2d+21:57:34.397  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   2d+21:57:34.376  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   2d+21:57:34.336  READ NATIVE MAX ADDRESS EXT

Error 6 occurred at disk power-on lifetime: 9006 hours (375 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 03 b8 13 00  Error: UNC at LBA = 0x0013b803 = 1292291

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 fd b7 13 e0 00   2d+21:57:31.288  READ DMA
  27 00 00 00 00 00 e0 00   2d+21:57:31.287  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   2d+21:57:31.267  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   2d+21:57:31.247  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   2d+21:57:31.196  READ NATIVE MAX ADDRESS EXT

Error 5 occurred at disk power-on lifetime: 9006 hours (375 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 03 b8 13 00  Error: UNC at LBA = 0x0013b803 = 1292291

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 fd b7 13 e0 00   2d+21:57:28.198  READ DMA
  27 00 00 00 00 00 e0 00   2d+21:57:28.197  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   2d+21:57:28.177  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   2d+21:57:28.156  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   2d+21:57:28.116  READ NATIVE MAX ADDRESS EXT

Error 4 occurred at disk power-on lifetime: 9006 hours (375 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 03 b8 13 00  Error: UNC at LBA = 0x0013b803 = 1292291

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 fd b7 13 e0 00   2d+21:57:25.007  READ DMA
  27 00 00 00 00 00 e0 00   2d+21:57:25.005  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   2d+21:57:24.985  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   2d+21:57:24.966  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   2d+21:57:24.907  READ NATIVE MAX ADDRESS EXT

Error 3 occurred at disk power-on lifetime: 9006 hours (375 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 02 b8 13 00  Error: UNC at LBA = 0x0013b802 = 1292290

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 fd b7 13 e0 00   2d+21:57:21.899  READ DMA
  27 00 00 00 00 00 e0 00   2d+21:57:21.897  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   2d+21:57:21.877  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   2d+21:57:21.860  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   2d+21:57:21.807  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=397597&aid=2930789&group_id=29880




More information about the Devel mailing list