Thursday, February 12, 2009

Disk Health Monitoring With Smartmon

Today, I've decided to load the smartmontools package on one of my Solaris 10 file servers. This toolset allows an administrator to make use of the extra features in all S.M.A.R.T. capable hard drives. What I am interested in doing is configuring a storage server to run the smartd daemon and email me when a disk is throwing errors. Hopefully, this will help me to preempively replace disks before an issue arises.

Before we begin, I have to give credit to "Matty" for both of these posts: Blog O' Matty #1 and Blog O' Matty #2. Without them, I would probably still be trying to figure this out.

Here are the refined steps I used to set this up on one of my Solaris 10 storage systems. Enjoy.

Installing Smartmon on Solaris 10

Download smartmontools from here.

# wget http://downloads.sourceforge.net/smartmontools/smartmontools-5.38.tar.gz

# gunzip smartmontools-5.38.tar.gz |tar xvf -

# cd smartmontools-5.38

# ./configure --sbindir=/usr/sbin \
--sysconfdir=/etc \

--mandir=/usr/share/man \

--with-docdir=/usr/share/doc/smartmontools-5.38 \

--with-initscriptdir=/etc/init.d


# make


# su

# make install


Create three service scripts in /usr/local/bin: smartd.start, smartd.stop, and smartd.restart:

#!/bin/sh
/etc/init.d/smartd start

#!/bin/sh
/etc/init.d/smartd stop

#!/bin/sh
/etc/init.d/smartd restart


Now create a new xml file called "smartd.xml":

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">


<service_bundle type='manifest' name='smartd'>


<service

name="application/smartd"

type="service"

version="1">


<create_default_instance enabled="true"/>


<exec_method

type='method'
name='start'
exec='/usr/local/bin/smartd.start'

timeout_seconds='3'>

</exec_method>


<exec_method

type='method'

name='stop'

exec='
/usr/local/bin/smartd.stop'
timeout_seconds='3'>

</exec_method>


<exec_method

type='method'

name='restart'

exec='
/usr/local/bin/smartd.restart'
timeout_seconds='3'>

</exec_method>


</service>

</service_bundle>


Save the file and test it with svccfg:

# svccfg validate smartd.xml
# echo $?
0


If you get the utterly unuseful error "svccfg: couldn't parse document", use xmllint to find the offending portion:

# xmllint -valid smartd.xml

correct any errors and revalidate with svccfg.

Now import the new manifest:
# svccfg import smartd.xml

List the properties of the new service for verification:
# svccfg -s application/smartd listprop

Edit /etc/smartd.conf to your liking, so that it will run whatever tests you require for your environment. For my purposes, I simply added a line for every disk in the server:

/dev/rdsk/c1t0d0 -d scsi -o on -a
/dev/rdsk/c1t1d0 -d scsi -o on -a
/dev/rdsk/c1t2d0 -d scsi -o on -a
/dev/rdsk/c1t3d0 -d scsi -o on -a
/dev/rdsk/c3t0d0 -d scsi -S on -o on -a
/dev/rdsk/c3t1d0 -d scsi -S on -o on -a
/dev/rdsk/c3t2d0 -d scsi -S on -o on -a
/dev/rdsk/c3t3d0 -d scsi -S on -o on -a
/dev/rdsk/c3t4d0 -d scsi -S on -o on -a
/dev/rdsk/c3t5d0 -d scsi -S on -o on -a
/dev/rdsk/c3t8d0 -d scsi -S on -o on -a
/dev/rdsk/c3t9d0 -d scsi -S on -o on -a
/dev/rdsk/c3t10d0 -d scsi -S on -o on -a
/dev/rdsk/c3t11d0 -d scsi -S on -o on -a
/dev/rdsk/c3t12d0 -d scsi -S on -o on -a
/dev/rdsk/c3t13d0 -d scsi -S on -o on -a
/dev/rdsk/c3t14d0 -d scsi -S on -o on -a
/dev/rdsk/c3t15d0 -d scsi -S on -o on -a
/dev/rdsk/c5t0d0 -d scsi -S on -o on -a
/dev/rdsk/c5t1d0 -d scsi -S on -o on -a
/dev/rdsk/c5t2d0 -d scsi -S on -o on -a
/dev/rdsk/c5t3d0 -d scsi -S on -o on -a
/dev/rdsk/c5t4d0 -d scsi -S on -o on -a
/dev/rdsk/c5t5d0 -d scsi -S on -o on -a
/dev/rdsk/c5t8d0 -d scsi -S on -o on -a
/dev/rdsk/c5t9d0 -d scsi -S on -o on -a
/dev/rdsk/c5t10d0 -d scsi -S on -o on -a
/dev/rdsk/c5t11d0 -d scsi -S on -o on -a
/dev/rdsk/c5t12d0 -d scsi -S on -o on -a
/dev/rdsk/c5t13d0 -d scsi -S on -o on -a
/dev/rdsk/c5t14d0 -d scsi -S on -o on -a
/dev/rdsk/c5t15d0 -d scsi -S on -o on -a

Enable the new service and verify that it's running as expected:
# svccfg enable application/smartd
# ps -elf |grep smartd


Originally, I was going to use the "-m" function to send email alerts, but I found that smartd works quite well with syslog. Since I already have a centralized syslog server, I'll just add a swatch statement to watch for smartd entries and then send email alerts from there.

No comments: