Thursday, November 20, 2008

Kernel Panic - Oh No!

Well, it appears that the newly upgraded V240 that I was so impressed with crashed last night. It came right back up and hasn't had any issues since, but the fact that it happened at all is disturbing. There was only one user connected, and one job running at the time. Backups were running too. If anyone out there is proficient with picking through dump files, here's some mdb output for you to enjoy:

# dumpadm
Dump content: kernel pages
Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/sunfire
Savecore enabled: yes
# cd /var/crash/sunfire/
# ls
bounds unix.0 vmcore.0
# mdb 0
Loading modules: [ unix genunix specfs dtrace zfs sd pcisch ip hook neti sctp arp usba fcp fctl qlc nca lofs mpt md cpc random crypto wrsmd fcip logindmux ptm ufs sppp nfs ]
> ::status
debugging crash dump vmcore.0 (64-bit) from sunfire
operating system: 5.10 Generic_137137-09 (sun4u)
panic message: BAD TRAP: type=31 rp=2a1009768e0 addr=0 mmu_fsr=0 occurred in module "unix" due to a NULL pointer dereference
dump content: kernel pages only
> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 834712 6521 81%
Anon 97092 758 9%
Exec and libs 3492 27 0%
Page cache 3202 25 0%
Free (cachelist) 1543 12 0%
Free (freelist) 88943 694 9%

Total 1028984 8038
Physical 1025981 8015
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0000183bb88 1b 1 0 105 no no t-0 2a100977ca0 sched
1 0000180c000 1d 1 0 41 yes no t-0 30014117080 sas.e9bb95
> panic_thread/J
panic_thread:
panic_thread: 2a100977ca0
> 2a100977ca0::findstack
stack pointer for thread 2a100977ca0: 2a100975d51
000002a100975e01 die+0x78()
000002a100975ee1 trap+0x9e0()
000002a100976031 ktl0+0x48()
000002a100976181 ip_wput_ioctl+0xc4()
000002a100976231 tcp_xmit_early_reset+0x6b8()
000002a100976341 tcp_xmit_listeners_reset+0x1f4()
000002a100976411 ip_tcp_input+0xaf8()
000002a1009764f1 ip_input+0xa70()
000002a100976661 putnext+0x218()
000002a100976711 ce_intr+0x764c()
000002a1009771e1 pci_intr_wrapper+0xb8()
000002a100977291 intr_thread+0x168()
> $r
%g0 = 0x0000000000000000 %l0 = 0x0000060016e5eef0
%g1 = 0x00000000000001c0 %l1 = 0x000000007be78638 ip_ire_delete
%g2 = 0x0000000000005316 %l2 = 0x000000007001ac00 ip_areq_template+0x24
%g3 = 0x000006001f298254 %l3 = 0x0000000000005000
%g4 = 0x000006001f2981f0 %l4 = 0x0000000000000006
%g5 = 0x000006001f2981f0 %l5 = 0x000000007be783a8 ip_ire_advise
%g6 = 0x0000000000000010 %l6 = 0x000000007001aca8 ip_ioctl_ftbl+0x30
%g7 = 0x000002a100977ca0 %l7 = 0x000006001f2981f0

%o0 = 0x0000000000000000 %i0 = 0x00000600421dfb00
%o1 = 0x000002a100977ca0 %i1 = 0x0000060016e01380
%o2 = 0x0000000000000001 %i2 = 0x0000060016dcd0c0
%o3 = 0x0000000000005316 %i3 = 0x0000000000000000
%o4 = 0x0000000000000000 %i4 = 0x0000060011003e48
%o5 = 0x0000000000000064 %i5 = 0x0000000000000000
%o6 = 0x000002a100976181 %i6 = 0x000002a100976231
%o7 = 0x000000007be6a120 ip_wput_ioctl+0xc4 %i7 = 0x000000007bed8b94 tcp_xmit_early_reset+0x6b8

%ccr = 0x44 xcc=nZvc icc=nZvc
%fprs = 0x00 fef=0 du=0 dl=0
%asi = 0x80
%y = 0x0000000000000000
%pc = 0x0000000001047824 mutex_enter+4
%npc = 0x0000000001047828 mutex_enter+8
%sp = 0x000002a100976181 unbiased=0x000002a100976980
%fp = 0x000002a100976231

%tick = 0x0000000000000000
%tba = 0x0000000000000000
%tt = 0x31
%tl = 0x0
%pil = 0x6
%pstate = 0x016 cle=0 tle=0 mm=TSO red=0 pef=1 am=0 priv=1 ie=1 ag=0

%cwp = 0x04 %cansave = 0x00
%canrestore = 0x00 %otherwin = 0x00
%wstate = 0x00 %cleanwin = 0x00
> 2a100977ca0::thread -p
ADDR PROC LWP CRED
000002a100977ca0 1839750 60015eee058 60011003e48
> 1839750::ptree
0000000001839750 sched
0000060013401848 fsflush
0000060013402468 pageout
0000060013403088 init
000006001bd804b8 bpbkar
00000600183879b0 bpbkar
0000030015371ab8 bpbkar
00000300228a5238 bpbkar
0000030035cae210 bpbkar
0000030035ee12a8 bpbkar
0000030034cf8668 bpbkar
000006002dfce180 bpbkar
000003001e6b4e58 bpbkar
000006001b98e4a8 java
000006001bd7e058 dtlogin
000006001b8910c0 fmd
0000060019aeec48 snmpXdmid
000006001b98d888 dmispd
00000600145bf850 vold
000006001b7d0038 snmpdx
000006001b98c048 sendmail
000006001b9f50d0 snmpd
000006001b98f0c8 sendmail
00000600147f3098 syslogd
000006001b8904a0 sshd
000006001b7d1878 automountd
000006001aa9c030 automountd
000006001993b860 smcboot
000006001993a020 smcboot
000006001aa9e490 smcboot
000006001b7d30b8 utmpd
0000060019aee028 inetd
0000030031aa5270 in.telnetd
0000030034f212b0 ksh
0000030026e2bab0 sas.e9bb95
000006001667cda8 elssrv
000006001b9f44b0 in.telnetd
0000030027c7a6c8 ksh
00000600291e0220 sas.e9bb95
000006001b5460e8 elssrv
000006001b98cc68 in.telnetd
000006001b7d2498 ksh
0000060016667990 sas.e9bb95
0000060015f00db0 elssrv
000003002b6a46a8 in.telnetd
> $c
mutex_enter+4(600421dfb00, 60016e01380, 60016dcd0c0, 0, 60011003e48, 0)
tcp_xmit_early_reset+0x6b8(7be25368, 0, 6001f2981f0, 10, 0, 0)
tcp_xmit_listeners_reset+0x1f4(6001c73da80, 14, 0, 60013130000, 60033df1d40, b88c608d)
ip_tcp_input+0xaf8(18, 60015f1ee10, 30000d98068, 60033df1d40, 0, 30000d98068)
ip_input+0xa70(60015f1ee10, 0, 0, 30000d98068, 0, 0)
putnext+0x218(600143b6ed0, 600143b6ce0, 6001c73da80, 100, 600143b6a50, 0)
ce_intr+0x764c(1069128, 0, 6001c73da80, 11999b8, 600143b6a50, 600141eb700)
pci_intr_wrapper+0xb8(60014b12420, 300000b8148, 0, 0, 60014bd9548, 0)
intr_thread+0x168(ffffffff75702bdc, ffffffff7a9263a4, 4, 0, 0, 3)

Solaris 10, ZFS, and Dell JBODs - Redux - Update

As promised, here are the results of adding a second disk array to the SunFire V240. Interestingly, I didn't really see much of a performance boost by adding the second array:


The iozone charts above show the sequential read and write performance with two Dell PV220s disk arrays attached. Write performance stayed pretty much the same, and read performance only improved by about 200MB/second. Now, both of these arrays are plugged into the same SCSI controller, so either I've reached the capacity of the SCSI card or the server's PCI bus.

I think it's far more likely that the SCSI controller is simply doing all it can to keep up. I am now looking into moving the second disk array to its own SCSI controller. I expect this will yield yet another boost in performance. I'm not sure about the ramifications of trying to migrate 14 out of 28 disks in a single zpool to another controller. I'll have to research that one a bit.

Friday, November 14, 2008

New Backup System

Our backup system is starting to get a little "long in the tooth", if you know what I mean. Currently we've got a single Windows 2003 server running Netbackup 5.0. It sends the incremental backups to a fibre-attached EMC CX300, and then dups them off to tape. The weekly and monthly fulls go straight to tape on an Adic Scalar 100 with 5 LTO2 drives, which is also connected via fibre.

The CX300 is no longer under warranty, and disk are going offline left and right. Dell wants $16K to renew maintenance and another few thousand to upgrade the firmware on it to the current revision. Unfortunately, with the CX300 only being used for d2d backups, it's really not worth the money to renew and update the thing.

I also hate running Netbackup on Windows. I love Netbackup, but I just feel that I could better leverage its capabilities on a Solaris system.

And finally, we spend a ton of money on offiste tape storage every year.

So to sum up, I want to:
- Get off of the CX300
- Upgrade from Netbackup 5 to Netbackup 6.5
- Use Solaris instead of Windows for the master server
- Drastically reduce offsite tape storage services

If I go with a non-sparc server, I can run Solaris on x86 and save some money. I figure something like a Dell PowerEdge 2970 or the like. Those only run about $4K.

I did some searching and found that XStore carries a 24-disk SAS/SATA JBOD chassis for about $2500. I can get some Seagate 1TB enterprise-class disks for $205 a piece. Probably less if I buy in bulk. This will set me up with a smoking fast, 24TB d2d system for under $13K. Tack on the Netbackup upgrade for another $12K and we're up to $25,000.

This takes care of The first three issues I have with the existing backup system. Now for the offsite storage. We've been with the same vendor for about four years now, and we coincidentally only have four years of backups stored there. This year will cost us a little thess than $30,000 to store our tapes offsite. For legal reasons, we need to retain seven years so we can assume that our annual spend for offsite storage will nearly double by the end of 2011. So, it's definitely worth it to look into an alternative.

To tackle this, I'm currently thinking about a software-based, block level data mirroring solution. Something like Double-Take might do the trick. We've used it in the past, but only to keep user data on two file servers synchronized over a WAN. Generic user data is a lot different from compressed d2d backup images, so I'm not sure how viable a solution like this really is.

If the synchronization does work though, then I will look at building out two of these servers with two of the SATA arrays on each one. This should provide me with the ability to store all daily incrementals as well as weekly fulls on disk. I'll only have to offsite the monthly fulls. I figure I can just send the NY tapes to VA, and the VA tapes to NY. Or I could just keep the monthlies at the existing offsite storage vendor.

Anyway, with the synchronization piece, the project cost now jumps to about $55,000. If I can manage to trade-in or sell the CX300 and reuse existing servers, then this will reduce the capital outlay even more. I would love to have an ROI of less than 12 months.

I'll post more as I refine my plan. Input is certainly welcome.

Solaris 10, ZFS, and Dell JBODs - Redux

As the follow-on to my previous post, I just completed another project in which again shows that ZFS with Dell JBODs just makes sense.

Another team of SAS users has a Sunfire V240 which had about the worst disk configuration I have ever seen, running Solaris 9. The performance was absolutely awful. Here's the scenario:

- One 1GB fibre channel connection to an EMC Symmetrix 8530.
- The 8530 served up 17 concatenated disk pairs as, get this, 116 9GB disks. WTF??
- These 116 9GB logical disks were then combined into 13 RAID5 groups- Try and make sense of iostat -xn output with that crap!
- One 2GB fibre channel connection to a 4-disk (300GB, 10K) RAID5 group on an EMC CX300.

To give you an idea of the performance, here' s the iozone read/write results under that config:


350MB/sec write speed?! Blech! And the read is only a paltry 900MB/sec. With that many spindles on FC, this thing should scream. Oh, and this system is used for data mining of all things.

I was constrained by the parameters of the project as well. I couldn't buy a new server, and we were phasing out both the Symmetrix and the CX300. I needed a lot of disk and it had to perform.

I really liked the performance gain we got with the Sunfire v445 and Dell JBODs on the previously posted project and decided to go with a similar config.

So, I ran a full backup of the server and then shut everything down. Disconnected the FC cables and pulled the HBAs. Dropped in two shiny new SCSI controllers, connected the PV220s arrays and fired it up. It was only then that I realized that I had two different speed arrays. One was a U320 and one was a U160. It was do or die time, so I proceeded with just one array. If performance was poor, I could get the second array upgraded to U320 in just a few days.

I loaded Solaris 10 on the system, using the cool new ability to boot into ZFS. This time, I created a zpool with two 7-disk raidz vdevs:

zpool create sbidata raidz c3t0d0 c3t1d0 c3t2d0 c3t3d0 c3t4d0 c3t5d0 c3t8d0 raidz c3t9d0 c3t10d0 c3t11d0 c3t12d0 c3t13d0 c3t14d0 c3t15d0

I then restored the passwd, shadow & group files, the home directories, the SAS application, and the data from the old system. Then I fired up the system and ran some tests. The iozone results with just the single 14-disk JBOD were staggering:



Unbelievable. Write performance had more than tripled and read performance doubled! Keep in mind, I went from two EMC fibre channel arrays to a single, 14-disk SCSI JBOD. The previous configuration was just that bad.

Anyway, batch jobs that took 11 hours now only take 5. User driven job time has now been cut by as much as 80% in some cases and at least 66% in most. I just got the parts to upgrade the second JBOD to U320, and will make the change tomorrow morning. I will post the new iozone results when I'm done.

I can't wait to see what the performance looks like tomorrow afternoon!

Solaris 10, ZFS, and Dell JBODs

For everyone else out there who is trying to do more with less, I thought I would post some of my projects that I feel really had a lot of "bang for the buck".

The first in the series is regarding Solaris 10 and my new favorite filesystem, ZFS. In Spring of '07, I was charged with migrating a team of SAS users off of a Sun V240 and onto a larger V445. Not only did they need a good amount of disk space, but performance was a critical factor. They would also likely grow about 8-12% per year.

The problem was, there just wasn't money in the budget for an expensive SAN. So, I started testing out ZFS in the lab with spare equipment and was amazed at the performance.

After enough testing, I decided to go with the Sunfire V445 and two Dell PowerVault 22os JBOD arrays. I loaded the 220s with 14 U320 146GB SCSI disks, and direct attached each array to a separate SCSI controller onthe server.

Now, at the time I was still very new to ZFS and did not choose an optimal configuration. I figured that more spindles in a raid array meant better performance, so I assigned 21 of the 28 disks to the zpool in a raidz vdev:

zpool create sbimktg raidz c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t8d0 c2t9d0 c2t10d0 c2t11d0 c2t12d0 c2t12d0 c2t14d0 c2t15d0 c3t0d0 c3t1d0 c3t2d0 c3t3d0 c3t4d0 c3t5d0 c3t8d0

And, voila! A shiny new ZFS raidz for the marketing folks. I then ran iozone to get an idea of performance, and things looked great:



I know, I know- With the components I used, I should be able to reconfigure and obtain much better performance than what's shown in the graphs. But, compared to what we were getting on the old server, this was a phenomenal performance boost.

I am planning on a reconfig in the near future, which will ultimately put the data on a single zpool consisting of four 7-disk raidz vdevs. This should substantially boost performance for the marketing team.