Press "Enter" to skip to content

Category: computers

New Hard Disk Is Failing

root@banstyle:~# smartctl -H /dev/ad4
smartctl version 5.38 [amd64-portbld-freebsd7.0] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  7 Seek_Error_Rate         0x000f   013   012   030    Pre-fail  Always   FAILING_NOW 38293929828058

root@banstyle:~#

I can’t win. Now one of the brand new hard disks in the server is getting a bunch of seek errors.

Leave a Comment

3ware 8006-2LP Sucks Under FreeBSD 7.0-RELEASE

Results from using Bonnie++ on FreeBSD 7.0 with a 3ware controller (twe), gmirror, and just a single local disk.

As mentioned here I got my new server working with a 3ware 8006-2LP and a pair of new 500GB disks. While it was working fine, I noticed that when updating the FreeBSD ports collection that the update would occasionally pause, consuming no CPU, but with the update process having a status of sbwait. I understand this to mean that the process is waiting on a blocked socket.

It turns out that the twe(4) driver is what is known as GIANT-LOCKED, which I believe means that it uses the old SMP locking mechanism in FreeBSD:

twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0x8c00-0x8c0f mem 0xfc7ffc00-0xfc7ffc0f,0xfb800000-0xfbffffff irq 28 at device 3.0 on pci1
twe0: [GIANT-LOCKED]
twe0: [ITHREAD]
twe0: 2 ports, Firmware FE8S 1.05.00.068, BIOS BE7X 1.08.00.048

Best I can tell, the result of this is that the disk controller’s driver needs to wait for the kernel to free up other resources and tell the driver that it can go ahead and work before it does things. The result of this tends to be that the driver works well, but there is a lot of latency.

This understanding matches what I observed, which was the aforementioned lengthy pauses when doing things which required a bunch of disk IO. In order to prove this understanding out, I set up a test hard disk running a stock FreeBSD 7.0 amd64 installation from which I could run Bonnie++, a file-based disk benchmarking suite.

In my testing I used the following three scenarios:

· One 120GB IBM Deskstar PATA drive (IC35L120AVVA07) connected to the motherboard booting the OS, listed in the results as banstyle_deskstar.
· Two 500GB Western Digital SATA drives (WD5000AAKS-40TMA0) connected to the motherboard with software RAID 1 via gmirror(8), listed in the results as banstyle_gmirror.
· Two 500GB Seagate SATA drives (ST3500320AS) connected to the 3ware 8006-2LP using the twe(4) driver in hardware RAID 1, listed in the results as banstyle_twe.

The result ended up being that all three configurations are generally around the same speed for throughput, but the 3ware controller had an absurd amount of latency. If one looks at the HTML version of the Bonnie++ output here (or the PNG here or above), one can see that was giving near three SECONDS of latency for random seeks and writes using write(2). This is insane.

The only thing I can think to attribute this to is the GIANT-LOCK in twe(4). I guess this means that I’m going to have to go back to gmirror(8) for software RAID and return the card. How disappointing.

(If anyone reading this disagrees with these findings or wishes to comment on them, please don’t hesitate to do so here or by emailing me directly.)

7 Comments

Black and Shiny

Set up to polish my boots in the laundry room. One boot is done.

After eating some really nice Skillet Baked Ziti (recipe from America’s Test Kitchen) that Danielle made for dinner I avoided working on my server by polishing my boots. As you can see above or at this close-up of the toes of my boots, they needed it.

Now I get to go back to figuring out why twe(4) in FreeBSD 7.0 seems sluggish. It may just be my perception, so I’m double-checking this by comparing the new 3ware-based array to the old gmirror(8) version. Or, it may be that it’s one of three drivers (the other two are ohci(4) and atkbd(4)) which indicate that they are GIANT-LOCKED, which means that they use the old SMP locking method.

Leave a Comment

Control

Screen capture from Control of Sam Riley as Ian Curtis, with the Unknown Pleasures album artwork in the background.

Danielle and I finally watched Control (Official Site · IMDB · Wikipedia), which she had received from Netflix last week. While it was a bit slow and (obviously) predictable, I enjoyed it.

I think that tonight I also got banstyle.nuxx.net working properly again. Over the past two days I did a bunch of extensive testing with spare RAM, Breakin, and a white board, and I think that I may have narrowed down the problem. I believe that the MCEs I was seeing were caused by a combination of a failing DIMM and modules which were the same in part number but not in actual chip content. There may actually be a bad slot there too, but I’m not certain of that.

I’ve winnowed the box down to 6GB of matched, tested RAM and it seems to pass all the tests I’ve thrown at it thus far. With the discovery that ad6 is dying as well I ordered a 3ware 8006-2LP and two Seagate ST3500320AS 500GB disks. Those were fitted into the server and I then dumped the the partitions from ad4 to it and everything seemed to be working fine, but occasionally slowly. Jumpering the board to force the first PCI-X slot to 66MHz (to match the PCI 8006-2LP) and turning on bus mastering for IDE transfers on the PCI slots seems to have sorted this out.

SMART tests and a number of hours of Breakin have shown the disks to be okay, so come Monday morning I’ll attempt to get a good 36 hours of burning in happening. If this all goes good the server will be back in place on Wednesday, with everything moved (shifted?) back over by Thursday evening.

If you are interested, here is a photo of my workbench just after dumping the partitions from one half of the old mirror to the new mirror set. Due to a bug in dump (or UFS) on FreeBSD 7.0 I had 6.3 booting off of an external USB drive, running dump to throw data from disk to another, a partition at a time.

After that photo was taken fstab was edited, everything booted up great, and then the new drives each passed an extended offline SMART test.

Leave a Comment

Time Machine Network Backup Speedup / Fix

I just acquired a new external disk enclosure and 750GB disk for hanging off of an AirPort Extreme and using for Time Machine backups of my main machine. From this I currently have ~480GB of data to back up, and for some reason the initial large backup repeatedly fails when I attempt to do it over the network.

The easy way around this is to first do the backup to the drive when it is connected locally and then hang it off of the AirPort Extreme to continue the incremental backups. The problem is that this doesn’t work as one would expect, because when an initial Time Machine backup is made to a local disk the backup ends up in a series of subdirectories, which is a different format from what it is via network.

When the backup is made to a volume hanging off of an AirPort Extreme a .sparsebundle file is created containing the backup; essentially a disk image stored on the network. Therefore, if you make a Time Machine backup locally and then try to use it via an AirPort Extreme the .sparsebundle file will be created on the disk in parallel to the now-useless directory structure.

So, how do you work around this? Easy. Hook the external disk up to the AirPort Extreme then either let the backup fail or cancel it, which will leave the incomplete .sparsebundle file on the disk. Disconnect the drive from the AirPort Extreme, connect it to your Mac, and point Time Machine to that volume. If it finds an appropriate .sparsebundle on the volume (which it will, since it’s already there) it’ll use that instead of creating the aforementioned subdirectory structure.

The backup will then happen quite quickly, and after it completes you can just hang the drive back off of the AirPort Extreme, redirect Time Machine to back up to that network volume, and things will continue via the network.

UPDATE: Since 10.5.5 was applied to my machine I have been unable to use this backup method and have had to resort to making the entire initial backup via network.

Leave a Comment

ad6 is Dying Too!

Error messages on the console showing that ad6 is actually failing hard. Good thing I ordered replacement disks.

It’s a good thing I received a 3ware 8006-2LP and a pair of Seagate 500GB disks today, because one of the two drives in the mirror set on my new server is just about to fail. To make matters worse, the failing disk is ad6, and ad4 is the one I’d accidently broken the other night, so I’ve been desperately waiting for the disks to finish syncing so that everything would be backed up.

This failed at ~10:00am this morning, which kept me from rebooting the box remotely to run more stress testing and (hopefully) replicating last night’s error.

Now that the data is sync’d I’ll wait for it to finish fscking then I’ll shut it down cleanly and begin running Breakin again.

Leave a Comment

CPU 2: Machine Check Exception: 4 Bank 4: f61c2001ba080813

A real, honest, good failure while running Breakin on banstyle.nuxx.net. It points to something being wrong with the second CPU or bank of memory.

In testing my server banstyle.nuxx.net has had its first real set of errors / failures. This is a good thing.

First, last night I started getting SMART warnings about bad blocks on ad6, which is the second hard drive. So today I just went ahead and ordered up a pair of ST3500320AS 500GB disks and a 3ware 8006-2LP, the same as is used in my current server.

Note the sdb errors, which are consistent with the other errors I’d been seeing indicating a bad block on the second hard disk.

Second, I came home today and found my server hung while running Breakin, displaying the error CPU 2: Machine Check Exception: 4 Bank 4: f61c2001ba080813 TSC 2561d00c4ef7 ADDR ce19fd00. So, at least I’ve got some place to look for what else might be the issue.

2 Comments

nuxx.net Is Back Up

Well, my site, nuxx.net is back up. I have the new server here at home and I’m starting to take a look at it. Hopefully I’ll have some sort of answer soon.

Unfortunately, in looking at it, I both screwed the BIOS and the software RAID array. In the BIOS I tried to backrev the BIOS, only to find out that Tyan (motherboard manufacturer) had changed Flash chips with the particular board I got, and the older BIOS’ don’t support it. Long story short, I was able to downgrade to an older BIOS, but as that older BIOS doesn’t support the new chip type, attempting to upgrade it again simply causes the flashing program to report “Error : Flash part is not supported”.

Beyond that, I was waiting for the server to rebuild the array (after the hard power off of the failure on Sunday morning) and getting impatient, so I decided to disconnect (via software) the inconsistent half of the array, thinking that I could just let it finish building later. This didn’t go so well (for some reason) and I ended up breaking the array. I think it’s back together, but I do worry a bit that something may be lost. We’ll see, I guess.

Mail is back up, things were sync’d over, but there is/will be some quirkyness with the mail received in the last day or two. Expect to see some duplicates. Sorry.

Oh, great: Sep 7 22:37:12 banstyle smartd[863]: Device: /dev/ad6, 1 Currently unreadable (pending) sectors

Right now I’m feeling really frustrated with this whole process and wanting to just put it away and maybe start over later. I just don’t want to leave everyone else’s stuff down.

Leave a Comment

banstyle.nuxx.net: It’s Back

The display on the VGA output of my new server, banstyle.nuxx.net, after it went down hard at ~01:30 EDT on 05-Sep-2008.

My server, banstyle.nuxx.net is back. In case you didn’t see the LiveJournal post I made about the server being down, know that it went down about 01:30 EDT this morning and didn’t come back up over night. The symptoms were that the machine had an active link to the switch, but the arp cache was aging and the box was generally unreachable and unresponsive. Here’s a Cacti graph showing the outage.

At lunch I drove down to the colo facility, was escorted down to the room, and first noticed that the box was powered up, the network activity LED was blinking, but the disk controller LED was dark. Plugging in a monitor I saw that blinky colored bars overlaid on the normal console, looking like a hardware problem. Perhaps something with the video controller.

The box was rebooted, and as a precaution I went into the BIOS and disabled the bits which redirect video output (text mode only, of course) to the serial port, essentially allowing the whole box to be managed from a terminal. I figure that maybe, possibly, somehow this contributed. After that, I booted the OS back up, did an initial check to be sure everything was okay, started the backed up mail on the old box flushing, and left. Things were a bit slow at first while fsck ran in the background and the mail filtered through, but after that everything seemed good.

So, to be honest, I don’t really know what went wrong. The server is working well again, I guess I’ll just have to keep a close eye on for a while. This is particularly frustrating because it’d been working great for the last four months while I had it at home. If there are any more problems, please bear with me…

For reference, here’s the stuff in /var/log/messages showing that there was nothing between the events noted in last night’s post about SMTP-AUTH and the reboot this morning:

Sep 4 23:05:50 banstyle postfix/smtpd[91552]: sql_select option missing
Sep 4 23:05:50 banstyle postfix/smtpd[91552]: auxpropfunc error no mechanism available
Sep 5 11:57:31 banstyle syslogd: kernel boot file is /boot/kernel/kernel
Sep 5 11:57:31 banstyle kernel: Copyright (c) 1992-2008 The FreeBSD Project.
Sep 5 11:57:31 banstyle kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Sep 5 11:57:31 banstyle kernel: The Regents of the University of California. All rights reserved.
Sep 5 11:57:31 banstyle kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
Sep 5 11:57:31 banstyle kernel: FreeBSD 7.0-RELEASE #2: Wed Aug 20 12:57:10 EDT 2008

Leave a Comment

SMTP-AUTH for Postfix via courier-authlib (authdaemond)

Getting SMTP authentication working with Postfix via authdaemond on FreeBSD 7.0 without occasional, useless errors in /var/log/messages has just caused me an hour of frustration. Therefore, I wish to document what I had to do to make it work right:

First off, Postfix (mail/postfix) and courier-authlib with MySQL support (security/courier-authlib with AUTH_MYSQL set in the config) must be installed. Setting up courier-authlib to talk to a MySQL db is beyond the scope of this document, but it basically involves setting the following lines:

/usr/local/etc/authlib/authdaemonrc:

authmodulelist="authmysql"

/usr/local/etc/authlib/authmysqlrc:

MYSQL_SERVER localhost
MYSQL_SOCKET /tmp/mysql.sock
MYSQL_PORT 0
MYSQL_OPT 0
MYSQL_USERNAME mail
MYSQL_PASSWORD [OBSCURED]
MYSQL_DATABASE mail
MYSQL_USER_TABLE mailbox
MYSQL_CRYPT_PWFIELD password
MYSQL_UID_FIELD uid
MYSQL_GID_FIELD gid
MYSQL_LOGIN_FIELD pobox
MYSQL_HOME_FIELD homedir
MYSQL_MAILDIR_FIELD CONCAT(homedir,'/',maildir,'/')
MYSQL_QUOTA_FIELD quota
MYSQL_NAME_FIELD name

After that is set, Postfix’s main.cf must have SASL enabled with smtpd_sasl_auth_enable = yes. Next, the following smtpd.conf must be placed in /usr/local/etc/sasl2:

/usr/local/etc/sasl2/smtpd.conf

pwcheck_method: authdaemond
log_level: 3
mech_list: PLAIN LOGIN
authdaemond_path: /var/run/authdaemond/socket

auxprop_plugin: mysql
sql_select: select password from users where email = '%u@%r'

Now, here’s the stupid part. See those last two lines, auxprop_plugin: mysql and sql_select: select...? They don’t do anything, and that SELECT statement won’t even return anything useful on my db. Without them there SMTP AUTH works great. However, if you don’t have those lines there, Postfix will regularly complain loudly with errors such as these:

Sep 4 21:30:02 banstyle postfix/smtpd[47677]: sql_select option missing
Sep 4 21:30:02 banstyle postfix/smtpd[47677]: auxpropfunc error no mechanism available

Please note that with authdaemond, CRAM-MD5 and DIGEST-MD5 authentication mechanisms won’t work. (These would normally be set with mech_list: PLAIN LOGIN CRAM-MD5 DIGEST-MD5.) If enabled they will appear available but won’t work.

One final thing… Want to know how to be sure that the server is notifying clients that it supports authentication? Just simply telnet to port 25 on your mail server and type in EHLO domain.com. The AUTH LOGIN PLAIN and AUTH=LOGIN PLAIN lines show you that plain-text authentication is now available:

c0nsumer@banstyle:~> telnet localhost 25
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 banstyle.nuxx.net ESMTP Postfix
EHLO nuxx.net
250-banstyle.nuxx.net
250-PIPELINING
250-SIZE 10240000
250-VRFY
250-ETRN
250-STARTTLS
250-AUTH LOGIN PLAIN
250-AUTH=LOGIN PLAIN
250-ENHANCEDSTATUSCODES
250-8BITMIME
250 DSN
QUIT
221 2.0.0 Bye
Connection closed by foreign host.

2 Comments