Archive for the ‘nuxx.net’ Category.

Light Snow, Bike Riding, Feeling Sick

Bob riding across the S shaped bridge in The Pines at Stony Creek on a November evening.

Here’s a photo of Bob / utabintarbo riding across the S-shaped bridge which is part of The Pines at Stony Creek. He and I met up with the intention of getting some extra riding in before the normal Wednesday at 6:30 PM group ride, but after our first lap (and a naughty daylight backwards run through The Pines) I was so out of it that I had to stop and go home early. I think I’m getting the cold that Danielle had while we were in the UK, as I feel extremely tired, I’m coughing, can’t properly get my breath, and just feeling blah. I hope this doesn’t turn into pneumonia.

Riding was interesting as the leaf and snow covered trails were reasonably slippery, previously muddy areas were rock-hard narrow ruts, wet areas were now slick ice, and previously loose sand was hard as concrete fun. I had a very hard time making it through some normally easy areas, and I’m blaming this on being slightly overdressed for the cold weather and unable to breathe properly. Ah well, hopefully I’ll be better next week.

A couple of trips to Home Depot and Lowes has resulted in my purchase of some spray paint designed for frosting windows, a replacement light bulb for the ceiling fan in my bedroom, and new LED-based nightlights for the bathrooms. Tomorrow I’m hoping to remove the blinds in the bathroom and frost the windows. Hopefully that will go as well, which is how replacing the bulb in the ceiling fan went, making the room light up properly again.

On a very positive note, I had no problems uploading the image above, and I didn’t anticipate any after incorporating the fix mentioned in the bottom of this post about php-cgi hung as sbwait. It turns out that a default setting in lighttpd breaks particularly badly on FreeBSD 7.0-RELEASE, but not previous versions. Changing it to a different setting suggested by one of the lighttpd developers has worked around the issue. This is good.

php-cgi hung as sbwait with lighttpd on FreeBSD

Gallery Remote hung at "Upload completed: server processing...", which is the most obvious symptom of the lighttpd / php-cgi problems I've been having.

When uploading a quantity of photos to my gallery I like to use a tool like Gallery Remote to make it go easier. However, since moving to banstyle (and a newer version of FreeBSD and lighttpd and PHP) I’ve had Gallery Remote regularly hang at that “Upload completed: server processing…” message. It seems to happen after a few (typically two to five) images have been uploaded.

This problem has been bothering me for a while, but I was able to work around it by scping the files to the server then adding them locally, which doesn’t have this problem. Now that I have a bunch of photos from the UK trip to upload, I want to be able to use Gallery Remote again. This morning I set to getting working, but I seem to have failed.

In short, what happens is that after an upload hangs I see one of the php-cgi processes stuck in a status of sbwait, as can be seen in this screenshot:

81073 c0nsumer        1   4    0   116M 20988K sbwait 3   0:01  0.00% php-cgi

Digging around I found this thread where someone else indicates that they are having the same problem, and only on SMP boxes. (Note: banstyle.nuxx.net is four-way SMP using SCHED_ULE.) I also came across this report to the lighttpd folks regarding this issue. The consensus seems to be that when using a config such as mine, with PHP as a FastCGI and lighttpd, this occasionally happens. I’ve seen no reports of the issue occurring under Apache.

Since I’m able to reproduce the problem I did so, attached gdb to the seemingly hung php-cgi process, and grabbed a backtrace:

(gdb) bt
#0  0x00000008010a476a in read () from /lib/libc.so.7
#1  0x000000000057d5fc in fcgi_read ()
#2  0x000000000057e306 in sapi_cgi_read_post ()
#3  0x00000000004c84a4 in fill_buffer ()
#4  0x00000000004c88b5 in multipart_buffer_read ()
#5  0x00000000004c9c08 in rfc1867_post_handler ()
#6  0x00000000004c6ee5 in sapi_handle_post ()
#7  0x00000000004cc30c in php_default_treat_data ()
#8  0x00000000004cc7eb in php_hash_environment ()
#9  0x00000000004bff47 in php_request_startup ()
#10 0x000000000057f8c6 in main ()

(gdb) f 0
#0  0x00000008010a476a in read () from /lib/libc.so.7
(gdb) info frame
Stack level 0, frame at 0x7fffffff9c50:
 rip = 0x8010a476a in read; saved rip 0x57d5fc
 called by frame at 0x7fffffff9db0
 Arglist at 0x7fffffff9c40, args: 
 Locals at 0x7fffffff9c40, Previous frame's sp is 0x7fffffff9c50
 Saved registers:
  rip at 0x7fffffff9c48

Based on the input from some folks online, it’s looking like that is php-cgi doing what it’s supposed to and just waiting for more data, which means that the problem is likely somewhere in lighttpd. I’m not really sure where to go from here, besides wait for the lighttpd folks to (hopefully) fix the problem. With any luck I’ll be able to update this post later on with a solution. For now I’m going to contemplate the difficulty of going (back, in many ways) to Apache.

For reference, I’m running lighttpd 1.4.20 and PHP 5.2.6, both installed from ports, configured as described in my article about running lighttpd with PHP as FastCGI with each user having their own PHP processes.

UPDATE: So, it seems that there is a fix for this which was suggested in the aforementioned bug report. Setting the option server.network-backend = "writev" along with the already set (in my case) server.event-handler = "freebsd-kqueue" in lighttpd fixes it. I’m not sure if both options are needed to resolve the issue, but it seems that the default setting for server.network-backend of write is confirmed as broken under FreeBSD 7.0-RELEASE with lighttpd <= 1.4.20.

BoingBoinging

Graph of the network traffic on nuxx.net when 23 Tubes 1 Bowl was posted to BoingBoing.

A bit over a month ago I made a post entitled 23 Tubes 1 Bowl detailing my dispensing of 23 sample size tubes of toothpaste into one bowl. Well, on Tuesday it was posted on BoingBoing.

This had the expected effect, with a tremendous surge in traffic in traffic of 4-5 Mb/sec, with one spike of over 10 Mb/sec. Strangely, despite all the hits, I only received something like $3 in ad revenue over the last day. When Lightsticks In The Toilet made digg the earnings were something like $30 the first day, and $12 the second. This surprised me a bit and showed that dental-related ads probably just don’t pay that much.

That graph up top shows the spike of traffic observed. The server is currently on a burstable 1 Mb/sec connection, billed 95th percentile. As this is figured on a monthly basis I’ve managed to stay under this limit, so this BoingBoinging won’t cost any extra. It’s nice to see that a bunch of people found the photos interesting, too. I had fun taking them, I just didn’t like the strong mint smell which was hanging wherever the bowl went.

restrict default ignore

In setting up NTP on nuxx.net I ran into a bit of a problem: time wouldn’t sync. My configuration was fairly simple, following the information on support.ntp.org for using the pool of North American servers, blocking external access, but allowing ntpq (et al) to work from localhost:

server 0.north-america.pool.ntp.org
server 1.north-america.pool.ntp.org
server 2.north-america.pool.ntp.org
server 3.north-america.pool.ntp.org

driftfile /var/db/ntp.drift

restrict default ignore
restrict 127.0.0.1

However, it seemed that no matter what I tried (disabling the firewall, adding exceptions for TCP/UDP 123, changing order of the restrict statements, etc) the box wasn’t able to contact its peers:

c0nsumer@banstyle:~> ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 217.160.254.116 .INIT.          16 u    -   64    0    0.000    0.000 4000.00
 209.132.176.4   .INIT.          16 u    -   64    0    0.000    0.000 4000.00
 209.40.97.141   .INIT.          16 u    -   64    0    0.000    0.000 4000.00
 216.14.98.234   .INIT.          16 u    -   64    0    0.000    0.000 4000.00

After some more digging I found that the restrict default ignore option, which is widely recommended to keep external folks from connecting to your ntpd, prevents synchronization from happening, even with the exception for localhost.

Having realized that, my ntp.conf is now just the basic config for the NA servers and the drift file, and it all works great:

server 0.north-america.pool.ntp.org
server 1.north-america.pool.ntp.org
server 2.north-america.pool.ntp.org
server 3.north-america.pool.ntp.org

driftfile /var/db/ntp.drift

Yep, it’s syncing just fine:

c0nsumer@banstyle:~> ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*217.160.254.116 18.26.4.105      2 u  200  256   17   37.192    4.619   1.461
 209.132.176.4   66.187.233.4     2 u  201  256   17  101.819   21.118   9.529
 209.40.97.141   192.5.41.40      2 u  197  256   17   38.565  -31.122  21.081
 216.14.98.234   216.218.254.202  2 u  200  256   17   18.731    3.940   4.848

c0nsumer@banstyle:~> ntptrace
localhost: stratum 3, offset 0.004619, root distance 0.043540
server.donkeyfly.com: stratum 2, offset -0.000686, root distance 0.006361
bonehed.lcs.mit.edu: stratum 1, offset 0.000018, root distance 0.000000, refid 'CDMA'

Now I just let pf restrict access to NTP. That works just fine.

Server Issues and Rainbows

A rainbow above Stony Creek High School before a wednesday night MMBA group ride.

As some of you may have noticed before, all sites hosted on my new server was inaccessible for a brief while this morning. It turns out that there was a slight issue which led to the switch port being turned off for a while. Everything had remained up, but the server was unavailable until it was turned back on. So, it’s still fine. This is a huge relief.

Tonight was also the normal Wednesday night group ride. I was a bit cold at first wearing just a long sleeved Target running shirt and shorts, but after we got going I was fine. This (intentionally saturated) photo was taken from the parking lot where we meet right after the rain before the ride stopped.

Thankfully some friends loaned me a head-mounted HID light so I could ride in the dark. This was definitely needed in the latter half of the ride, and particularly after the rain moved in. The ride back to the car was a quick 1.5 miles, but a bit unpleasant being in 50°F weather and down a muddy dirt road with rain coming down. At least it stopped once we got to the parking lot, making loading my bike into the car easier.

Now I’m home, relaxing, doing some laundry (yay, clean socks!), debating turning on the furnace, and generally thinking about bed.

rowla.nuxx.net, RIP

PuTTY screenshot of a disconnected session to rowla.nuxx.net after shutting it down for the last time.

That’s it. rowla.nuxx.net has been turned off, and I’m slated to pick it up tomorrow sometime around lunch. Everything has been moved over and seems to be working great. So, if I host your stuff on nuxx.net and you are having a problem, please let me know so that it may be corrected.

Busy Weekend

This weekend looks to be very busy. I’m still at work, don’t know when I’ll be leaving, and likely will have to put in some time on either Saturday evening or early Sunday morning.

The new hard disks for my server are going to be delivered today, so hopefully the wipe of the failing ones (with DBAN) will be complete by the time I arrive home so that I’ll be able to do the dump and restore, check out the install, then get in with more burn-in.

I’d originally planned on riding both the Tour De Troit and the Addison Oaks Fall Classic this Saturday and Sunday (respectively), but I just don’t think I want to schedule things that tightly. So, maybe I’ll get out and ride a bit, but it definitely won’t be anything planned or structured.

Now, to get this stuff at work wrapped up. Thankfully Danielle brought me some really, really yummy lunch from Rangoli Express so that I didn’t have to leave for lunch today. It was really, really, really good.

(No, I’m not neglecting work right now… I’m just waiting for some other folks so I can keep going with stuff that I’m doing.)

+12 Hours of Breakin

Breakin, having run for 12h 28m 33s after swapping RAM around.

Yesterday I ordered a pair of Seagate Barracuda ES.2 ST3500320NS disks to replace the two which failed on Tuesday. Today I called Newegg about my RMA for the old ones and the old controller and was able to get the 15% restocking fee waived for both the controller and drives. Hopefully the drives will arrive tomorrow and I can dump | restore the OS and such, then start Breakin running so that it can thrash the drives for a few days.

Speaking of Breakin, I disconnected the disks from the machine (but left them mostly fitted in the case as to not disrupt airflow) and started Breakin running this morning before I left for work. When I arrived home it was still running, unlike last week when it regularly failed with MCEs. This is good, as I had been unable to get it to run for this long before.

SMART Issues

When I got home I started running SeaTools, Seagate’s disk diagnostics utility for Windows, on the ad4 which had begun failing earlier. It reported back that it, and the other hard drive, were just fine. However, when booting into FreeBSD after using them I found that both drives were now indicating that Seek_Error_Rate was past threshold. The OS booted very slow, then kicked ad6 out of the mirror set.

I tried connecting the drives to another, standalone SATA controller (some plain old Maxtor bundle-in one) with new SATA cables and same problem.

So, I’m not sure what to do. Here’s every issue I’ve had with the new server and its resolution:

Issue: Server locking up hard, unexpectedly. MCEs on console.
Resolution: Ensure that only matched RAM is used and that all RAM tests good during burn-in.

Issue: Slow performance / absurd latency while using 3ware disk controller.
Resolution: Identified GIANT-LOCK on driver, moved to using software mirroring.

Issue: One of the original two Western Digital disks used, which were part of a gmirror set, has started giving block errors.
Resolution: Replace disks with brand new Seagate pair.

Issue: Both of the new Seagate drives began failing with excessive Seek_Error_Rate within a few hours of each other after extensive burn in.
Resolution: Unsure.

I can’t help but wonder if one of the Seagates beginning to fail was contributing to the latency observed with the 3ware controller, but as neither was throwing SMART errors at the time, so I discount this.

My current thought is that I should order a pair of server-grade disks, burn them in as before (~50 hours of constant activity), copy the data to them, then see if things will keep working. The failed disks and the unwanted 3ware controller will go back to Newegg, and hopefully things will work right.

I don’t know what other option I have besides scrapping the whole idea of moving servers, but I really rather not do that. If anyone else has any ideas, I’d love to hear them…

New Hard Disk Is Failing

root@banstyle:~# smartctl -H /dev/ad4
smartctl version 5.38 [amd64-portbld-freebsd7.0] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  7 Seek_Error_Rate         0x000f   013   012   030    Pre-fail  Always   FAILING_NOW 38293929828058

root@banstyle:~#

I can’t win. Now one of the brand new hard disks in the server is getting a bunch of seek errors.