Setting kernel clocksource to HPET solves mysterious performance issues

Nathan Kinkade, April 10th, 2012

For quite a long time the server which runs this very site has had some performance issues. This same server runs one or two instances of Mediawiki, and I have always just presumed that Mediawiki was the cause of the problems. I really didn’t give it too much more thought, since the issues weren’t causing many horrible user-facing performance issues. The server sort of hobbled along in the background, fairly loaded, but still managing to serve up pages decently. However, the problem most seriously manifested itself for me personally when working in a remote shell. Sometimes I’d go to save a file and the operation would take 10 or 15 seconds to complete. I ignored this, too, for some time, but it reached a point where I couldn’t take it any longer.

I watched the output of top for a while, sorting on various metrics, and noticed that flush and kjournald were pegged at the top when sorted by process state, both being in a disk-wait (“D”) state. This didn’t make any sense to me, since the machine doesn’t host any really busy sites and should have plenty of memory to handle what it has. I decided to do a web search for “linux flush kswapd” to see what it would turn up. As it turns out, the very first article returned in the search ended up indirectly shedding light on this issue, even though it turned out to be mostly unrelated to my own problem. However, what I did take away from it was learning of a utility that I didn’t previous know about. Namely, perf, and specifically perf top -a.

What I discovered upon running this command was that the kernel was spending a huge amount of time (60% to 80%) running the function acpi_pm_read. A little investigation on this tracked it back to the kernel clocksource being set to acpi_pm. The current, and available, clocksource(s) can be discovered by running the following, respectively:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource

I then went to another machine, also running Mediawiki, but one not having any performance issues, and found its clocksource to be hpet. After a little more research, some experiementing, and a few reboots, I found that adding the kernel parameter hpet=force to the variable GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and then running update-grub got the system using hpet as the clocksource. And this seems to have totally cleared up the issues on the machine. Processor usage is way down, memory usage is way down, processes in the disk-wait state are down, and our Mediawiki site is returning pages much faster that it ever has.

For reference, here are a few machine specifications which might be useful for others investigating this:

  • OS: Debian Squeeze
  • Processors: 2 x AMD Opteron 246
No Comments »

Converting a remote, headless server to RAID1

Nathan Kinkade, April 6th, 2012

We have a particular server which has been running very well for the past few years, but I have had a certain amount of low-level anxiety that the hard disk in the machine might fail, causing the sites it hosts to go down. We have backups, of course, but restoring a machine from backups would take hours, if for no other reason than transferring gigabytes of data to the restored disk. So I had our hosting company add a second hard disk to the machine so that I could attempt to covert it to boot to a RAID1 array. Playing around with how a system boots on a remote, headless machine to which you have no console access is a little nerve-racking, because it’s very easy to make a small mistake and have the machine fail to boot. You are then at the mercy of a data center technician, and the machine may be down for some time.

There are several documents on the Web to be found about how to go about doing this, but I found one in particular to be the most useful. I pretty much followed the instructions in that document line for line, until it broke down because that person was on a Debian Etch system running GRUB version 1, but our system is running Debian Squeeze with GRUB 2 (1.98). The steps I followed differ from those in that document starting on page two around half way down where the person says “Now up to the GRUB boot loader.“. Here are the steps I used to configure GRUB 2:

The instructions below assume the following. Things may be different on your system, so you will have to change device names to match those on your system:

Current system:

/dev/sda1 = /boot
/dev/sd5 = swap
/dev/sd6 = /

New RAID1 arrays:

/dev/md0 = /boot
/dev/md1 = swap
/dev/md2 = /

Create a file called /etc/grub.d/06_raid with the following contents. Be sure to make the file executable:

exec tail -n +3 $0
# A custom file for CC to get system to boot to RAID1 array

menuentry "Debian GNU/Linux, with Linux 2.6.32-5-amd64, RAID1"
        insmod raid
        insmod mdraid
        insmod ext2
	set root='(md0)'
	search --no-floppy --fs-uuid --set 075a7fbf-eed4-4903-8988-983079658873
	echo 'Loading Linux 2.6.32-5-amd64 with RAID1 ...'
	linux /vmlinuz-2.6.32-5-amd64 root=/dev/md2 ro quiet panic=5
	echo 'Loading initial ramdisk ...'
	initrd  /initrd.img-2.6.32-5-amd64

Of course, you are going to need to change the UUID in the search line, and also the kernel and initrd image names to match those on your system, and probably some other details.

We should tell GRUB to fall back on a known good boot scenario, just in case something doesn’t work. I will say up front that this completely saved my ass, since it took me numerous reboots before I found a working configuration, and if it weren’t for this fall-back mechanism the machine would have stuck on a failed boot screen in GRUB. I found some instructions about how to go about doing this in GRUB 2. Create a file name /etc/grub.d/01_fallback and add the following contents. Be sure to make the file executable:

#! /bin/sh -e

if [ ! "x${GRUB_DEFAULT}" = "xsaved"  ] ; then
  if [ "x${GRUB_FALLBACK}" = "x" ] ; then 
      export  GRUB_FALLBACK = ""
     GRUB_FALLBACK = $( ls /boot | grep -c 'initrd.img' )
   echo "fallback set to menuentry=${GRUB_FALLBACK}"  >&2

  cat << EOF
  set fallback="${GRUB_FALLBACK}"


Then add the following line to /etc/default/grub:

export GRUB_FALLBACK="1"

And while you are in there, also uncomment or add the following line, unless you plan to be using UUIDs everywhere. I'm not sure if this is necessary, but since I was mostly using device names (e.g. /dev/md0) everywhere, I figured it couldn't hurt.


Update the GRUB configuration by executing the following command:

# update-grub

Make sure GRUB is installed and configured in the MBR of each of our disks:

# grub-install /dev/sda
# grub-install /dev/sdb

Now we need to update the initramfs image so that it knows about our RAID set up. You could do this by simply running update-initramfs -u, but I found that running the following command did this for me, and perhaps some other relevant things(?), and also it verified that my mdadm settings were where they needed to be:

# dpkg-reconfigure mdadm

I used rsync, instead of cp, to move the data from the running system to the degraded arrays like so:

# rsync -ax /boot/ /mnt/md0
# rsync -ax / /mnt/md2

When rsync finishes moving / to /mnt/md2, then edit the following files, chaning any references to the current disk to our new mdX devices:

# vi /mnt/md2/etc/fstab
# vi /mnt/md2/etc/mtab

Warning: do not edit /etc/fstab and /etc/mtab on the currently running system, as the instructions would seem to indicate, else if the new RAID configuration fails and the machine has to fall back to the current system, then it won't be able to boot that either.

I believe that was it, though it's possible I may have forgot to add a step here. Don't run the following command unless you can afford to possibly have the machine down for a while. This is a good time to also make sure you have good backups. But if you're ready, then run:

# shutdown -rf now

Now cross your fingers and hope the system comes back up on the RAID1 arrays, or at all.

If the machine comes back up on the RAID1 arrays, then you can now add the original disk to the new arrays with commands like the following:

# mdadm --manage /dev/md0 --add /dev/sda1
# mdadm --manage /dev/md1 --add /dev/sda5
# mdadm --manage /dev/md2 --add /dev/sda6

The arrays will automatically rebuild themselves, and you can check the status by running:

#cat /proc/mdstat
No Comments »

Tuning TCP on CC’s servers

Nathan Kinkade, December 8th, 2010

A couple weeks ago we launched a new rack-mount server, which is kindly hosted by the ISC in their Redwood City, California data center. The sole purpose of this new server is to host static content, mostly, which is probably the busiest domain CC has due to the license icons and badges being served from there.

Upon moving to this new machine I noticed that there were terrible problems with connection timeouts when requesting images. After thrashing around for why this was happening, I used tcpdump to grab some network traffic on the server and discovered that SYN requests were arriving at the machine and dying right there, with no subsequent SYN-ACK. At that point it was clear that this was not a Varnish or Apache problem, but something at a much lower level. After testing various TCP tweaks in the running kernel I discovered that setting net.ipv4.tcp_max_syn_backlog=2048, up from the default of 256, and turning on net.ipv4.tcp_syncookies seemed to resolve the connection timeout issues.

However, the kernel message log was filled with message like the following. In fact, there was one such message written to the log every minute:

possible SYN flooding on port 80. Sending cookies.

I was confused by this because ostensibly the site was functioning just fine. My understanding was that SYN cookies were only activated when the SYN queue filled up, but as far as I could tell I had increased the depth of the queue sufficiently to avoid that problem. I even tried setting net.ipv4.tcp_max_syn_backlog arbitrarily high to see what would happen. Same result: site operated fine, but with SYN cookie kernel messages. In my testing I also discovered that disabling net.ipv4.tcp_syncookies would immediately bring back the connection timeout problems. Additionally, netstat revealed that though the site appeared to be functioning correctly, there were still an abnormally large amount of ‘failed connection attempts’ listed in the TCP stats.

I went over and over all the TCP settings and just couldn’t figure out what was happening, nor did Google shed any light on this. I then decided to do this:

$ netstat -n | grep SYN_RECV | wc -l

I ran this command many times in a row over a period of time and was surprised to see that the result was nearly always 256, give or take a few. It then occurred to me that that number looked a lot like the default value of net.ipv4.tcp_max_syn_backlog. However, as far as I knew (and know), all of those kernel parameters are supposed to be dynamic, capable of being changed on-the-fly, with sysctl or writing directly to the /proc file system. So I set all my TCP changes in /etc/sysctl.conf and rebooted the machine. Sure enough, since coming back up about a day ago I haven’t seen a single kernel message about SYN cookies. I even decided to just disable SYN cookies altogether based on a recommendation to do so in the default /etc/sysctl.conf file found on Debian systems.

The machine is now humming along nicely. For reference here are the TCP parameters I changed. The values were gleaned from various sites while doing extensive research on TCP tuning. Some of the values seem improbable to me, but don’t seem to be having any perceptible negative impact, and were also recommended in TCP tuning articles on more than one site. I went ahead and implemented these settings on the rest of CC’s servers as well:

net.ipv4.tcp_fin_timeout = 3
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_no_metrics_save = 1 
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_max_syn_backlog = 8192
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.somaxconn = 1024
vm.min_free_kbytes = 65536
1 Comment »