Archive for the ‘Linux’ Category

Tomato and AT&T U-Verse Disconnects

I recently ran into an issue with my home network setup where my Linksys WRT54G router running Tomato 1.27 was disconnecting my long-running active TCP connections every 10 minutes or so. After further investigation, this is known to be a common issue resulting from Tomato’s dhcp client performing a unicast DHCP renewal which the firewall blocks or misroutes.

A number of people have published similar reports, but none of the suggested solutions appeared to work reliably for me, so I decided to diagnose, troubleshoot and resolve the issue myself. Here’s how I solved the problem. The notes I gathered while working on this are also located at 2WIRE & Tomato – Google Docs.

If you’d like to stop reading and skip right to the pay off, simply add the following two lines to the firewall script which is located in the web based user interface under administration, scripts, in the firewall tab:

iptables -t nat -I PREROUTING -p udp -i vlan1 --dport 68 --sport 67 -j ACCEPT
iptables -I INPUT -p udp -i vlan1 --dport 68 --sport 67 -j ACCEPT

These firewall rules allow DHCP traffic to and from the Linksys router, regardless if the traffic is broadcast or unicast. Please let me know if these rules are not optimal or could be improved.

Here are some references to other reports of this issue:

My troubleshooting process follows.

I can see in the logs that udhcpc attempts a renewal right up until the lease expires:

Feb 17 15:14:26 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:16:56 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:18:11 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:18:48 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:06 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:15 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:19 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:21 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Lease lost, entering init state
Feb 17 15:19:22 tomato user.info kernel: vlan1: dev_set_allmulti(master, 1)
Feb 17 15:19:22 tomato user.info kernel: vlan1: dev_set_promiscuity(master, -1)
Feb 17 15:19:22 tomato user.info kernel: device vlan1 left promiscuous mode
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Sending discover...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Sending select for 99.29.172.159...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Lease of 99.29.172.159 obtained, lease time 600
Feb 17 15:19:22 tomato user.info kernel: vlan1: dev_set_allmulti(master, -1)
Feb 17 15:19:22 tomato daemon.info dnsmasq[12612]: exiting on receipt of SIGTERM
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: started, version 2.51 cachesize 150
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: compile time options: no-IPv6 GNU-getopt no-RTC no-DBus no-I18N DHCP no-scripts no-TFTP
Feb 17 15:19:22 tomato daemon.info dnsmasq-dhcp[13007]: DHCP, IP range 192.168.3.100 -- 192.168.3.149, lease time 1d
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: reading /etc/resolv.dnsmasq
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: using nameserver 192.168.4.254#53
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: using nameserver 8.8.4.4#53
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: using nameserver 8.8.8.8#53
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: read /etc/hosts - 0 addresses
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: read /etc/hosts.dnsmasq - 16 addresses
Feb 17 15:19:25 tomato daemon.err miniupnpd[12649]: recv (state0): Connection reset by peer
Feb 17 15:19:27 tomato daemon.notice miniupnpd[12649]: received signal 15, good-bye
Feb 17 15:19:27 tomato daemon.notice miniupnpd[13043]: HTTP listening on port 5000
Feb 17 15:19:27 tomato daemon.notice miniupnpd[13043]: Listening for NAT-PMP traffic on port 5351
Feb 17 15:19:27 tomato user.info kernel: device br0 left promiscuous mode
Feb 17 15:19:27 tomato user.info kernel: vlan1: dev_set_allmulti(master, -1)
Feb 17 15:19:27 tomato user.info kernel: vlan1: del 01:00:5e:00:00:02 mcast address from master interface

Working with the solution mentioned in the forums, I added a firewall rule to allow DHCP traffic into the router itself. This is in the INPUT chain. This worked well up until I enabled DMZ mode for my Xbox 360. Once I enabled DMZ mode, the DHCP renewal issue cropped back up and I kept getting dropped. Luckily, I have experience with netfilter and iptables so I know that DMZ is probably implemented in tomato as a catch-all PREROUTING rule to perform NAT on all unknown connections to a specified address. I also know the PREROUTING chain is processed before the INPUT chain, so any catch-all rule there would trump my fix to allow DHCP in the INPUT chain.

This can be verified with tcpdump and wireshark. Luckily, there are pre-compied versions of tcpdump for the mips architecture located at http://ipkg.nslu2-linux.org/feeds/unslung/wl500g/.

In order to get the tcpdump binary onto the router, I had to unpack the ipkg file:

wget http://ipkg.nslu2-linux.org/feeds/unslung/wl500g/tcpdump_3.9.7-1_mipsel.ipk
gzip -dc tar xvzf data.tar.gz
scp opt/bin/tcpdump fw:/tmp

Finally, capturing the data is easy and since we're dealing with DHCP traffic, there's not much worry about filling up the small /tmp filesystem on the router:

/tmp/tcpdump -w /tmp/renew.cap -v -i vlan1 -s 1500 port 67 or port 68

I copied the cap files back to my desktop and fired them up in wireshark. Not too surprising, it's clear as day the request packets are making it out, but the acknowledgement packts coming back from the DHCP server aren't making it to udhcpc.

Screen capture of wireshark displaying repeated attempts to renew the DHCP lease

Adding the explicit rule to the PREROUTING and INPUT tables, the conversation looks much less confusing:

The logs tell a similar tale. Note the lack of the full re-initialization of dnsmasq, upnpd, and the firewall script itself.

Feb 17 15:29:31 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:29:31 tomato daemon.info udhcpc[285]: Lease of 99.29.172.159 obtained, lease time 600

 

Python 2.5.2 RedHat Enterprise Linux 5 RPM’s

In order to support the same version of python across all of our servers, I’ve also build Python 2.5.2 RPM’s for RedHat Enterprise Linux 5 (Tikanga).

This build is far more straightforward than the build for RHEL4, as the system X11 libraries link without patching Setup.dist and RHEL5 comes with a supported version of expat so statically linking  the library into the pyexpat module isn’t required.

The SRPM: python25-2.5.2-1.el5.src.rpm

Build command:

rpmbuild –define ‘__python_ver 25′ –define ‘dist .el5′ -ba ~/redhat/SPECS/python.spec

This package will not conflict with the system python package.  Scripts should use #!/usr/bin/env python25 to make sure the proper python is being used.

 

Python 2.5.2 RedHat Enterprise Linux 4 RPM’s

I’ve successfully built Python 2.5.2 RPM’s for RedHat Enterprise Linux 4 (Nahant).  The package is named python25 as not to conflict with the system’s python package.

Other than some minor tweaks to the patch process to account for the location of X11 libraries and db4.2, the only major change is that the pyexpat module is statically linked against libexpat.a since expat version 1.95.8 is required and not available in RHEL4.  If you build my SRPM, you’ll need to download an SRPM for expat-1.95.8 then build and install expat-devel-1.95.8 or greater.  Once present, the python25 SRPM will statically link in the correct version of the library.

The SRPM: python25-2.5.2-1.el4.src.rpm

 

Apache and strace /usr/sbin/httpd

TuxWorking with Apache today, I ran into an issue where the process would appear to start OK, returning a zero exit status, yet strace was showing a SIGCHLD being caught.

Needless to say, the server wasn’t actually running for any length of time, but I found the following strace command immensely helpful in figuring out the problem.

  strace -o /tmp/httpd.strace -ff /usr/sbin/httpd

Because apache spawns a number of children, strace with -ff attaches to each child and recorded the system calls in /tmp/httpd.strace.$PID

As it turns out, I was receiving the following error in the child processes:

    bind(5, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("0.0.0.0")}, 16) \
    = -1 EADDRINUSE (Address already in use)
 

DD-WRT replaces OpenWRT

TuxOver the past few months, I’ve been getting fed up with stability issues plaguing my OpenWRT based Linksys WRT54GS v2.0 router. Wireless under OpenWRT was very unreliable, often cutting out in the recent version of White Russian I was running.

Based on the advice of a friend, I’ve re-flashed my firmware to DD-WRT v23 SP2, and I must say, I’m quite impressed. The Web interface is very slick and clean, UPnP is working out of the box, QoS is present and configurable, though I haven’t tested it very much yet, the web interface allows SSH public keys to be configured easily, and stores them in NVRAM variables, and my dynamic DNS host name is also easily configured through the web interface.

All in all, I’m finding DD-WRT to be much more developed and polished than OpenWRT. I’ll comment on this post after a week or so in the event I have stability issues.

 

LVM Host Tagging with iSCSI

TuxThe quick problem and fix of the day deals with iSCSI storage, CentOS 5, RHEL5, and LVM. As previously mentioned, I’m using LVM tagging to arbitrate logical volume activation among a set of physical hosts all hitting the same storage. This has been working quite well, and appears to a simple and effective solution to the clustered Xen host problem.

We recently installed a new iSCSI target, and my boss complained that it’s LVM logical volumes weren’t active on boot, despite being properly tagged. This is because all block devices are scanned for LVM signatures from within the initial ram disk, not later in the boot process. At this stage, there’s no networking, and the iSCSI initiator hasn’t been brought online yet.

Nothing necessary for boot lives on the iSCSI target, it’s really just a large pool of bits for our backup system, so I decided the most simple solution is to just activate all volumes a second time from /etc/rc.local. This appears to work well and reliably.

  # Append to /etc/rc.local, executed after all other init scripts.
  # Activate all logical volumes tagged with the local machine's hostname.
  lvchange -ay @$(uname -n)
 

Beware: lvm.conf hosttags and kernel upgrades

TuxThis is just a record of a problem I ran into today. I’ve been working with two CentOS 5 Opteron servers connected to a shared SAN, and I’m using LVM tags to arbitrate activation of the correct volumes on the correct hosts.

While hacking at /etc/lvm/lvm.conf, I apparently forgot this file is copied into initial ram disk images in order to make a root device sitting on a LVM logical volume available to the kernel at boot.

I configured LVM to only activate devices with tags matching the host tags, which posed a problem when I upgraded to kernel 2.6.18-8.1.8.el5. At boot, the initial ram disk failed to activate any logical volumes because the tags were incorrect.

The solution was to make sure I specifically list the system volume group in lvm.conf, in addition to the host tags. For example, the following is correct:

tags { hosttags = 1 }
activation { volume_list = [ "@*", "VolGroup00" ] }

While the following configuration doesn’t activate any volumes:

tags { hosttags = 1 }
activation { volume_list = [ "@*" ] }

LVM Tags are a really nice feature. It allows me to control what logical volumes sitting on the shared SAN are activated by which of the hosts in the cluster. For example:

[root@xen02 ~]# lvdisplay @`uname -n` | grep /dev/
  LV Name                /dev/xensan/test01.rootfs
  LV Name                /dev/xensan/test01.swapfs
  LV Name                /dev/xensan/test-webserver

[root@xen03 ~]# lvdisplay @`uname -n` | grep /dev/
  LV Name                /dev/xensan/dns1.swapfs
  LV Name                /dev/xensan/dns1.rootfs
  LV Name                /dev/xensan/newtest
  LV Name                /dev/xensan/ns1.vm
  LV Name                /dev/xensan/test2.vm

Shared block devices are nice; they facilitate live migration of guest domains between the two xen hosts.

 

Quick Synergy KVM Scripts

SynergyI constantly use keyboard sharing software like synergy2, teleport, x2vnc, x2x, etc… I’ve settled on synergy since it’s relatively platform independent. I use the command line program rather than synergyKM, just because I find it far more reliable. I also tunnel through ssh and found the certificate stuff in synergyKM to be less than ideal.

In any case, I wrote a small shell script which fires up the synergy server on the machine with the physical keyboard attached, then reaches out via ssh to the client machine, copying the synergyc client binary, establishes a reverse tunnel back to the synergy server on my laptop, launches the client in the background on the remote machine and connecting to the tunnel, then finally detaches the ssh session. When I close my lid and walk away at the end of the day, the tunnel and synergy processes gracefully clean up after themselves.

I find this setup ideal, because whenever I bring my portable machine into the office in the morning, I just run

kvmc ford

and the script takes care of everything for me.

The scripts are located in the northstarlabs repository. As usual, most of their usefulness is derived from password-less ssh authentication using public keys and ssh-agent.

 

Moving Xen 2.0.x Guests to newer RHEL5 or CentOS5 Linux hosts.

Last Friday I attempted to move some of our older virtual machines hosted on do-it-yourself jailtime based Xen2 hosts. The end result is that I decided it’s more trouble than it’s worth, primarily because Redhat and CentOS handle guest machines in a fundamentally different way than I handled them with xen2, before there was any official vendor support.

The difference primarily lies in the partition table and boot process of the guest virtual machine. In Redhat enterprise Linux 5 (And CentOS5), the system assumes you’ll be using a full partition table inside the guest machines block device. These devices are then mapped to /dev/xvd??, where as in xen2 I would map unique block devices into partitions like /dev/sda1 and /dev/sda2. The xvda devices alone aren’t a big problem. The show stopping problems I ran into stem from the initial ram disk now used by the modular Xen 3 kernel, the change in hardware architecture between the old machines, and a few other factors.

Redhat also assumes you’ll be using py-grub to boot the kernel and initial ram disk from inside the guest file system, rather than the host file system. This more closely resembles the behavior of a “real” physical machine, so I understand why they’re doing things this way. Py-grub and the partition table ends up being a bit awkward if you want to mount the guest file system inside the host, however, as you need to use byte offset options and loopback block devices since the guest partition table isn’t visible to the host.

For the indexing bots and the curious, my full attempt is logged in my wiki at ReferenceXen2toXen3onCentOS5.

Since there’s now official vendor support for Xen virtual machines, I figure now’s the time to switch to their way of doing things in order to prevent migration issues like this after future updates to RHEL and CentOS.