Archive for February, 2010

OpenSolaris milestone/xvm grub dom0_mem problem

I’ve recently been struggling to track down a problem with my OpenSolaris xVM system. I’m running xvm in OpenSolaris b133. The issue is that my manual configuration of dom0_mem in /rpool/boot/grub/menu.lst seems to constantly be overwritten upon reboot. This is a problem since I need dom0 to be clamped down to prevent Xen’s balloon feature from fighting with the ZFS arc. In addition to this problem, there are bugs in the b132 and b133 of OpenSolaris which require the config/dom0-min-mem SMF property to be set to match dom0_mem.

I’ve also been running into the dom0-min-mem issues documented at My South – Sun xVM 3.4.2 available, dom0_min_mem. Pascal also mentions setting the dom0-min-mem propery, but doesn’t appear to be running into the issue I have with b132 and b133 where the property is consistently changes by the xvm-milestone service method script.

The problem is caused by the SMF xvm milestone ( svc:/milestone/xvm) constantly re-writing these properties and the menu.lst file. The solution is to disable the xvm milestone and re-enable all of the xvm services manually. This will allow you to make manual changes to the menu.lst file without the xvm milestone interfering with you.

OpenSolaris introduced the xvm milestone in b126 around October of 2009. Please see [xen-discuss] FYI: enable/disable the xVM hypervisor.

Here is the recipe to fix the problem. First, make a backup copy of your menu.lst file, then disable the xvm milestone, enable the other xvm SMF services, and finally restore your menu.lst file. We do this because disabling the xvm milestone disables all of xvm, where we really just want to prevent /lib/svc/method/xvm-milestone from executing.

This assumes you already have xVM enabled through the use of svcadn enable milestone/xvm.

cd /rpool/boot/grub
pfexec cp -p menu.lst menu.lst.milestone-xvm.enabled
pfexec svcadm disable milestone/xvm
pfexec svcadm enable -r svc:/system/xvm/domains:default
pfexec cp -p menu.lst menu.lst.milestone-xvm.disabled
pfexec cp -p menu.lst.milestone-xvm.enabled menu.lst

Before rebooting, ensure the dom0_mem setting is something reasonable. I find 1.5GB to be a good balance.

title os-133-xvm1
findroot (pool_rpool,0,a)
bootfs rpool/ROOT/os-133-xvm1
kernel$ /boot/$ISADIR/xen.gz console=vga dom0_mem=1536M dom0_vcpus_pin=false watchdog=false
module$ /platform/i86xpv/kernel/$ISADIR/unix /platform/i86xpv/kernel/$ISADIR/unix -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive

Finally, ensure SMF properties match the dom0_mem value:

svccfg -s svc:/milestone/xvm listprop hypervisor/dom0_mem
svccfg -s xend listprop config/dom0-min-mem

If they don’t match, they may be set using:

pfexec svccfg -s svc:/system/xvm/xend setprop config/dom0-min-mem = 1536
pfexec /usr/sbin/svccfg -s svc:/milestone/xvm setprop hypervisor/dom0_mem = 1536

I plan to diagnose just why the xvm-milestone service method script is misbehaving so much and file the appropriate bug reports. If anyone has any suggestions or ideas, please let me know.

 

Tomato and AT&T U-Verse Disconnects

I recently ran into an issue with my home network setup where my Linksys WRT54G router running Tomato 1.27 was disconnecting my long-running active TCP connections every 10 minutes or so. After further investigation, this is known to be a common issue resulting from Tomato’s dhcp client performing a unicast DHCP renewal which the firewall blocks or misroutes.

A number of people have published similar reports, but none of the suggested solutions appeared to work reliably for me, so I decided to diagnose, troubleshoot and resolve the issue myself. Here’s how I solved the problem. The notes I gathered while working on this are also located at 2WIRE & Tomato – Google Docs.

If you’d like to stop reading and skip right to the pay off, simply add the following two lines to the firewall script which is located in the web based user interface under administration, scripts, in the firewall tab:

iptables -t nat -I PREROUTING -p udp -i vlan1 --dport 68 --sport 67 -j ACCEPT
iptables -I INPUT -p udp -i vlan1 --dport 68 --sport 67 -j ACCEPT

These firewall rules allow DHCP traffic to and from the Linksys router, regardless if the traffic is broadcast or unicast. Please let me know if these rules are not optimal or could be improved.

Here are some references to other reports of this issue:

My troubleshooting process follows.

I can see in the logs that udhcpc attempts a renewal right up until the lease expires:

Feb 17 15:14:26 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:16:56 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:18:11 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:18:48 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:06 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:15 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:19 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:21 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Lease lost, entering init state
Feb 17 15:19:22 tomato user.info kernel: vlan1: dev_set_allmulti(master, 1)
Feb 17 15:19:22 tomato user.info kernel: vlan1: dev_set_promiscuity(master, -1)
Feb 17 15:19:22 tomato user.info kernel: device vlan1 left promiscuous mode
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Sending discover...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Sending select for 99.29.172.159...
Feb 17 15:19:22 tomato daemon.info udhcpc[285]: Lease of 99.29.172.159 obtained, lease time 600
Feb 17 15:19:22 tomato user.info kernel: vlan1: dev_set_allmulti(master, -1)
Feb 17 15:19:22 tomato daemon.info dnsmasq[12612]: exiting on receipt of SIGTERM
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: started, version 2.51 cachesize 150
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: compile time options: no-IPv6 GNU-getopt no-RTC no-DBus no-I18N DHCP no-scripts no-TFTP
Feb 17 15:19:22 tomato daemon.info dnsmasq-dhcp[13007]: DHCP, IP range 192.168.3.100 -- 192.168.3.149, lease time 1d
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: reading /etc/resolv.dnsmasq
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: using nameserver 192.168.4.254#53
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: using nameserver 8.8.4.4#53
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: using nameserver 8.8.8.8#53
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: read /etc/hosts - 0 addresses
Feb 17 15:19:22 tomato daemon.info dnsmasq[13007]: read /etc/hosts.dnsmasq - 16 addresses
Feb 17 15:19:25 tomato daemon.err miniupnpd[12649]: recv (state0): Connection reset by peer
Feb 17 15:19:27 tomato daemon.notice miniupnpd[12649]: received signal 15, good-bye
Feb 17 15:19:27 tomato daemon.notice miniupnpd[13043]: HTTP listening on port 5000
Feb 17 15:19:27 tomato daemon.notice miniupnpd[13043]: Listening for NAT-PMP traffic on port 5351
Feb 17 15:19:27 tomato user.info kernel: device br0 left promiscuous mode
Feb 17 15:19:27 tomato user.info kernel: vlan1: dev_set_allmulti(master, -1)
Feb 17 15:19:27 tomato user.info kernel: vlan1: del 01:00:5e:00:00:02 mcast address from master interface

Working with the solution mentioned in the forums, I added a firewall rule to allow DHCP traffic into the router itself. This is in the INPUT chain. This worked well up until I enabled DMZ mode for my Xbox 360. Once I enabled DMZ mode, the DHCP renewal issue cropped back up and I kept getting dropped. Luckily, I have experience with netfilter and iptables so I know that DMZ is probably implemented in tomato as a catch-all PREROUTING rule to perform NAT on all unknown connections to a specified address. I also know the PREROUTING chain is processed before the INPUT chain, so any catch-all rule there would trump my fix to allow DHCP in the INPUT chain.

This can be verified with tcpdump and wireshark. Luckily, there are pre-compied versions of tcpdump for the mips architecture located at http://ipkg.nslu2-linux.org/feeds/unslung/wl500g/.

In order to get the tcpdump binary onto the router, I had to unpack the ipkg file:

wget http://ipkg.nslu2-linux.org/feeds/unslung/wl500g/tcpdump_3.9.7-1_mipsel.ipk
gzip -dc tar xvzf data.tar.gz
scp opt/bin/tcpdump fw:/tmp

Finally, capturing the data is easy and since we're dealing with DHCP traffic, there's not much worry about filling up the small /tmp filesystem on the router:

/tmp/tcpdump -w /tmp/renew.cap -v -i vlan1 -s 1500 port 67 or port 68

I copied the cap files back to my desktop and fired them up in wireshark. Not too surprising, it's clear as day the request packets are making it out, but the acknowledgement packts coming back from the DHCP server aren't making it to udhcpc.

Screen capture of wireshark displaying repeated attempts to renew the DHCP lease

Adding the explicit rule to the PREROUTING and INPUT tables, the conversation looks much less confusing:

The logs tell a similar tale. Note the lack of the full re-initialization of dnsmasq, upnpd, and the firewall script itself.

Feb 17 15:29:31 tomato daemon.info udhcpc[285]: Sending renew...
Feb 17 15:29:31 tomato daemon.info udhcpc[285]: Lease of 99.29.172.159 obtained, lease time 600