<html><center></html>STATUS updates<html></center></html>
Some links of interest:
I installed 2 additional OpenBSD systems as virtual machines… one i386, the other amd64.
Instead of installing packages, I opted to go the ports system route, which ends up building the packages and installing them (I mostly knew that, but didn't, so this is a good realization).
I made the following ports (make and make install):
Really, one just needs to cd into /usr/ports and then into the appropriate subdirectories to run the make followed by make install
To build the non-X11 version of vim, I needed to make a change to the Makefile in /usr/ports/editors/vim… basically, look for the following lines:
FLAVORS= huge gtk2 athena motif no_x11 perl python ruby FLAVOR?= gtk2
And change the FLAVOR?= line from gtk2 to no_x11… continue as usual.
Yesterday, we set up the LAIRwall at CCC for the 2011 Student Art Show. Everything tested out fine.
This morning, it appears that only wall01, wall03, and wall05 powered on at the designated time… wall02, wall04, and wall06 did not power on until sometime later (~8:11am— supposed to power on around 7:32-7:40am).
My first remedy is to ensure all clocks are synchronized accordingly.
My preorder of OpenBSD 4.9 arrived today, and I happily set about getting things set up to PXE boot from it, primarily focusing on the ALIX.2 board.
I had to configure PXE to force the serial console on install, which was made possible by creating an etc/boot.conf file within /export/tftpboot on NFS, which contains:
set tty com0 boot tftp:/distros/openbsd/4.9/bsd.i386.rd
I also configured a DHCP and DNS entry for the ALIX board, so I could purposefully steer it in the right direction (since the LAIR netboot menu couldn't exactly appear and be usable).
I got OpenBSD 4.9/i386 installed on it.
Some links:
Read-Only OpenBSD:
Turns out my calcweek script has a logic bug! It was a case where calculations taking place between two and three digit numbers would yield woefully incorrect values for the week (tripping up the logic, making it think we're past the end of the semester, and forcing week to be 0).
I manually overrode the gn scripts last week, but forgot to look into it. This morning, I fixed it.
The fix, we need to add a leading “0” to the $sem_start variable during the week calculation when the current day of the year ($sem_today) is greater than or equal to 100.
## ## Perform THE calculation ## fill="" if [ "$sem_start" -lt 100 ]; then if [ "$sem_today" -ge 100 ]; then fill="0" fi fi week="$((((((${cur_year}${sem_today})-(${sem_year}${fill}${sem_start}))/7)+1)-$boffset))"
If the conditions are right, $fill becomes “0”… otherwise, it remains null.
Note that we couldn't just slap a leading 0 onto variables if they were less than 100, as that would automatically kick on “I'm an octal number” functionality, which we definitely do not want!
I discovered this morning that, although I was able to VPN in to the lair.lan side of the network, as I always do… I was unable to ping or ssh into anything on the lair.lan side of the universe.
I could ssh into juicebox, and anything on the offbyone.lan or student.lab portions of the universe, but not places like ahhcobras.
I went in and enabled a skip on the tun0 and tun1 interfaces on jb, reloaded the rules, and things lit back up.
After seeing some output errors on main.sh for several weeks, I thought to finally do something about it. So I went in and tweaked the logic.
I thought to look (again) into fixing that indexed log message that profusely pops up.
I stumbled across a link I had visited before, this time it appeared to work:
auth1:~# echo "dn: olcDatabase={1}hdb,cn=config changetype: modify replace: olcDbIndex olcDbIndex: uid,uidNumber,gidNumber,memberUid,uniqueMember,objectClass,cn eq" > indexchanges.ldif auth1:~# sudo ldapmodify -f indexchanges.ldif -D cn=admin,cn=config -x -y /etc/ldap.secret auth1:~# sudo /etc/init.d/slapd stop auth1:~# sudo su -s /bin/bash -c slapindex openldap auth1:~# sudo /etc/init.d/slapd start
The trick is the slapindex that I never noticed before and therefore never ran.
With my campus network explorations, I have discovered some background packet noise taking place on our network, emanating from the mgough's bigworld server.
Specifically, it is some UDP traffic sent from port 20018 to port 20018 to 255.255.255.255 (broadcast on a /32?), and at least everyone on 10.80.2.x sees it (annoying but harmless as far as I am concerned).
A packet sniff for it would show the following:
machine:~# tcpdump -i eth0 host 10.80.2.60 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 16:27:55.519034 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:27:58.106817 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:28:01.500528 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:28:04.339277 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:28:06.859047 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:28:09.531102 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:28:13.387211 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:28:16.463583 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20 16:28:20.451743 IP bigworld.offbyone.lan.20018 > 255.255.255.255.20018: UDP, length 20
So, as I've been “getting around to” fixing small things today, I finally put a lid on this one, via /etc/rc.local so it'll be fixed on subsequent boots:
iptables -A OUTPUT -p udp --dport 20018 -j DROP
bam! Traffic stopped. The world is a little bit quieter.
I've continued to look into possible configuration issues that would impact our campus network performance problems. I continue to come up empty handed. There is NOTHING I am setting that is directly causing an adverse reaction.
I did have an idea about looking into compressing traffic (especially after my experiences with latency yesterday). After some looking, I discovered mod_deflate.
I enabled it, and restarted apache2.
According to http://www.whatsmyip.org/http_compression/, traffic is now compressed (typically around 70%), so the amount of data transferred is therefore much less. Maybe this will help (but only in a “avoiding the problem” sort of way).
I set about implementing my PF rule improvements on juicebox this morning. Aside from accidentally blocking ICMP traffic for about 20-30 minutes, the transition appears to have gone smoothly, and will enable me to try some other experiments re: the campus network performance.
We may need to wait until JB is upgraded, as I think I want to try playing with relayd and do some TCP proxying (local connection to port 80 on JB then gets directed off somewhere… so the IP would appear as coming from JB itself, not from the internet).
I've added some additional queues:
I apparently need to do a full PF flush at some point, because the VPN sub-queues do not seem to be working (all traffic still is being tallied in the master vpn queue). All the other queues/sub-queues are working as they should.
Since the MTU downsizing didn't seem to have any appreciable impact, and disabling DNS didn't make a difference… a random idea I should try at some point is to (on capri), forcibly lower the TTL on outgoing packets (preferably as a returned packet on an established HTTP connection) to see if that makes any difference when packets are lost.
After all, when coming from campus to the BDC… we are still in somewhat of a closed loop, so there shouldn't be that many routers in the way.
I had an opportunity to realize some network latency while up on campus. On a 10.100.60.210 IP (the computer in my office up on campus)… I was experiencing the horrific delays and page load times.
Unfortunately, my MTU theories did not seem to rectify it (although occasionally I would get an amazing page load transaction).
Typically, it appears as though traffic bursts and then halts, bursts again and then halts… etc. until done.
I was able to witness this through: telnet 143.66.50.18 80 and doing a: get /
I wonder if those lost packets are mucking with the transaction, causing delays as things timeout and re-ACK.
At any rate it is good to realize I have more ready access to a likely troublesome machine for testing.
I discovered that tagging all outgoing packets as ToS “low delay” broke ALTQ's ability to distinguish between interactive ssh traffic and “bulk” scp/sftp ssh traffic.
I have removed this option, and things are back to where they should be.
As I'm investigating the network performance issues on the campus network, I definitely need to learn to use a packet sniffing tool. In the long term I want that tool to be tcpdump, but in the present time Wireshark (if only deceptively) gives me the feeling I have some flexibility over the information I can access.
So, following on that, I need to have wireshark deployed in all the places I need to analyze traffic.
This means I need to put wireshark on capri, and that's exactly what I've been working on.
The process of install Wireshark…
First up, I need to have gtk+ 2 installed (since it is an X/GTK2 application). This means pulling in the various packages (there are many prerequisites).
caprisun:~# export PKG_PATH=ftp://ftp.openbsd.org/pub/OpenBSD/4.4/packages/i386/ caprisun:~# pkg_add pkg_add gtk+2-2.12.11.tgz ...
As a point of information, packages being pulled off ftp from the internet (via capri) appropriate are queued in the “etc” queue, thereby not interfering with existing ssh, vpn, or web traffic.
Once that is done, download/extract the wireshark source, and run configure, then 'gmake'.
caprisun:~$ wget http://wiresharkdownloads.riverbed.com/wireshark/src/wireshark-1.4.6.tar.bz2 ... caprisun:~$ tar -jxf wireshark-1.4.6.tar.bz2 ... caprisun:~$ cd wireshark-1.4.6 caprisun:~/wireshark-1.4.6$ ./configure ... caprisun:~/wireshark-1.4.6$ gmake ...
Also of note, making sure the OpenBSD packages 'gmake' and 'gtar' are installed (I had tar aliased to gtar– the default BSD tar doesn't recognize -j).
I found I needed to add some more libraries to the system library path. ldconfig is the way to do this:
ldconfig -R /usr/local/lib ldconfig -R /usr/X11R6/lib
Assuming X11 forwarding is enabled in sshd (kill -1 to reload the config), and you ssh in with -X specified, you can run applications and it should work.
So far I've heard that the latency problem hasn't gone away, but no responses from students yet aside from “seems good”. So we will see.
I added a few more entries for empty zones into capri's DNS config, to handle some campus subnets (especially computer labs) to see if this helps to make any difference in the on-going campus network performance investigations.
My explorations continue….
First up, more useful links:
Learned that ICMP is pretty much all blocked, so Path MTU Discovery is out of the question, and we must resort to packet fragmentation (the first link to the cisco whitepaper is even quite informative).
I disabled Path MTU Discovery on capri (via sysctl).
Also adjusted the ext_if scrub rule in pf.conf:
scrub on $ext_if all random-id min-ttl 64 max-mss 1400 \ set-tos lowdelay reassemble tcp fragment reassemble
I'm starting to think that we may not need ALL the scrubbing rules, and perhaps just one per interface will do.
Still need to do lots of testing…
An additional feature to add to my queueing setup on caprisun was the differentiation of ssh traffic (interactive ssh sessions vs. scp/sftp transfers).
As it turns out, pf can distinguish between the two types of traffic by analyzing the ToS (Type of Service) flag, where interactive ssh sessions flip the “low delay” setting, and the bigger “bulk” transfers do not.
pf makes this nice with the queue keyword… actual syntax follows:
altq on $ext_if cbq bandwidth 1980Kb queue { ssh, vpn, web, etc } queue ssh bandwidth 40% priority 7 cbq(borrow red) { ssh_login, ssh_bulk } queue ssh_login bandwidth 60% priority 7 cbq(borrow red) queue ssh_bulk bandwidth 40% priority 5 cbq(borrow red) queue vpn bandwidth 20% priority 6 cbq(borrow red) queue web bandwidth 30% priority 4 cbq(borrow red) queue etc bandwidth 10% priority 1 cbq(default borrow) ... pass in quick on { $int_if, $bbn_if } from any to any tagged SSHTAG \ queue (ssh_bulk, ssh_login) pass in quick on { $ext_if } proto tcp from $approved tagged SSHTAG \ queue (ssh_bulk, ssh_login) pass in quick on { $ext_if } proto tcp tagged SSHTAG \ queue (ssh_bulk, ssh_login) flags S/SA keep state \ (max-src-conn 48, max-src-conn-rate 6/60, overload <brutes> flush global)
Note the queue (ssh_bulk, ssh_login)… that's the magic… the 2nd argument is for ToS of low delay or content-less ACK packets. So bam! Just have unique queues set aside, and assign as appropriate.
I stumbled across this site, calomel.org, which has tons of nifty tutorials on PF and related OpenBSD thingies.
This site, also has some nifty things:
I've been revisiting the traffic sniffing adventures from earlier in the week.
To start off with, useful links:
Some additional vectors of attack include:
So far, I haven't had much luck with the MTU sizing stuff… all the examples I try, and I can never get the “payload too big” message, and I can crank it up to really obscene values.
While toying with MTU values, I apparently set one too small (to 512). This created a bout of unhappiness, and to make sure everything was happy, I gave capri a reboot. Lesson learned.
My network explorations have led me into the realm of Path MTU Discovery.
Links!
Referencing the Network Tuning and Performance guide at:
I ended up modifying the following sysctl's on caprisun (set in /etc/sysctl.conf):
net.inet.tcp.mssdflt=1400 # Set MSS net.inet.ip.ifq.maxlen=768 # Maximum allowed input queue len (256*# of interfaces) net.inet.tcp.ackonpush=1 # acks for packets with push bit set should'nt be delayed net.inet.tcp.ecn=1 # Explicit Congestion Notification enabled net.inet.tcp.recvspace=262144 # Increase TCP "receive" window size to increase perf
From system defaults, net.inet.ip.forwarding has also been changed (set to 1) to enable IP forwarding.
My quest to better learn, and subsequently optimize our PF rulesets continue!
Some links:
I installed pftop on capri, which allows me to see lots of neat things all condensed and organized.
pftop -v rules
has been particularly useful.
Kelly came and sat with me yesterday to help debug the on-going performance issues experienced from SOME CCC networks to the Lab46 web server.
Specifically, the following network VLANs experience sub-optimal delays (9000+ mS):
But the following consistently demonstrate adequate (200 mS) responses:
These results were gathered from performing “httping”s from the various networks.
What is confusing is that:
A bit more on that last point:
So, while the symptoms may indicate that the problem is specific to our web server (either the config, or the config+certain content)… I am not entirely convinced that is the problem.
So could dokuwiki be, despite the fact we have not, selectively filtering connections from those networks but not chosen others? How would it know the difference between 10.100 and 10.200 when nowhere in its config have we ever mentioned any of those addresses?
One thing I did notice was during some of our network investigations, wireshark identified some TCP fragment reassembly errors.
In the process of investigating the apache2 config, I thought to search for the combination of two clearly unrelated things… “apache2 tcp fragment reassembly”, and variations therein.
This actually turned up some very interesting things. The useful URLs:
It indicated a possible MSS/MTU problem on the network, which would cause such problems when sufficiently large pages are served, resulting in painfully slow page loads (exactly the issue, on those networks). This refined exploration led me to the following informative page:
Which goes into some nice detail about MTU, ICMP filtering, and related things.
The interesting bit I read from this document was:
Many network administrators have decided to filter ICMP at a router or firewall. There are valid (and many invalid) reasons for doing this, however it can cause problems. ICMP is an integral part of the Internet and can not be filtered without due consideration for the effects.
In this case, if the ICMP can't fragment errors can not get back to the source host due to a filter, the host will never know that the packets it is sending are too large. This means it will keep trying to send the same large packet, and it will keep being dropped–silently dropped from the view of any system on the other side of the filter.
We know that campus blocks ICMP, and is likely running a tighter network on the troublesome VLANs than the non-troublesome ones.
So my current investigations will be exploring MTU size (and where that needs to be set to hopefully make a difference) and then moving up from there.
As a result of the performance investigations, I did uncover some areas of our apache2 config that could see some improvements.
The following documents proved useful in this regard:
Some useful pf information:
I found some tweaks to apply to capri's pf.conf.
I set an MTU of 1400 on the following machines/interfaces (manually set, will go away on reboot):
No discernible difference in performance yet (at least on the “known working” networks)… will be testing this and my other changes later on our test environment:
The MSS/MTU ratio is not optimal, but somewhat intentionally undercut to see if it makes a difference. We'll be send more but smaller packets. Let's see how this works.
After some further searching, playing, and exploring, I have refined the values a bit.
On the above-mentioned systems, MTU is now set to 1440, and MSS is at 1400.
Still have to do much more testing to see if this actually made any difference whatsoever.
Neat links that have some interesting information:
Started tinkering with altq on pf on caprisun…
Some useful links:
While debugging the “lab46 slowness” and follow-up with the LAIR's interaction with IT, I discovered that the subnet and broadcast address associated with 143.66.50.18 was incorrect. So we're going to try setting it manually with the correct values.
Before:
!ifconfig $if lladdr 00:16:3e:5d:88:d6 metric 0 dhcp
After:
!ifconfig $if lladdr 00:16:3e:5d:88:d6 metric 0 inet 143.66.50.18 255.255.255.248 143.66.50.23 !route add -net default 143.66.50.22
Caprisun was up for 300 days.
Further optimizations.
Before:
inet 10.80.2.1 255.255.255.0 10.80.2.255 inet alias 192.168.10.248 255.255.255.0 192.168.10.255
After:
inet 10.80.2.1 255.255.255.0 10.80.2.255
And optimizations here too.
Before:
inet 10.10.10.2 255.255.255.0 10.10.10.255 !route add 10.80.1.0/24 10.10.10.1 !route add 10.80.3.0/24 10.10.10.3
After:
inet 10.10.10.2 255.255.255.0 10.10.10.255
Boot sequence was hanging on the start of nullmailer-send, because it was grabbing the foreground.
This was fixed by doing the following:
(/usr/local/sbin/nullmailer-send 2>&1 &) && echo -n "nullmailer "
Still spits out “rescanning queue”, but does what we want.
Did an aptitude update && aptitude upgrade on the wildebeest herd, nfs1 and nfs2. Minor updates were applied.
Also updated the lairstations, bringing lairstation4 back on-line.
Many updates were available to install, and then I also installed SDL and the Java JDK (had to enable the canonical entry in sources.list).
Total package list is as follows:
lairstationX:~# aptitude install libsdl-ttf2.0-0 libsdl-ttf2.0-dev libsdl-sound1.2 libsdl-sound1.2-dev libsdl-net1.2 libsdl-net1.2-dev libsdl-mixer1.2 libsdl-mixer1.2-dev libsdl-gfx1.2-4 libsdl-gfx1.2-dev libsdl-image1.2 libsdl-image1.2-dev sun-java6-jdk
Squirrel was playing with some symbol manipulation in his SunPC decoding efforts, and I found some interesting knowledge worth documenting.
First up, two files– main.c and thing.c, defined as follows:
#include <stdio.h> void thing(); int main() { thing(); return(0); } void thing() { printf("Local thing\n"); }
#include <stdio.h> void thing() { printf("External thing\n"); }
lab46:~$ gcc -c main.c lab46:~$ gcc -c thing.c
lab46:~$ nm main.o 0000000000000000 T main U puts 0000000000000015 T thing lab46:~$ nm thing.o U puts 0000000000000000 T thing lab46:~$
The intention is to have main() call the thing() from thing.o, instead of main.o… but we can't do this by default because it sees the thing() in main.o.
To get around this, one way is to alter the state of the local thing symbol to a “weakened” state, which can be accomplished via the objcopy command:
lab46:~$ objcopy -Wthing main.o lab46:~$ nm main.o 0000000000000000 T main U puts 0000000000000015 W thing lab46:~$
Note from the previous nm output the change from 'T' to 'W' states for the thing symbol. This means thing as part of main.o is now weakened.
We can then perform the desired operation:
lab46:~$ gcc -o final main.o thing.o lab46:~$ ./final External thing lab46:~$
Pretty cool.
Question is… is there a way to remove the thing symbol entirely from main.o, as if it were never implemented there in the first place? That is still the question.
Some pages I found that might be useful:
It was discovered (last week) that the wiki edits were being incorrectly reported for the commitchk script (component of the grade not-z).
I finally got around to looking at it, re-remembering what I was doing, and starting the debug process, when I noticed an instance of “notes/data” hardcoded into some filtering logic. Aha! Of course the wiki edits were coming up empty… it was never allowed to look in the right place!
Changed “notes/data” to “notes/${CLASS}” as it should have been, and it lit right up, no other changes to the script needed.
Fixed.
On a whim I did a google search for running Plan9 under VirtualBox (after a few unsuccessful attempts at getting a MacOS X VM up and running)… turns out that, as of ~ version 4.0.2 of VirtualBox, Plan9 actually appears to run, and runs well.
I confirmed this after downloading last night's ISO (for both the main Plan9 distro and 9atom)… the regular Plan9 distro installed without a hitch… in fact did better than I expected, as I felt I pushed it a little (1280x1024x24!)
Got networking up and running via mostly manual config, and proceeded to install some of the necessary contrib packages (vim!) so I can complete the actual system configuration. Of course, existing entirely within rio from the start (doing a graphical install of Plan9 is a pleasant change from what I'm used to) makes a lot of things a lot easier.
A few days ago, a daily script of mine reported a loss of contact with the entire DSLAB cluster… data was still up, but none of the spartas were.
It turned out a power cable of some sort had been accidentally unplugged. Upon fixing this, the spartas were still inaccessible. Apparently a problem with DNS entries was the issue.
The cluster resumed operations, with all appropriate filesystems mounted.
First saturday of the month, backups appear to have gone off on schedule.
Apparently there are ways to get audio working in a MacOS X VM:
Squirrel reported a brief power blip in the LAIR. Nothing appears to have been adversely affected. No machines (even non-UPS machines) appear to have gone off-line.
This was noticed prior to this day, but getting around to reporting it now— I noticed a slew of these appear in the logs for nfs one day earlier this week:
[38622060.304192] eth0: too many iterations (6) in nv_nic_irq. [38622060.316295] eth0: too many iterations (6) in nv_nic_irq. [38622060.447023] eth0: too many iterations (6) in nv_nic_irq. [38622064.098704] eth0: too many iterations (6) in nv_nic_irq. [38622064.112054] eth0: too many iterations (6) in nv_nic_irq. [38622064.126185] eth0: too many iterations (6) in nv_nic_irq. [38622065.153317] eth0: too many iterations (6) in nv_nic_irq.
Not what I'd like to see… but the machine still appears to be hauling along. The great news is that nfs1 had been running uninterrupted for 377 days (so the errors occurred Tuesday afternoon, March 30th)! Over a year of uptime… that put it in the uptime club with cobras (Which we never could get an exact figure for)… not sure if we've got any other long running gems plugging away in the LAIR.
<html><center></html>
<html></center></html>