<html><center></html>STATUS updates<html></center></html>
Some links of interest:
There's been a new version of dokuwiki released; I opted to upgrade both /haas and the lab46 wiki this morning… following the instructions located at:
As far as I can tell, the upgrades took place without incident. No immediate problems noticed (and we have all summer to find/work out any kinks).
The wikis are now running the dokuwiki-2011-05-25 “rincewind” release.
The LAIRwall was toted back to the LAIR today. Still disassembled, all pieces are now back, from its 2011 visit up to campus highlighting the annual student art show.
As was discovered over the weekend, climate control once again went off in the LAIR.
With the migration to the USB docking ports, there existed a situation where two of the four cursor keys were not appropriately mapped, so their auto-repeat functionality was not enabled.
This caused frustration during the semester as students could move the cursor as expected in two directions, but in the other two (I want to say right and down), they'd have to aggressively press those keys the number of times they wanted to move the cursor.
The problem could be temporarily rectified by manually running the following two commands in a local terminal (ie on one of the wildebeest VMs, NOT lab46):
gnu:~$ xset r 113 gnu:~$ xset r 116
From that point until they log out, all cursor keys (and their repeat functionalities) would work as expected.
But, it was very common to forget to re-run this on the next login.
I finally sat down to take another look at it and got it resolved!
Turns out I just needed to make an ~/.xinitrc file in each user's home directory and place those two xset lines (verbatim). Additionally, I need to add a stanza to each X client system's /etc/profile file.
Upon each login, the file gets parsed and the user's cursor keys work!
The change to /etc/profile is as follows:
# Fix cursor key repeat settings for LAIR flake pods if [ ! -z "$DISPLAY" ]; then if [ -e "$HOME/.xinitrc" ]; then source $HOME/.xinitrc fi fi
Specifically, we check to see if there is a DISPLAY variable set, and if so, the inner if statement checks for the existence of $HOME/.xinitrc… if it is there, source it (which will run those two needed xset lines). And bam! Cursor key repeats work!
I applied the necessary /etc/profiles changes on all machines in the wildebeest herd and lab46.
To propagate this into each user's home directory, I hopped onto NFS and ran the following:
#!/bin/bash # # Put the proper .xinitrc file in each user's home directory # cd /export/home for user in `/bin/ls -1A | grep -v '^wedge$'`; do cp -v wedge/.xinitrc $user/.xinitrc chown $user:lab46 $user/.xinitrc done exit 0
If any users already had a .xinitrc file it would not overwrite it.
I ended up installing the x11-xserver-utils package so that the xset command would be available, which enables the enabling of cursor key repeat on the LAIR flake pods.
During the great NFS1/NFS2 DRBD resync of 2011, I realized that mail access was horrendously slow, due to the fact we're using Maildir/, and there are hundreds, if not thousands of files that need to be accessed. During heavy fileserver load, mail server performance suffers.
So I thought to migrate all mail from nfs back to mail (thinking that the reason I moved it from mail to nfs in the first place was to eliminate redundancy– all the VMs used to be hosted on nfs, now they are not).
I realized though, that the prime reason for keeping mail on nfs and NFS mounting it, was to enable “You have new mail” notifications for users on lab46. Until this problem is solved, mail must remain on nfs.
In the spirit of backing up user home directories, I realized I should also perform similar actions with the mail data. So I adapted homedirbackup.sh to a script called maildirbackup.sh that runs daily on the nfses:
#!/bin/bash # # maildirbackup.sh - script responsible for performing Maildir/ directory back ups # to a location that is NOT the fileserver. # # 20110515 - adapted homedirbackup.sh to maildirbackup.sh (mth) # 20101014 - fixed a directory typo on destination (mth) # 20101013 - initial version (mth) # ismaster="`df | grep export | wc -l`" if [ "$ismaster" -eq 1 ]; then date=`date +"%Y%m%d"` day=`date +"%d"` if [ "$day" -lt 14 ]; then bhost="sokraits" else bhost="halfadder" fi cd /export/lib # Check the load average, and delay as long as we're above 200% CPU loadavg=`uptime | sed 's/^.*average: \([0-9][0-9]*\)\.\([0-9][0-9]\).*$/\1/'` while [ "$loadavg" -ge 2 ]; do sleep "$((RANDOM % 64 + 64))" loadavg=`uptime | sed 's/^.*average: \([0-9][0-9]*\)\.\([0-9][0-9]\).*$/\1/'` done ssh $bhost "mkdir -p /export/backup/mail; /export/backup/prune.sh mail 14" tar cpf - mail | gzip -9 | ssh $bhost "dd of=/backup/mail/mail-${date}.tar.gz" fi exit 0
I decided to keep mail backups for 14 days (2 weeks). The archives are relatively small (~140MB), so this shouldn't be a huge space waster.
It also makes use of a new option to prune.sh, that allows us to specify the number of backups to keep.
To assist in the nfs1/nfs2 drbd resync, I disabled many automated jobs on nfs and www to lighten the load on the disk.
Now with the resync done, I have re-enabled the jobs and have set about fixing additional bugs discovered thanks to taking nfs1 down on friday.
It turns out that my user home directory backup scripts had not been running since late December (script existed, but job not present in any cron file). So I re-enabled that and manually kicked off a job to ensure that all proper connectivity was in place. This once again appears to be operational (I should check tomorrow to verify that the b's have been backed up).
Another problem observed was yesterday a lairdump run was to take place, and although the jobs fired off on their respective clients, there was some sort of problem getting a session on nfs2 (it would establish the connection then promptly end the session)..
This turned out to be a security/access issue, which was resolved by editing /etc/security/access.conf on nfs2 and adding “dump” as a privileged user for login:
-:ALL EXCEPT root lair dump:ALL EXCEPT LOCAL
Previously, “dump” was not included in the list of users/groups allowed to log in, so the system would happily cut off the connection each time.
I've had a prune script in place for some time which, upon the conclusion of a client's lairdump, will cap off the number of backups at a count of 4. This works great for clients with only 1 dump file per lairdump… some machines (like www, lab46, and koolaid), have multiple dump files (www especially has one for /, /var/www, and /var/repos).
My script logic would unceremoniously list all the files and delete all but the last four. File ordering being what it is, it results in the possibility of entirely removing legitimately recent files and preserving only one of the dump file series (on www this resulted in the deletion of / and /var/repos, but kept 4 copies of /var/www).
I enhanced the prune.sh as follows:
#!/bin/bash # # prune.sh - prune backups, weed out any entries older than the most recent # # 20110515 - added logic to handle clients with multiple dump files (mth) # 20110101 - adapted userdir prune to lair-backup (mth) # 20101013 - initial version (mth) # ## ## Configuration Variables and Settings ## unit="$1" datapath="/export/backup/${unit}" ## ## Make sure we have enough data to run ## if [ -z "$1" ]; then echo "ERROR! Must be called with a proper argument." exit 1 fi ## ## Check if an alternate backup count is provided (arg 2) ## bknum="`echo $2 | grep '^[0-9]*$'`" if [ -z "$bknum" ]; then bknum=4 # if no count is provided, default to this value else bknum="$2" fi ## ## Ensure provided data is correct ## if [ -e "${datapath}" ]; then cd ${datapath} # move into position ## ## There may be a variety of entries, condense them, then iterate ## for entry in `/bin/ls -1A | sed 's/-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].*$//' | sort | uniq`; do bckcnt="`/bin/ls -1A ${entry}-[0-9]* | wc -l`" # how many are there ## ## If we're above our threshold, process the outliers ## if [ "$bckcnt" -gt "${bknum}" ]; then echo echo "Pruning old backups . . ." let removal=$bckcnt-$bknum files="`/bin/ls -1A ${entry}-[0-9]* | head -$removal`" rm -vf $files fi done fi exit 0
Specifically, the addition of the for loop based on the list generated by the expression: /bin/ls -1A | sed 's/-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].*$//' | sort | uniq, which will identify all present files so we can find the 4 most recent of each.
This way, no modifications need to be performed to the client end, only to prune.sh on the server.
I deployed a copy of this prune.sh script on sokraits/halfadder, for use in processing the user home directory backups.
Note the other new feature- the specification of a 2nd command-line argument which will represent the number of backups to keep.
At around 2pm Saturday, climate control kicked off, and the temperature shot up to around 79 degrees.
The nfs1/nfs2 DRBD peer resync completed around 11:30pm. Disk performance has returned to normal.
While debugging what I thought might have been some sort of data corruption error (turned out to be a false alarm), I chose to reboot nfs1 and nfs2.
nfs1 had been running for just over 420 days when I took it down, and brought it back up as a secondary node for now, giving nfs2 the lead role for a while.
This resulted in uncovering several bugs that needed to be addressed or still need to be worked out:
The method of switching nfs2 from secondary to primary largely went without a hitch… I followed the instructions I had left for myself in /etc/rc.local, and also had to fill in some gaps. Here are the actual instructions:
Similar would exist for nfs1 (in fact, it would be identical except for the now changed offbyone.lan network interface).
Some TODOs for the summer regarding the NFSes:
In response to rebooting the NFSes, I ended up having to reboot lab46 and related services due to higher loads resulting from the switchover (some processes just do not like having NFS pulled out from under them– especially eggdrop IRC bots).
We lasted almost the entire semester, so it was certainly a good run.
As a result of my investigations into potential data loss/corruption (even though they proved to be unnecessary), I discovered that my automated home directory backup scripts (to back up user home directories to sokraits/halfadder each night by letter of alphabet placement) had not run since late December of 2010… so I had no recent data backed up for users.
I'll be adding this script back in– the homedirbackup.sh script, which is located in /export/lib/homedirbackup.sh… to the appropriate cron jobs on nfs1/nfs2.
As a result of the outage and apparently the time that transpired in getting both NFSes linked back up, they appear to be solving their re-syncing by performing a FULL resync of all 2.9TB of disk… so the NFSes will be a little busy this weekend.
To assist in this process, I have temporarily disabled all cron jobs that would result in additional disk activity (apt-mirror, netsync, backups)
I finally got around to playing with the flash on disk card in the alix box.
The switch on the card is to select master or slave.. I originally ran into some problems because the compact flash card was assuming the master position.
I ended up being bold and removing the compact flash card (which meant unscrewing the antennae cable to remove the board, since the CF card is pretty much inaccessible).. and did a PXE boot to reinstall OpenBSD 4.9.
Seems to be working fine– it detects the card as a 7279.9MB drive (wd0), which I custom partitioned (put /, /usr, /usr/local, /home, /var, /tmp in separate partitions).
Installed and rebooted without a hitch.. onto putting the desired packages back on.
Does seem a bit more responsive with the disk on flash vs. the CF card.
These are nifty-looking things:
And the data sheet:
My testing will be done on the SF8GB5F45 (8GB) (8GB) module, the MLC-based version. Not sure if I have version 2 or version 3 (probably version 2).
MLC-based flash seems to be rated for 10,000 write/erase cycles. The compact flash card I have in ambitus (the alix box) appears to also be MLC-based. The advantage of the SF8GB5F45 is that it utilizes the IDE interface, so I am expecting better throughput.
With the C/C++ class' “Shoveling Bob” development, we had a need for a graphics program better than XPaint, so I install the GIMP on LAIRstation 2… I should make sure it is eventually installed on all the LAIRstations.
Some lab46 users who have customized their vim settings experience problems when running vim on the LAIRstations, due to missing color schemes. I went and checked, and it appears that Ubuntu and Debian ship with different default color schemes.
I fixed this problem by copying over the missing color schemes from lab46 to the LAIRstation:
lab46:/usr/share/vim/vim72/colors$ scp darkblue.vim nightshade* sean.vim widower.vim root@lairstation1.lair.lan:/usr/share/vim/vim72/colors/ root@lairstation1.lair.lan's password: darkblue.vim 100% 2990 2.9KB/s 00:00 nightshade.vim 100% 3430 3.4KB/s 00:00 nightshade_print.vim 100% 3198 3.1KB/s 00:00 sean.vim 100% 2774 2.7KB/s 00:00 widower.vim 100% 1248 1.2KB/s 00:00 lab46:/usr/share/vim/vim72/colors$
This resolved that problem.
There is a need to run fluxbox on the LAIRstations in certain situations to test things (ie basically replicate the pod-like environment running fluxbox but have dual screens).
I wanted to just have a nice “select your environment” on login, but this apparently is not possible, so I did it manually.
The procedure to do this is as follows:
This step may prove to be unnecessary, especially if we manually stop gdm. But I did it on my test attempt, so I mention it here:
lairstationX:~# cd /etc/alternatives lairstationX:/etc/alternatives# mv x-session-manager / lairstationX:/etc/alternatives#
Don't forget to restore x-session-manager otherwise things will most certainly not work as normal.
As I said, we may be able to avoid this step entirely and leave x-session-manager in place.
UPDATE: This step was verified to be necessary, do not skip it
To ensure minimal operation, install the following packages:
In this step, we make a change to the system's default window manager, so that when we start X, it knows to start fluxbox:
lairstationX:~# update-alternatives --config x-window-manager There are 2 choices for the alternative x-window-manager (providing /usr/bin/x-window-manager). Selection Path Priority Status ------------------------------------------------------------ * 0 /usr/bin/metacity 60 auto mode 1 /usr/bin/metacity 60 manual mode 2 /usr/bin/startfluxbox 50 manual mode Press enter to keep the current choice[*], or type selection number: 2 update-alternatives: using /usr/bin/startfluxbox to provide /usr/bin/x-window-manager (x-window-manager) in manual mode. root@lairstationX:/etc/alternatives#
So, change it from auto mode metacity to manual mode startfluxbox.
I just scp'ed the fluxbox-menu from one of the wildebeest herd so we'd have the same config:
lairstationX:~# scp gnu.offbyone.lan:/etc/X11/fluxbox/fluxbox-menu /etc/X11/fluxbox/ root@gnu.offbyone.lan's password: fluxbox-menu 100% 3443 3.4KB/s 00:00 lairstationX:~#
I also made the traditional fluxbox-menu-good as a backup in /etc/X11/fluxbox so when I install new packages and it overwrites the fluxbox menu to ubuntu defaults, we aren't that out of luck (just re-copy the “good” one to fluxbox-menu and we're back in action).
Now we're set… let's disable gdm so we can commence with this alternate arrangement:
lairstationX:~# /etc/init.d/gdm stop
With these changes set, log in as the desired user and fire up X by running the startx script.
You should be good to go.
When done, we should restore the necessary defaults… which can be done as follows:
lairstationX:~# mv /x-session-manager /etc/alternatives/ lairstationX:~# update-alternatives --config x-window-manager There are 2 choices for the alternative x-window-manager (providing /usr/bin/x-window-manager). Selection Path Priority Status ------------------------------------------------------------ 0 /usr/bin/metacity 60 auto mode 1 /usr/bin/metacity 60 manual mode * 2 /usr/bin/startfluxbox 50 manual mode Press enter to keep the current choice[*], or type selection number: 0 update-alternatives: using /usr/bin/metacity to provide /usr/bin/x-window-manager (x-window-manager) in auto mode. root@lairstation1:~# /etc/init.d/gdm start
And we should be back at the usual GDM login screen. All set!
Applied any updates to all 4 LAIRstations, also made sure that xwit, xautomation, and xpaint were installed.
With the on-going campus connection investigations, some “errors” were finally noticed, so it was requested we change our cable… I did that and more… swapped out the cable, AND I changed the network interface.
So capri is plugged into the campus into the free interface (xl0), adjusted all the related things (moved /etc/hostname.pcn0 to /etc/hostname.xl0, and changed the ext_if variable definition in /etc/pf.conf).. I also rebooted for good order.
This led to needing to remember how to re-establish the CloudVPN connection for the LAIRwall up on campus:
I'm writing this down so we don't forget!
I created /etc/hostname.tun2:
link0 up
If desired, we can do this manually: ifconfig tun2 link0 up
I added tun2 into /etc/bridgename.bridge0:
add em1 add tun0 add tun2 up
If desired, we can do this manually as follows: brconfig bridge0 add tun2
Make sure these are up and operational before proceeding to the actual cloudvpn config:
As root, fire up a screen session, then do the following:
caprisun:~# cd /home/squirrel/cloud caprisun:/home/squirrel/cloud# rm -f gate.sock caprisun:/home/squirrel/cloud# ./lairnorth 2011-05-05 11:15:57: (info) cloud@39: cloudvpn starting 2011-05-05 11:15:57: You are using CloudVPN, which is Free software. 2011-05-05 11:15:57: For more information please see the GNU GPL license, 2011-05-05 11:15:57: which you should have received along with this program. 2011-05-05 11:15:57: (info) common/sighandler@25: setting up signal handler 2011-05-05 11:15:57: (info) cloud@58: heartbeat is set to 50000 usec 2011-05-05 11:15:57: (info) cloud/status@153: exporting status to file `./status.txt' 2011-05-05 11:15:57: (info) cloud/route@63: ID cache max size is 32768 2011-05-05 11:15:57: (info) cloud/route@67: ID cache reduction halftime is 10000002011-05-05 11:15:57: (info) cloud/route@259: only ping changes above 5msec will be repo rted to peers 2011-05-05 11:15:57: (info) cloud/route@263: maximal node distance is 642011-05-05 11:15:57: (info) cloud/route@267: default TTL is 128 2011-05-05 11:15:57: (info) cloud/route@271: hop penalization is 0% 2011-05-05 11:15:57: (info) common/sq@108: maximal input queue size is 16777216 bytes 2011-05-05 11:15:57: (info) common/network@70: listen backlog size is 32 2011-05-05 11:15:57: (info) cloud/comm@1460: max connections count is 1024 2011-05-05 11:15:57: (info) cloud/comm@1466: maximal size of internal packets is 8192 2011-05-05 11:15:57: (info) cloud/comm@1472: max 1024000 pending data bytes 2011-05-05 11:15:57: (info) cloud/comm@1478: max 256 remote routes 2011-05-05 11:15:57: (info) cloud/comm@1483: connection retry is 10sec 2011-05-05 11:15:57: (info) cloud/comm@1488: connection timeout is 60sec 2011-05-05 11:15:57: (info) cloud/comm@1494: connection keepalive is 5sec 2011-05-05 11:15:57: (info) cloud/comm@85: Initializing ssl layer 2011-05-05 11:15:57: (info) cloud/comm@175: loaded 1 CAs from ./ca.crt 2011-05-05 11:15:57: (info) cloud/comm@197: SSL initialized OK 2011-05-05 11:15:57: (info) cloud/comm@1331: trying to listen on `0.0.0.0 3201' 2011-05-05 11:15:57: (info) common/network@148: created listening socket 5 2011-05-05 11:15:57: (info) cloud/comm@1339: listeners ready 2011-05-05 11:15:57: (info) cloud/comm@1376: no connections specified 2011-05-05 11:15:57: (info) cloud/gate@412: creating gate on `./gate.sock' 2011-05-05 11:15:57: (info) common/network@148: created listening socket 6 2011-05-05 11:15:57: (info) cloud/gate@420: gates ready 2011-05-05 11:15:57: (info) cloud/gate@475: max gate count is 64 2011-05-05 11:15:57: (info) cloud/gate@477: gate OK 2011-05-05 11:15:57: (info) cloud@113: initialization complete, entering main loop 2011-05-05 11:15:59: (info) cloud/comm@283: get connection from address 192.168.9.215 57354 on socket 7 2011-05-05 11:16:01: (info) cloud/comm@755: socket 7 accepted SSL connection id 0 2011-05-05 11:16:34: (info) cloud/gate@161: gate 0 handling address e78adefa.00:bd:44:e8:bd:03
CTRL-a c to create a second console, run the following:
caprisun:~# cd /home/squirrel/cloud caprisun:/home/squirrel/cloud# ether -gate ./gate.sock -iface_dev tun2 2011-05-05 11:16:34: (info) common/sighandler@25: setting up signal handler 2011-05-05 11:16:34: (info) common/sq@108: maximal input queue size is 16777216 bytes 2011-05-05 11:16:34: (info) common/network@70: listen backlog size is 32 2011-05-05 11:16:34: (info) ether@160: using `tun2' as interface 2011-05-05 11:16:34: (info) ether@361: iface has mac address 00:bd:44:e8:bd:03 2011-05-05 11:16:34: (info) ether@197: iface: initialized OK 2011-05-05 11:16:34: (info) ether@775: gate connected OK
This has been an issue that crops up from time to time, especially on modern Ubuntu systems, as we can't just remove a “-nolisten tcp” from an .xserverrc file and expect it to work.
I went and looked up some stuff, and came across the following:
In /etc/gdm/gdm.schemas:
<schema> <key>security/DisallowTCP</key> <signature>b</signature> <default>true</default> </schema>
Change that true to a false:
<schema> <key>security/DisallowTCP</key> <signature>b</signature> <default>false</default> </schema>
Save, and reboot.
It appears me optimizations made on friday have worked, for the entire LAIRwall automatically booted up and went into an operational state this morning at the intended time.
I've uploaded some new student images (will likely get them inserted into the slide show tomorrow) and have been performing some weather data updates… I need to automate the weather updates so it grabs the new weather data itself.
For my UNIX class, I am going to have them perform various system administration tasks on a NetBSD VM. This meant figuring out how to pull of NetBSD under Xen, since it has been at least a couple of years (the old anti-/evil-jb days)
Some links:
As it turns out, NetBSD and Xen4 have network issues (We've experienced this a bit with the Plan9 stuff)… apparently running an older NetBSD kernel works… I moved my efforts over to vmserver01, which is running Lenny and Xen3. Works fine.
Basically, I downloaded the following NetBSD kernels (and gunzipped them):
I created a Xen disk image as usual (dd a file full of zeroes).
Xen config file is as follows:
#kernel = "/xen/images/netbsd-INSTALL_XEN3PAE_DOMU" # For install kernel = "/xen/images/netbsd-XEN3PAE_DOMU" # For normal operations vcpus = '1' memory = '256' root = '/dev/xvda1 ro' disk = [ 'file:/xen/images/sp2011eoce.disk,xvda1,w' ] name = 'netbsd' dhcp = 'dhcp' vif = [ 'mac=00:16:3E:2E:C0:39' ] on_poweroff = 'destroy' on_reboot = 'restart' on_crash = 'restart'
Note the difference in kernels— have one enabled for the actual install, and the other enabled for regular boots.
That's it!
I downloaded NetBSD 5.1 ISO images for i386 and amd64, and put them on the web-accessible repository so we can perform local installs.
<html><center></html>
<html></center></html>