User Tools

Site Tools


haas:status:status_201004

STATUS updates

TODO

  • How to handle UNIX journal keywords?
  • Need to finish writing up HPC0 projects
  • the formular plugin is giving me errors, need to figure this out (email assignment form)
  • use include plugin to include a page containing various prior month status pages
  • can I install writer2latex on wildebeest herd without needing gcj??
  • update lair-nfs for new idmap Domain of “lair”
  • put UNIX course listing examples in public directory

URLs

Other Days

April 25th, 2010

irssi window moving

I ended up with a non-optimal situation where some of my common windows in irssi were on different window numbers.

I googled it, and arrived at a fix… let's say I wanted #unix on window id 4, but it was on 12.

First, I'd switch to #unix in id 12, and do the following:

[#unix] /window move 4

bam! It is now placed on, existing windows move out of the way (#unix would be inserted at 4… old window gets pushed up one, etc.)

This was helpful: http://www.irssi.org/documentation/startup

April 23rd, 2010

juicebox resurrected/lair.lan back online

squirrel saved the day by pulling the classic original juicebox out of retirement and putting it back in service. It is now routing and serving DNS/DHCP for the lair.lan side of the network… a very close-to drop-in replacement for jb2 in its absence.

Since juicebox is over a year out of the loop, we had to pull some updated configs from the backup dumps I took of jb2.

restore has a nifty interactive feature, allowing one to select what files/directories they'd like to pull out of the dump.

# restore -i -f jb2.dump

Basically you “cd”/“ls” around, “add” files, and when done, “extract”.. it asks about some volume… I just type something (I typed “1”), and it dumps the files to your present directory. Deploy them as needed.

juicebox is also an OpenBSD 4.0 box, so there was some out-of-date-ness with regards to syntax.

Specifically, I ended up making the following changes to get things working on juicebox (from jb2's files):

pass quick on $int_if # no state

I basically had to add a # as it did not like the “no state” (okay in OpenBSD 4.4, not okay in OpenBSD 4.0).

And in /etc/bgpd.conf:

# filter out prefixes longer than 24 or shorter than 8 bits
#deny from any
#allow from any inet prefixlen 8 - 24

It did not like those deny and allow lines.

I pulled in the OpenVPN configs, pf.conf, DNS, DHCP, rc.local, rc.conf.local, and hostname/bridge files (for OpenVPN operation).

Ended up restarting juicebox to ensure everything was as it should be. Seems to be so.

LAIRwall network portability (part 1)

Until juicebox was brought on-line, the lair.lan network was a bit dysfunctional. Having class, and having a class that wanted to use the video wall, we had to make some quick changes… I ended up inserting records in DNS and DHCP on caprisun so we could just switch the LAIRwall over to the offbyone.lan network and have it DHCP.

Ended up doing it so the IPs would be the same, minus the subnet change (so 10.80.1.71 → 10.80.2.71)… for an attempt at a minimal number of disruptions. It served our purposes.

I updated some immediately noticeable files to make them IP-agnostic (ie specifying hostnames instead of FQDN's and IP's).

wireless in wall01.lair.lan

In preparation for the LAIRwall's big trip up to campus next week, we installed a PCI PCMCIA bridge and have a wireless card inserted.

The card is on eth1.

To get things humming:

wall01:~$ sudo aptitude update && aptitude install wireless-tools wavemon
...
wall01:~$ sudo ifconfig eth1 up
wall01:~$ sudo iwlist eth1 scan
wall01:~$ sudo iwconfig eth1 essid cccair  # associate to cccair network
wall01:~$ sudo iwconfig  # to see that it made the association

Then just start dhclient'ing on eth1, and we'll be good to go.

April 22nd, 2010

jb2 dead

Just prior to 7:30pm this evening, the hard drive in jb2 completely gave up the ghost.

A rescue operation took place… jb2 always ran hot, and last weekend SMART detected impending bad sectors somewhere in /var. Upon trying to fire up the machine, very ugly clicking noises were heard from the drive. It appeared unable to boot, and I ended up disconnecting it.

At the moment, nothing is routing for the RR connection.

climate control out

Around 4am this morning, we lost climate control in the LAIR… temperatures ballooned up to a balmy 80 degrees. Luckily it was not feeling overly humid.

Some superfluous machines were turned off to try and minimize temperature gains.

April 20th, 2010

lab46 lockup

Today during HPC0, Lab46 experienced another sudden lockup… around 3pm.

halfadder vm migration to nfs

With the Lab46 lockup, I took the opportunity to move the remaining VMs running on Halfadder over to the /export/xen share on nfs, and relaunch them… so now all core VMs are hosted on NFS.

Load seems to be pretty stable… no longer down near 0%, but hovering beneath 50%… one might say for a server it is actually being busier. Somewhat of a unique thing for us.

LAIRwest rack moved

Tending to other housecleaning tasks, I opted to move the LAIRwest rack back to its original position, back onto a common circuit, but away from the thermostat.

Chromium functionality on LAIRwall

Students in my HPC0 class achieved successful LAIRwall Chromium functionality today, and we enjoyed several full screen OpenGL screensavers.

April 17th, 2010

dokuwiki faqs: qna plugin

I've been contemplating getting some FAQ functionality going on the wiki, to have another medium for students to contribute to (especially as a means of an assistance for them, as I find them often running into scenarios where if they had written down small notes when they discover solutions to problems they wouldn't get hung up as much).

I looked and there were 2 plugins— faq and qna… both introduce new syntax… I ended up going with 'qna' because I liked its syntax a bit better.

g7 backups

Since January, backup attempts from g7 have failed. It has been using the campus connection, and seems to only be able to get up to 16MB before bailing out. I can't see any limiting factor on our end, so I decided to blame the campus connection. As a test, I rigged up an attempt to send the backup through the RR connection.

I am having it connect to backup, even though I have reservations about this (ie it stores some backups…) from a security point of view, so I tweaked the sshd settings:

##
## Extra Secure Settings for g7 backups through RR connection
##
PasswordAuthentication no
AllowUsers user1 user2 user3 user4
LoginGraceTime 4
X11Forwarding no
PermitRootLogin no

Basically, lock it down and lock it down tight. Allowing only the most authorized of users in, and with an exceptionally small login window.

Obviously there are some problems with this (g7 isn't fast enough to make the login window).

So what I'm going to end up doing instead is rig up a temporal opening (ie fire up a server with allowing permissions only on the times of g7 backups, then immediately close them off afterward). This is certainly more ideal an approach, because nobody has any business getting to the machine externally anyway (VPN access, sure).

lair-nfs

I finally got around to resolving the idmap.conf settings in lair-nfs (resolving the idmap issues after the power issues, I finally set the domain to “lair” instead of “localdomain”)… this of course meant that EVERY machine that utilizes nfs must be updated to use that domain.

At the time I did it manually.

So I finally put the changes in, and also figured I'd resolve the other longstanding issue with lair-nfs— installation on etch systems. The issue is that lair-nfs modprobes the 'nfs4' module.. it turns out that etch does not have a module called 'nfs4', but just 'nfs'.

To fix, I added the following logic (and hence 1.2.0-6 was created and added to the repository):

 81     # Ensure NFS support is available, otherwise package install will fail.
 82     echo -n "Loading NFS4 kernel module ... "
 83     nfsmodname="`cat /etc/debian_version | grep '4.0' | wc -l`"
 84     if [ "$nfsmodname" -eq 0 ]; then
 85         modprobe nfs4 && echo "done." || "failed!"
 86     else # etch doesn't have an 'nfs4' module
 87         modprobe nfs && echo "done." || "failed!"
 88     fi

Just a simple version check leading to the appropriate module insertion. Bam! Now a seamless install whether on etch or lenny.

LAIR packages

Just a quick review of the LAIR package structure.

LAIR packages are no longer hosted on web! Do not store them there.

Instead, they are on nfs! Under the /export/packages directory. All necessary scripts have been moved there.

In fact I just removed the old directories on web (/var/www/pages/packages/), to avoid any future confusion. Because it got me good, even though I made the change.

VMs on NFS serving Sokraits

I decided, in an effort to reintroduce some of the cool Xen functionality, start moving VMs onto NFS under /export/xen (basically take the entire /xen directory tree that exists on sokraits and halfadder, and put it on nfs); then have them NFS mount it… this will afford us the flexibility of using live migration of VMs, and gives us a little bit of data insurance as we're no longer storing the only live copy of the VM on one hard disk (but NFS's DRBD… so if we lose “a” disk, it isn't so bad).

nfs config

Configuration on nfs was in /etc/exports:

/export/xen                     10.80.2.42/32(rw,sync,no_root_squash,no_subtree_check,fsid=2) 10.80.2.46/32(rw,sync,no_root_squash,no_subtree_check,fsid=2) 10.80.2.47/32(rw,sync,no_root_squash,no_subtree_check,fsid=2)

And then exporting the new share:

nfs:~# exportfs -r

And nfs now serves it (to JUST the designated VM servers).

sokraits config

On sokraits, I established the live migration parameters in /etc/xen/xend-config.sxp:

(xend-relocation-server yes)
(xend-relocation-port 8002)
(xend-relocation-address '')
(xend-relocation-hosts-allow '^localhost$ ^localhost\\.localdomain$ ^10.80.2.42$ ^10.80.2.46$ ^10.80.2.47$ ^halfadder$ ^halfadder.offbyone.lan$ ^sokraits$ ^sokraits.offbyone.lan$ ^yourmambas$ ^yourmambas.offbyone.lan$')

Again, I restricted access to JUST the VM servers.

In sokraits' /etc/rc.local, the following:

# Try to fix NFS callback stupidity
modprobe nfs4
sysctl fs.nfs.nfs_callback_tcpport=2049
/etc/init.d/nfs-common restart

# Mount Xen NFS share
mount -t nfs4 -o proto=tcp,intr nfs:/xen /xen

Rebooted it, and bam! We're up and running.

I shut down antelope, gnu, repos, and www, and copied their images over to nfs.

Then, I xm created them on sokraits. All four of those VMs are now up and running on sokraits (relieving halfadder of the heavier VM load it has shouldered for a few weeks now), but all VM data is being retried from NFS.

It does drive load up a little, especially spiking during a VM boot or significant VM disk activity (doing an aptitude upgrade, for instance). We'll have to see if it is worth it to put ALL our production VMs on there (I figured www would be a good semi-test, as it likely sees usage as significant as lab46).

But, give it a few minutes to settle down, and load seems to settle. Again, this is only with 4 VMs (out of 14 total) running under this new setup. I'll slowly push some more over and see how it handles on load.

This is one of those areas where Ceph would likely shine for us.

VMs updated

In addition to being moved from halfadder to nfs to be launched via NFS on sokraits, antelope; gnu; repos; and www had an “aptitude update && aptitude upgrade && aptitude clean” performed on them today.

I also moved over web, db, and lab46db, bringing the total number of VMs running on sokraits off nfs to 7 (equally balancing VMs between sokraits and halfadder once again), updating with the same logic as above.

Load on nfs still seems okay… I'll be looking at the lrrd reports tomorrow on load to see how it made out… I have a feeling that, aside from the higher spikes due to startup and disk-heavy maintenance tasks, things will hopefully not be overburdened… really won't know until everyone is sitting in the LAIR logging into the pod machines, with people running stuff on Lab46 and having multiple logins taking place.

April 16th, 2010

mambas/sokraits

Finding the urge to resume some ongoing projects, I wandered into the LAIR and fired up mambas and sokraits. Performed updates on both.

I had left sokraits off from our electric circuit adventures… and seeing as such are probably over for a while, I figured I'd start migrating VMs back to LAIReast.

For mambas, I'd like to use it as a potential upgrade environment for VMs (ie the Lab46 rebuild).

For now, mambas is running Ubuntu 9.10, but I may upgrade it to 10.10 once it is released as it is going to be a LTS (Long Term Support) release– figured if I make the Ubuntu switch it is likely preferable to use that (an LTS release), so if things remain running for long periods of time, we'll at least still get some level of updates being released.

In its current form, Ubuntu actually lacks Xen kernel support for dom0 (they can be domUs out of the box, but Ubuntu hasn't been packaging Xen-dom0-bootable kernels). According to one of the Ubuntu maintainers, it is really more out of waiting for the pv_opts to be more formalized in the mainstream kernel. All the userspace tools and daemons are in the package repository.

And amusingly, one of their preferred methods for running Xen on Ubuntu is to launch it with a Debian Xen kernel.

Some links:

I also want to explore KVM.. they seem to be doing some interesting things. There's a particularly interesting project called SheepDog:

That looks to in some ways nicely add a level of redundancy to storage for VMs, of course… such things would also be solved when Ceph's DFS is production worthy.

After some playing.. Ubuntu is looking less and less usable for my needs. Originally I had wanted to use it for the following reasons:

  • more up-to-date software
  • more up-to-date kernel and Xen
  • more software (less strict licensing than Debian)
  • includes 64-bit userspace
  • give me an opportunity to get used to more of the ubuntuisms

But, Ubuntu-managed Xen is currently a no-go, and I just realized that Lenny includes 64-bit userspace. Their 6.0 “Squeeze” release is just around the corner, we're used to Debian, it includes Xen and all its fixings… and aside from it getting a little long in the tooth, it works. So I will likely be reinstalling Mambas with Debian to proceed with my Evil Plans™.

And now, reading up on the progress of Debian squeeze… that is (unsurprisingly) behind schedule… they were aiming for a Spring 2010 release which was pushed to a Summer 2010 release, and now are looking at even farther out. I really didn't want to go with Lenny, simply because I know Lab46 will be deployed for a couple of years before a rebuild (even if I have better intentions).

exploring KVM on mambas

So, with nothing set in stone, I decided to play with KVM a bit.

Some links:

yourmambas:~# aptitude install ubuntu-virt-server

The file /etc/network/interfaces was adapted as follows:

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eth0
iface eth0 inet manual

# The bridge interface (for KVM)
auto br0
iface br0 inet static
        address 10.80.2.42
        network 10.80.2.0
        netmask 255.255.255.0
        broadcast 10.80.2.255
        gateway 10.80.2.1
        bridge_ports eth0
        bridge_stp off
        bridge_fd 0
        bridge_maxwait 0

April 10th, 2010

Sometime around 3:10AM this morning, Lab46 locked up again.

Did the usual. It is back up.

April 3rd, 2010

Around 2:26PM, Lab46 locked up. Got it all restarted.

As an aside, when the system is really busy, it starts spewing clock skew errors… when jjansen4 runs his plethora of bots, that seems to aggravate the situation.

I removed libc6-xen, that seemed to mitigate the libc error 4 messages I thought may have been causing the lockups… as we can see with the April 10th lockup, that doesn't seem to have fixed it.

<html><center></html>

<html></center></html>

haas/status/status_201004.txt · Last modified: 2010/09/16 18:33 by 127.0.0.1