sokraits.offbyone.lan and halfadder.offbyone.lan are the two main Xen VM servers hosting virtual machines in the LAIR.
hostname | RAM | disk | swap | OS | Kernel |
---|---|---|---|---|---|
sokraits.lair.lan | 8GB | 32GB (/) | 1.2GB | Debian 8.0 “Jessie” (AMD64) | 3.16-2-amd64 |
500GB + 500GB RAID1 (/dev/md0) |
hostname | RAM | disk | swap | OS | Kernel |
---|---|---|---|---|---|
halfadder.lair.lan | 8GB | 32GB (/) | 1.2GB | Debian 8.0 “Jessie” (AMD64) | 3.16-2-amd64 |
500GB + 500GB RAID1 (/dev/md0) |
Machine | Interface | IP Address | MAC Address |
---|---|---|---|
sokraits.lair.lan | eth0 | 10.80.1.46 (lair.lan subnet) | 00:1a:92:cd:0b:1b |
eth1 | offbyone.lan subnet | 00:1a:92:cd:05:d6 | |
eth2 | 172.16.1.1 (peer link) | 00:0a:cd:16:d9:ac |
Machine | Interface | IP Address | MAC Address |
---|---|---|---|
halfadder.lair.lan | eth0 | 10.80.1.47 (lair.lan subnet) | 00:1a:92:cd:0a:7f |
eth1 | offbyone.lan subnet | 00:1a:92:cd:06:60 | |
eth2 | 172.16.1.2 (peer link) | 00:0a:cd:16:d3:cd |
Both Sokraits and Halfadder are using their (once forbidden!) second network interfaces (to exist on both primary LAIR subnets), as well as an additional add-in PCI-e NIC (to be used with an over-the-ceiling across-the-floor cable to connect to each other, for the specific purpose of performing DRBD and OCFS2 peer updates).
To ensure that all network interfaces come up as intended, we need to configure /etc/network/interfaces as follows (first example given is for sokraits, second is current for halfadder):
# This file describes the network interfaces available on your system # and how to activate them. For more information, see interfaces(5). # The loopback network interface auto lo iface lo inet loopback # The primary network interface #auto eth0 iface eth0 inet manual iface eth1 inet manual ## management and lair.lan access through xenbr0 auto xenbr0 iface xenbr0 inet dhcp bridge_ports eth0 bridge_stp off # disable Spanning Tree Protocol bridge_waitport 0 # no delay before a port becomes available bridge_fd 0 # no forwarding delay ## configure a (separate) bridge for the DomUs without giving Dom0 an IP on it auto xenbr1 iface xenbr1 inet manual bridge_ports eth1 bridge_stp off # disable Spanning Tree Protocol bridge_waitport 0 # no delay before a port becomes available bridge_fd 0 # no forwarding delay auto eth2 iface eth2 inet static address 172.16.1.1 # halfadder assigns the address: 172.16.1.2 netmask 255.255.255.0
Additionally, I found some probe order issues cropping up, so the I had to manually edit which interface was which on both sokraits and halfadder via their /etc/udev/rules.d/70-persistent-net.rules files.
# This file was automatically generated by the /lib/udev/write_net_rules # program, run by the persistent-net-generator.rules rules file. # # You can modify it, as long as you keep each rule on a single # line, and change only the value of the NAME= key. # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:05:d6", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1" # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0b:1b", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" # PCI device 0x10ec:0x8168 (r8169) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d9:ac", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
# This file was automatically generated by the /lib/udev/write_net_rules # program, run by the persistent-net-generator.rules rules file. # # You can modify it, as long as you keep each rule on a single # line, and change only the value of the NAME= key. # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0a:7f", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:06:60", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1" # PCI device 0x10ec:/sys/devices/pci0000:00/0000:00:16.0/0000:06:00.0 (r8169) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d3:cd", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
To reduce traffic caused by package transactions, I set up a proxy on the fileserver(s), so every client will need to configure itself appropriately. Turns out this can be done most easily by creating the file /etc/apt/apt.conf.d/00apt-cacher-ng and putting in the following contents:
Acquire::http { Proxy "http://10.80.1.3:3142"; };
I wanted a small installation footprint, so I disabled the installation of recommended packages by default.
To do so, create/edit /etc/apt/apt.conf.d/99_norecommends, and put in the following:
APT::Install-Recommends "false"; APT::AutoRemove::RecommendsImportant "false"; APT::AutoRemove::SuggestsImportant "false";
This can also post-remove previously installed recommended packages. Run aptitude' type 'g', type 'g' again, should take care of business.
There are also some options that can be set in aptitude proper, via its console gui (options→preferences):
Useful URLs:
The following packages have been installed on both sokraits and halfadder:
bridge-utils lair-std lair-backup mdadm xen-linux-system xen-tools drbd8-utils ocfs2-tools smartmontools firmware-realtek qemu-system-x86-64
As specified in the Debian Xen Wiki: https://wiki.debian.org/Xen#Prioritise_Booting_Xen_Over_Native
machine:~# dpkg-divert --divert /etc/grub.d/08_linux_xen --rename /etc/grub.d/20_linux_xen Adding 'local diversion of /etc/grub.d/20_linux_xen to /etc/grub.d/08_linux_xen' machine:~#
and then regenerate grub config:
machine:~# update-grub Generating grub.cfg ... Found linux image: /boot/vmlinuz-3.14-1-amd64 Found initrd image: /boot/initrd.img-3.14-1-amd64 Found linux image: /boot/vmlinuz-3.14-1-amd64 Found initrd image: /boot/initrd.img-3.14-1-amd64 done machine:~#
The Xend config file (/etc/xen/xend-config.sxp) for this host is as follows:
# -*- sh -*- # # Xend configuration file. # (xend-relocation-server yes) (xend-relocation-port 8002) (xend-address '') (xend-relocation-address '') (xend-relocation-hosts-allow '^localhost$ ^localhost\\.localdomain$ ^10.80.2.42$ ^10.80.2.46$ ^10.80.2.47$ ^halfadder$ ^halfadder.offbyone.lan$ ^sokraits$ ^sokraits.offbyone.lan$ ^yourmambas$ ^yourmambas.offbyone.lan$ ^grrasp$ ^grrasp.offbyone.lan$') (network-script network-bridge) (vif-script vif-bridge) (dom0-min-mem 196) (enable-dom0-ballooning yes) (total_available_memory 0) (dom0-cpus 0) (vnc-listen '10.80.1.46') (vncpasswd '********') (xend-domains-path /export/xen/xend/domains) # be sure to create this directory
As usual, if left to its own devices, only 8 loopback devices will be created by default. Don't forget to edit /etc/modules as follows and reboot:
# /etc/modules: kernel modules to load at boot time. # # This file contains the names of kernel modules that should be loaded # at boot time, one per line. Lines beginning with "#" are ignored. # Parameters can be specified after the module name. firewire-sbp2 loop max_loop=255 #forcedeth max_interrupt_work=20 optimization_mode=1 poll_interval=100 nfs callback_tcpport=2049
I commented out forcedeth for now.. it does load, but I don't know with the current kernel if I need to specifically set those options. Time will tell.
Xen Tools appears to have been updated… it can now handle recent distributions!
Config file, /etc/xen-tools/xen-tools.conf follows:
###################################################################### ## ## Virtual Machine configuration ## dir = /xen install-method = debootstrap cache = no ###################################################################### ## ## Disk and Sizing options ## size = 4Gb # Disk image size. memory = 256Mb # Memory size swap = 128Mb # Swap size fs = ext4 # use the EXT3 filesystem for the disk image. dist = jessie # Default distribution to install. images = full ###################################################################### ## ## Network configuration ## bridge = xenbr1 dhcp = 1 gateway = 10.80.2.1 netmask = 255.255.255.0 ###################################################################### ## ## Password configuration ## passwd = 1 ###################################################################### ## ## Package Mirror configuration ## arch = amd64 mirror = http://ftp.us.debian.org/debian/ mirror_squeeze = http://ftp.us.debian.org/debian/ mirror_wheezy = http://ftp.us.debian.org/debian/ mirror_jessie = http://ftp.us.debian.org/debian/ ###################################################################### ## ## Proxy Settings for repositories ## apt_proxy = http://10.80.1.3:3142/ ###################################################################### ## ## Filesystem settings ## ext4_options = noatime,nodiratime,errors=remount-ro ext3_options = noatime,nodiratime,errors=remount-ro ext2_options = noatime,nodiratime,errors=remount-ro xfs_options = defaults reiser_options = defaults ###################################################################### ## ## Xen VM boot settings ## pygrub = 1 # Filesystem options for the different filesystems we support. # ext4_options = noatime,nodiratime,errors=remount-ro,data=writeback,barrier=0,commit=600 ext3_options = noatime,nodiratime,errors=remount-ro ext2_options = noatime,nodiratime,errors=remount-ro xfs_options = defaults btrfs_options = defaults ###################################################################### ## ## Xen VM settings ## serial_device = hvc0 disk_device = xvda ###################################################################### ## ## Xen configuration files ## output = /xen/conf extension = .cfg
Since we're running Xen 4.0.1, there are some additional configuration options to tend to (along with squeeze likely better distributing functionality to specific files). /etc/default/xendomains is next… two changes need to be made:
## Type: string ## Default: /var/lib/xen/save # # Directory to save running domains to when the system (dom0) is # shut down. Will also be used to restore domains from if # XENDOMAINS_RESTORE # is set (see below). Leave empty to disable domain saving on shutdown # (e.g. because you rather shut domains down). # If domain saving does succeed, SHUTDOWN will not be executed. # #XENDOMAINS_SAVE=/var/lib/xen/save XENDOMAINS_SAVE=""
Basically make XENDOMAINS_SAVE an empty string, and:
## Type: boolean ## Default: true # # This variable determines whether saved domains from XENDOMAINS_SAVE # will be restored on system startup. # XENDOMAINS_RESTORE=false
XENDOMAINS_RESTORE should be set to false.
Finally, we set a directory for auto-starting VMs on dome boot:
# This variable sets the directory where domains configurations # are stored that should be started on system startup automatically. # Leave empty if you don't want to start domains automatically # (or just don't place any xen domain config files in that dir). # Note that the script tries to be clever if both RESTORE and AUTO are # set: It will first restore saved domains and then only start domains # in AUTO which are not running yet. # Note that the name matching is somewhat fuzzy. # XENDOMAINS_AUTO=/xen/conf/auto
The purpose of the disk array is to provide RAID1 (mirror) to the Xen VM images.
As we've had functioning RAID volumes for years, I thought I would re-do the arrays so as to take advantage of any new version features (when I first created them, mdadm was at version 0.8- now it is 1.2).
So, I first stopped the array:
sokraits:~# mdadm --stop /dev/md0
Then, I zeroed out the superblocks on both constituent drives:
sokraits:~# mdadm --zero-superblock /dev/sdb sokraits:~# mdadm --zero-superblock /dev/sdc
Now we can proceed with creating the new array.
I opted to build the array straight to disk– no messing with partition tables.
halfadder:~# mdadm --create /dev/md0 --level=1 --raid-disks=2 /dev/sdb /dev/sdc mdadm: /dev/sdb appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969 mdadm: partition table exists on /dev/sdb but will be lost or meaningless after creating array mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: /dev/sdc appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969 mdadm: partition table exists on /dev/sdc but will be lost or meaningless after creating array Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. halfadder:~#
To check the status:
halfadder:~# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc[1] sdb[0] 488385424 blocks super 1.2 [2/2] [UU] [=>...................] resync = 8.9% (43629696/488385424) finish=56.9min speed=130132K/sec unused devices: <none> halfadder:~#
usually (when finished building and all is in order) it'll likely look something like:
halfadder:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb[0] sdc[1] 488385424 blocks super 1.2 [2/2] [UU] unused devices: <none> halfadder:~#
To avoid oddities (such as /dev/md0 coming up as /dev/md127 and confusing everything) on subsequent boots, we should set up the /etc/mdadm/mdadm.conf file accordingly. Assuming hardware is in identical places device-wise, the only data unique to each peer is the hostname and the md0 uuid, as is seen in the following:
# mdadm.conf # # Please refer to mdadm.conf(5) for information about this file. # # by default, scan all partitions (/proc/partitions) for MD superblocks. # alternatively, specify devices to scan, using wildcards if desired. DEVICE /dev/sdb /dev/sdc # auto-create devices with Debian standard permissions CREATE owner=root group=disk mode=0660 auto=yes # automatically tag new arrays as belonging to the local system HOMEHOST sokraits # instruct the monitoring daemon where to send mail alerts MAILADDR root # definitions of existing MD arrays ARRAY /dev/md/0 metadata=1.2 UUID=731663e3:d5fd45ac:157baa06:11018534 name=sokraits:0 # This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400 # by mkconf 3.2.5-5
# mdadm.conf # # Please refer to mdadm.conf(5) for information about this file. # # by default, scan all partitions (/proc/partitions) for MD superblocks. # alternatively, specify devices to scan, using wildcards if desired. DEVICE /dev/sdb /dev/sdc # auto-create devices with Debian standard permissions CREATE owner=root group=disk mode=0660 auto=yes # automatically tag new arrays as belonging to the local system HOMEHOST halfadder # instruct the monitoring daemon where to send mail alerts MAILADDR root # definitions of existing MD arrays ARRAY /dev/md/0 metadata=1.2 UUID=c846eb24:6b9783db:cd9b436c:8470fd46 name=halfadder:0 # This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400 # by mkconf 3.2.5-5
To obtain the UUID generated for the md volume, simply run the following (it is unique per host):
halfadder:~# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Thu Aug 7 11:50:06 2014 Raid Level : raid1 Array Size : 488255488 (465.64 GiB 499.97 GB) Used Dev Size : 488255488 (465.64 GiB 499.97 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Thu Aug 7 13:11:52 2014 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : halfadder:0 (local to host halfadder) UUID : c846eb24:6b9783db:cd9b436c:8470fd46 Events : 979 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc halfadder:~#
You'll see the UUID listed. Just copy this into /etc/mdadm/mdadm.conf in the appropriate place, as indicated by the above config files, to ensure the proper identification of the MD array.
According to the information in /usr/share/doc/mdadm/README.upgrading-2.5.3.gz, once we configure the /etc/mdadm/mdadm.conf file, we must let the system know and rebuild the initial ramdisk:
BOTH:~# update-initramfs -t -u -k all update-initramfs: Generating /boot/initrd.img-2.6.32-5-xen-amd64 update-initramfs: Generating /boot/initrd.img-2.6.32-5-amd64 BOTH:~#
In order to have the “shared” storage that allows OCFS2 to work, we'll set up DRBD to constantly sync the volumes between sokraits and halfadder.
With the tools installed, we need to configure some files.
First up, we need to get the peers talking so we can form the volume and get OCFS2 established. Let's make the /etc/drbd.d/global_common.conf file look as follows:
global { usage-count no; } common { startup { wfc-timeout 60; degr-wfc-timeout 60; } disk { on-io-error detach; } syncer { rate 40M; } protocol C; }
This is only an intermediate step. Further changes are needed before we can bring it up in dual-primary mode.
And the resource configuration (doesn't need to change), on both peers:
resource xen_data { device /dev/drbd0; disk /dev/md0; meta-disk internal; on sokraits { address 172.16.1.1:7788; } on halfadder { address 172.16.1.2:7788; } }
Getting DRBD initially up-and-running has always has a bit of voodoo behind it… trying a number of commands and eventually stumbling upon something that works. I may have finally gotten the procedure down:
These identical steps
BOTH:~# drbdadm create-md xen_data Writing meta data... initializing activity log NOT initialized bitmap New drbd meta data block successfully created. BOTH:~# modprobe drbd BOTH:~# drbdadm attach xen_data BOTH:~# drbdadm syncer xen_data BOTH:~# drbdadm connect xen_data BOTH:~#
At this point we should be able to do the following:
EITHER:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:488370480 EITHER:~#
Once we see a semblance of communication (the “Secondary/Secondary” in /proc/drbd output, for example), we can kick the two peers into operation.
This next step must only take place on ONE of the peers. I picked sokraits for this example.. but it really doesn't matter which:
sokraits:~# drbdadm -- --overwrite-data-of-peer primary xen_data
Of course, this isn't so willy nilly if one of the peers has the more up-to-date copy of the data.
Upon which we can now view /proc/drbd and see messages like:
sokraits:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r---- ns:361016 nr:0 dw:0 dr:367944 al:0 bm:21 lo:15 pe:76 ua:225 ap:0 ep:1 wo:b oos:488011888 [>....................] sync'ed: 0.1% (476572/476924)M finish: 2:16:00 speed: 59,764 (59,764) K/sec sokraits:~#
Checking this occasionally (on either peer), will show the progress.
To format the volume, we ignore the underlying disks, and address /dev/drbd0 all the time. The OCFS2 filesystem was put on the disk array:
halfadder:~# mkfs.ocfs2 -v -L datastore -N 4 -T datafiles /dev/drbd0 mkfs.ocfs2 1.4.4 Cluster stack: classic o2cb Filesystem Type of datafiles Label: datastore Features: sparse backup-super unwritten inline-data strict-journal-super Block size: 4096 (12 bits) Cluster size: 1048576 (20 bits) Volume size: 500105740288 (476938 clusters) (122096128 blocks) Cluster groups: 15 (tail covers 25354 clusters, rest cover 32256 clusters) Extent allocator size: 377487360 (90 groups) Journal size: 33554432 Node slots: 4 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing backup superblock: 5 block(s) Formatting Journals: done Growing extent allocator: done Formatting slot map: done Writing lost+found: done mkfs.ocfs2 successful halfadder:~#
By default, if no -N # argument was specified during the formatting of the filesystem, a maximum of 8 machines can simultaneously mount this volume. The intent is for just two machines (sokraits and halfadder) to be the only machines ever mounting this volume.
Because sokraits and halfadder will exist in a primary-primary peer relationship, we need to run a cluster-aware filesystem on our shared volume. Although many exist, the one we've had any amount of prior experience with is OCFS2, so it is redeployed here.
The following should be put in /etc/default/o2cb:
# # This is a configuration file for automatic startup of the O2CB # driver. It is generated by running 'dpkg-reconfigure ocfs2-tools'. # Please use that method to modify this file. # # O2CB_ENABLED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=datastore # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=31 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=30000 # O2CB_KEEPALIVE_DELAY_MS: Max. time in ms before a keepalive packet is sent. O2CB_KEEPALIVE_DELAY_MS=2000 # O2CB_RECONNECT_DELAY_MS: Min. time in ms between connection attempts. O2CB_RECONNECT_DELAY_MS=2000
Next we need to configure the OCFS2 cluster (/etc/ocfs2/cluster.conf):
node: ip_port = 7777 ip_address = 172.16.1.1 number = 0 name = sokraits cluster = datastore node: ip_port = 7777 ip_address = 172.16.1.2 number = 1 name = halfadder cluster = datastore cluster: node_count = 2 name = datastore
Once the other prerequisites are taken care of, we can bring the OCFS2 cluster up in dual primary mode, as the following config file allows for. Duplicate this on both peers.
global { usage-count no; } common { startup { wfc-timeout 60; degr-wfc-timeout 60; become-primary-on both; } disk { on-io-error detach; } net { allow-two-primaries yes; } syncer { rate 80M; } protocol C; }
This /etc/drbd.d/global_common.conf file needs to be identical and present on BOTH DRBD peers.
Recognizing the changes does not require a reboot! The following command (run on both DRBD peers), will update the config:
machine:~# drbdadm adjust xen_data
Assuming /etc/ocfs2/cluster.conf and /etc/default/o2cb are configured and identical, we can now establish OCFS2 cluster connectivity.
These steps take place on BOTH peers.
BOTH:~# modprobe ocfs2
Next, we need to bring the o2cb service online:
BOTH:~# /etc/init.d/o2cb start ounting configfs filesystem at /sys/kernel/config: OK Loading stack plugin "o2cb": OK Loading filesystem "ocfs2_dlmfs": OK Creating directory '/dlm': OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Setting cluster stack "o2cb": OK Starting O2CB cluster datastore: OK
Whatever other functionality there is related to OCFS2, time to bring it on-line as well:
BOTH:~# /etc/init.d/ocfs2 start BOTH:~#
Assuming all is in order, we can now mount our volume:
BOTH:~# mount -t ocfs2 /dev/drbd0 /export BOTH:~# df Filesystem Size Used Avail Use% Mounted on /dev/sda1 27G 1.2G 25G 5% / tmpfs 1.7G 0 1.7G 0% /lib/init/rw udev 1.6G 160K 1.6G 1% /dev tmpfs 1.7G 0 1.7G 0% /dev/shm /dev/drbd0 466G 1.6G 465G 1% /export BOTH:~#
We can have the system work to automatically mount our volume on boot by putting an appropriate entry into /etc/fstab, by appending the following to the bottom of the file:
/dev/drbd0 /export ocfs2 noatime 0 0
The Linux default for swapiness is 60, which will result in the system paging stuff out to swap. Cranking it down to 10 seems a more prudent setting, especially on systems with SSDs, where we want them used as little as possible.
I added the following line to /etc/sysctl.conf on both systems:
vm.swappiness = 10
Seems Debian has a nice built-in support for mounting /tmp in tmpfs (RAM-backed filesystem). All you need to do is edit /etc/default/tmpfs, and uncomment/change the following line:
RAMTMP=yes
And reboot!
The disk array is going to hold both Xen virtual machine images (and supporting files), but also serve as another backup destination for resources in the LAIR.
The following directories have been created:
This is no longer needed, but may well be in the future.
There are two changes needed to successfully create jessie VMs, and both are symlinks:
halfadder:~# cd /usr/share/debootstrap halfadder:/usr/share/debootstrap# ln -s sid jessie
halfadder:~# cd /usr/lib/xen-tools halfadder:/usr/lib/xen-tools# ln -s debian.d jessie.d
Sokraits and Halfadder serve as the production virtual machine servers in the LAIR, predominantly to the offbyone.lan network, but also providing services utilized on lair.lan, student.lab, and across the BITS universe.
Running the Open Source Xen hypervisor (version 4.0.1), around a dozen virtual machines are instantiated.
From the VM server, we can adjust various properties and control aspects of virtual machines via the xm command-line tool.
To determine what is running on a particular VM host, we use the xm list command.
For example, here's an example output (list of VMs can and will vary):
sokraits:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 2344 4 r----- 1377.5 irc 8 128 1 -b---- 9.2 lab46 11 512 2 -b---- 10.6 lab46db 6 128 1 -b---- 14.3 mail 5 192 1 -b---- 18.0 www 4 192 1 -b---- 129.2 sokraits:~#
What this shows us is that the following VMs are running locally on this VM server. If the VM you are looking for is believed running but not found in this list, it is likely running on the other VM server.
To start a VM that is not presently running, assuming all prerequisites are met (operational VM image exists, correct configuration file, available resources (mainly memory)), we can use xm to create an instantiation of the virtual machine.
In this example, we will start the VM for repos.offbyone.lan:
halfadder:~# xm create -c /xen/conf/repos.cfg Using config file "/xen/conf/repos.cfg". [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 2.6.26-2-xen-amd64 (Debian 2.6.26-25lenny1) (dannf@debian.org) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 SMP Thu Sep 16 16:32:15 UTC 2010 [ 0.000000] Command line: root=/dev/xvda1 ro ip=:127.0.255.255::::eth0:dhcp clocksource=jiffies [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 0000000008800000 (usable) [ 0.000000] max_pfn_mapped = 34816 [ 0.000000] init_memory_mapping ... Starting periodic command scheduler: crond. Debian GNU/Linux 5.0 repos.offbyone.lan hvc0 repos.offbyone.lan login:
Due to the -c argument we gave xm when creating the virtual machine, we will connect to the console of this virtual machine, allowing us to see it boot. One only needs omit the -c from the xm command-line, and the machine will still start, but we'll be returned to the command prompt.
In this current scenario, we'll want to issue a: CTRL-]
Once you do that, you'll escape from the VM's prompt, and be returned to the prompt on the VM server.
And what if the VM is already running? If you are trying to start it on the same VM host it is already running, you'll see the following:
halfadder:~# xm create -c /xen/conf/repos.cfg Using config file "/xen/conf/repos.cfg". Error: Domain 'repos' already exists with ID '3' halfadder:~#
If it is running, but on the other VM server, well, trouble is likely going to take place. Although the VM servers are using the cluster file system, the individual VMs are not, and will likely not take kindly to concurrent accesses. So prevent headaches and take care not to start multiple copies of the same VM!
If we desire to shut down a VM, we can do so (and properly!) from the VM server command-line. Using the xm shutdown command, a shutdown signal is created on the VM, and the machine shuts down just as if we gave it a “shutdown -h now” command.
Shutting down repos.offbyone.lan:
halfadder:~# xm shutdown repos halfadder:~#
After a bit, if you check the output of xm list, you will no longer see the VM in question listed. Once this condition is true, you can proceed with whatever operation is underway.
One of the impressive features we have available with the use of DRBD and OCFS2 is a multi-master concurrent filesystem. This creates “shared storage”, which grants us some advantages.
Specifically, we can use our shared storage to enable migration of virtual machines between VM servers. What's more, we can perform a live migration, transparently (to anyone using the virtual machine) moving the VM to another physical host without interrupting its operation.
Following will be an example of a live migration, migrating the *www* virtual machine, originally residing on sokraits:
sokraits:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 2344 4 r----- 1383.1 irc 8 128 1 -b---- 9.8 lab46 11 512 2 -b---- 11.3 lab46db 6 128 1 -b---- 14.6 mail 5 192 1 -b---- 18.8 www 4 192 1 -b---- 133.3 sokraits:~#
So we see www is running on sokraits.
sokraits:~# xm migrate --live www halfadder sokraits:~#
After only a few seconds, we get our prompt back.
Do another xm list on sokraits:
sokraits:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 2344 4 r----- 1387.8 irc 8 128 1 -b---- 9.9 lab46 11 512 2 -b---- 11.4 lab46db 6 128 1 -b---- 14.6 mail 5 192 1 -b---- 19.0 sokraits:~#
As you can see, www is no longer present in the VM list on sokraits.
Switch over to halfadder, do a check:
halfadder:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 3305 1 r----- 65.0 auth 4 128 1 -b---- 4.1 log 1 128 1 -b---- 0.8 repos 5 128 1 -b---- 8.3 web 2 128 1 -b---- 2.9 www 6 192 1 -b---- 0.4 halfadder:~#
And voila! A successful live migration.
To facilitate administration, both sokraits and halfadder are configured as LRRDnode clients and log data that can be retrieved from LRRD at: http://web.offbyone.lan/lrrd/
First step is to install the actual LAIR package:
BOTH:~# aptitude install lrrd-node The following NEW packages will be installed: libstatgrab6{a} lrrd-node python-statgrab{a} 0 packages upgraded, 3 newly installed, 0 to remove and 0 not upgraded. Need to get 118 kB of archives. After unpacking 348 kB will be used. Do you want to continue? [Y/n/?] Get:1 http://mirror/debian/ squeeze/main libstatgrab6 amd64 0.16-0.1 [57.6 kB] Get:2 http://mirror/debian/ squeeze/main python-statgrab amd64 0.4-1.1+b2 [53.0 kB] Get:3 http://mirror/lair/ squeeze/main lrrd-node all 1.0.7-1 [7,128 B] Fetched 118 kB in 0s (9,978 kB/s) Selecting previously deselected package libstatgrab6. (Reading database ... 28935 files and directories currently installed.) Unpacking libstatgrab6 (from .../libstatgrab6_0.16-0.1_amd64.deb) ... Selecting previously deselected package python-statgrab. Unpacking python-statgrab (from .../python-statgrab_0.4-1.1+b2_amd64.deb) ... Setting up libstatgrab6 (0.16-0.1) ... Setting up python-statgrab (0.4-1.1+b2) ... Processing triggers for python-support ... Selecting previously deselected package lrrd-node. (Reading database ... 28961 files and directories currently installed.) Unpacking lrrd-node (from .../lrrd-node_1.0.7-1_all.deb) ... Setting up lrrd-node (1.0.7-1) ... Adding lrrdNode to init.d update-rc.d: using dependency based boot sequencing insserv: warning: script 'lrrdnode' missing LSB tags and overrides Running lrrdNode ... Starting lrrdNode: stat collection thinger: Starting LRRD Node lrrdNode BOTH:~#
Once installed and running on the client side, we need to configure (or reconfigure, as the case may be) at LRRD.
So pop a browser over to: http://web.offbyone.lan/lrrd/
And log in (~root, punctuation-less ~root pass).
Click on the “Configure” link, and find the host in question (if it has prior history reporting to LRRD).
If found, note that it is Enabled, and click the “reconfigure” link to the right of the entry.
There's an option to delete existing databases (do it), and check off any appropriate network interfaces.
If it is discovered that data reporting ceases, and other components of the LRRD system are still deemed functioning, it is likely that the lrrd-node client needs a restart. Simply do the following on the machine in question:
sokraits:~# /etc/init.d/lrrdnode restart Stopping lrrdNode: stat collection thinger: lrrdNode Starting lrrdNode: stat collection thinger: Starting LRRD Node lrrdNode sokraits:~#
Wait at least 5 minutes for data reporting to make it into graphable form.
Since we've been successful running the systems out of a RAMdisk, care must be taken to preserve any changes in the event of a reboot or power failure.
In this light, I first had the systems rsync'ing to their local SSD (boot drive). I rigged up a custom cronjob than ran 3 times a day. It looks as follows:
12 */8 * * * (mkdir -p /tmp/sda1; mount /dev/sda1 /tmp/sda1; rsync -av --one-file-system / /tmp/sda1/; umount /tmp/sda1)
This worked handily until sokraits lost its boot drive (again! In 2 months time!) so I decided to investigate netbooting using an NFSroot.
In the process, I may have finally made a breakthrough in my longtime desire to put the entire system IN the initial ramdisk (so it would be running in RAM from the get-go). Turns out, according to the manual page, you merely have to put the system IN the initrd file… obviously one needs adequate memory (2x at boot- enough for the initrd, and enough to decompress it).
My cron job changed as follows:
24 */8 * * * (rsync -av --one-file-system / data.lair.lan:/export/tftpboot/netboot/halfadder/disk/)
I plan to rig up either some daily autogeneration of the initrd, or have a script on standby that can use to make it. This will then become the method of booting both sokraits and halfadder (and potentially freeing up a still-working SSD in the process! Which I can use in data2).
On the fileserver, I then obtain the latest copy of the hypervisor, kernel, and generate a new all-system initrd:
data1:/export/tftpboot/netboot/halfadder# cp disk/boot/xen-4.4-amd64.gz . data1:/export/tftpboot/netboot/halfadder# cp disk/boot/vmlinuz-3.16-2-amd64 linux data1:/export/tftpboot/netboot/halfadder# cd disk data1:/export/tftpboot/netboot/halfadder/disk# find . | cpio -c -o | gzip -9 > ../initrd.gz data1:/export/tftpboot/netboot/halfadder/disk#
On the fileserver, in /export/tftpboot/pxelinux.cfg/ are two files, 0A50012E (sokraits) and 0A50012F (halfadder)… they are named according to the machine's IP address (only in hex).
The file(s) contain:
default netboot prompt 1 timeout 2 label netboot kernel mboot.c32 append netboot/halfadder/xen-4.4-amd64.gz --- netboot/halfadder/linux console=tty0 root=/dev/ram0 ro --- netboot/halfadder/initrd.gz label memtest kernel distros/memtest/memtest86+