Table of Contents

Overview

sokraits.offbyone.lan and halfadder.offbyone.lan are the two main Xen VM servers hosting virtual machines in the LAIR.

hostname RAM disk swap OS Kernel
sokraits.lair.lan 8GB 32GB (/) 1.2GB Debian 8.0 “Jessie” (AMD64) 3.16-2-amd64
500GB + 500GB RAID1 (/dev/md0)
hostname RAM disk swap OS Kernel
halfadder.lair.lan 8GB 32GB (/) 1.2GB Debian 8.0 “Jessie” (AMD64) 3.16-2-amd64
500GB + 500GB RAID1 (/dev/md0)

News

TODO

Network Configuration

Overview

Machine Interface IP Address MAC Address
sokraits.lair.lan eth0 10.80.1.46 (lair.lan subnet) 00:1a:92:cd:0b:1b
eth1 offbyone.lan subnet 00:1a:92:cd:05:d6
eth2 172.16.1.1 (peer link) 00:0a:cd:16:d9:ac
Machine Interface IP Address MAC Address
halfadder.lair.lan eth0 10.80.1.47 (lair.lan subnet) 00:1a:92:cd:0a:7f
eth1 offbyone.lan subnet 00:1a:92:cd:06:60
eth2 172.16.1.2 (peer link) 00:0a:cd:16:d3:cd

Both Sokraits and Halfadder are using their (once forbidden!) second network interfaces (to exist on both primary LAIR subnets), as well as an additional add-in PCI-e NIC (to be used with an over-the-ceiling across-the-floor cable to connect to each other, for the specific purpose of performing DRBD and OCFS2 peer updates).

Interfaces

To ensure that all network interfaces come up as intended, we need to configure /etc/network/interfaces as follows (first example given is for sokraits, second is current for halfadder):

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
#auto eth0
iface eth0 inet manual
iface eth1 inet manual

## management and lair.lan access through xenbr0
auto xenbr0
iface xenbr0 inet dhcp
   bridge_ports eth0
   bridge_stp off       # disable Spanning Tree Protocol
   bridge_waitport 0    # no delay before a port becomes available
   bridge_fd 0          # no forwarding delay

## configure a (separate) bridge for the DomUs without giving Dom0 an IP on it
auto xenbr1
iface xenbr1 inet manual
   bridge_ports eth1
   bridge_stp off       # disable Spanning Tree Protocol
   bridge_waitport 0    # no delay before a port becomes available
   bridge_fd 0          # no forwarding delay

auto eth2
iface eth2 inet static
    address 172.16.1.1  # halfadder assigns the address: 172.16.1.2
    netmask 255.255.255.0

udev rules.d

Additionally, I found some probe order issues cropping up, so the I had to manually edit which interface was which on both sokraits and halfadder via their /etc/udev/rules.d/70-persistent-net.rules files.

sokraits

# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:05:d6", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0b:1b", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

# PCI device 0x10ec:0x8168 (r8169)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d9:ac", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"

halfadder

# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0a:7f", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:06:60", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

# PCI device 0x10ec:/sys/devices/pci0000:00/0000:00:16.0/0000:06:00.0 (r8169)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d3:cd", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"

apt configuration

use LAIR apt proxy

To reduce traffic caused by package transactions, I set up a proxy on the fileserver(s), so every client will need to configure itself appropriately. Turns out this can be done most easily by creating the file /etc/apt/apt.conf.d/00apt-cacher-ng and putting in the following contents:

Acquire::http { Proxy "http://10.80.1.3:3142"; };

no recommends

I wanted a small installation footprint, so I disabled the installation of recommended packages by default.

To do so, create/edit /etc/apt/apt.conf.d/99_norecommends, and put in the following:

APT::Install-Recommends "false";
APT::AutoRemove::RecommendsImportant "false";
APT::AutoRemove::SuggestsImportant "false";

This can also post-remove previously installed recommended packages. Run aptitude' type 'g', type 'g' again, should take care of business.

There are also some options that can be set in aptitude proper, via its console gui (options→preferences):

Useful URLs:

Packages

The following packages have been installed on both sokraits and halfadder:

bridge-utils
lair-std
lair-backup
mdadm
xen-linux-system
xen-tools
drbd8-utils
ocfs2-tools
smartmontools
firmware-realtek
qemu-system-x86-64

GRUB Configuration

As specified in the Debian Xen Wiki: https://wiki.debian.org/Xen#Prioritise_Booting_Xen_Over_Native

machine:~# dpkg-divert --divert /etc/grub.d/08_linux_xen --rename /etc/grub.d/20_linux_xen
Adding 'local diversion of /etc/grub.d/20_linux_xen to /etc/grub.d/08_linux_xen'
machine:~# 

and then regenerate grub config:

machine:~# update-grub
Generating grub.cfg ...
Found linux image: /boot/vmlinuz-3.14-1-amd64
Found initrd image: /boot/initrd.img-3.14-1-amd64
Found linux image: /boot/vmlinuz-3.14-1-amd64
Found initrd image: /boot/initrd.img-3.14-1-amd64
done
machine:~# 

Xen Configuration

Xend configuration

The Xend config file (/etc/xen/xend-config.sxp) for this host is as follows:

# -*- sh -*-

#
# Xend configuration file.
#
(xend-relocation-server yes)
(xend-relocation-port 8002)
(xend-address '')
(xend-relocation-address '')
(xend-relocation-hosts-allow '^localhost$ ^localhost\\.localdomain$ ^10.80.2.42$ ^10.80.2.46$ ^10.80.2.47$ ^halfadder$ ^halfadder.offbyone.lan$ ^sokraits$ ^sokraits.offbyone.lan$ ^yourmambas$ ^yourmambas.offbyone.lan$ ^grrasp$ ^grrasp.offbyone.lan$')
(network-script network-bridge)
(vif-script vif-bridge)
(dom0-min-mem 196)
(enable-dom0-ballooning yes)
(total_available_memory 0)
(dom0-cpus 0)
(vnc-listen '10.80.1.46')
(vncpasswd '********')
(xend-domains-path /export/xen/xend/domains)   # be sure to create this directory

local loopback

As usual, if left to its own devices, only 8 loopback devices will be created by default. Don't forget to edit /etc/modules as follows and reboot:

# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Parameters can be specified after the module name.

firewire-sbp2
loop max_loop=255
#forcedeth max_interrupt_work=20 optimization_mode=1 poll_interval=100
nfs callback_tcpport=2049

I commented out forcedeth for now.. it does load, but I don't know with the current kernel if I need to specifically set those options. Time will tell.

xen-tools

Xen Tools appears to have been updated… it can now handle recent distributions!

Config file, /etc/xen-tools/xen-tools.conf follows:

######################################################################
##
## Virtual Machine configuration
##
dir             = /xen
install-method  = debootstrap
cache           = no

######################################################################
##
## Disk and Sizing options
##
size            = 4Gb      # Disk image size.
memory          = 256Mb    # Memory size
swap            = 128Mb    # Swap size
fs              = ext4     # use the EXT3 filesystem for the disk image.
dist            = jessie   # Default distribution to install.
images          = full

######################################################################
##
## Network configuration
##
bridge          = xenbr1
dhcp            = 1
gateway         = 10.80.2.1
netmask         = 255.255.255.0

######################################################################
##
## Password configuration
##
passwd          = 1

######################################################################
##
## Package Mirror configuration
##
arch            = amd64
mirror          = http://ftp.us.debian.org/debian/
mirror_squeeze  = http://ftp.us.debian.org/debian/
mirror_wheezy   = http://ftp.us.debian.org/debian/
mirror_jessie   = http://ftp.us.debian.org/debian/

######################################################################
##
## Proxy Settings for repositories
##
apt_proxy       = http://10.80.1.3:3142/

######################################################################
##
## Filesystem settings
##
ext4_options    = noatime,nodiratime,errors=remount-ro
ext3_options    = noatime,nodiratime,errors=remount-ro
ext2_options    = noatime,nodiratime,errors=remount-ro
xfs_options     = defaults
reiser_options  = defaults

######################################################################
##
## Xen VM boot settings
##
pygrub          = 1

#  Filesystem options for the different filesystems we support.
#
ext4_options    = noatime,nodiratime,errors=remount-ro,data=writeback,barrier=0,commit=600
ext3_options    = noatime,nodiratime,errors=remount-ro
ext2_options    = noatime,nodiratime,errors=remount-ro                                                                                                                                                                                                                                                               
xfs_options     = defaults
btrfs_options   = defaults

######################################################################
##
## Xen VM settings
##
serial_device   = hvc0
disk_device     = xvda

######################################################################
##
## Xen configuration files
##
output          = /xen/conf
extension       = .cfg

xendomains config

Since we're running Xen 4.0.1, there are some additional configuration options to tend to (along with squeeze likely better distributing functionality to specific files). /etc/default/xendomains is next… two changes need to be made:

## Type: string
## Default: /var/lib/xen/save
#
# Directory to save running domains to when the system (dom0) is
# shut down. Will also be used to restore domains from if # XENDOMAINS_RESTORE
# is set (see below). Leave empty to disable domain saving on shutdown 
# (e.g. because you rather shut domains down).
# If domain saving does succeed, SHUTDOWN will not be executed.
#
#XENDOMAINS_SAVE=/var/lib/xen/save
XENDOMAINS_SAVE=""

Basically make XENDOMAINS_SAVE an empty string, and:

## Type: boolean
## Default: true
#
# This variable determines whether saved domains from XENDOMAINS_SAVE
# will be restored on system startup. 
#
XENDOMAINS_RESTORE=false

XENDOMAINS_RESTORE should be set to false.

Finally, we set a directory for auto-starting VMs on dome boot:

# This variable sets the directory where domains configurations
# are stored that should be started on system startup automatically.
# Leave empty if you don't want to start domains automatically
# (or just don't place any xen domain config files in that dir).
# Note that the script tries to be clever if both RESTORE and AUTO are 
# set: It will first restore saved domains and then only start domains
# in AUTO which are not running yet. 
# Note that the name matching is somewhat fuzzy.
#
XENDOMAINS_AUTO=/xen/conf/auto

MD array configuration

The purpose of the disk array is to provide RAID1 (mirror) to the Xen VM images.

Re-initializing

As we've had functioning RAID volumes for years, I thought I would re-do the arrays so as to take advantage of any new version features (when I first created them, mdadm was at version 0.8- now it is 1.2).

So, I first stopped the array:

sokraits:~# mdadm --stop /dev/md0

Then, I zeroed out the superblocks on both constituent drives:

sokraits:~# mdadm --zero-superblock /dev/sdb
sokraits:~# mdadm --zero-superblock /dev/sdc

Now we can proceed with creating the new array.

creating /dev/md0

I opted to build the array straight to disk– no messing with partition tables.

halfadder:~# mdadm --create /dev/md0 --level=1 --raid-disks=2 /dev/sdb /dev/sdc 
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdc appears to be part of a raid array:
       level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969
mdadm: partition table exists on /dev/sdc but will be lost or
       meaningless after creating array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
halfadder:~# 

checking disk array status

To check the status:

halfadder:~# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdc[1] sdb[0]
      488385424 blocks super 1.2 [2/2] [UU]
      [=>...................]  resync =  8.9% (43629696/488385424) finish=56.9min speed=130132K/sec
      
unused devices: <none>
halfadder:~# 

usually (when finished building and all is in order) it'll likely look something like:

halfadder:~# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb[0] sdc[1]
      488385424 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
halfadder:~# 

Setting /etc/mdadm/mdadm.conf

To avoid oddities (such as /dev/md0 coming up as /dev/md127 and confusing everything) on subsequent boots, we should set up the /etc/mdadm/mdadm.conf file accordingly. Assuming hardware is in identical places device-wise, the only data unique to each peer is the hostname and the md0 uuid, as is seen in the following:

sokraits's mdadm.conf

# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE /dev/sdb /dev/sdc

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST sokraits

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0 metadata=1.2 UUID=731663e3:d5fd45ac:157baa06:11018534 name=sokraits:0

# This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400
# by mkconf 3.2.5-5

halfadder's mdadm.conf

# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE /dev/sdb /dev/sdc

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST halfadder

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0 metadata=1.2 UUID=c846eb24:6b9783db:cd9b436c:8470fd46 name=halfadder:0

# This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400
# by mkconf 3.2.5-5

How to find the local md volume UUID

To obtain the UUID generated for the md volume, simply run the following (it is unique per host):

halfadder:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Thu Aug  7 11:50:06 2014
     Raid Level : raid1
     Array Size : 488255488 (465.64 GiB 499.97 GB)
  Used Dev Size : 488255488 (465.64 GiB 499.97 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Aug  7 13:11:52 2014
          State : active 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : halfadder:0  (local to host halfadder)
           UUID : c846eb24:6b9783db:cd9b436c:8470fd46
         Events : 979

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
halfadder:~# 

You'll see the UUID listed. Just copy this into /etc/mdadm/mdadm.conf in the appropriate place, as indicated by the above config files, to ensure the proper identification of the MD array.

After configuring mdadm.conf

According to the information in /usr/share/doc/mdadm/README.upgrading-2.5.3.gz, once we configure the /etc/mdadm/mdadm.conf file, we must let the system know and rebuild the initial ramdisk:

BOTH:~# update-initramfs -t -u -k all
update-initramfs: Generating /boot/initrd.img-2.6.32-5-xen-amd64
update-initramfs: Generating /boot/initrd.img-2.6.32-5-amd64
BOTH:~# 

DRBD

In order to have the “shared” storage that allows OCFS2 to work, we'll set up DRBD to constantly sync the volumes between sokraits and halfadder.

With the tools installed, we need to configure some files.

/etc/drbd.d/global_common.conf BEFORE

First up, we need to get the peers talking so we can form the volume and get OCFS2 established. Let's make the /etc/drbd.d/global_common.conf file look as follows:

global
{
    usage-count no;
}

common
{
    startup
    {
        wfc-timeout 60;
        degr-wfc-timeout 60;
    }

    disk
    {
        on-io-error detach;
    }

    syncer
    {
        rate 40M;
    }

    protocol C;
}

This is only an intermediate step. Further changes are needed before we can bring it up in dual-primary mode.

/etc/drbd.d/xen_data.res

And the resource configuration (doesn't need to change), on both peers:

resource xen_data
{
    device      /dev/drbd0;
    disk        /dev/md0;
    meta-disk   internal;

    on sokraits
    {
        address     172.16.1.1:7788;
    }

    on halfadder
    {
        address     172.16.1.2:7788;
    }
}

bootstrapping DRBD

Getting DRBD initially up-and-running has always has a bit of voodoo behind it… trying a number of commands and eventually stumbling upon something that works. I may have finally gotten the procedure down:

TO DO ON BOTH PEERS

These identical steps

BOTH:~# drbdadm create-md xen_data
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
BOTH:~# modprobe drbd
BOTH:~# drbdadm attach xen_data
BOTH:~# drbdadm syncer xen_data
BOTH:~# drbdadm connect xen_data
BOTH:~# 

At this point we should be able to do the following:

EITHER:~# cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757 
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:488370480
EITHER:~# 

TO DO ON ONLY ONE

Once we see a semblance of communication (the “Secondary/Secondary” in /proc/drbd output, for example), we can kick the two peers into operation.

This next step must only take place on ONE of the peers. I picked sokraits for this example.. but it really doesn't matter which:

sokraits:~# drbdadm -- --overwrite-data-of-peer primary xen_data

Of course, this isn't so willy nilly if one of the peers has the more up-to-date copy of the data.

Upon which we can now view /proc/drbd and see messages like:

sokraits:~# cat /proc/drbd 
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757 
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:361016 nr:0 dw:0 dr:367944 al:0 bm:21 lo:15 pe:76 ua:225 ap:0 ep:1 wo:b oos:488011888
	[>....................] sync'ed:  0.1% (476572/476924)M
	finish: 2:16:00 speed: 59,764 (59,764) K/sec
sokraits:~# 

Checking this occasionally (on either peer), will show the progress.

formatting the array

To format the volume, we ignore the underlying disks, and address /dev/drbd0 all the time. The OCFS2 filesystem was put on the disk array:

halfadder:~# mkfs.ocfs2 -v -L datastore -N 4 -T datafiles /dev/drbd0
mkfs.ocfs2 1.4.4
Cluster stack: classic o2cb
Filesystem Type of datafiles
Label: datastore
Features: sparse backup-super unwritten inline-data strict-journal-super
Block size: 4096 (12 bits)
Cluster size: 1048576 (20 bits)
Volume size: 500105740288 (476938 clusters) (122096128 blocks)
Cluster groups: 15 (tail covers 25354 clusters, rest cover 32256 clusters)
Extent allocator size: 377487360 (90 groups)
Journal size: 33554432
Node slots: 4
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 5 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Writing lost+found: done
mkfs.ocfs2 successful

halfadder:~# 

By default, if no -N # argument was specified during the formatting of the filesystem, a maximum of 8 machines can simultaneously mount this volume. The intent is for just two machines (sokraits and halfadder) to be the only machines ever mounting this volume.

OCFS2

Because sokraits and halfadder will exist in a primary-primary peer relationship, we need to run a cluster-aware filesystem on our shared volume. Although many exist, the one we've had any amount of prior experience with is OCFS2, so it is redeployed here.

configuring OCFS2

The following should be put in /etc/default/o2cb:

#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running 'dpkg-reconfigure ocfs2-tools'.
# Please use that method to modify this file.
#

# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=datastore

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=31

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.
O2CB_IDLE_TIMEOUT_MS=30000

# O2CB_KEEPALIVE_DELAY_MS: Max. time in ms before a keepalive packet is sent.
O2CB_KEEPALIVE_DELAY_MS=2000

# O2CB_RECONNECT_DELAY_MS: Min. time in ms between connection attempts.
O2CB_RECONNECT_DELAY_MS=2000

Next we need to configure the OCFS2 cluster (/etc/ocfs2/cluster.conf):

node:
      ip_port = 7777
      ip_address = 172.16.1.1
      number = 0
      name =  sokraits
      cluster = datastore

node:
      ip_port = 7777
      ip_address = 172.16.1.2
      number = 1
      name =  halfadder
      cluster = datastore

cluster:
      node_count = 2
      name = datastore

/etc/drbd.d/global_common.conf AFTER OCFS2 is ready

Once the other prerequisites are taken care of, we can bring the OCFS2 cluster up in dual primary mode, as the following config file allows for. Duplicate this on both peers.

global                                                                                    
{
    usage-count no;
}
                                                                                          
common
{
    startup
    {
        wfc-timeout 60;
        degr-wfc-timeout 60;
        become-primary-on both;
    }   

    disk
    {   
        on-io-error detach;
    }       

    net
    {   
        allow-two-primaries yes;
    }   

    syncer
    {   
        rate 80M;
    }       

    protocol C;
}

This /etc/drbd.d/global_common.conf file needs to be identical and present on BOTH DRBD peers.

Recognizing the changes does not require a reboot! The following command (run on both DRBD peers), will update the config:

machine:~# drbdadm adjust xen_data

Bringing OCFS2 online

Assuming /etc/ocfs2/cluster.conf and /etc/default/o2cb are configured and identical, we can now establish OCFS2 cluster connectivity.

These steps take place on BOTH peers.

Step 0: kernel module

BOTH:~# modprobe ocfs2

Step 1: o2cb service

Next, we need to bring the o2cb service online:

BOTH:~# /etc/init.d/o2cb start
ounting configfs filesystem at /sys/kernel/config: OK
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Creating directory '/dlm': OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Setting cluster stack "o2cb": OK
Starting O2CB cluster datastore: OK

Step 2: ocfs2 bits

Whatever other functionality there is related to OCFS2, time to bring it on-line as well:

BOTH:~# /etc/init.d/ocfs2 start
BOTH:~# 

Step 3: mount the volume

Assuming all is in order, we can now mount our volume:

BOTH:~# mount -t ocfs2 /dev/drbd0 /export
BOTH:~# df
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              27G  1.2G   25G   5% /
tmpfs                 1.7G     0  1.7G   0% /lib/init/rw
udev                  1.6G  160K  1.6G   1% /dev
tmpfs                 1.7G     0  1.7G   0% /dev/shm
/dev/drbd0            466G  1.6G  465G   1% /export
BOTH:~# 

Local Modifications

Automating the mount in /etc/fstab

We can have the system work to automatically mount our volume on boot by putting an appropriate entry into /etc/fstab, by appending the following to the bottom of the file:

/dev/drbd0      /export         ocfs2   noatime         0       0

Turn swapiness way down

The Linux default for swapiness is 60, which will result in the system paging stuff out to swap. Cranking it down to 10 seems a more prudent setting, especially on systems with SSDs, where we want them used as little as possible.

I added the following line to /etc/sysctl.conf on both systems:

vm.swappiness = 10

/tmp in RAM

Seems Debian has a nice built-in support for mounting /tmp in tmpfs (RAM-backed filesystem). All you need to do is edit /etc/default/tmpfs, and uncomment/change the following line:

RAMTMP=yes

And reboot!

integrating the array's storage into the system

The disk array is going to hold both Xen virtual machine images (and supporting files), but also serve as another backup destination for resources in the LAIR.

The following directories have been created:

Historical: Configuring xen-tools to create Debian jessie VMs

This is no longer needed, but may well be in the future.

There are two changes needed to successfully create jessie VMs, and both are symlinks:

Enable debootstrap to work with jessie

halfadder:~# cd /usr/share/debootstrap
halfadder:/usr/share/debootstrap# ln -s sid jessie

Enable xen-tools to recognize jessie as a valid distro

halfadder:~# cd /usr/lib/xen-tools
halfadder:/usr/lib/xen-tools# ln -s debian.d jessie.d

Xen in Operation

Overview

Sokraits and Halfadder serve as the production virtual machine servers in the LAIR, predominantly to the offbyone.lan network, but also providing services utilized on lair.lan, student.lab, and across the BITS universe.

Running the Open Source Xen hypervisor (version 4.0.1), around a dozen virtual machines are instantiated.

Xen administration

From the VM server, we can adjust various properties and control aspects of virtual machines via the xm command-line tool.

What's running

To determine what is running on a particular VM host, we use the xm list command.

For example, here's an example output (list of VMs can and will vary):

sokraits:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2344     4     r-----   1377.5
irc                                          8   128     1     -b----      9.2
lab46                                       11   512     2     -b----     10.6
lab46db                                      6   128     1     -b----     14.3
mail                                         5   192     1     -b----     18.0
www                                          4   192     1     -b----    129.2
sokraits:~# 

What this shows us is that the following VMs are running locally on this VM server. If the VM you are looking for is believed running but not found in this list, it is likely running on the other VM server.

Boot a VM

To start a VM that is not presently running, assuming all prerequisites are met (operational VM image exists, correct configuration file, available resources (mainly memory)), we can use xm to create an instantiation of the virtual machine.

In this example, we will start the VM for repos.offbyone.lan:

halfadder:~# xm create -c /xen/conf/repos.cfg
Using config file "/xen/conf/repos.cfg".
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.26-2-xen-amd64 (Debian 2.6.26-25lenny1) (dannf@debian.org) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 SMP Thu Sep 16 16:32:15 UTC 2010
[    0.000000] Command line: root=/dev/xvda1 ro ip=:127.0.255.255::::eth0:dhcp clocksource=jiffies
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 0000000008800000 (usable)
[    0.000000] max_pfn_mapped = 34816
[    0.000000] init_memory_mapping
...
Starting periodic command scheduler: crond.

Debian GNU/Linux 5.0 repos.offbyone.lan hvc0

repos.offbyone.lan login: 

Due to the -c argument we gave xm when creating the virtual machine, we will connect to the console of this virtual machine, allowing us to see it boot. One only needs omit the -c from the xm command-line, and the machine will still start, but we'll be returned to the command prompt.

Detaching from VM console

In this current scenario, we'll want to issue a: CTRL-]

Once you do that, you'll escape from the VM's prompt, and be returned to the prompt on the VM server.

Duplicate VM creation

And what if the VM is already running? If you are trying to start it on the same VM host it is already running, you'll see the following:

halfadder:~# xm create -c /xen/conf/repos.cfg
Using config file "/xen/conf/repos.cfg".
Error: Domain 'repos' already exists with ID '3'
halfadder:~# 

If it is running, but on the other VM server, well, trouble is likely going to take place. Although the VM servers are using the cluster file system, the individual VMs are not, and will likely not take kindly to concurrent accesses. So prevent headaches and take care not to start multiple copies of the same VM!

Shut down a VM

If we desire to shut down a VM, we can do so (and properly!) from the VM server command-line. Using the xm shutdown command, a shutdown signal is created on the VM, and the machine shuts down just as if we gave it a “shutdown -h now” command.

Shutting down repos.offbyone.lan:

halfadder:~# xm shutdown repos
halfadder:~# 

After a bit, if you check the output of xm list, you will no longer see the VM in question listed. Once this condition is true, you can proceed with whatever operation is underway.

Live Migrate a VM

One of the impressive features we have available with the use of DRBD and OCFS2 is a multi-master concurrent filesystem. This creates “shared storage”, which grants us some advantages.

Specifically, we can use our shared storage to enable migration of virtual machines between VM servers. What's more, we can perform a live migration, transparently (to anyone using the virtual machine) moving the VM to another physical host without interrupting its operation.

Following will be an example of a live migration, migrating the *www* virtual machine, originally residing on sokraits:

Step 0: Verify the running VM

sokraits:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2344     4     r-----   1383.1
irc                                          8   128     1     -b----      9.8
lab46                                       11   512     2     -b----     11.3
lab46db                                      6   128     1     -b----     14.6
mail                                         5   192     1     -b----     18.8
www                                          4   192     1     -b----    133.3
sokraits:~# 

So we see www is running on sokraits.

Step 1: Live migrate it to halfadder

sokraits:~# xm migrate --live www halfadder
sokraits:~# 

After only a few seconds, we get our prompt back.

Step 2: Verify www is no longer running on sokraits

Do another xm list on sokraits:

sokraits:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2344     4     r-----   1387.8
irc                                          8   128     1     -b----      9.9
lab46                                       11   512     2     -b----     11.4
lab46db                                      6   128     1     -b----     14.6
mail                                         5   192     1     -b----     19.0
sokraits:~# 

As you can see, www is no longer present in the VM list on sokraits.

Step 3: Check running VMs on halfadder

Switch over to halfadder, do a check:

halfadder:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  3305     1     r-----     65.0
auth                                         4   128     1     -b----      4.1
log                                          1   128     1     -b----      0.8
repos                                        5   128     1     -b----      8.3
web                                          2   128     1     -b----      2.9
www                                          6   192     1     -b----      0.4
halfadder:~# 

And voila! A successful live migration.

LRRDnode configuration

To facilitate administration, both sokraits and halfadder are configured as LRRDnode clients and log data that can be retrieved from LRRD at: http://web.offbyone.lan/lrrd/

Install lrrd-node

First step is to install the actual LAIR package:

BOTH:~# aptitude install lrrd-node
The following NEW packages will be installed:
  libstatgrab6{a} lrrd-node python-statgrab{a} 
0 packages upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 118 kB of archives. After unpacking 348 kB will be used.
Do you want to continue? [Y/n/?] 
Get:1 http://mirror/debian/ squeeze/main libstatgrab6 amd64 0.16-0.1 [57.6 kB]
Get:2 http://mirror/debian/ squeeze/main python-statgrab amd64 0.4-1.1+b2 [53.0 kB]
Get:3 http://mirror/lair/ squeeze/main lrrd-node all 1.0.7-1 [7,128 B]
Fetched 118 kB in 0s (9,978 kB/s)
Selecting previously deselected package libstatgrab6.
(Reading database ... 28935 files and directories currently installed.)
Unpacking libstatgrab6 (from .../libstatgrab6_0.16-0.1_amd64.deb) ...
Selecting previously deselected package python-statgrab.
Unpacking python-statgrab (from .../python-statgrab_0.4-1.1+b2_amd64.deb) ...
Setting up libstatgrab6 (0.16-0.1) ...
Setting up python-statgrab (0.4-1.1+b2) ...
Processing triggers for python-support ...
Selecting previously deselected package lrrd-node.
(Reading database ... 28961 files and directories currently installed.)
Unpacking lrrd-node (from .../lrrd-node_1.0.7-1_all.deb) ...
Setting up lrrd-node (1.0.7-1) ...
Adding lrrdNode to init.d
update-rc.d: using dependency based boot sequencing
insserv: warning: script 'lrrdnode' missing LSB tags and overrides
Running lrrdNode ...
Starting lrrdNode: stat collection thinger: Starting LRRD Node
lrrdNode
                                         
BOTH:~# 

Configure lrrd-node at LRRD

Once installed and running on the client side, we need to configure (or reconfigure, as the case may be) at LRRD.

So pop a browser over to: http://web.offbyone.lan/lrrd/

And log in (~root, punctuation-less ~root pass).

Click on the “Configure” link, and find the host in question (if it has prior history reporting to LRRD).

If found, note that it is Enabled, and click the “reconfigure” link to the right of the entry.

There's an option to delete existing databases (do it), and check off any appropriate network interfaces.

Manual lrrd-node restart

If it is discovered that data reporting ceases, and other components of the LRRD system are still deemed functioning, it is likely that the lrrd-node client needs a restart. Simply do the following on the machine in question:

sokraits:~# /etc/init.d/lrrdnode restart
Stopping lrrdNode: stat collection thinger: lrrdNode
Starting lrrdNode: stat collection thinger: Starting LRRD Node
lrrdNode
sokraits:~# 

Wait at least 5 minutes for data reporting to make it into graphable form.

Sync'ing to data store

Since we've been successful running the systems out of a RAMdisk, care must be taken to preserve any changes in the event of a reboot or power failure.

rsync to disk

In this light, I first had the systems rsync'ing to their local SSD (boot drive). I rigged up a custom cronjob than ran 3 times a day. It looks as follows:

12 */8 *   *   *     (mkdir -p /tmp/sda1; mount /dev/sda1 /tmp/sda1; rsync -av --one-file-system / /tmp/sda1/; umount /tmp/sda1)

rsync to fileserver

This worked handily until sokraits lost its boot drive (again! In 2 months time!) so I decided to investigate netbooting using an NFSroot.

In the process, I may have finally made a breakthrough in my longtime desire to put the entire system IN the initial ramdisk (so it would be running in RAM from the get-go). Turns out, according to the manual page, you merely have to put the system IN the initrd file… obviously one needs adequate memory (2x at boot- enough for the initrd, and enough to decompress it).

My cron job changed as follows:

24 */8 *   *   *     (rsync -av --one-file-system / data.lair.lan:/export/tftpboot/netboot/halfadder/disk/)

I plan to rig up either some daily autogeneration of the initrd, or have a script on standby that can use to make it. This will then become the method of booting both sokraits and halfadder (and potentially freeing up a still-working SSD in the process! Which I can use in data2).

On the fileserver, I then obtain the latest copy of the hypervisor, kernel, and generate a new all-system initrd:

data1:/export/tftpboot/netboot/halfadder# cp disk/boot/xen-4.4-amd64.gz .
data1:/export/tftpboot/netboot/halfadder# cp disk/boot/vmlinuz-3.16-2-amd64 linux
data1:/export/tftpboot/netboot/halfadder# cd disk
data1:/export/tftpboot/netboot/halfadder/disk# find . | cpio -c -o | gzip -9 > ../initrd.gz
data1:/export/tftpboot/netboot/halfadder/disk# 

pxeboot file for sokraits/halfadder

On the fileserver, in /export/tftpboot/pxelinux.cfg/ are two files, 0A50012E (sokraits) and 0A50012F (halfadder)… they are named according to the machine's IP address (only in hex).

The file(s) contain:

default netboot
prompt 1
timeout 2

label netboot
    kernel mboot.c32
    append netboot/halfadder/xen-4.4-amd64.gz --- netboot/halfadder/linux console=tty0 root=/dev/ram0 ro --- netboot/halfadder/initrd.gz

label memtest
    kernel distros/memtest/memtest86+

References

Xen

Xen on Squeeze

Xen Live Migration

Xen vif-common.sh fixes

Xen domU loses network

Nvidia forcedeth (MCP55)

MDADM

Volume coming up on md127 instead of md0

DRBD

DRBD+OCFS2

Debian from RAM

/tmp as noexec

netboot system to nfsroot