=====Overview=====

sokraits.offbyone.lan and halfadder.offbyone.lan are the two main Xen VM servers hosting virtual machines in the LAIR.

^  hostname  ^  RAM  ^  disk  ^  swap  ^  OS  ^  Kernel  ^
|  sokraits.lair.lan  |  8GB  |  32GB (/)  |  1.2GB  |  Debian 8.0 "Jessie" (AMD64)  |  3.16-2-amd64  |
|  :::  |  :::  |  500GB + 500GB RAID1 (/dev/md0)  |  :::  |  :::  |  :::  |

^  hostname  ^  RAM  ^  disk  ^  swap  ^  OS  ^  Kernel  ^
|  halfadder.lair.lan  |  8GB  |  32GB (/)  |  1.2GB  |  Debian 8.0 "Jessie" (AMD64)  |  3.16-2-amd64  |
|  :::  |  :::  |  500GB + 500GB RAID1 (/dev/md0)  |  :::  |  :::  |  :::  |

=====News=====
  * Installed new disks, installed Debian squeeze (20101117)
  * Restored old SSH keys
  * Reinstalled Sokraits to bring it up to standards (20101222)
  * Sokraits and Halfadder functional DRBD+OCFS2 peers, live migration working (20101223)
  * Updated xen-tools config and set up symlinks to create Debian jessie VMs (20140411)
  * Reinstalled sokraits with Debian Jessie, upgraded to 8GB of RAM (20140422)
  * Re-reinstalled sokraits with Debian Jessie, getting ready to deploy (20140703)
  * Re-re-reinstalled sokraits with Debian Wheezy -> Jessie, due to failed boot drive (20140806)
  * Re-installed halfadder with Debian Jessie (20140806)
  * Re-re-re-resetup sokraits as a clone of halfadder and netbooting with entire system in initrd, due to (another) failed boot drive (20141004)
=====TODO====
  * <del>rig up ramdisk /var and /tmp w/ periodic writes (since we have an SSD /).</del> System runs in a RAMdisk.
  * find 3.5" to 5.25" drive brackets and remount sokraits data drives in case.
  * on next halfadder reboot, verify that OCFS2 /export gets mounted automatically (last time I had to run "/etc/init.d/ocfs2 restart" for it to do this).


=====Network Configuration=====

====Overview====

^  Machine  ^  Interface  ^  IP Address  ^  MAC Address  |
|  sokraits.lair.lan  |  eth0  |  10.80.1.46 (lair.lan subnet)  |  00:1a:92:cd:0b:1b  |
|  :::  |  eth1  |  offbyone.lan subnet  |  00:1a:92:cd:05:d6  |
|  :::  |  eth2  |  172.16.1.1 (peer link)  |  00:0a:cd:16:d9:ac  |

^  Machine  ^  Interface  ^  IP Address  ^  MAC Address  |
|  halfadder.lair.lan  |  eth0  |  10.80.1.47 (lair.lan subnet)  |  00:1a:92:cd:0a:7f  |
|  :::  |  eth1  |  offbyone.lan subnet  |  00:1a:92:cd:06:60  |
|  :::  |  eth2  |  172.16.1.2 (peer link)  |  00:0a:cd:16:d3:cd  |

Both Sokraits and Halfadder are using their (once forbidden!) second network interfaces (to exist on both primary LAIR subnets), as well as an additional add-in PCI-e NIC (to be used with an <del>over-the-ceiling</del> across-the-floor cable to connect to each other, for the specific purpose of performing DRBD and OCFS2 peer updates).
====Interfaces====

To ensure that all network interfaces come up as intended, we need to configure **/etc/network/interfaces** as follows (first example given is for sokraits, second is current for halfadder):

<code>
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
#auto eth0
iface eth0 inet manual
iface eth1 inet manual

## management and lair.lan access through xenbr0
auto xenbr0
iface xenbr0 inet dhcp
   bridge_ports eth0
   bridge_stp off       # disable Spanning Tree Protocol
   bridge_waitport 0    # no delay before a port becomes available
   bridge_fd 0          # no forwarding delay

## configure a (separate) bridge for the DomUs without giving Dom0 an IP on it
auto xenbr1
iface xenbr1 inet manual
   bridge_ports eth1
   bridge_stp off       # disable Spanning Tree Protocol
   bridge_waitport 0    # no delay before a port becomes available
   bridge_fd 0          # no forwarding delay

auto eth2
iface eth2 inet static
    address 172.16.1.1  # halfadder assigns the address: 172.16.1.2
    netmask 255.255.255.0
</code>
====udev rules.d====
Additionally, I found some probe order issues cropping up, so the I had to manually edit which interface was which on both sokraits and halfadder via their **/etc/udev/rules.d/70-persistent-net.rules** files.

===sokraits===
<code>
# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:05:d6", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0b:1b", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

# PCI device 0x10ec:0x8168 (r8169)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d9:ac", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
</code>

===halfadder===

<code>
# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0a:7f", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

# PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:06:60", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

# PCI device 0x10ec:/sys/devices/pci0000:00/0000:00:16.0/0000:06:00.0 (r8169)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d3:cd", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
</code>
=====apt configuration=====

====use LAIR apt proxy====
To reduce traffic caused by package transactions, I set up a proxy on the fileserver(s), so every client will need to configure itself appropriately. Turns out this can be done most easily by creating the file **/etc/apt/apt.conf.d/00apt-cacher-ng** and putting in the following contents:

<code>
Acquire::http { Proxy "http://10.80.1.3:3142"; };
</code>

====no recommends====
I wanted a small installation footprint, so I disabled the installation of recommended packages by default.

To do so, create/edit **/etc/apt/apt.conf.d/99_norecommends**, and put in the following:

<code>
APT::Install-Recommends "false";
APT::AutoRemove::RecommendsImportant "false";
APT::AutoRemove::SuggestsImportant "false";
</code>

This can also post-remove previously installed recommended packages. Run **aptitude**' type 'g', type 'g' again, should take care of business.

There are also some options that can be set in **aptitude** proper, via its console gui (options->preferences):

  * Uncheck (was already) "Install recommended packages automatically"
  * Check "Automatically upgrade installed packages"
  * Check "Remove obsolete packages files after downloading new package lists"

Useful URLs:
  * http://askubuntu.com/questions/351085/how-to-remove-recommended-and-suggested-dependencies-of-uninstalled-packages
  * http://askubuntu.com/questions/223811/how-to-apt-get-install-with-only-minimal-components-necessary-for-an-application
=====Packages=====

The following packages have been installed on both sokraits and halfadder:

  bridge-utils
  lair-std
  lair-backup
  mdadm
  xen-linux-system
  xen-tools
  drbd8-utils
  ocfs2-tools
  smartmontools
  firmware-realtek
  qemu-system-x86-64

=====GRUB Configuration=====
As specified in the Debian Xen Wiki: https://wiki.debian.org/Xen#Prioritise_Booting_Xen_Over_Native

<cli>
machine:~# dpkg-divert --divert /etc/grub.d/08_linux_xen --rename /etc/grub.d/20_linux_xen
Adding 'local diversion of /etc/grub.d/20_linux_xen to /etc/grub.d/08_linux_xen'
machine:~# 
</cli>

and then regenerate grub config:

<cli>
machine:~# update-grub
Generating grub.cfg ...
Found linux image: /boot/vmlinuz-3.14-1-amd64
Found initrd image: /boot/initrd.img-3.14-1-amd64
Found linux image: /boot/vmlinuz-3.14-1-amd64
Found initrd image: /boot/initrd.img-3.14-1-amd64
done
machine:~# 
</cli>
=====Xen Configuration=====
====Xend configuration====
The Xend config file (**/etc/xen/xend-config.sxp**) for this host is as follows:

<code>
# -*- sh -*-

#
# Xend configuration file.
#
(xend-relocation-server yes)
(xend-relocation-port 8002)
(xend-address '')
(xend-relocation-address '')
(xend-relocation-hosts-allow '^localhost$ ^localhost\\.localdomain$ ^10.80.2.42$ ^10.80.2.46$ ^10.80.2.47$ ^halfadder$ ^halfadder.offbyone.lan$ ^sokraits$ ^sokraits.offbyone.lan$ ^yourmambas$ ^yourmambas.offbyone.lan$ ^grrasp$ ^grrasp.offbyone.lan$')
(network-script network-bridge)
(vif-script vif-bridge)
(dom0-min-mem 196)
(enable-dom0-ballooning yes)
(total_available_memory 0)
(dom0-cpus 0)
(vnc-listen '10.80.1.46')
(vncpasswd '********')
(xend-domains-path /export/xen/xend/domains)   # be sure to create this directory
</code>


====local loopback====
As usual, if left to its own devices, only 8 loopback devices will be created by default. Don't forget to edit **/etc/modules** as follows and reboot:

<code>
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Parameters can be specified after the module name.

firewire-sbp2
loop max_loop=255
#forcedeth max_interrupt_work=20 optimization_mode=1 poll_interval=100
nfs callback_tcpport=2049
</code>

I commented out **forcedeth** for now.. it does load, but I don't know with the current kernel if I need to specifically set those options. Time will tell.
====xen-tools====
Xen Tools appears to have been updated… it can now handle recent distributions!

Config file, **/etc/xen-tools/xen-tools.conf** follows:

<code>
######################################################################
##
## Virtual Machine configuration
##
dir             = /xen
install-method  = debootstrap
cache           = no

######################################################################
##
## Disk and Sizing options
##
size            = 4Gb      # Disk image size.
memory          = 256Mb    # Memory size
swap            = 128Mb    # Swap size
fs              = ext4     # use the EXT3 filesystem for the disk image.
dist            = jessie   # Default distribution to install.
images          = full

######################################################################
##
## Network configuration
##
bridge          = xenbr1
dhcp            = 1
gateway         = 10.80.2.1
netmask         = 255.255.255.0

######################################################################
##
## Password configuration
##
passwd          = 1

######################################################################
##
## Package Mirror configuration
##
arch            = amd64
mirror          = http://ftp.us.debian.org/debian/
mirror_squeeze  = http://ftp.us.debian.org/debian/
mirror_wheezy   = http://ftp.us.debian.org/debian/
mirror_jessie   = http://ftp.us.debian.org/debian/

######################################################################
##
## Proxy Settings for repositories
##
apt_proxy       = http://10.80.1.3:3142/

######################################################################
##
## Filesystem settings
##
ext4_options    = noatime,nodiratime,errors=remount-ro
ext3_options    = noatime,nodiratime,errors=remount-ro
ext2_options    = noatime,nodiratime,errors=remount-ro
xfs_options     = defaults
reiser_options  = defaults

######################################################################
##
## Xen VM boot settings
##
pygrub          = 1

#  Filesystem options for the different filesystems we support.
#
ext4_options    = noatime,nodiratime,errors=remount-ro,data=writeback,barrier=0,commit=600
ext3_options    = noatime,nodiratime,errors=remount-ro
ext2_options    = noatime,nodiratime,errors=remount-ro                                                                                                                                                                                                                                                               
xfs_options     = defaults
btrfs_options   = defaults

######################################################################
##
## Xen VM settings
##
serial_device   = hvc0
disk_device     = xvda

######################################################################
##
## Xen configuration files
##
output          = /xen/conf
extension       = .cfg
</code>

====xendomains config====
Since we're running Xen 4.0.1, there are some additional configuration options to tend to (along with squeeze likely better distributing functionality to specific files). **/etc/default/xendomains** is next… two changes need to be made:

<code>
## Type: string
## Default: /var/lib/xen/save
#
# Directory to save running domains to when the system (dom0) is
# shut down. Will also be used to restore domains from if # XENDOMAINS_RESTORE
# is set (see below). Leave empty to disable domain saving on shutdown 
# (e.g. because you rather shut domains down).
# If domain saving does succeed, SHUTDOWN will not be executed.
#
#XENDOMAINS_SAVE=/var/lib/xen/save
XENDOMAINS_SAVE=""
</code>

Basically make **XENDOMAINS_SAVE** an empty string, and:

<code>
## Type: boolean
## Default: true
#
# This variable determines whether saved domains from XENDOMAINS_SAVE
# will be restored on system startup. 
#
XENDOMAINS_RESTORE=false
</code>

**XENDOMAINS_RESTORE** should be set to **false**.

Finally, we set a directory for auto-starting VMs on dome boot:

<code>
# This variable sets the directory where domains configurations
# are stored that should be started on system startup automatically.
# Leave empty if you don't want to start domains automatically
# (or just don't place any xen domain config files in that dir).
# Note that the script tries to be clever if both RESTORE and AUTO are 
# set: It will first restore saved domains and then only start domains
# in AUTO which are not running yet. 
# Note that the name matching is somewhat fuzzy.
#
XENDOMAINS_AUTO=/xen/conf/auto
</code>
=====MD array configuration=====
The purpose of the disk array is to provide RAID1 (mirror) to the Xen VM images.

====Re-initializing====
As we've had functioning RAID volumes for years, I thought I would re-do the arrays so as to take advantage of any new version features (when I first created them, mdadm was at version 0.8- now it is 1.2).

So, I first stopped the array:

<cli>
sokraits:~# mdadm --stop /dev/md0
</cli>

Then, I zeroed out the superblocks on both constituent drives:

<cli>
sokraits:~# mdadm --zero-superblock /dev/sdb
sokraits:~# mdadm --zero-superblock /dev/sdc
</cli>

Now we can proceed with creating the new array.
====creating /dev/md0====
I opted to build the array straight to disk-- no messing with partition tables.

<cli prompt="# ">
halfadder:~# mdadm --create /dev/md0 --level=1 --raid-disks=2 /dev/sdb /dev/sdc 
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969
mdadm: partition table exists on /dev/sdb but will be lost or
       meaningless after creating array
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdc appears to be part of a raid array:
       level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969
mdadm: partition table exists on /dev/sdc but will be lost or
       meaningless after creating array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
halfadder:~# 
</cli>


====checking disk array status====
To check the status:

<cli prompt="# ">
halfadder:~# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdc[1] sdb[0]
      488385424 blocks super 1.2 [2/2] [UU]
      [=>...................]  resync =  8.9% (43629696/488385424) finish=56.9min speed=130132K/sec
      
unused devices: <none>
halfadder:~# 
</cli>

usually (when finished building and all is in order) it'll likely look something like:

<cli prompt="# ">
halfadder:~# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb[0] sdc[1]
      488385424 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>
halfadder:~# 
</cli>

====Setting /etc/mdadm/mdadm.conf====
To avoid oddities (such as /dev/md0 coming up as /dev/md127 and confusing everything) on subsequent boots, we should set up the **/etc/mdadm/mdadm.conf** file accordingly. Assuming hardware is in identical places device-wise, the only data unique to each peer is the hostname and the md0 uuid, as is seen in the following:

===sokraits's mdadm.conf===

<code>
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE /dev/sdb /dev/sdc

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST sokraits

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0 metadata=1.2 UUID=731663e3:d5fd45ac:157baa06:11018534 name=sokraits:0

# This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400
# by mkconf 3.2.5-5
</code>

===halfadder's mdadm.conf===

<code>
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE /dev/sdb /dev/sdc

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST halfadder

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0 metadata=1.2 UUID=c846eb24:6b9783db:cd9b436c:8470fd46 name=halfadder:0

# This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400
# by mkconf 3.2.5-5
</code>

===How to find the local md volume UUID===
To obtain the UUID generated for the md volume, simply run the following (it is unique per host):

<cli prompt="# ">
halfadder:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Thu Aug  7 11:50:06 2014
     Raid Level : raid1
     Array Size : 488255488 (465.64 GiB 499.97 GB)
  Used Dev Size : 488255488 (465.64 GiB 499.97 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Aug  7 13:11:52 2014
          State : active 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : halfadder:0  (local to host halfadder)
           UUID : c846eb24:6b9783db:cd9b436c:8470fd46
         Events : 979

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
halfadder:~# 
</cli>

You'll see the **UUID** listed. Just copy this into **/etc/mdadm/mdadm.conf** in the appropriate place, as indicated by the above config files, to ensure the proper identification of the MD array.

===After configuring mdadm.conf===
According to the information in **/usr/share/doc/mdadm/README.upgrading-2.5.3.gz**, once we configure the **/etc/mdadm/mdadm.conf** file, we must let the system know and rebuild the initial ramdisk:

<cli prompt="# ">
BOTH:~# update-initramfs -t -u -k all
update-initramfs: Generating /boot/initrd.img-2.6.32-5-xen-amd64
update-initramfs: Generating /boot/initrd.img-2.6.32-5-amd64
BOTH:~# 
</cli>
=====DRBD=====
In order to have the "shared" storage that allows OCFS2 to work, we'll set up DRBD to constantly sync the volumes between sokraits and halfadder.

With the tools installed, we need to configure some files.

====/etc/drbd.d/global_common.conf BEFORE====
First up, we need to get the peers talking so we can form the volume and get OCFS2 established. Let's make the **/etc/drbd.d/global_common.conf** file look as follows:

<code>
global
{
    usage-count no;
}

common
{
    startup
    {
        wfc-timeout 60;
        degr-wfc-timeout 60;
    }

    disk
    {
        on-io-error detach;
    }

    syncer
    {
        rate 40M;
    }

    protocol C;
}
</code>

This is only an intermediate step. Further changes are needed before we can bring it up in dual-primary mode.

====/etc/drbd.d/xen_data.res====
And the resource configuration (doesn't need to change), on both peers:

<code>
resource xen_data
{
    device      /dev/drbd0;
    disk        /dev/md0;
    meta-disk   internal;

    on sokraits
    {
        address     172.16.1.1:7788;
    }

    on halfadder
    {
        address     172.16.1.2:7788;
    }
}
</code>
====bootstrapping DRBD====
Getting DRBD initially up-and-running has always has a bit of voodoo behind it... trying a number of commands and eventually stumbling upon something that works. I may have finally gotten the procedure down:

===TO DO ON BOTH PEERS===
These identical steps
<cli prompt="# ">
BOTH:~# drbdadm create-md xen_data
Writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.
BOTH:~# modprobe drbd
BOTH:~# drbdadm attach xen_data
BOTH:~# drbdadm syncer xen_data
BOTH:~# drbdadm connect xen_data
BOTH:~# 
</cli>

At this point we should be able to do the following:

<cli>
EITHER:~# cat /proc/drbd
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757 
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:488370480
EITHER:~# 
</cli>

===TO DO ON ONLY ONE===
Once we see a semblance of communication (the "Secondary/Secondary" in /proc/drbd output, for example), we can kick the two peers into operation.

This next step must only take place on **ONE** of the peers. I picked **sokraits** for this example.. but it really doesn't matter which:

<cli>
sokraits:~# drbdadm -- --overwrite-data-of-peer primary xen_data
</cli>

Of course, this isn't so willy nilly if one of the peers has the more up-to-date copy of the data.

Upon which we can now view /proc/drbd and see messages like:

<cli>
sokraits:~# cat /proc/drbd 
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757 
 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r----
    ns:361016 nr:0 dw:0 dr:367944 al:0 bm:21 lo:15 pe:76 ua:225 ap:0 ep:1 wo:b oos:488011888
	[>....................] sync'ed:  0.1% (476572/476924)M
	finish: 2:16:00 speed: 59,764 (59,764) K/sec
sokraits:~# 
</cli>

Checking this occasionally (on either peer), will show the progress.

====formatting the array====
To format the volume, we ignore the underlying disks, and address **/dev/drbd0** all the time. The **OCFS2** filesystem was put on the disk array:

<cli prompt="# ">
halfadder:~# mkfs.ocfs2 -v -L datastore -N 4 -T datafiles /dev/drbd0
mkfs.ocfs2 1.4.4
Cluster stack: classic o2cb
Filesystem Type of datafiles
Label: datastore
Features: sparse backup-super unwritten inline-data strict-journal-super
Block size: 4096 (12 bits)
Cluster size: 1048576 (20 bits)
Volume size: 500105740288 (476938 clusters) (122096128 blocks)
Cluster groups: 15 (tail covers 25354 clusters, rest cover 32256 clusters)
Extent allocator size: 377487360 (90 groups)
Journal size: 33554432
Node slots: 4
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 5 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Writing lost+found: done
mkfs.ocfs2 successful

halfadder:~# 
</cli>

By default, if no **-N #** argument was specified during the formatting of the filesystem, a maximum of 8 machines can simultaneously mount this volume. The intent is for just two machines (sokraits and halfadder) to be the only machines ever mounting this volume.

=====OCFS2=====
Because sokraits and halfadder will exist in a primary-primary peer relationship, we need to run a cluster-aware filesystem on our shared volume. Although many exist, the one we've had any amount of prior experience with is OCFS2, so it is redeployed here.

====configuring OCFS2====

The following should be put in /etc/default/o2cb:

<code>
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running 'dpkg-reconfigure ocfs2-tools'.
# Please use that method to modify this file.
#

# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=datastore

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=31

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.
O2CB_IDLE_TIMEOUT_MS=30000

# O2CB_KEEPALIVE_DELAY_MS: Max. time in ms before a keepalive packet is sent.
O2CB_KEEPALIVE_DELAY_MS=2000

# O2CB_RECONNECT_DELAY_MS: Min. time in ms between connection attempts.
O2CB_RECONNECT_DELAY_MS=2000
</code>

Next we need to configure the OCFS2 cluster (/etc/ocfs2/cluster.conf):

<code>
node:
      ip_port = 7777
      ip_address = 172.16.1.1
      number = 0
      name =  sokraits
      cluster = datastore

node:
      ip_port = 7777
      ip_address = 172.16.1.2
      number = 1
      name =  halfadder
      cluster = datastore

cluster:
      node_count = 2
      name = datastore
</code>
====/etc/drbd.d/global_common.conf AFTER OCFS2 is ready====
Once the other prerequisites are taken care of, we can bring the OCFS2 cluster up in dual primary mode, as the following config file allows for. Duplicate this on both peers.

<code>
global                                                                                    
{
    usage-count no;
}
                                                                                          
common
{
    startup
    {
        wfc-timeout 60;
        degr-wfc-timeout 60;
        become-primary-on both;
    }   

    disk
    {   
        on-io-error detach;
    }       

    net
    {   
        allow-two-primaries yes;
    }   

    syncer
    {   
        rate 80M;
    }       

    protocol C;
}
</code>

This **/etc/drbd.d/global_common.conf** file needs to be identical and present on BOTH DRBD peers.

Recognizing the changes does not require a reboot! The following command (run on both DRBD peers), will update the config:

<cli>
machine:~# drbdadm adjust xen_data
</cli>
====Bringing OCFS2 online====
Assuming **/etc/ocfs2/cluster.conf** and **/etc/default/o2cb** are configured and identical, we can now establish OCFS2 cluster connectivity.

These steps take place on **BOTH** peers.

===Step 0: kernel module===
<cli prompt="# ">
BOTH:~# modprobe ocfs2
</cli>

===Step 1: o2cb service===
Next, we need to bring the **o2cb** service online:

<cli>
BOTH:~# /etc/init.d/o2cb start
ounting configfs filesystem at /sys/kernel/config: OK
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Creating directory '/dlm': OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Setting cluster stack "o2cb": OK
Starting O2CB cluster datastore: OK
</cli>

===Step 2: ocfs2 bits===
Whatever other functionality there is related to OCFS2, time to bring it on-line as well:

<cli>
BOTH:~# /etc/init.d/ocfs2 start
BOTH:~# 
</cli>

===Step 3: mount the volume===
Assuming all is in order, we can now mount our volume:

<cli>
BOTH:~# mount -t ocfs2 /dev/drbd0 /export
BOTH:~# df
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              27G  1.2G   25G   5% /
tmpfs                 1.7G     0  1.7G   0% /lib/init/rw
udev                  1.6G  160K  1.6G   1% /dev
tmpfs                 1.7G     0  1.7G   0% /dev/shm
/dev/drbd0            466G  1.6G  465G   1% /export
BOTH:~# 
</cli>


=====Local Modifications=====
====Automating the mount in /etc/fstab====
We can have the system work to automatically mount our volume on boot by putting an appropriate entry into **/etc/fstab**, by appending the following to the bottom of the file:

<code>
/dev/drbd0      /export         ocfs2   noatime         0       0
</code>

====Turn swapiness way down====
The Linux default for swapiness is 60, which will result in the system paging stuff out to swap. Cranking it down to 10 seems a more prudent setting, especially on systems with SSDs, where we want them used as little as possible.

I added the following line to **/etc/sysctl.conf** on both systems:

<code>
vm.swappiness = 10
</code>

====/tmp in RAM====
Seems Debian has a nice built-in support for mounting /tmp in tmpfs (RAM-backed filesystem). All you need to do is edit **/etc/default/tmpfs**, and uncomment/change the following line:

<code>
RAMTMP=yes
</code>

And reboot!
====integrating the array's storage into the system====
The disk array is going to hold both Xen virtual machine images (and supporting files), but also serve as another backup destination for resources in the LAIR.

The following directories have been created:

  * /export - the array's main mountpoint
  * /xen - location of Xen data (symlink to /export/xen)
  * /backup - location of backup data (symlink to /export/backup)


====Historical: Configuring xen-tools to create Debian jessie VMs====
**This is no longer needed, but may well be in the future.**

There are two changes needed to successfully create jessie VMs, and both are symlinks:

===Enable debootstrap to work with jessie===
<cli>
halfadder:~# cd /usr/share/debootstrap
halfadder:/usr/share/debootstrap# ln -s sid jessie
</cli>

===Enable xen-tools to recognize jessie as a valid distro===
<cli>
halfadder:~# cd /usr/lib/xen-tools
halfadder:/usr/lib/xen-tools# ln -s debian.d jessie.d
</cli>
=====Xen in Operation=====
====Overview====
Sokraits and Halfadder serve as the production virtual machine servers in the LAIR, predominantly to the offbyone.lan network, but also providing services utilized on lair.lan, student.lab, and across the BITS universe.

Running the Open Source Xen hypervisor (version 4.0.1), around a dozen virtual machines are instantiated.

====Xen administration====
From the VM server, we can adjust various properties and control aspects of virtual machines via the **xm** command-line tool.

====What's running====
To determine what is running on a particular VM host, we use the **xm list** command.

For example, here's an example output (list of VMs can and will vary):

<cli prompt="# ">
sokraits:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2344     4     r-----   1377.5
irc                                          8   128     1     -b----      9.2
lab46                                       11   512     2     -b----     10.6
lab46db                                      6   128     1     -b----     14.3
mail                                         5   192     1     -b----     18.0
www                                          4   192     1     -b----    129.2
sokraits:~# 
</cli>

What this shows us is that the following VMs are running locally on this VM server. If the VM you are looking for is believed running but not found in this list, it is likely running on the other VM server.

====Boot a VM====
To start a VM that is not presently running, assuming all prerequisites are met (operational VM image exists, correct configuration file, available resources (mainly memory)), we can use **xm** to create an instantiation of the virtual machine.

In this example, we will start the VM for **repos.offbyone.lan**:

<cli prompt="# ">
halfadder:~# xm create -c /xen/conf/repos.cfg
Using config file "/xen/conf/repos.cfg".
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.26-2-xen-amd64 (Debian 2.6.26-25lenny1) (dannf@debian.org) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 SMP Thu Sep 16 16:32:15 UTC 2010
[    0.000000] Command line: root=/dev/xvda1 ro ip=:127.0.255.255::::eth0:dhcp clocksource=jiffies
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  Xen: 0000000000000000 - 0000000008800000 (usable)
[    0.000000] max_pfn_mapped = 34816
[    0.000000] init_memory_mapping
...
Starting periodic command scheduler: crond.

Debian GNU/Linux 5.0 repos.offbyone.lan hvc0

repos.offbyone.lan login: 
</cli>

Due to the **-c** argument we gave **xm** when creating the virtual machine, we will connect to the console of this virtual machine, allowing us to see it boot. One only needs omit the **-c** from the **xm** command-line, and the machine will still start, but we'll be returned to the command prompt.

====Detaching from VM console====

In this current scenario, we'll want to issue a: **CTRL-]**

Once you do that, you'll escape from the VM's prompt, and be returned to the prompt on the VM server.

====Duplicate VM creation====

And what if the VM is already running? If you are trying to start it on the same VM host it is already running, you'll see the following:

<cli>
halfadder:~# xm create -c /xen/conf/repos.cfg
Using config file "/xen/conf/repos.cfg".
Error: Domain 'repos' already exists with ID '3'
halfadder:~# 
</cli>

If it is running, but on the other VM server, well, trouble is likely going to take place. Although the VM servers are using the cluster file system, the individual VMs are not, and will likely not take kindly to concurrent accesses. So prevent headaches and take care not to start multiple copies of the same VM!

====Shut down a VM====
If we desire to shut down a VM, we can do so (and properly!) from the VM server command-line. Using the **xm shutdown** command, a shutdown signal is created on the VM, and the machine shuts down just as if we gave it a "**shutdown -h now**" command.

Shutting down **repos.offbyone.lan**:

<cli prompt="# ">
halfadder:~# xm shutdown repos
halfadder:~# 
</cli>

After a bit, if you check the output of **xm list**, you will no longer see the VM in question listed. Once this condition is true, you can proceed with whatever operation is underway.

====Live Migrate a VM====
One of the impressive features we have available with the use of DRBD and OCFS2 is a multi-master concurrent filesystem. This creates "shared storage", which grants us some advantages.

Specifically, we can use our shared storage to enable migration of virtual machines between VM servers. What's more, we can perform a **live** migration, transparently (to anyone using the virtual machine) moving the VM to another physical host without interrupting its operation.

Following will be an example of a live migration, migrating the ***www*** virtual machine, originally residing on sokraits:

===Step 0: Verify the running VM===

<cli prompt="# ">
sokraits:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2344     4     r-----   1383.1
irc                                          8   128     1     -b----      9.8
lab46                                       11   512     2     -b----     11.3
lab46db                                      6   128     1     -b----     14.6
mail                                         5   192     1     -b----     18.8
www                                          4   192     1     -b----    133.3
sokraits:~# 
</cli>

So we see **www** is running on sokraits.

===Step 1: Live migrate it to halfadder===

<cli>
sokraits:~# xm migrate --live www halfadder
sokraits:~# 
</cli>

After only a few seconds, we get our prompt back.

===Step 2: Verify www is no longer running on sokraits===
Do another **xm list** on sokraits:

<cli>
sokraits:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  2344     4     r-----   1387.8
irc                                          8   128     1     -b----      9.9
lab46                                       11   512     2     -b----     11.4
lab46db                                      6   128     1     -b----     14.6
mail                                         5   192     1     -b----     19.0
sokraits:~# 
</cli>

As you can see, **www** is no longer present in the VM list on sokraits.

===Step 3: Check running VMs on halfadder===
Switch over to halfadder, do a check:

<cli>
halfadder:~# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
Domain-0                                     0  3305     1     r-----     65.0
auth                                         4   128     1     -b----      4.1
log                                          1   128     1     -b----      0.8
repos                                        5   128     1     -b----      8.3
web                                          2   128     1     -b----      2.9
www                                          6   192     1     -b----      0.4
halfadder:~# 
</cli>

And voila! A successful live migration.

=====LRRDnode configuration=====
To facilitate administration, both sokraits and halfadder are configured as LRRDnode clients and log data that can be retrieved from LRRD at: http://web.offbyone.lan/lrrd/

====Install lrrd-node====
First step is to install the actual LAIR package:

<cli prompt="# ">
BOTH:~# aptitude install lrrd-node
The following NEW packages will be installed:
  libstatgrab6{a} lrrd-node python-statgrab{a} 
0 packages upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 118 kB of archives. After unpacking 348 kB will be used.
Do you want to continue? [Y/n/?] 
Get:1 http://mirror/debian/ squeeze/main libstatgrab6 amd64 0.16-0.1 [57.6 kB]
Get:2 http://mirror/debian/ squeeze/main python-statgrab amd64 0.4-1.1+b2 [53.0 kB]
Get:3 http://mirror/lair/ squeeze/main lrrd-node all 1.0.7-1 [7,128 B]
Fetched 118 kB in 0s (9,978 kB/s)
Selecting previously deselected package libstatgrab6.
(Reading database ... 28935 files and directories currently installed.)
Unpacking libstatgrab6 (from .../libstatgrab6_0.16-0.1_amd64.deb) ...
Selecting previously deselected package python-statgrab.
Unpacking python-statgrab (from .../python-statgrab_0.4-1.1+b2_amd64.deb) ...
Setting up libstatgrab6 (0.16-0.1) ...
Setting up python-statgrab (0.4-1.1+b2) ...
Processing triggers for python-support ...
Selecting previously deselected package lrrd-node.
(Reading database ... 28961 files and directories currently installed.)
Unpacking lrrd-node (from .../lrrd-node_1.0.7-1_all.deb) ...
Setting up lrrd-node (1.0.7-1) ...
Adding lrrdNode to init.d
update-rc.d: using dependency based boot sequencing
insserv: warning: script 'lrrdnode' missing LSB tags and overrides
Running lrrdNode ...
Starting lrrdNode: stat collection thinger: Starting LRRD Node
lrrdNode
                                         
BOTH:~# 
</cli>

====Configure lrrd-node at LRRD====
Once installed and running on the client side, we need to configure (or reconfigure, as the case may be) at LRRD.

So pop a browser over to: http://web.offbyone.lan/lrrd/

And log in (~root, punctuation-less ~root pass).

Click on the "Configure" link, and find the host in question (if it has prior history reporting to LRRD).

If found, note that it is Enabled, and click the "reconfigure" link to the right of the entry.

There's an option to delete existing databases (do it), and check off any appropriate network interfaces.

====Manual lrrd-node restart====
If it is discovered that data reporting ceases, and other components of the LRRD system are still deemed functioning, it is likely that the lrrd-node client needs a restart. Simply do the following on the machine in question:

<cli prompt="# ">
sokraits:~# /etc/init.d/lrrdnode restart
Stopping lrrdNode: stat collection thinger: lrrdNode
Starting lrrdNode: stat collection thinger: Starting LRRD Node
lrrdNode
sokraits:~# 
</cli>

Wait at least 5 minutes for data reporting to make it into graphable form.

=====Sync'ing to data store=====
Since we've been successful running the systems out of a RAMdisk, care must be taken to preserve any changes in the event of a reboot or power failure.

====rsync to disk====

In this light, I first had the systems rsync'ing to their local SSD (boot drive). I rigged up a custom cronjob than ran 3 times a day. It looks as follows:

<code>
12 */8 *   *   *     (mkdir -p /tmp/sda1; mount /dev/sda1 /tmp/sda1; rsync -av --one-file-system / /tmp/sda1/; umount /tmp/sda1)
</code>

====rsync to fileserver====

This worked handily until sokraits lost its boot drive (again! In 2 months time!) so I decided to investigate netbooting using an NFSroot.

In the process, I may have finally made a breakthrough in my longtime desire to put the entire system IN the initial ramdisk (so it would be running in RAM from the get-go). Turns out, according to the manual page, you merely have to put the system IN the initrd file... obviously one needs adequate memory (2x at boot- enough for the initrd, and enough to decompress it).

My cron job changed as follows:

<code>
24 */8 *   *   *     (rsync -av --one-file-system / data.lair.lan:/export/tftpboot/netboot/halfadder/disk/)
</code>

I plan to rig up either some daily autogeneration of the initrd, or have a script on standby that can use to make it. This will then become the method of booting both sokraits and halfadder (and potentially freeing up a still-working SSD in the process! Which I can use in data2).

On the fileserver, I then obtain the latest copy of the hypervisor, kernel, and generate a new all-system initrd:

<cli>
data1:/export/tftpboot/netboot/halfadder# cp disk/boot/xen-4.4-amd64.gz .
data1:/export/tftpboot/netboot/halfadder# cp disk/boot/vmlinuz-3.16-2-amd64 linux
data1:/export/tftpboot/netboot/halfadder# cd disk
data1:/export/tftpboot/netboot/halfadder/disk# find . | cpio -c -o | gzip -9 > ../initrd.gz
data1:/export/tftpboot/netboot/halfadder/disk# 
</cli>

====pxeboot file for sokraits/halfadder====
On the fileserver, in **/export/tftpboot/pxelinux.cfg/** are two files, **0A50012E** (sokraits) and **0A50012F** (halfadder)... they are named according to the machine's IP address (only in hex).

The file(s) contain:

<code>
default netboot
prompt 1
timeout 2

label netboot
    kernel mboot.c32
    append netboot/halfadder/xen-4.4-amd64.gz --- netboot/halfadder/linux console=tty0 root=/dev/ram0 ro --- netboot/halfadder/initrd.gz

label memtest
    kernel distros/memtest/memtest86+
</code>

=====References=====
====Xen====
===Xen on Squeeze===
  * http://wiki.debian.org/Xen

===Xen Live Migration===
  * http://www.linux.com/archive/feed/55773

===Xen vif-common.sh fixes===
  * http://xen.1045712.n5.nabble.com/PATCH-vif-common-sh-prevent-physdev-match-using-physdev-out-in-the-OUTPUT-FORWARD-and-POSTROUTING-che-td3255945.html
  * http://www.gossamer-threads.com/lists/xen/devel/189692
  * http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=571634#10

===Xen domU loses network===
  * http://xen.1045712.n5.nabble.com/domU-loses-network-after-a-while-td3265172.html
  * http://lists.xensource.com/archives/html/xen-users/2010-09/msg00026.html
  * http://blog.foaa.de/2009/11/hanging-network-in-xen-with-bridging/
  * http://www.gossamer-threads.com/lists/xen/users/183736

====Nvidia forcedeth (MCP55)===
  * https://bugs.launchpad.net/ubuntu/+source/linux/+bug/136836/

====MDADM====
===Volume coming up on md127 instead of md0===
  * http://www.spinics.net/lists/raid/msg30175.html

====DRBD====
  * http://en.gentoo-wiki.com/wiki/Active-active_DRBD_with_OCFS2
  * http://www.howtoforge.com/drbd-8.3-third-node-replication-with-debian-etch
  * http://blog.friedland.id.au/2010/08/setting-up-highly-available-nfs-cluster.html

====DRBD+OCFS2====
  * http://www.clusterlabs.org/wiki/Dual_Primary_DRBD_%2B_OCFS2

====Debian from RAM====
  * http://reboot.pro/topic/14547-linux-load-your-root-partition-to-ram-and-boot-it/
  * debirf: 
    * http://cmrg.fifthhorseman.net/wiki/debirf
    * http://www.sphaero.org/blog:2012:0114_running_debian_from_ram

====/tmp as noexec====
  * http://www.debian-administration.org/articles/57

====netboot system to nfsroot====
  * http://www.iram.fr/~blanchet/tutorials/read-only_diskless_debian7.pdf
    * this led me to the initrd man page which indicated we might be able to stick the entire system in the initrd and PXE boot that. So many things become simpler at that point.