=====Overview===== sokraits.offbyone.lan and halfadder.offbyone.lan are the two main Xen VM servers hosting virtual machines in the LAIR. ^ hostname ^ RAM ^ disk ^ swap ^ OS ^ Kernel ^ | sokraits.lair.lan | 8GB | 32GB (/) | 1.2GB | Debian 8.0 "Jessie" (AMD64) | 3.16-2-amd64 | | ::: | ::: | 500GB + 500GB RAID1 (/dev/md0) | ::: | ::: | ::: | ^ hostname ^ RAM ^ disk ^ swap ^ OS ^ Kernel ^ | halfadder.lair.lan | 8GB | 32GB (/) | 1.2GB | Debian 8.0 "Jessie" (AMD64) | 3.16-2-amd64 | | ::: | ::: | 500GB + 500GB RAID1 (/dev/md0) | ::: | ::: | ::: | =====News===== * Installed new disks, installed Debian squeeze (20101117) * Restored old SSH keys * Reinstalled Sokraits to bring it up to standards (20101222) * Sokraits and Halfadder functional DRBD+OCFS2 peers, live migration working (20101223) * Updated xen-tools config and set up symlinks to create Debian jessie VMs (20140411) * Reinstalled sokraits with Debian Jessie, upgraded to 8GB of RAM (20140422) * Re-reinstalled sokraits with Debian Jessie, getting ready to deploy (20140703) * Re-re-reinstalled sokraits with Debian Wheezy -> Jessie, due to failed boot drive (20140806) * Re-installed halfadder with Debian Jessie (20140806) * Re-re-re-resetup sokraits as a clone of halfadder and netbooting with entire system in initrd, due to (another) failed boot drive (20141004) =====TODO==== * rig up ramdisk /var and /tmp w/ periodic writes (since we have an SSD /). System runs in a RAMdisk. * find 3.5" to 5.25" drive brackets and remount sokraits data drives in case. * on next halfadder reboot, verify that OCFS2 /export gets mounted automatically (last time I had to run "/etc/init.d/ocfs2 restart" for it to do this). =====Network Configuration===== ====Overview==== ^ Machine ^ Interface ^ IP Address ^ MAC Address | | sokraits.lair.lan | eth0 | 10.80.1.46 (lair.lan subnet) | 00:1a:92:cd:0b:1b | | ::: | eth1 | offbyone.lan subnet | 00:1a:92:cd:05:d6 | | ::: | eth2 | 172.16.1.1 (peer link) | 00:0a:cd:16:d9:ac | ^ Machine ^ Interface ^ IP Address ^ MAC Address | | halfadder.lair.lan | eth0 | 10.80.1.47 (lair.lan subnet) | 00:1a:92:cd:0a:7f | | ::: | eth1 | offbyone.lan subnet | 00:1a:92:cd:06:60 | | ::: | eth2 | 172.16.1.2 (peer link) | 00:0a:cd:16:d3:cd | Both Sokraits and Halfadder are using their (once forbidden!) second network interfaces (to exist on both primary LAIR subnets), as well as an additional add-in PCI-e NIC (to be used with an over-the-ceiling across-the-floor cable to connect to each other, for the specific purpose of performing DRBD and OCFS2 peer updates). ====Interfaces==== To ensure that all network interfaces come up as intended, we need to configure **/etc/network/interfaces** as follows (first example given is for sokraits, second is current for halfadder): # This file describes the network interfaces available on your system # and how to activate them. For more information, see interfaces(5). # The loopback network interface auto lo iface lo inet loopback # The primary network interface #auto eth0 iface eth0 inet manual iface eth1 inet manual ## management and lair.lan access through xenbr0 auto xenbr0 iface xenbr0 inet dhcp bridge_ports eth0 bridge_stp off # disable Spanning Tree Protocol bridge_waitport 0 # no delay before a port becomes available bridge_fd 0 # no forwarding delay ## configure a (separate) bridge for the DomUs without giving Dom0 an IP on it auto xenbr1 iface xenbr1 inet manual bridge_ports eth1 bridge_stp off # disable Spanning Tree Protocol bridge_waitport 0 # no delay before a port becomes available bridge_fd 0 # no forwarding delay auto eth2 iface eth2 inet static address 172.16.1.1 # halfadder assigns the address: 172.16.1.2 netmask 255.255.255.0 ====udev rules.d==== Additionally, I found some probe order issues cropping up, so the I had to manually edit which interface was which on both sokraits and halfadder via their **/etc/udev/rules.d/70-persistent-net.rules** files. ===sokraits=== # This file was automatically generated by the /lib/udev/write_net_rules # program, run by the persistent-net-generator.rules rules file. # # You can modify it, as long as you keep each rule on a single # line, and change only the value of the NAME= key. # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:05:d6", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1" # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0b:1b", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" # PCI device 0x10ec:0x8168 (r8169) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d9:ac", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2" ===halfadder=== # This file was automatically generated by the /lib/udev/write_net_rules # program, run by the persistent-net-generator.rules rules file. # # You can modify it, as long as you keep each rule on a single # line, and change only the value of the NAME= key. # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:11.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:0a:7f", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0" # PCI device 0x10de:/sys/devices/pci0000:00/0000:00:12.0 (forcedeth) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:1a:92:cd:06:60", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1" # PCI device 0x10ec:/sys/devices/pci0000:00/0000:00:16.0/0000:06:00.0 (r8169) SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:0a:cd:16:d3:cd", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2" =====apt configuration===== ====use LAIR apt proxy==== To reduce traffic caused by package transactions, I set up a proxy on the fileserver(s), so every client will need to configure itself appropriately. Turns out this can be done most easily by creating the file **/etc/apt/apt.conf.d/00apt-cacher-ng** and putting in the following contents: Acquire::http { Proxy "http://10.80.1.3:3142"; }; ====no recommends==== I wanted a small installation footprint, so I disabled the installation of recommended packages by default. To do so, create/edit **/etc/apt/apt.conf.d/99_norecommends**, and put in the following: APT::Install-Recommends "false"; APT::AutoRemove::RecommendsImportant "false"; APT::AutoRemove::SuggestsImportant "false"; This can also post-remove previously installed recommended packages. Run **aptitude**' type 'g', type 'g' again, should take care of business. There are also some options that can be set in **aptitude** proper, via its console gui (options->preferences): * Uncheck (was already) "Install recommended packages automatically" * Check "Automatically upgrade installed packages" * Check "Remove obsolete packages files after downloading new package lists" Useful URLs: * http://askubuntu.com/questions/351085/how-to-remove-recommended-and-suggested-dependencies-of-uninstalled-packages * http://askubuntu.com/questions/223811/how-to-apt-get-install-with-only-minimal-components-necessary-for-an-application =====Packages===== The following packages have been installed on both sokraits and halfadder: bridge-utils lair-std lair-backup mdadm xen-linux-system xen-tools drbd8-utils ocfs2-tools smartmontools firmware-realtek qemu-system-x86-64 =====GRUB Configuration===== As specified in the Debian Xen Wiki: https://wiki.debian.org/Xen#Prioritise_Booting_Xen_Over_Native machine:~# dpkg-divert --divert /etc/grub.d/08_linux_xen --rename /etc/grub.d/20_linux_xen Adding 'local diversion of /etc/grub.d/20_linux_xen to /etc/grub.d/08_linux_xen' machine:~# and then regenerate grub config: machine:~# update-grub Generating grub.cfg ... Found linux image: /boot/vmlinuz-3.14-1-amd64 Found initrd image: /boot/initrd.img-3.14-1-amd64 Found linux image: /boot/vmlinuz-3.14-1-amd64 Found initrd image: /boot/initrd.img-3.14-1-amd64 done machine:~# =====Xen Configuration===== ====Xend configuration==== The Xend config file (**/etc/xen/xend-config.sxp**) for this host is as follows: # -*- sh -*- # # Xend configuration file. # (xend-relocation-server yes) (xend-relocation-port 8002) (xend-address '') (xend-relocation-address '') (xend-relocation-hosts-allow '^localhost$ ^localhost\\.localdomain$ ^10.80.2.42$ ^10.80.2.46$ ^10.80.2.47$ ^halfadder$ ^halfadder.offbyone.lan$ ^sokraits$ ^sokraits.offbyone.lan$ ^yourmambas$ ^yourmambas.offbyone.lan$ ^grrasp$ ^grrasp.offbyone.lan$') (network-script network-bridge) (vif-script vif-bridge) (dom0-min-mem 196) (enable-dom0-ballooning yes) (total_available_memory 0) (dom0-cpus 0) (vnc-listen '10.80.1.46') (vncpasswd '********') (xend-domains-path /export/xen/xend/domains) # be sure to create this directory ====local loopback==== As usual, if left to its own devices, only 8 loopback devices will be created by default. Don't forget to edit **/etc/modules** as follows and reboot: # /etc/modules: kernel modules to load at boot time. # # This file contains the names of kernel modules that should be loaded # at boot time, one per line. Lines beginning with "#" are ignored. # Parameters can be specified after the module name. firewire-sbp2 loop max_loop=255 #forcedeth max_interrupt_work=20 optimization_mode=1 poll_interval=100 nfs callback_tcpport=2049 I commented out **forcedeth** for now.. it does load, but I don't know with the current kernel if I need to specifically set those options. Time will tell. ====xen-tools==== Xen Tools appears to have been updated… it can now handle recent distributions! Config file, **/etc/xen-tools/xen-tools.conf** follows: ###################################################################### ## ## Virtual Machine configuration ## dir = /xen install-method = debootstrap cache = no ###################################################################### ## ## Disk and Sizing options ## size = 4Gb # Disk image size. memory = 256Mb # Memory size swap = 128Mb # Swap size fs = ext4 # use the EXT3 filesystem for the disk image. dist = jessie # Default distribution to install. images = full ###################################################################### ## ## Network configuration ## bridge = xenbr1 dhcp = 1 gateway = 10.80.2.1 netmask = 255.255.255.0 ###################################################################### ## ## Password configuration ## passwd = 1 ###################################################################### ## ## Package Mirror configuration ## arch = amd64 mirror = http://ftp.us.debian.org/debian/ mirror_squeeze = http://ftp.us.debian.org/debian/ mirror_wheezy = http://ftp.us.debian.org/debian/ mirror_jessie = http://ftp.us.debian.org/debian/ ###################################################################### ## ## Proxy Settings for repositories ## apt_proxy = http://10.80.1.3:3142/ ###################################################################### ## ## Filesystem settings ## ext4_options = noatime,nodiratime,errors=remount-ro ext3_options = noatime,nodiratime,errors=remount-ro ext2_options = noatime,nodiratime,errors=remount-ro xfs_options = defaults reiser_options = defaults ###################################################################### ## ## Xen VM boot settings ## pygrub = 1 # Filesystem options for the different filesystems we support. # ext4_options = noatime,nodiratime,errors=remount-ro,data=writeback,barrier=0,commit=600 ext3_options = noatime,nodiratime,errors=remount-ro ext2_options = noatime,nodiratime,errors=remount-ro xfs_options = defaults btrfs_options = defaults ###################################################################### ## ## Xen VM settings ## serial_device = hvc0 disk_device = xvda ###################################################################### ## ## Xen configuration files ## output = /xen/conf extension = .cfg ====xendomains config==== Since we're running Xen 4.0.1, there are some additional configuration options to tend to (along with squeeze likely better distributing functionality to specific files). **/etc/default/xendomains** is next… two changes need to be made: ## Type: string ## Default: /var/lib/xen/save # # Directory to save running domains to when the system (dom0) is # shut down. Will also be used to restore domains from if # XENDOMAINS_RESTORE # is set (see below). Leave empty to disable domain saving on shutdown # (e.g. because you rather shut domains down). # If domain saving does succeed, SHUTDOWN will not be executed. # #XENDOMAINS_SAVE=/var/lib/xen/save XENDOMAINS_SAVE="" Basically make **XENDOMAINS_SAVE** an empty string, and: ## Type: boolean ## Default: true # # This variable determines whether saved domains from XENDOMAINS_SAVE # will be restored on system startup. # XENDOMAINS_RESTORE=false **XENDOMAINS_RESTORE** should be set to **false**. Finally, we set a directory for auto-starting VMs on dome boot: # This variable sets the directory where domains configurations # are stored that should be started on system startup automatically. # Leave empty if you don't want to start domains automatically # (or just don't place any xen domain config files in that dir). # Note that the script tries to be clever if both RESTORE and AUTO are # set: It will first restore saved domains and then only start domains # in AUTO which are not running yet. # Note that the name matching is somewhat fuzzy. # XENDOMAINS_AUTO=/xen/conf/auto =====MD array configuration===== The purpose of the disk array is to provide RAID1 (mirror) to the Xen VM images. ====Re-initializing==== As we've had functioning RAID volumes for years, I thought I would re-do the arrays so as to take advantage of any new version features (when I first created them, mdadm was at version 0.8- now it is 1.2). So, I first stopped the array: sokraits:~# mdadm --stop /dev/md0 Then, I zeroed out the superblocks on both constituent drives: sokraits:~# mdadm --zero-superblock /dev/sdb sokraits:~# mdadm --zero-superblock /dev/sdc Now we can proceed with creating the new array. ====creating /dev/md0==== I opted to build the array straight to disk-- no messing with partition tables. halfadder:~# mdadm --create /dev/md0 --level=1 --raid-disks=2 /dev/sdb /dev/sdc mdadm: /dev/sdb appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969 mdadm: partition table exists on /dev/sdb but will be lost or meaningless after creating array mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: /dev/sdc appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 19:00:00 1969 mdadm: partition table exists on /dev/sdc but will be lost or meaningless after creating array Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. halfadder:~# ====checking disk array status==== To check the status: halfadder:~# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdc[1] sdb[0] 488385424 blocks super 1.2 [2/2] [UU] [=>...................] resync = 8.9% (43629696/488385424) finish=56.9min speed=130132K/sec unused devices: halfadder:~# usually (when finished building and all is in order) it'll likely look something like: halfadder:~# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb[0] sdc[1] 488385424 blocks super 1.2 [2/2] [UU] unused devices: halfadder:~# ====Setting /etc/mdadm/mdadm.conf==== To avoid oddities (such as /dev/md0 coming up as /dev/md127 and confusing everything) on subsequent boots, we should set up the **/etc/mdadm/mdadm.conf** file accordingly. Assuming hardware is in identical places device-wise, the only data unique to each peer is the hostname and the md0 uuid, as is seen in the following: ===sokraits's mdadm.conf=== # mdadm.conf # # Please refer to mdadm.conf(5) for information about this file. # # by default, scan all partitions (/proc/partitions) for MD superblocks. # alternatively, specify devices to scan, using wildcards if desired. DEVICE /dev/sdb /dev/sdc # auto-create devices with Debian standard permissions CREATE owner=root group=disk mode=0660 auto=yes # automatically tag new arrays as belonging to the local system HOMEHOST sokraits # instruct the monitoring daemon where to send mail alerts MAILADDR root # definitions of existing MD arrays ARRAY /dev/md/0 metadata=1.2 UUID=731663e3:d5fd45ac:157baa06:11018534 name=sokraits:0 # This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400 # by mkconf 3.2.5-5 ===halfadder's mdadm.conf=== # mdadm.conf # # Please refer to mdadm.conf(5) for information about this file. # # by default, scan all partitions (/proc/partitions) for MD superblocks. # alternatively, specify devices to scan, using wildcards if desired. DEVICE /dev/sdb /dev/sdc # auto-create devices with Debian standard permissions CREATE owner=root group=disk mode=0660 auto=yes # automatically tag new arrays as belonging to the local system HOMEHOST halfadder # instruct the monitoring daemon where to send mail alerts MAILADDR root # definitions of existing MD arrays ARRAY /dev/md/0 metadata=1.2 UUID=c846eb24:6b9783db:cd9b436c:8470fd46 name=halfadder:0 # This file was auto-generated on Wed, 06 Aug 2014 11:17:46 -0400 # by mkconf 3.2.5-5 ===How to find the local md volume UUID=== To obtain the UUID generated for the md volume, simply run the following (it is unique per host): halfadder:~# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Thu Aug 7 11:50:06 2014 Raid Level : raid1 Array Size : 488255488 (465.64 GiB 499.97 GB) Used Dev Size : 488255488 (465.64 GiB 499.97 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Thu Aug 7 13:11:52 2014 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : halfadder:0 (local to host halfadder) UUID : c846eb24:6b9783db:cd9b436c:8470fd46 Events : 979 Number Major Minor RaidDevice State 0 8 16 0 active sync /dev/sdb 1 8 32 1 active sync /dev/sdc halfadder:~# You'll see the **UUID** listed. Just copy this into **/etc/mdadm/mdadm.conf** in the appropriate place, as indicated by the above config files, to ensure the proper identification of the MD array. ===After configuring mdadm.conf=== According to the information in **/usr/share/doc/mdadm/README.upgrading-2.5.3.gz**, once we configure the **/etc/mdadm/mdadm.conf** file, we must let the system know and rebuild the initial ramdisk: BOTH:~# update-initramfs -t -u -k all update-initramfs: Generating /boot/initrd.img-2.6.32-5-xen-amd64 update-initramfs: Generating /boot/initrd.img-2.6.32-5-amd64 BOTH:~# =====DRBD===== In order to have the "shared" storage that allows OCFS2 to work, we'll set up DRBD to constantly sync the volumes between sokraits and halfadder. With the tools installed, we need to configure some files. ====/etc/drbd.d/global_common.conf BEFORE==== First up, we need to get the peers talking so we can form the volume and get OCFS2 established. Let's make the **/etc/drbd.d/global_common.conf** file look as follows: global { usage-count no; } common { startup { wfc-timeout 60; degr-wfc-timeout 60; } disk { on-io-error detach; } syncer { rate 40M; } protocol C; } This is only an intermediate step. Further changes are needed before we can bring it up in dual-primary mode. ====/etc/drbd.d/xen_data.res==== And the resource configuration (doesn't need to change), on both peers: resource xen_data { device /dev/drbd0; disk /dev/md0; meta-disk internal; on sokraits { address 172.16.1.1:7788; } on halfadder { address 172.16.1.2:7788; } } ====bootstrapping DRBD==== Getting DRBD initially up-and-running has always has a bit of voodoo behind it... trying a number of commands and eventually stumbling upon something that works. I may have finally gotten the procedure down: ===TO DO ON BOTH PEERS=== These identical steps BOTH:~# drbdadm create-md xen_data Writing meta data... initializing activity log NOT initialized bitmap New drbd meta data block successfully created. BOTH:~# modprobe drbd BOTH:~# drbdadm attach xen_data BOTH:~# drbdadm syncer xen_data BOTH:~# drbdadm connect xen_data BOTH:~# At this point we should be able to do the following: EITHER:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:488370480 EITHER:~# ===TO DO ON ONLY ONE=== Once we see a semblance of communication (the "Secondary/Secondary" in /proc/drbd output, for example), we can kick the two peers into operation. This next step must only take place on **ONE** of the peers. I picked **sokraits** for this example.. but it really doesn't matter which: sokraits:~# drbdadm -- --overwrite-data-of-peer primary xen_data Of course, this isn't so willy nilly if one of the peers has the more up-to-date copy of the data. Upon which we can now view /proc/drbd and see messages like: sokraits:~# cat /proc/drbd version: 8.3.7 (api:88/proto:86-91) srcversion: EE47D8BF18AC166BE219757 0: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r---- ns:361016 nr:0 dw:0 dr:367944 al:0 bm:21 lo:15 pe:76 ua:225 ap:0 ep:1 wo:b oos:488011888 [>....................] sync'ed: 0.1% (476572/476924)M finish: 2:16:00 speed: 59,764 (59,764) K/sec sokraits:~# Checking this occasionally (on either peer), will show the progress. ====formatting the array==== To format the volume, we ignore the underlying disks, and address **/dev/drbd0** all the time. The **OCFS2** filesystem was put on the disk array: halfadder:~# mkfs.ocfs2 -v -L datastore -N 4 -T datafiles /dev/drbd0 mkfs.ocfs2 1.4.4 Cluster stack: classic o2cb Filesystem Type of datafiles Label: datastore Features: sparse backup-super unwritten inline-data strict-journal-super Block size: 4096 (12 bits) Cluster size: 1048576 (20 bits) Volume size: 500105740288 (476938 clusters) (122096128 blocks) Cluster groups: 15 (tail covers 25354 clusters, rest cover 32256 clusters) Extent allocator size: 377487360 (90 groups) Journal size: 33554432 Node slots: 4 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing backup superblock: 5 block(s) Formatting Journals: done Growing extent allocator: done Formatting slot map: done Writing lost+found: done mkfs.ocfs2 successful halfadder:~# By default, if no **-N #** argument was specified during the formatting of the filesystem, a maximum of 8 machines can simultaneously mount this volume. The intent is for just two machines (sokraits and halfadder) to be the only machines ever mounting this volume. =====OCFS2===== Because sokraits and halfadder will exist in a primary-primary peer relationship, we need to run a cluster-aware filesystem on our shared volume. Although many exist, the one we've had any amount of prior experience with is OCFS2, so it is redeployed here. ====configuring OCFS2==== The following should be put in /etc/default/o2cb: # # This is a configuration file for automatic startup of the O2CB # driver. It is generated by running 'dpkg-reconfigure ocfs2-tools'. # Please use that method to modify this file. # # O2CB_ENABLED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=datastore # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=31 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=30000 # O2CB_KEEPALIVE_DELAY_MS: Max. time in ms before a keepalive packet is sent. O2CB_KEEPALIVE_DELAY_MS=2000 # O2CB_RECONNECT_DELAY_MS: Min. time in ms between connection attempts. O2CB_RECONNECT_DELAY_MS=2000 Next we need to configure the OCFS2 cluster (/etc/ocfs2/cluster.conf): node: ip_port = 7777 ip_address = 172.16.1.1 number = 0 name = sokraits cluster = datastore node: ip_port = 7777 ip_address = 172.16.1.2 number = 1 name = halfadder cluster = datastore cluster: node_count = 2 name = datastore ====/etc/drbd.d/global_common.conf AFTER OCFS2 is ready==== Once the other prerequisites are taken care of, we can bring the OCFS2 cluster up in dual primary mode, as the following config file allows for. Duplicate this on both peers. global { usage-count no; } common { startup { wfc-timeout 60; degr-wfc-timeout 60; become-primary-on both; } disk { on-io-error detach; } net { allow-two-primaries yes; } syncer { rate 80M; } protocol C; } This **/etc/drbd.d/global_common.conf** file needs to be identical and present on BOTH DRBD peers. Recognizing the changes does not require a reboot! The following command (run on both DRBD peers), will update the config: machine:~# drbdadm adjust xen_data ====Bringing OCFS2 online==== Assuming **/etc/ocfs2/cluster.conf** and **/etc/default/o2cb** are configured and identical, we can now establish OCFS2 cluster connectivity. These steps take place on **BOTH** peers. ===Step 0: kernel module=== BOTH:~# modprobe ocfs2 ===Step 1: o2cb service=== Next, we need to bring the **o2cb** service online: BOTH:~# /etc/init.d/o2cb start ounting configfs filesystem at /sys/kernel/config: OK Loading stack plugin "o2cb": OK Loading filesystem "ocfs2_dlmfs": OK Creating directory '/dlm': OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Setting cluster stack "o2cb": OK Starting O2CB cluster datastore: OK ===Step 2: ocfs2 bits=== Whatever other functionality there is related to OCFS2, time to bring it on-line as well: BOTH:~# /etc/init.d/ocfs2 start BOTH:~# ===Step 3: mount the volume=== Assuming all is in order, we can now mount our volume: BOTH:~# mount -t ocfs2 /dev/drbd0 /export BOTH:~# df Filesystem Size Used Avail Use% Mounted on /dev/sda1 27G 1.2G 25G 5% / tmpfs 1.7G 0 1.7G 0% /lib/init/rw udev 1.6G 160K 1.6G 1% /dev tmpfs 1.7G 0 1.7G 0% /dev/shm /dev/drbd0 466G 1.6G 465G 1% /export BOTH:~# =====Local Modifications===== ====Automating the mount in /etc/fstab==== We can have the system work to automatically mount our volume on boot by putting an appropriate entry into **/etc/fstab**, by appending the following to the bottom of the file: /dev/drbd0 /export ocfs2 noatime 0 0 ====Turn swapiness way down==== The Linux default for swapiness is 60, which will result in the system paging stuff out to swap. Cranking it down to 10 seems a more prudent setting, especially on systems with SSDs, where we want them used as little as possible. I added the following line to **/etc/sysctl.conf** on both systems: vm.swappiness = 10 ====/tmp in RAM==== Seems Debian has a nice built-in support for mounting /tmp in tmpfs (RAM-backed filesystem). All you need to do is edit **/etc/default/tmpfs**, and uncomment/change the following line: RAMTMP=yes And reboot! ====integrating the array's storage into the system==== The disk array is going to hold both Xen virtual machine images (and supporting files), but also serve as another backup destination for resources in the LAIR. The following directories have been created: * /export - the array's main mountpoint * /xen - location of Xen data (symlink to /export/xen) * /backup - location of backup data (symlink to /export/backup) ====Historical: Configuring xen-tools to create Debian jessie VMs==== **This is no longer needed, but may well be in the future.** There are two changes needed to successfully create jessie VMs, and both are symlinks: ===Enable debootstrap to work with jessie=== halfadder:~# cd /usr/share/debootstrap halfadder:/usr/share/debootstrap# ln -s sid jessie ===Enable xen-tools to recognize jessie as a valid distro=== halfadder:~# cd /usr/lib/xen-tools halfadder:/usr/lib/xen-tools# ln -s debian.d jessie.d =====Xen in Operation===== ====Overview==== Sokraits and Halfadder serve as the production virtual machine servers in the LAIR, predominantly to the offbyone.lan network, but also providing services utilized on lair.lan, student.lab, and across the BITS universe. Running the Open Source Xen hypervisor (version 4.0.1), around a dozen virtual machines are instantiated. ====Xen administration==== From the VM server, we can adjust various properties and control aspects of virtual machines via the **xm** command-line tool. ====What's running==== To determine what is running on a particular VM host, we use the **xm list** command. For example, here's an example output (list of VMs can and will vary): sokraits:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 2344 4 r----- 1377.5 irc 8 128 1 -b---- 9.2 lab46 11 512 2 -b---- 10.6 lab46db 6 128 1 -b---- 14.3 mail 5 192 1 -b---- 18.0 www 4 192 1 -b---- 129.2 sokraits:~# What this shows us is that the following VMs are running locally on this VM server. If the VM you are looking for is believed running but not found in this list, it is likely running on the other VM server. ====Boot a VM==== To start a VM that is not presently running, assuming all prerequisites are met (operational VM image exists, correct configuration file, available resources (mainly memory)), we can use **xm** to create an instantiation of the virtual machine. In this example, we will start the VM for **repos.offbyone.lan**: halfadder:~# xm create -c /xen/conf/repos.cfg Using config file "/xen/conf/repos.cfg". [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 2.6.26-2-xen-amd64 (Debian 2.6.26-25lenny1) (dannf@debian.org) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 SMP Thu Sep 16 16:32:15 UTC 2010 [ 0.000000] Command line: root=/dev/xvda1 ro ip=:127.0.255.255::::eth0:dhcp clocksource=jiffies [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] Xen: 0000000000000000 - 0000000008800000 (usable) [ 0.000000] max_pfn_mapped = 34816 [ 0.000000] init_memory_mapping ... Starting periodic command scheduler: crond. Debian GNU/Linux 5.0 repos.offbyone.lan hvc0 repos.offbyone.lan login: Due to the **-c** argument we gave **xm** when creating the virtual machine, we will connect to the console of this virtual machine, allowing us to see it boot. One only needs omit the **-c** from the **xm** command-line, and the machine will still start, but we'll be returned to the command prompt. ====Detaching from VM console==== In this current scenario, we'll want to issue a: **CTRL-]** Once you do that, you'll escape from the VM's prompt, and be returned to the prompt on the VM server. ====Duplicate VM creation==== And what if the VM is already running? If you are trying to start it on the same VM host it is already running, you'll see the following: halfadder:~# xm create -c /xen/conf/repos.cfg Using config file "/xen/conf/repos.cfg". Error: Domain 'repos' already exists with ID '3' halfadder:~# If it is running, but on the other VM server, well, trouble is likely going to take place. Although the VM servers are using the cluster file system, the individual VMs are not, and will likely not take kindly to concurrent accesses. So prevent headaches and take care not to start multiple copies of the same VM! ====Shut down a VM==== If we desire to shut down a VM, we can do so (and properly!) from the VM server command-line. Using the **xm shutdown** command, a shutdown signal is created on the VM, and the machine shuts down just as if we gave it a "**shutdown -h now**" command. Shutting down **repos.offbyone.lan**: halfadder:~# xm shutdown repos halfadder:~# After a bit, if you check the output of **xm list**, you will no longer see the VM in question listed. Once this condition is true, you can proceed with whatever operation is underway. ====Live Migrate a VM==== One of the impressive features we have available with the use of DRBD and OCFS2 is a multi-master concurrent filesystem. This creates "shared storage", which grants us some advantages. Specifically, we can use our shared storage to enable migration of virtual machines between VM servers. What's more, we can perform a **live** migration, transparently (to anyone using the virtual machine) moving the VM to another physical host without interrupting its operation. Following will be an example of a live migration, migrating the ***www*** virtual machine, originally residing on sokraits: ===Step 0: Verify the running VM=== sokraits:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 2344 4 r----- 1383.1 irc 8 128 1 -b---- 9.8 lab46 11 512 2 -b---- 11.3 lab46db 6 128 1 -b---- 14.6 mail 5 192 1 -b---- 18.8 www 4 192 1 -b---- 133.3 sokraits:~# So we see **www** is running on sokraits. ===Step 1: Live migrate it to halfadder=== sokraits:~# xm migrate --live www halfadder sokraits:~# After only a few seconds, we get our prompt back. ===Step 2: Verify www is no longer running on sokraits=== Do another **xm list** on sokraits: sokraits:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 2344 4 r----- 1387.8 irc 8 128 1 -b---- 9.9 lab46 11 512 2 -b---- 11.4 lab46db 6 128 1 -b---- 14.6 mail 5 192 1 -b---- 19.0 sokraits:~# As you can see, **www** is no longer present in the VM list on sokraits. ===Step 3: Check running VMs on halfadder=== Switch over to halfadder, do a check: halfadder:~# xm list Name ID Mem VCPUs State Time(s) Domain-0 0 3305 1 r----- 65.0 auth 4 128 1 -b---- 4.1 log 1 128 1 -b---- 0.8 repos 5 128 1 -b---- 8.3 web 2 128 1 -b---- 2.9 www 6 192 1 -b---- 0.4 halfadder:~# And voila! A successful live migration. =====LRRDnode configuration===== To facilitate administration, both sokraits and halfadder are configured as LRRDnode clients and log data that can be retrieved from LRRD at: http://web.offbyone.lan/lrrd/ ====Install lrrd-node==== First step is to install the actual LAIR package: BOTH:~# aptitude install lrrd-node The following NEW packages will be installed: libstatgrab6{a} lrrd-node python-statgrab{a} 0 packages upgraded, 3 newly installed, 0 to remove and 0 not upgraded. Need to get 118 kB of archives. After unpacking 348 kB will be used. Do you want to continue? [Y/n/?] Get:1 http://mirror/debian/ squeeze/main libstatgrab6 amd64 0.16-0.1 [57.6 kB] Get:2 http://mirror/debian/ squeeze/main python-statgrab amd64 0.4-1.1+b2 [53.0 kB] Get:3 http://mirror/lair/ squeeze/main lrrd-node all 1.0.7-1 [7,128 B] Fetched 118 kB in 0s (9,978 kB/s) Selecting previously deselected package libstatgrab6. (Reading database ... 28935 files and directories currently installed.) Unpacking libstatgrab6 (from .../libstatgrab6_0.16-0.1_amd64.deb) ... Selecting previously deselected package python-statgrab. Unpacking python-statgrab (from .../python-statgrab_0.4-1.1+b2_amd64.deb) ... Setting up libstatgrab6 (0.16-0.1) ... Setting up python-statgrab (0.4-1.1+b2) ... Processing triggers for python-support ... Selecting previously deselected package lrrd-node. (Reading database ... 28961 files and directories currently installed.) Unpacking lrrd-node (from .../lrrd-node_1.0.7-1_all.deb) ... Setting up lrrd-node (1.0.7-1) ... Adding lrrdNode to init.d update-rc.d: using dependency based boot sequencing insserv: warning: script 'lrrdnode' missing LSB tags and overrides Running lrrdNode ... Starting lrrdNode: stat collection thinger: Starting LRRD Node lrrdNode BOTH:~# ====Configure lrrd-node at LRRD==== Once installed and running on the client side, we need to configure (or reconfigure, as the case may be) at LRRD. So pop a browser over to: http://web.offbyone.lan/lrrd/ And log in (~root, punctuation-less ~root pass). Click on the "Configure" link, and find the host in question (if it has prior history reporting to LRRD). If found, note that it is Enabled, and click the "reconfigure" link to the right of the entry. There's an option to delete existing databases (do it), and check off any appropriate network interfaces. ====Manual lrrd-node restart==== If it is discovered that data reporting ceases, and other components of the LRRD system are still deemed functioning, it is likely that the lrrd-node client needs a restart. Simply do the following on the machine in question: sokraits:~# /etc/init.d/lrrdnode restart Stopping lrrdNode: stat collection thinger: lrrdNode Starting lrrdNode: stat collection thinger: Starting LRRD Node lrrdNode sokraits:~# Wait at least 5 minutes for data reporting to make it into graphable form. =====Sync'ing to data store===== Since we've been successful running the systems out of a RAMdisk, care must be taken to preserve any changes in the event of a reboot or power failure. ====rsync to disk==== In this light, I first had the systems rsync'ing to their local SSD (boot drive). I rigged up a custom cronjob than ran 3 times a day. It looks as follows: 12 */8 * * * (mkdir -p /tmp/sda1; mount /dev/sda1 /tmp/sda1; rsync -av --one-file-system / /tmp/sda1/; umount /tmp/sda1) ====rsync to fileserver==== This worked handily until sokraits lost its boot drive (again! In 2 months time!) so I decided to investigate netbooting using an NFSroot. In the process, I may have finally made a breakthrough in my longtime desire to put the entire system IN the initial ramdisk (so it would be running in RAM from the get-go). Turns out, according to the manual page, you merely have to put the system IN the initrd file... obviously one needs adequate memory (2x at boot- enough for the initrd, and enough to decompress it). My cron job changed as follows: 24 */8 * * * (rsync -av --one-file-system / data.lair.lan:/export/tftpboot/netboot/halfadder/disk/) I plan to rig up either some daily autogeneration of the initrd, or have a script on standby that can use to make it. This will then become the method of booting both sokraits and halfadder (and potentially freeing up a still-working SSD in the process! Which I can use in data2). On the fileserver, I then obtain the latest copy of the hypervisor, kernel, and generate a new all-system initrd: data1:/export/tftpboot/netboot/halfadder# cp disk/boot/xen-4.4-amd64.gz . data1:/export/tftpboot/netboot/halfadder# cp disk/boot/vmlinuz-3.16-2-amd64 linux data1:/export/tftpboot/netboot/halfadder# cd disk data1:/export/tftpboot/netboot/halfadder/disk# find . | cpio -c -o | gzip -9 > ../initrd.gz data1:/export/tftpboot/netboot/halfadder/disk# ====pxeboot file for sokraits/halfadder==== On the fileserver, in **/export/tftpboot/pxelinux.cfg/** are two files, **0A50012E** (sokraits) and **0A50012F** (halfadder)... they are named according to the machine's IP address (only in hex). The file(s) contain: default netboot prompt 1 timeout 2 label netboot kernel mboot.c32 append netboot/halfadder/xen-4.4-amd64.gz --- netboot/halfadder/linux console=tty0 root=/dev/ram0 ro --- netboot/halfadder/initrd.gz label memtest kernel distros/memtest/memtest86+ =====References===== ====Xen==== ===Xen on Squeeze=== * http://wiki.debian.org/Xen ===Xen Live Migration=== * http://www.linux.com/archive/feed/55773 ===Xen vif-common.sh fixes=== * http://xen.1045712.n5.nabble.com/PATCH-vif-common-sh-prevent-physdev-match-using-physdev-out-in-the-OUTPUT-FORWARD-and-POSTROUTING-che-td3255945.html * http://www.gossamer-threads.com/lists/xen/devel/189692 * http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=571634#10 ===Xen domU loses network=== * http://xen.1045712.n5.nabble.com/domU-loses-network-after-a-while-td3265172.html * http://lists.xensource.com/archives/html/xen-users/2010-09/msg00026.html * http://blog.foaa.de/2009/11/hanging-network-in-xen-with-bridging/ * http://www.gossamer-threads.com/lists/xen/users/183736 ====Nvidia forcedeth (MCP55)=== * https://bugs.launchpad.net/ubuntu/+source/linux/+bug/136836/ ====MDADM==== ===Volume coming up on md127 instead of md0=== * http://www.spinics.net/lists/raid/msg30175.html ====DRBD==== * http://en.gentoo-wiki.com/wiki/Active-active_DRBD_with_OCFS2 * http://www.howtoforge.com/drbd-8.3-third-node-replication-with-debian-etch * http://blog.friedland.id.au/2010/08/setting-up-highly-available-nfs-cluster.html ====DRBD+OCFS2==== * http://www.clusterlabs.org/wiki/Dual_Primary_DRBD_%2B_OCFS2 ====Debian from RAM==== * http://reboot.pro/topic/14547-linux-load-your-root-partition-to-ram-and-boot-it/ * debirf: * http://cmrg.fifthhorseman.net/wiki/debirf * http://www.sphaero.org/blog:2012:0114_running_debian_from_ram ====/tmp as noexec==== * http://www.debian-administration.org/articles/57 ====netboot system to nfsroot==== * http://www.iram.fr/~blanchet/tutorials/read-only_diskless_debian7.pdf * this led me to the initrd man page which indicated we might be able to stick the entire system in the initrd and PXE boot that. So many things become simpler at that point.