Setting up high-availability failover mode

LAN model UCARP-based failover

Access Server comes with a built-in failover mode which can be deployed on a local area network. It is designed to allow one primary node to handle all the tasks, and if it fails, to let a secondary standby node come online automatically and take over the tasks from the failed node. This is done with a method called UCARP using VRRP heartbeat network packets. The two nodes work to keep a single virtual IP address online. During normal operationg the currently active node will handle all incoming request, but when it goes down, the secondary takes over by becoming the new active node.

There will be a short disruption because the previous OpenVPN server and the VPN clients negotiated TLS encryption keys that are not valid on the new server. VPN clients will timeout their connection after about 30 to 60 seconds, after which they will reconnect automatically and complete their authentication with session tokens when possible and negotiate new TLS keys with the new server they're connected to.

Platform compatibility

This method unfortunately does not work on all platforms. For example on Amazon AWS, broadcast UCARP/VRRP traffic is simply filtered away, so this model cannot be used on Amazon AWS. For those platforms we recommend that OpenVPN Access Server clustering capability instead, where multiple nodes can be active at the same time, each capable of handling incoming connections. A failure of a node there would result in VPN clients automatically connecting to any of the other nodes in the cluster. UCARP/VRRP failover platform compatibility is further explained below.

  • Physical servers should work just fine on physical networks.
  • Microsoft HyperV and VMWare ESXi are supported, but you may need "MAC spoofing" or "Promiscuous mode".
  • Other virtualized platforms should also work as long as it's a local network where broadcast UCARP/VRRP is possible.
  • Amazon AWS is not supported, because the heartbeat signal is filtered away on their networks.
  • Most major cloud networks are not supported because they do not support the UCARP/VRRP traffic. Consider using clustering instead.
  • If one node is in a different network from the other node, this failover model can almost certainly not be used.
  • If multiple UCARP/VRRP failover pairs are present in the same network, you must adjust the VHID to be unique.

That last point requires further explanation. The VHID is a number that is sent along in the heartbeat signal that goes onto the local network. The secondary node monitors this heartbeat signal. If there are multiple UCARP/VRRP systems online at the same time in the same network, multiple such heartbeat signals can be seen. To know which one the secondary node has to monitor, the heartbeat signal is given a unique identifier number. By default on an Access Server failover pair setup this number is 94. You can adjust the VHID on the command line to ensure that each failover pair running in the same LAN network recognizes its partner node properly.

Function description

Typically you'd run a private network with 2 servers that each have their own private IP address. The UCARP/VRRP method works by setting up a third shared virtual private IP address where all services are handled by the currently active node. For example node A could be on 192.168.70.1 and node B on 192.168.70.2, while the shared virtual IP where the services are offered is at 192.168.70.222. You could set up your router to send requests from the Internet to that last IP.

While the active node is online it handles all traffic on the shared virtual IP address. It sends configuration changes to the standby node. The standby node stays dormant until it notices that the VRRP/UCARP heartbeat signal that the primary node sends out on the network has ceased. If this last more than a few seconds, the standby node will take over and become the active node. At this point the configuration changes that were sent to this node are loaded and it now takes over the shared virtual IP address. This happens very quickly, in the order of 5 to 10 seconds, maybe a bit more, depending on how fast your server is at starting the Access Server service and how much data is in the configuration, certificates, and user properties databases.

Clients will be momentarily interrupted by the failure of the primary node. Their current encryption sessions use TLS encryption keys that were agreed with the previous running server node that the newly activated node doesn't know about. After a timeout, usually 30 to 60 seconds, the client decides that the connection has failed and will reconnect. It will try to use the previous session's authentication token to authenticate. The failover node will validate this session token and then allow the client to reconnect automatically. Auto-login profiles don't need the session token logic since their authentication is by the certificate alone. In a failover event an interruption of about a minute is to be expected, and in almost all cases connectivity should automatically restore. The client connection profiles on one node will be accepted by the other one - that data is synchronized.

If an active node stops functioning while the other node is in standby, the standby node now will become the active node. If the previously active node were to for example crash and reboot and come back online it will see an active node on the network and will go into standby node. So the logic in effect is that if there a node online now that is handling requests and sends out a heartbeat signal on the network, and the other node starts up, it will not force a failover. We do this behavior on purpose, because if the primary node has failed in a way that causes it to go online and crash again after a short while and end up in some reboot cycle due to maybe a hardware failure, we don't want to cause failover events in that situation. So when a failover event has occurred, you will have to manually intervene if you want the primary node to be the active master node again. To do so, simply ensure the primary node is running normally, and then restart the Access Server service on the secondary node, or reboot it. The primary will see this as a failure of the active secondary node and then take over again.
High-Availability Failover Mode

First steps in setting up the primary node

This part is the same as setting up a normal OpenVPN Access Server installation on a private network. You will need a supported Linux operating system with a private static IP address. We have some technical documentation on how to set a static IP address on a Linux installation here, if you need it. Some networks work with a DHCP server with a static IP address assignment for DHCP clients, and if you have that configured and working, then that is also acceptable. Since you will be running the Access Server failover pair inside of a private network, if you want people from the Internet to reach it, you will need to set up port forwarding in the gateway system on this network that leads to the Internet. For initial testing you can forward ports TCP 443, TCP 943, and UDP 1194, to the static IP address of your primary node. This way you can set up your Access Server and get it reachable and working from the outside. Later, you should direct the port forwards to the virtual IP chosen for your failover setup instead. You would ideally have a DNS (FQND) record set up that points to the public IP address of your Internet gateway system that forwards ports to your Access Server's shared virtual IP, and you would have this FQDN name configured in the Access Server's Network Settings page in the host name or IP address field. This field contains the address clients will try to connect to. A DNS name allows for easy updates if the public IP of the server ever changes in the future, and it also makes it possible for a proper SSL certificate to be installed.

You will need the program rsync present on your primary node. Install it:

apt-get update
apt-get install rsync

The program rsync is used to transfer configuration backups, user certificates, and user properties, from the primary node to the secondary node. In the event of a failover, the secondary node loads these backups and goes online and takes over the tasks from the failed node with this up-to-date information.

Preparing the failover node for use

We are going to assume you have a server already set up as the primary node, as described in the section above.

To set up the secondary node, simply do a new deployment of Access Server. It doesn't matter if you have it as an appliance or virtual image or an installation manually on Linux. You do not need to configure all the settings of the Access Server, just get it to the point where you can get to the command line and the Access Server package installer file is installed. Next set up a static IP address for this node as well, just like the primary node, but a different IP address obviously. You do not need to do port forwarding to this node. Get root permissions on the server you are going to use as secondary node and run the following destructive command on it to clear all its settings and prepare it for use as a secondary node.

Prepare the secondary node for its role as a failover system:

ovpn-init --secondary

You will have to manually confirm this step by typing the word DELETE to confirm that you want to wipe this server's settings and set it up as a failover node. It goes without saying that this step wipes this particular node of all of its settings, so if this is a production node and it contains data that you want to keep, obviously do not demote this node to a failover role, but instead set up a new failover node. If you want to automate this command completely so it doesn't ask confirmation then you can add the parameters --batch and --force to it.

You will need the program rsync present on your secondary node. Install it:

apt-get update
apt-get install rsync

Set up bi-directional SSH access

Currently the Access Server needs the ability to have root level access to the partner node in order to configure things and to keep the settings updated. There are two ways to go about this. One way is to use passwordless SSH keys which are automated and fairly secure, or you can enable root user login directly through SSH with a password, but this is not considered secure. We are therefore going to focus here on the passwordless SSH key setup.

We are going to make a number of assumptions in this guide and you should adjust for your situation as necessary:

  • We are assuming that you cannot login with the username root via an SSH connection.
  • We assume that you do have the ability to login through SSH with a user other than root, and that with the use of the command 'sudo su', you can gain root privileges.
  • In our guide we assume that this non-root user is called simply sshuser.
  • We assume that 192.168.70.1 is your primary node's IP address.
  • We assume that 192.168.70.2 is your secondary node's IP address.
  • We assume that 192.168.70.222 is the shared virtual IP that your failover pair will work to keep online at all times.
  • That you are logged on through SSH and have now obtained root privileges on both nodes.
  • All commands below are assumed to be run as the root user.

Log on to both nodes and run these commands on both nodes:

mkdir ~/.ssh
cd ~/.ssh
ssh-keygen -t rsa -f id_rsa -P ""
cat id_rsa.pub >> authorized_keys
chmod 600 authorized_keys

This creates SSH access keys that require no password to login. But they need to transferred to their partner node and put into the correct place so the nodes know when and how to use them for direct SSH access without the need to login with credentials.

On the primary node, copy the key to the secondary node:

/usr/bin/ssh-copy-id -i ~/.ssh/id_rsa.pub sshuser@192.168.70.2

And vice-versa, on the secondary node, copy the key to the primary node:

/usr/bin/ssh-copy-id -i ~/.ssh/id_rsa.pub sshuser@192.168.70.1

You will likely have to confirm that you want to make a connection for the purpose of copying the SSH access key to its partner node. You will have to enter the password of the user sshuser to complete the transfer.

Once this copy process is done, the keys are in the wrong place. Run this command on both nodes to put the SSH access keys in the correct place for root access:

cat /home/sshuser/.ssh/authorized_keys >> /root/.ssh/authorized_keys

To test that is working try to establish an SSH connection from the primary node to the secondary node by only typing:

ssh root@192.168.70.2

If this works, that means the passwordless SSH key setup has succeeded. You should test the other direction as well, from secondary node to primary node.

Configure the failover function

Log on to your primary node's admin UI web interface, and go to the failover page. Switch on the LAN model (UCARP-based failover) option and then enter the shared virtual IP that you want both nodes to try to keep online at all times, and enter the IP address of your primary node and your failover node. Assuming you used the passwordless SSH key setup described in the section above, you do not need to alter any of the other values. Now select the Validate option and let the Access Server check the connection. If all is well you should see a good result. You can then use the Commit and Restart button to commit the changes.

Once the changes have been committed, the primary node's Access Server service will automatically restart itself and go online as the primary node in failover mode. It will bring online the virtual shared IP address (192.168.70.222 in our example) and offer its services there. Now restart the secondary node's OpenVPN Access Server service to ensure it picks up the new configuration changes (service openvpnas restart). The secondary node will go into a standby node and no longer offer a web service or VPN service at its configured static IP address. It will simply standby, wait for a failure of the primary node, and if the primary node has failed, it will take over the role of the primary node automatically and go online and offer a web service and VPN service and handle incoming connections just like the failed node would have.

You should now update your port forwarding settings to ensure that it goes to the shared virtual IP address (192.168.70.222 in our example). Your failover setup is now functional. You may test it by for example shutting down the primary node, and checking to see if your failover node now becomes the primary node. You can observe the /var/log/openvpnas.log and /var/log/openvpnas-node.log files to observe the state changes and you can also of course observe it by opening the public address of your Access Server's web interface and checking to see if it responds once the primary node has been shut down.

Finally, you should look into the licensing status of your servers.

Activate license on LAN model failover pair

It is recommended to purchase and activate a subscription activation key. It is designed with this use-case in mind. This type of software license activation key allows you to activate on multiple Access Servers at the same time. It also automatically migrates from one node to the other in a LAN model failover pair. This means that if you activate a subscription activation key on the currently active node in a failover pair, the other one will also automatically use it once a failover event occurs.

In the past we sold the fixed license activation keys. These were single-activation and had the drawback that you would need to have a separate license for each node in a failover pair. At the time we would offer special courtesy license keys for the failover node to match the purchased license key, but since we now have the subscription model, we advise people to switch to a subscription instead. That does require that you update your Access Server to version 2.8.1 or higher.

You can activate a subscription activation key by going to the Admin UI of the failover pair, and going to Configuration > Activation, to activate the subscription. If you prefer to use a command line method, that is also possible, but must be done on the node that is now currently the master node. You can find this out fairly easily by trying to open the web interface of the Access Server not on the shared virtual IP, but on the direct IP of the node in question. Only the current master node will respond. The node that is in standby will just not respond. You can also check the log file /var/log/openvpnas.log to see the current node's status.

Activating a subscription on the command line (as root):

cd /usr/local/openvpn_as/scripts/
./sacli -v "InsertTheSubscriptionActivationKeyHere" LoadSubscription

Verify that it works:

./sacli SubscriptionStatus

After a failover event occurs, the configuration data including the subscription activation data gets loaded onto the secondary node automatically. If you have any trouble with activation, see our troubleshooting guide for software licensing.

Troubleshooting

If you experience the situation where both nodes simultaneously try to be a MASTER node, or primary node, then your nodes may simply not be able to communicate with each other using VRRP heartbeat signals. There is a way to find out for certain if this is the case. An active primary node will send our VRRP packets onto the network, a secondary node in standby mode will not. If for example you were to stop the Access Server service on the secondary node, the primary node should in theory be online as primary node, and be the MASTER in the network, and should then be sending out VRRP packets that are visible to the secondary node. So for testing purposes, stop the Access Server service on the secondary node and use tcpdump to look if the VRRP packets arrive at the secondary node.

Stop the Access Server service on the secondary node:

service openvpnas restart

Install tcpdump on the secondary node:

apt-get update
apt-get install tcpdump

Use tcpdump to look for VRRP packets:

tcpdump -eni any vrrp

Example output:

18:15:53.000605 M 00:00:5e:00:00:5f ethertype IPv4 (0x0800), length 72: 192.168.70.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 94, prio 0, authtype none, intvl 1s, length 36
18:15:54.000718 M 00:00:5e:00:00:5f ethertype IPv4 (0x0800), length 72: 192.168.70.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 94, prio 0, authtype none, intvl 1s, length 36
18:15:55.000802 M 00:00:5e:00:00:5f ethertype IPv4 (0x0800), length 72: 192.168.70.1 > 224.0.0.18: VRRPv2, Advertisement, vrid 94, prio 0, authtype none, intvl 1s, length 36

If you do not see VRRP packets arriving there's a very good chance your network equipment is blocking the VRRP packets. In that case you should try to find a way to resolve that. If your network is incapable of passing these VRRP packets, then unfortunately you cannot use the LAN model UCARP-based failover model of the OpenVPN Access Server product.