Collector failover principle
It is possible to configure a collector to function as a failover collector for another collector.
Normal operation
In normal operation the main collector is collecting data and “healthy”. The main collector will make it’s health available to the failover collector. As long as the failover collector can verify that the main collector is healthy it will remain dormant.
Main collector is down
In this scenario the failover collector has not been able to verify the health of the main collector in a predefined time frame due to the main collector failing or the machine on which it is running failing. In this case the failover collector will become active and start the collection of data on the failover server.
Main collector is unhealthy but reachable
In this scenario the failover collector received a health status from the main collector but the health status has been bad for a predefined time frame. In this case the failover collector will become active and start the collection of data on the failover server.
Network down
In this case the communication between the main collector and the failover collector is prevented by a network issue. In this case both collectors will collect data for the configured measurements.
Configuration
To configure the failover for a collector create a new collector, install it on the target machine and perform the initial registration. Once the collector has been registered go to the settings page. After toggling the failover switch a new panel called “Failover settings” should appear with the options described below.
Main collector
Select the main collector for which you want the collector to be the failover. Only collectors of the same type can be selected.
Port
This is the port the main collector will listen on to serve the health status to the failover collector. Enter a valid tcp port number greater than 1023 if your collector is running without administrator or root privileges and make sure your firewall configuration allows a connection on this port from the host on which the failover collector is running to the host on which the main collector is running.
Host
This is the host address to which the failover collector will try to connect to get the health status of the main collector. This can be an ip address or a resolvable hostname.
Timeout
This is the number of seconds the failover collector will wait before becoming “active” after it has deemed the main collector to be unhealthy.
Stop delay
This is the number of seconds the failover collector will wait before returning to the “idle” state after it deemed the main collector to be healthy.