The other day I stopped started the PRTG services on the Primary node of my PRTG cluster. The Secondary node became active, starting to do the alerting, nice an smoothly.
When I started the PRTG services on the Primary node again, PRTG immediately failed back to this node. However, it took the node over 30 minutes to check all the sensors, resulting in a lot of Business Service sensors to go red.
Is there a way to disable the manual fail-back of the cluster? Because if I can manually fail back, I can do this when the Primary node has checked all sensors again.
I know I can manually fail-over to the Secondary node if I need to do maintenance on the Primary node. But I'm now talking about the Primary node for example crashing (BSOD) and automatically restarting. I really don't want this to cause an alert storm due to not all sensors checked yet on the Primary node in case of automatic fail-back...
Kind regards,
Corné van den Bosch
Article Comments
Hello Luciano,
Not that big. Just a two-node PRTG cluster with not even 5k sensors.
The trick with the manually start PRTG Core Server Service is a good one! I'll configure it on both nodes right away.
Especially because we're working on automatically creating tickets in our ticket system as soon as a sensor goes Red, I can't use this sea of red sensors when the Primary Master reboots (it happens; maintenance and such); especially the Business Service sensors are quite sensitive to this...
But this trick can help me avoid this. Thank you for the tip!
Kind regards,
Corné van den Bosch
Mar, 2017 - Permalink
Hello and thank you for your KB-Post,
Could it be that this is a fairly large deployment? Please the performance constrains regarding Clusters:
Within PRTG's there's no way of controlling this. The node with the highest priority will automatically "re-take control of the cluster" as soon as it starts.
As a workaround, you could configure the PRTG Core Server Service to only start manually. This way the probe service will always "resume" automatically, which means that the failover will start getting data from both nodes again but the core server service (and the "primary cluster node") will only come back when you manually command a service start.
This also means that when the Core Server Starts again, the probe will have been running for a while and should have already polled all sensors at least once, meaning that the Core wouldn't have to wait for all sensors to slowly resume and start getting their data again.
You could also use a Windows Scheduled Task to postpone the start of the Core Server Service in a specific time after the system starts.
Best Regards,
Luciano Lingnau [Paessler Support]
Mar, 2017 - Permalink