Hi,
Having an odd problem. I have some devices I monitor with EXE/Script Advanced sensors, these sensors spawn a perl script that opens and ssh session, collects some output, and returns XML to PRTG. These work fine generally, I use 15 of these sensors per device monitoring different statistics.
However, if I reboot the monitored device, ie, the device is not reachable for some period and then comes back, the PRTG probe never recovers from this. The Core/Probe Health sensor goes to the error state with status along these lines: 275 % Delay (Probe Interval Delay non-WMI&SNMP) is above the error limit of 100 % Delay. Sensors of this probe can not be scanned in their specified intervals. Try longer intervals or distribute load over probes.
The indvidual probes give this error status: Timeout caused by wait for mutex (code: PE035)
Rebooting the PRTG probe device resolves this.
Any suggestions? Thanks, Tim
Article Comments
Thank you - that helped a lot. I also improved the error handling for ssh timeout in the perl script to make that more efficient/foolproof.
Mar, 2011 - Permalink
Dear Tim,
it seems the sensors then wait and block each other when such a target system is down. Can you try adding a Ping-Sensor to these devices and set it up to be the "Master Object for parent" each time (Dependency Setting). This way the Ping Sensor should pause all other sensors on the device, when itself does fail.
Does this help?
best Regards.
Mar, 2011 - Permalink