ESX Server Health Monitoring and Settings

We use PRTG to monitor our ESX 4.1 Environment, which includes VM Lab Manager. We use our VMs mostly for performance testing and as such run pretty high on the CPU and Memory side. As a result we seem to get a lot of downs and immedeate ups (within the same minute often) related to ESX Server health:

ESX Server Health (VMware Host Server (SOAP))

On the other hand, Vcenter and Lab Manager seldom register a error (unless the machine losses communication with them), nor are users impacted during these PRTG alarms. Please also note, the ping sensor never reports the ESX server as down. So, it responds to ping, and the VmWare Infrastructure, yet fails the ESX Server Health check long enough to trigger an alarm.

My questions are as follows:

1) Could someone explain briefly the basic functions of the ESX Server Health Sensor?

2) How do SOAP based sensors compare to ICMP? As in why are pings still up and yet these sensor calls go up/down?

3) What are the "best practices" settings for this sensor? Both for monitoring and then for the threshold to trigger an alarm?

Article Comments

I'm experiencing the same "a lot" can somebody from company react on this :O) perhaps

Jul, 2011 - Permalink

1. ESX Server Health sensor shows the same performance values you see when acessing your ESX host with a vSphere Client

2.SOAP, originally defined as Simple Object Access Protocol, is a protocol specification for exchanging structured information in the implementation of Web Services in computer networks. It relies on Extensible Markup Language (XML) for its message format, and usually relies on other Application Layer protocols, most notably Remote Procedure Call (RPC) and Hypertext Transfer Protocol (HTTP), for message negotiation and transmission.

SOAP is on another OSI level than pings are! Ping can still work while everything else on a PC does not work anymore.

3. Depends on the number of VMs you are monitoring. Scanning interval should not be less than 5min. Latency for notifications should be at least the time of the scanning interval, better two times.

Jul, 2011 - Permalink

We see the same results. Lots of false positives. We changed the Scanning Interval to 5 and later 10 minutes and do not see any changes. It seems to happen when a peak in CPU or Disk usage appears. PRTG 8.4.1.2283.

Sep, 2011 - Permalink

ESX Server Health Monitoring and Settings

Article Comments

Search

Attention

Related Articles