During the investigation of high disk response times in one VM using vSAN storage, I saw a strange vSAN metric (TCP Connection Half Open Drop Rate).
What is it?
I have opened support ticket with VMware Support (2025-02-13) and started my own troubleshooting in paralel.
vSAN ESA - TCP Connection Half Open Drop issue
Here is the screenshot of vSAN ESA - Half Open Drop Rate over 50% on some vSAN Nodes ...
![]() |
vSAN ESA - Half Open Drop Rate over 50% on some vSAN Nodes |
Physical infrastructure schema
Here is the physical infrastructure schema of VMware vSAN ESA cluster ...
![]() |
The schema of Physical infrastructure |
Virtual Networking schema
Here is the virtual networing schema of VMware vSphere ESXi host (vSAN Node) participating in vSAN ESA cluster ...
![]() |
Virtual Networking of ESXi Host (vSAN Node) |
vSAN Cluster state
- ESX01 dcserv-esx05 192.168.123.21 (agent) [56% half-open drop]
- ESX02 dcserv-esx06 192.168.123.22 (backup) [98% half-open drop]
- ESX03 dcserv-esx07 192.168.123.23 (agent) [54% half-open drop]
- ESX04 dcserv-esx08 192.168.123.24 (agent) [0% half-open drop]
- ESX05 dcserv-esx09 192.168.123.25 (master) [0% half-open drop] but once per some time (hour or so) 42% - 49% drop
- ESX06 dcserv-esx10 192.168.123.26 (agent) [0% half-open drop]
Do I have problem? I’m not certain, but it doesn’t appear to be the case.
I have seen high virtual disk latency on VM (docker host with single NVMe vDisk) with the storage load less than 12,000 IOPS (IOPS limit set to 25,000), so that was the reason why I was checking vSAN ESA infrastructure deeper and found the TCP Half Open Drop "issue".
![]() |
High vDisk (vNVMe) response times in first week of February |
However, IOmeter in Windows server with single SCSI vDisk on SCSI0:0 adapter is able to generate almost 25,000 IOPS @ 0.6 ms response time of 28.5KB-100%_read-100%_random storage pattern with 12 workers (threads).
![]() |
12 workers on SCSI vDisk - we see performance of 25,000 IOPS @ 0.6 ms response time |
Network Analysis - packet capturing
What is happening in vSAN Node (dcserv-esx06) in maintenance mode with all vSAN storage migrated out of node?
[root@dcserv-esx06:/usr/lib/vmware/vsan/bin]
pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o -
| tcpdump-uw -r - 'src host 192.168.123.22 and tcp[tcpflags] &
tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'
The name of the uplink is vmnic4.
The session capture point is UplinkRcvKernel,UplinkSndKernel.
pktcap: The output file is -.
pktcap: No server port specifed, select 30749 as the port.
pktcap: Local CID 2.
pktcap: Listen on port 30749.
pktcap: Main thread: 305300921536.
pktcap: Dump Thread: 305301452544.
pktcap: The output file format is pcapng.
pktcap: Recv Thread: 305301980928.
pktcap: Accept...
reading from file -pktcap: Vsock connection from port 1032 cid 2.
, link-type EN10MB (Ethernet), snapshot length 65535
09:19:52.104211
IP 192.168.123.22.52611 > 192.168.123.23.2233: Flags [SEW], seq
2769751215, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
401040956 ecr 0], length 0
09:20:52.142511 IP 192.168.123.22.55264
> 192.168.123.23.2233: Flags [SEW], seq 3817033932, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 1805625573 ecr 0], length 0
09:21:52.182787
IP 192.168.123.22.57917 > 192.168.123.23.2233: Flags [SEW], seq
2055691008, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
430011832 ecr 0], length 0
09:22:26.956218 IP 192.168.123.22.59456
> 192.168.123.23.2233: Flags [SEW], seq 3524784519, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 2597182302 ecr 0], length 0
09:22:52.225550
IP 192.168.123.22.60576 > 192.168.123.23.2233: Flags [SEW], seq
3089565460, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
378912106 ecr 0], length 0
09:23:52.397431 IP 192.168.123.22.63229
> 192.168.123.23.2233: Flags [SEW], seq 2552721354, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 2409421282 ecr 0], length 0
09:24:52.436734
IP 192.168.123.22.12398 > 192.168.123.23.2233: Flags [SEW], seq
3269754737, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3563144147 ecr 0], length 0
09:25:52.476565 IP 192.168.123.22.15058
> 192.168.123.23.2233: Flags [SEW], seq 1510936927, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 1972989571 ecr 0], length 0
09:26:52.515032
IP 192.168.123.22.17707 > 192.168.123.23.2233: Flags [SEW], seq
262766144, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3787605572 ecr 0], length 0
09:27:52.554904 IP 192.168.123.22.20357
> 192.168.123.23.2233: Flags [SEW], seq 2099691233, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 2472387791 ecr 0], length 0
09:28:52.598409
IP 192.168.123.22.23017 > 192.168.123.23.2233: Flags [SEW], seq
1560369055, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
688302913 ecr 0], length 0
09:29:52.641938 IP 192.168.123.22.25663
> 192.168.123.23.2233: Flags [SEW], seq 394113563, win 65535, options
[mss 8960,nop,wscale 9,sackOK,TS val 3836880073 ecr 0], length 0
09:30:52.682276
IP 192.168.123.22.28221 > 192.168.123.23.2233: Flags [SEW], seq
4232787521, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
830544087 ecr 0], length 0
09:31:52.726506 IP 192.168.123.22.30871
> 192.168.123.23.2233: Flags [SEW], seq 3529232466, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 3037414646 ecr 0], length 0
09:32:52.768689
IP 192.168.123.22.33520 > 192.168.123.23.2233: Flags [SEW], seq
3467993307, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3716244554 ecr 0], length 0
09:33:52.809641 IP 192.168.123.22.36184
> 192.168.123.23.2233: Flags [SEW], seq 2859309873, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 1556603624 ecr 0], length 0
09:34:52.849282
IP 192.168.123.22.38830 > 192.168.123.23.2233: Flags [SEW], seq
891574849, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
226049490 ecr 0], length 0
09:35:52.889434 IP 192.168.123.22.41487
> 192.168.123.23.2233: Flags [SEW], seq 1629372626, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 100385827 ecr 0], length 0
09:36:52.931192
IP 192.168.123.22.44140 > 192.168.123.23.2233: Flags [SEW], seq
3898717755, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3230029896 ecr 0], length 0
09:37:52.972758 IP 192.168.123.22.46788
> 192.168.123.23.2233: Flags [SEW], seq 3798420138, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 1400467195 ecr 0], length 0
09:38:53.013565
IP 192.168.123.22.49449 > 192.168.123.23.2233: Flags [SEW], seq
1759807546, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
1072184991 ecr 0], length 0
09:39:53.055394 IP 192.168.123.22.52096
> 192.168.123.23.2233: Flags [SEW], seq 2996482935, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 3573008833 ecr 0], length 0
09:40:53.095123
IP 192.168.123.22.54754 > 192.168.123.23.2233: Flags [SEW], seq
103237119, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3275581229 ecr 0], length 0
09:41:53.136593 IP 192.168.123.22.57408
> 192.168.123.23.2233: Flags [SEW], seq 2105630912, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 1990595855 ecr 0], length 0
09:42:53.178033
IP 192.168.123.22.60054 > 192.168.123.23.2233: Flags [SEW], seq
4245039293, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
296668711 ecr 0], length 0
09:43:38.741557 IP 192.168.123.22.62070
> 192.168.123.23.2233: Flags [SEW], seq 343657957, win 65535, options
[mss 8960,nop,wscale 9,sackOK,TS val 3406471577 ecr 0], length 0
09:43:53.219844
IP 192.168.123.22.62713 > 192.168.123.23.2233: Flags [SEW], seq
452468561, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3555078978 ecr 0], length 0
09:44:53.264107 IP 192.168.123.22.11779
> 192.168.123.23.2233: Flags [SEW], seq 3807775128, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 3836709718 ecr 0], length 0
09:45:53.306117
IP 192.168.123.22.14431 > 192.168.123.23.2233: Flags [SEW], seq
3580778695, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3478626421 ecr 0], length 0
09:46:53.348438 IP 192.168.123.22.17083
> 192.168.123.23.2233: Flags [SEW], seq 1098229669, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 2219974257 ecr 0], length 0
09:47:53.386992
IP 192.168.123.22.19737 > 192.168.123.23.2233: Flags [SEW], seq
1338972264, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
708281300 ecr 0], length 0
09:48:53.426861 IP 192.168.123.22.22389
> 192.168.123.23.2233: Flags [SEW], seq 3973038592, win 65535,
options [mss 8960,nop,wscale 9,sackOK,TS val 3153895628 ecr 0], length 0
09:49:53.469640
IP 192.168.123.22.25046 > 192.168.123.23.2233: Flags [SEW], seq
2367639206, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val
3155172682 ecr 0], length 0
09:50:53.510996 IP 192.168.123.22.27703
> 192.168.123.23.2233: Flags [SEW], seq 515312838, win 65535, options
[mss 8960,nop,wscale 9,sackOK,TS val 3434645295 ecr 0], length 0
How does TCP SYN/SYN-ACK behave between DCSERV-ESX06 and other vSAN nodes?
ESXi command to sniff TCP SYN from DCSERV-ESX06 (192.168.123.23) to DCSERV-ESX07 (192.168.123.23) is
pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'src host 192.168.123.22 and dst host 192.168.123.23 and tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'
Command to sniff TCP SYN-ACK is
pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'src host 192.168.123.23 and dst host 192.168.123.22 and tcp[tcpflags] & (tcp-syn|tcp-ack) = (tcp-syn|tcp-ack)'
Here are observations and screenshots from sniffing excercise.
No new TCP connections have been initiated between DCSERV-ESX06 (backup vSAN node) and DCSERV-ESX05 (agent vSAN node) in some limited sniffing time (several minutes).
Between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX07 (192.168.123.23, agent vSAN node) new TCP Connection is established (SYN/SYN-ACK) every minute.
No new TCP connections have been initiated between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX08 (192.168.123.24, agent vSAN node) in some limited sniffing time (several minutes).
No new TCP connections have been initiated between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX09 (192.168.123.25, agent vSAN node) in some limited sniffing time (several minutes).
No new TCP connections have been initiated between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX10 (192.168.123.26, agent vSAN node) in some limited sniffing time (several minutes).Interesting observation
New TCP Connection between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX07 (192.168.123.23, agent vSAN node) is usually established (SYN/SYN-ACK) every minute.
Why this happening only between DCSERV-ESX06 (backup node) and DCSERV-ESX07 (agent node) and not with other nodes? I do not know.
Further TCP network troubleshooting
Next step is to collect TCP SYN, TCP SYN/ACK, TCP stats, and NET stats on DCSERV-ESX06 (most "problematic" vSAN node) and DCSERV-ESX10 (not "problematic" vSAN node) into the files. I will capture data during one hour (60 minutes) to be able to compare number of SYN and SYN/ACK packets and compare it with TCP and network statistics.
Capturing of TCP SYN
timeout -t 3600 pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0' > /tmp/dcserv-esx06_tcp-syn.dump
timeout -t 3600 pktcap-uw --uplink vmnic4 --capture
UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags]
& tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0' >
/tmp/dcserv-esx10_tcp-syn.dump
Capturing of TCP SYN/ACK
timeout -t 3600 pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags] & (tcp-syn|tcp-ack) = (tcp-syn|tcp-ack)' > /tmp/dcserv-esx06_tcp-syn_ack.dump
timeout -t 3600 pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags] & (tcp-syn|tcp-ack) = (tcp-syn|tcp-ack)' > /tmp/dcserv-esx10_tcp-syn_ack.dump
Capturing of TCP Statistics
for i in $(seq 60); do { date; vsish -e get /net/tcpip/instances/defaultTcpipStack/stats/tcp; } >> /tmp/dcserv-esx06_tcp_stats; sleep 60; done
for i in $(seq 60); do { date; vsish -e get /net/tcpip/instances/defaultTcpipStack/stats/tcp; } >> /tmp/dcserv-esx10_tcp_stats; sleep 60; done
Capturing of TCP Statistics
netstat captures 60 min with 30 sec x 120 times = 3600 sec = 60 min
for i in $(seq 120); do { date; net-stats -A -t WwQqihVv -i 30; } >> /tmp/dcserv-esx06_netstats ; done
for i in $(seq 120); do { date; net-stats -A -t WwQqihVv -i 30; } >> /tmp/dcserv-esx10_netstats ; done
Output Files Comparison
ESX06
tcpdump
15:48:32.422347 - 16:48:16.542078: 199 TCP SYN
15:49:16.434140 - 16:48:46.533262: 199 TCP SYN/ACK
Fri Mar 7 15:49:10 UTC 2025
tcp_statistics
connattempt:253432751
accepts:3996127
connects:8341861
drops:4778493
conndrops:247894569
minmssdrops:0
closed:257671058
Fri Mar 7 16:48:10 UTC 2025
tcp_statistics
connattempt:253587720
accepts:3997071
connects:8345071
drops:4781004
conndrops:248047267
minmssdrops:0
closed:257827086
tcp_statistics difference
connattempt:154969
accepts:944
connects:3210
drops:2511
conndrops:152698
minmssdrops:0
closed:156028
ESX10
tcpdump
15:49:44.554242 - 16:49:16.544940: 179 TCP SYN
15:50:16.441776 - 16:49:54.142493: 185 TCP SYN/ACK
Fri Mar 7 15:50:49 UTC 2025
tcp_statistics
connattempt:826534
accepts:2278888
connects:3105348
drops:1414905
conndrops:74
minmssdrops:0
closed:3338137
Fri Mar 7 16:49:49 UTC 2025
tcp_statistics
connattempt:826864
accepts:2279789
connects:3106579
drops:1415439
conndrops:74
minmssdrops:0
closed:3339470
Difference
connattempt:330
accepts:901
connects:1231
drops:534
conndrops:0
minmssdrops:0
closed:1333
What does it mean? I don't know. I have VMware support case opened and waiting on their analysis.
There were various calls with various parts of VMware support but here is the first meaningful response from VMware support (2025-04-03 - 50 days after opening a support ticket)
Your capture is highly filtered and many details are missing. Please consider the following points when collecting the capture:
- Use the pktcap-uw command and capture in .pcap format. Collecting all the data in a single file will help us trace packets to specific connections.
- Capture all TCP packets, not just SYN/SYN-ACK. Half-open drops are usually caused by RESET packets
- TCP uses the same set of statistics for the entire network stack. Therefore, we must collect packets from all vmk interfaces in the default network stack, or from a common uplink.
You can use a command similar to below one:
pktcap-uw --vmk <vmk> --proto 0x6 --dir 2 -o <file.pcap>
pktcap-uw --uplink <vmnic> --proto 0x6 --dir 2 -o <file.pcap>
Ok. No problem. Let's do a packet capturing of everything going through uplink used by vSAN.
My vSAN ESA vmkernel interface is pined to vmnic4, therefore I used following command
It is good to monitor datastore usage as it dumps 30GB of network trafic in 4 minutes.
Another meaningful communication with VMware support (2025-05-08 - 85 days after opening a support ticket)
VMware support asked me for another packet capturing. They want packet capture not only from uplink used for vSAN traffic (VMKNIC4), but also from uplinks VMKNIC0, VMKNIC1, and VMKNIC5, where if vSphere management traffic.
Below is onliner I used to capture network traffic and split it into ~2 GB (2,000 MB) files as requested by VMware support.
timeout -t 360 pktcap-uw --uplink vmnic1 --proto 0x6 --dir 2 -o - | tcpdump-uw -r - -w vmnic1-pcap -C 2000 & \
timeout -t 360 pktcap-uw --uplink vmnic4 --proto 0x6 --dir 2 -o - | tcpdump-uw -r - -w vmnic4-pcap -C 2000 & \
timeout -t 360 pktcap-uw --uplink vmnic5 --proto 0x6 --dir 2 -o - | tcpdump-uw -r - -w vmnic5-pcap -C 2000 &
Explanation of onliner above:
- timeout 360 : limit packet capturing to 6 minutes to keep overall packet capture data capacity below 30 GB
- -o - : Sends raw pcap data to stdout.
- tcpdump -r -: Reads from stdin
- -w /tmp/vmk0-%Y%m%d-%H%M%S.pcap: Uses timestamped filenames.
- -C 2000: Splits output files every 2000 MB (~2GB).
I've sent this new packet capture to VMware Support again and waited for their response.
Another meaningful communication with VMware support (2025-05-15 - 92 days after opening a support ticket)
Hello David,
Etcd is the misbehaving application. Looks like some of the hosts (100.68.81.23 and 100.68.81.21) dont have etcd configured and this host is trying to reach them. Can you help check why this configuration is missing on some of the hosts.
34 0.087251 0.000057000 100.68.81.23 100.68.81.22 2380 → 58192 [RST, ACK] Seq=0 Ack=2589825032 Win=0 Len=0 34
35 0.087370 0.000119000 100.68.81.23 100.68.81.22 2380 → 58193 [RST, ACK] Seq=0 Ack=1816019462 Win=0 Len=0 35
38 0.093287 0.000060000 100.68.81.21 100.68.81.22 2380 → 58194 [RST, ACK] Seq=0 Ack=3524013708 Win=0 Len=0 38
39 0.093407 0.000120000 100.68.81.21 100.68.81.22 2380 → 58195 [RST, ACK] Seq=0 Ack=2552292164 Win=0 Len=0 39
42 0.186674 0.000065000 100.68.81.23 100.68.81.22 2380 → 58196 [RST, ACK] Seq=0 Ack=428680618 Win=0 Len=0 42
43 0.186793 0.000119000 100.68.81.23 100.68.81.22 2380 → 58197 [RST, ACK] Seq=0 Ack=1113298373 Win=0 Len=0 43
46 0.193167 0.000056000 100.68.81.21 100.68.81.22 2380 → 58198 [RST, ACK] Seq=0 Ack=1739165024 Win=0 Len=0 46
47 0.193286 0.000119000 100.68.81.21 100.68.81.22 2380 → 58199 [RST, ACK] Seq=0 Ack=3827463043 Win=0 Len=0 47
50 0.286874 0.000073000 100.68.81.23 100.68.81.22 2380 → 58201 [RST, ACK] Seq=0 Ack=1641220058 Win=0 Len=0 50
51 0.286874 0.000000000 100.68.81.23 100.68.81.22 2380 → 58200 [RST, ACK] Seq=0 Ack=1825411290 Win=0 Len=0 51
./var/run/log/etcd.log:1556:2025-02-13T12:59:27Z Wa(4) etcd[28532348]: health check for peer 7312e1f21f195833 could not connect: dial tcp 100.68.81.21:2380: connect: connection refused
./var/run/log/etcd.log:1557:2025-02-13T12:59:30Z Wa(4) etcd[28532348]: health check for peer 5c34e4f236d566f0 could not connect: dial tcp 100.68.81.23:2380: connect: connection refused
./var/run/log/etcd.log:1558:2025-02-13T12:59:30Z Wa(4) etcd[28532348]: health check for peer 5c34e4f236d566f0 could not connect: dial tcp 100.68.81.23:2380: connect: connection refused
./var/run/log/etcd.log:1560:2025-02-13T12:59:32Z Wa(4) etcd[28532348]: health check for peer 7312e1f21f195833 could not connect: dial tcp 100.68.81.21:2380: connect: connection refused
./var/run/log/etcd.log:1561:2025-02-13T12:59:32Z Wa(4) etcd[28532348]: health check for peer 7312e1f21f195833 could not connect: dial tcp 100.68.81.21:2380: connect: connection refused
./var/run/log/etcd.log:1562:2025-02-13T12:59:35Z Wa(4) etcd[28532348]: health check for peer 5c34e4f236d566f0 could not connect: dial tcp 100.68.81.23:2380: connect: connection refused
tail -f /var/run/log/etcd.log # What is the last etcd.log log entry?
ps | grep etcd # Does etcd process run in ESXi host?
DCSERV-ESX05
DCSERV-ESX06
What does this all mean?
- Why ETCD is running on two ESXi hosts when I have just vSphere and vSAN? There is no Tanzu (aka VMware vSphere Kubernetes Service) enabled.
- I realized that two running ETCDs could be associated with two vCLS Pods and when consulting with ChatGPT, I have got following answers
- In 8.0.2 and newer, VMware started shifting vCLS to “vCLS Pods”, running containers inside the VM, using a small internal container runtime.
- VMware uses ETCD inside these pods as part of the vCLS control plane
- vCLS Pods communicate over port 2380, which is etcd’s peer port
I will share my findings and thoughts with VMware support and wait for their response, because we cannot trust ChatGPT and vendor support is the main authority for their product.
Another meaningful communication with VMware support (2025-05-23 - 100 days after opening a support ticket)
just to follow up on previous mail
I checked this internally, etcd can run even if WCP/TKG isn't in use, this could be a 3 etcd node cluster, so may not be running on some hosts,
The number of half open drops are increasing because the connection requests are being denied by the other host as the service is not currently running on them.
Can you send me the output of the below command on the vcenter
/usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
Can you also upload a full vcenter log bundle along with the host logs
What is command /usr/lib/vmware/clusterAgent/bin/clusterAdmin?
The clusterAdmin tool in VMware ESXi is a command-line utility used for managing and administering vSphere clustering functionality, particularly vSphere HA (High Availability) and DRS (Distributed Resource Scheduler) operations at the host level. This tool is part of the cluster agent infrastructure that runs on each ESXi host and handles communication between the host and vCenter Server for cluster-related operations.
Primary Functions:
- Managing cluster membership and host participation in vSphere clusters
- Configuring and troubleshooting vSphere HA settings on individual hosts
- Handling cluster state information and synchronization
- Managing resource pool configurations and DRS policies
- Performing cluster-related diagnostic operations
Common Use Cases:
- Troubleshooting cluster connectivity issues
- Manually triggering cluster reconfiguration operations
- Checking cluster agent status and health
- Resetting cluster configuration when hosts become disconnected
- Diagnosing HA or DRS failures
Typical Usage: The tool is usually invoked with various subcommands and parameters, such as:
- Status checking operations
- Configuration reset commands
- Cluster membership management
- Resource allocation adjustments
This utility is primarily intended for VMware support engineers and advanced administrators who need to perform low-level cluster troubleshooting or maintenance operations that aren't available through the vSphere Client interface. It's part of the internal clustering infrastructure and should be used carefully, typically only when directed by VMware support or when following specific troubleshooting procedures.
Well, that's the case. VMware suport engineer (TSE) was asking for command outputs, so here are outputs from all ESXi hosts in vSphere/vSAN Cluster ...
dcserv-esx05
[root@dcserv-esx05:~] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
{
"state": "hosted",
"cluster_id": "5bab0e84-305e-4966-ae6e-b9386c6b19f3:domain-c2051",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": true,
"members": {
"available": false
}
}
[root@dcserv-esx05:~]
dcserv-esx06
[root@dcserv-esx06:~] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster stat
us
{
"state": "hosted",
"cluster_id": "5bab0e84-305e-4966-ae6e-b9386c6b19f3:domain-c2051",
"is_in_alarm": true,
"alarm_cause": "Timeout",
"is_in_cluster": true,
"members": {
"available": false
}
}
[root@dcserv-esx06:~]
dcserv-esx07
[root@dcserv-esx07:~] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster stat
us
{
"state": "hosted",
"cluster_id": "5bab0e84-305e-4966-ae6e-b9386c6b19f3:domain-c2051",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": true,
"members": {
"available": false
}
}
[root@dcserv-esx07:~]
dcserv-esx08
[root@dcserv-esx08:~] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster stat
us
{
"state": "standalone",
"cluster_id": "",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": false,
"members": {
"available": false
}
}
[root@dcserv-esx08:~]
dcserv-esx09
[root@dcserv-esx09:~] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster stat
us
{
"state": "hosted",
"cluster_id": "5bab0e84-305e-4966-ae6e-b9386c6b19f3:domain-c2051",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": true,
"members": {
"available": true
},
"namespaces": [
{
"name": "root",
"up_to_date": true,
"members": [
{
"peer_address": "dcserv-esx09.dcserv.cloud:2380",
"api_address": "dcserv-esx09.dcserv.cloud:2379",
"reachable": true,
"primary": "yes",
"learner": false
}
]
}
]
}
[root@dcserv-esx09:~]
dcserv-esx10
[root@dcserv-esx10:~] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster stat
us
{
"state": "standalone",
"cluster_id": "",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": false,
"members": {
"available": false
}
}
[root@dcserv-esx10:~]
It seems to me that output above means that hosts
- 4 nodes (dcserv-esx05, dcserv-esx06, dcserv-esx07, dcserv-esx09) are in cluster
- only dcserv-esx09 have members available
- dcserv-esx06 is in alarm state and alarm cause is Timeout
- all other nodes are not in alarm state
- when I check if etcd is running (ps | grep etcd), it runs only on following two ESXi hosts
- dcserv-esx06, dcserv-esx09
VMware TSE mentioned that ... "etcd can run even if WCP/TKG isn't in use, this could be a 3 etcd node cluster". However, I see etcd service running only on two of six ESXi hosts. TSE believes there should be running 3 nodes. It leads into the following questions ...
Q1: What is the purpose of 3-node ETCD in vSphere/vSAN cluster?
Q2: Why only 2-nodes are running?
Anyway. I do not understand /usr/lib/vmware/clusterAgent/bin/clusterAdmin tool. This is VMware low level internal tool. So let's wait for next VMware Support follow up.
System Logs from vCenter along with the host logs have been exported and uploaded to VMware Support Case. I'm looking forward to seeing if this will help VMware support to identify the root cause.
No comments:
Post a Comment