Pages

Thursday, February 13, 2025

VMware vSAN ESA on Cisco UCS - TCP Connection Half Open Drop Rate

I do see strange vSAN metric (TCP Connection Half Open Drop Rate). What is it?

vSAN ESA - TCP Connection Half Open Drop issue

Here is the screenshot of vSAN ESA - Half Open Drop Rate over 50% on some vSAN Nodes ...

vSAN ESA - Half Open Drop Rate over 50% on some vSAN Nodes

Physical infrastructure schema

Here is the physical infrastructure schema of VMware vSAN ESA cluster ...

The schema of Physical infrastructure

 vSAN Cluster state

  • ESX01 dcserv-esx05 192.168.123.21 (agent)  [56% half-open drop]
  • ESX02 dcserv-esx06 192.168.123.22 (backup) [98% half-open drop]
  • ESX03 dcserv-esx07 192.168.123.23 (agent)  [54% half-open drop]
  • ESX04 dcserv-esx08 192.168.123.24 (agent)  [0% half-open drop]
  • ESX05 dcserv-esx09 192.168.123.25 (master) [0% half-open drop] but once per some time (hour or so) 42% - 49% drop
  • ESX06 dcserv-esx10 192.168.123.26 (agent)  [0% half-open drop]

Do I have problem? I’m not certain, but it doesn’t appear to be the case.

I have seen high virtual disk latency on VM (docker host with single NVMe vDisk) with the storage load less than 12,000 IOPS (IOPS limit set to 25,000), so that was the reason why I was checking vSAN ESA infrastructure deeper and found the TCP Half Open Drop "issue".

High vDisk (vNVMe) response times in first week of February

However, IOmeter in Windows server with single SCSI vDisk on SCSI0:0 adapter is able to generate almost 25,000 IOPS @ 0.6 ms response time of 28.5KB-100%_read-100%_random storage pattern with 12 workers (threads)

12 workers on SCSI vDisk - we see performance of 25,000 IOPS @ 0.6 ms response time
 
It is worth to mention, that approximately 2,600 IOPS (512B I/O size) - 1,400 IOPS (1MB I/O size) per storage worker is not only vSAN but any shared enterprise storages "artificial" throuhput limit for good reason (explanation of the reason is another topic), however, it's essential to use more workers (threads, oustanding I/Os)  to achieve higher performance/throughput. Bellow is the performance result of single worker (thread) with 4KB I/O size.
 
Single worker (thread) with 4KB I/O size
 
So, let's use more workers (more storage threads = leveraging higher queue depth = higher paralelization) and test how many 28.5KB IOPS we can achieve on single vDisk.
 
With 64 workers (IOmeter_64workers_28.5KB_IO_100%_read_100%_random) I can generate 108,000 IOPS @ 0.6 ms response time.
 
64 workers on SCSI vDisk we see performance of 108,000 IOPS @ 0.6 ms response time
 
It is important mention that all above test were done on SCSI vDisk on PVSCSI adapter which has 256 queue depth, so performance can, if storage subsystem allows it, theoretically scale up to 256 workers.
 
However, if we use 128 workers (IOmeter_64workers_28.5KB_IO_100%_read_100%_random) we can see that storage subsystem does not handle it, performance 98,000 IOPS is even lower than performance with 64 workers and response time increase to 1.3 ms.
 
128 workers on SCSI vDisk we see performance of 98,300 IOPS @ 1.3 ms response time
 
If we use the same storage workload with 128 workers (IOmeter_64workers_28.5KB_IO_100%_read_100%_random) but with NVMe vDisk instead of SCSI vDisk, we can see that storage subsystem can handle 108,000 IOPS @ 1.2 ms but it is still worse performance quality than 64 workers on SCSI vDisk from response time perspective (1.2 ms vs 0.6 ms response time).
 
128 workers on NVMe vDisk we see performance of 108,000 IOPS @ 1.2 ms response time
 
If we test 64 workers on NVMe vDisk we see performance of 110,000 IOPS @ 0.6 ms response time.
 
64 workers on NVMe vDisk we see performance of 110,000 IOPS @ 0.6 ms response time
 
Anyway, all tests above shows pretty good storage performance on vSAN ESA cluster experiencing TCP Connection Half Open Drop Rate.

Network Analysis - packet capturing

What is happening in vSAN Node (dcserv-esx06) in maintenance mode with all vSAN storage migrated out of node?

[root@dcserv-esx06:/usr/lib/vmware/vsan/bin] pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'src host 192.168.123.22 and tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'
The name of the uplink is vmnic4.
The session capture point is UplinkRcvKernel,UplinkSndKernel.
pktcap: The output file is -.
pktcap: No server port specifed, select 30749 as the port.
pktcap: Local CID 2.
pktcap: Listen on port 30749.
pktcap: Main thread: 305300921536.
pktcap: Dump Thread: 305301452544.
pktcap: The output file format is pcapng.
pktcap: Recv Thread: 305301980928.
pktcap: Accept...
reading from file -pktcap: Vsock connection from port 1032 cid 2.
, link-type EN10MB (Ethernet), snapshot length 65535
09:19:52.104211 IP 192.168.123.22.52611 > 192.168.123.23.2233: Flags [SEW], seq 2769751215, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 401040956 ecr 0], length 0
09:20:52.142511 IP 192.168.123.22.55264 > 192.168.123.23.2233: Flags [SEW], seq 3817033932, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 1805625573 ecr 0], length 0
09:21:52.182787 IP 192.168.123.22.57917 > 192.168.123.23.2233: Flags [SEW], seq 2055691008, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 430011832 ecr 0], length 0
09:22:26.956218 IP 192.168.123.22.59456 > 192.168.123.23.2233: Flags [SEW], seq 3524784519, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 2597182302 ecr 0], length 0
09:22:52.225550 IP 192.168.123.22.60576 > 192.168.123.23.2233: Flags [SEW], seq 3089565460, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 378912106 ecr 0], length 0
09:23:52.397431 IP 192.168.123.22.63229 > 192.168.123.23.2233: Flags [SEW], seq 2552721354, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 2409421282 ecr 0], length 0
09:24:52.436734 IP 192.168.123.22.12398 > 192.168.123.23.2233: Flags [SEW], seq 3269754737, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3563144147 ecr 0], length 0
09:25:52.476565 IP 192.168.123.22.15058 > 192.168.123.23.2233: Flags [SEW], seq 1510936927, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 1972989571 ecr 0], length 0
09:26:52.515032 IP 192.168.123.22.17707 > 192.168.123.23.2233: Flags [SEW], seq 262766144, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3787605572 ecr 0], length 0
09:27:52.554904 IP 192.168.123.22.20357 > 192.168.123.23.2233: Flags [SEW], seq 2099691233, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 2472387791 ecr 0], length 0
09:28:52.598409 IP 192.168.123.22.23017 > 192.168.123.23.2233: Flags [SEW], seq 1560369055, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 688302913 ecr 0], length 0
09:29:52.641938 IP 192.168.123.22.25663 > 192.168.123.23.2233: Flags [SEW], seq 394113563, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3836880073 ecr 0], length 0
09:30:52.682276 IP 192.168.123.22.28221 > 192.168.123.23.2233: Flags [SEW], seq 4232787521, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 830544087 ecr 0], length 0
09:31:52.726506 IP 192.168.123.22.30871 > 192.168.123.23.2233: Flags [SEW], seq 3529232466, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3037414646 ecr 0], length 0
09:32:52.768689 IP 192.168.123.22.33520 > 192.168.123.23.2233: Flags [SEW], seq 3467993307, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3716244554 ecr 0], length 0
09:33:52.809641 IP 192.168.123.22.36184 > 192.168.123.23.2233: Flags [SEW], seq 2859309873, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 1556603624 ecr 0], length 0
09:34:52.849282 IP 192.168.123.22.38830 > 192.168.123.23.2233: Flags [SEW], seq 891574849, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 226049490 ecr 0], length 0
09:35:52.889434 IP 192.168.123.22.41487 > 192.168.123.23.2233: Flags [SEW], seq 1629372626, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 100385827 ecr 0], length 0
09:36:52.931192 IP 192.168.123.22.44140 > 192.168.123.23.2233: Flags [SEW], seq 3898717755, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3230029896 ecr 0], length 0
09:37:52.972758 IP 192.168.123.22.46788 > 192.168.123.23.2233: Flags [SEW], seq 3798420138, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 1400467195 ecr 0], length 0
09:38:53.013565 IP 192.168.123.22.49449 > 192.168.123.23.2233: Flags [SEW], seq 1759807546, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 1072184991 ecr 0], length 0
09:39:53.055394 IP 192.168.123.22.52096 > 192.168.123.23.2233: Flags [SEW], seq 2996482935, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3573008833 ecr 0], length 0
09:40:53.095123 IP 192.168.123.22.54754 > 192.168.123.23.2233: Flags [SEW], seq 103237119, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3275581229 ecr 0], length 0
09:41:53.136593 IP 192.168.123.22.57408 > 192.168.123.23.2233: Flags [SEW], seq 2105630912, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 1990595855 ecr 0], length 0
09:42:53.178033 IP 192.168.123.22.60054 > 192.168.123.23.2233: Flags [SEW], seq 4245039293, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 296668711 ecr 0], length 0
09:43:38.741557 IP 192.168.123.22.62070 > 192.168.123.23.2233: Flags [SEW], seq 343657957, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3406471577 ecr 0], length 0
09:43:53.219844 IP 192.168.123.22.62713 > 192.168.123.23.2233: Flags [SEW], seq 452468561, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3555078978 ecr 0], length 0
09:44:53.264107 IP 192.168.123.22.11779 > 192.168.123.23.2233: Flags [SEW], seq 3807775128, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3836709718 ecr 0], length 0
09:45:53.306117 IP 192.168.123.22.14431 > 192.168.123.23.2233: Flags [SEW], seq 3580778695, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3478626421 ecr 0], length 0
09:46:53.348438 IP 192.168.123.22.17083 > 192.168.123.23.2233: Flags [SEW], seq 1098229669, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 2219974257 ecr 0], length 0
09:47:53.386992 IP 192.168.123.22.19737 > 192.168.123.23.2233: Flags [SEW], seq 1338972264, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 708281300 ecr 0], length 0
09:48:53.426861 IP 192.168.123.22.22389 > 192.168.123.23.2233: Flags [SEW], seq 3973038592, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3153895628 ecr 0], length 0
09:49:53.469640 IP 192.168.123.22.25046 > 192.168.123.23.2233: Flags [SEW], seq 2367639206, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3155172682 ecr 0], length 0
09:50:53.510996 IP 192.168.123.22.27703 > 192.168.123.23.2233: Flags [SEW], seq 515312838, win 65535, options [mss 8960,nop,wscale 9,sackOK,TS val 3434645295 ecr 0], length 0

How does TCP SYN/SYN-ACK behave between DCSERV-ESX06 and other vSAN nodes?

ESXi command to sniff TCP SYN from DCSERV-ESX06 (192.168.123.23) to DCSERV-ESX07 (192.168.123.23) is

pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'src host 192.168.123.22 and dst host 192.168.123.23 and tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'

Command to sniff TCP SYN-ACK is

pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'src host 192.168.123.23 and dst host 192.168.123.22 and tcp[tcpflags] & (tcp-syn|tcp-ack) = (tcp-syn|tcp-ack)' 

Here are observations and screenshots from sniffing excercise.

No new TCP connections have been initiated between DCSERV-ESX06 (backup vSAN node) and DCSERV-ESX05 (agent vSAN node) in some limited sniffing time (several minutes).

Between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX07 (192.168.123.23, agent vSAN node) new TCP Connection is established (SYN/SYN-ACK) every minute.

No new TCP connections have been initiated between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX08 (192.168.123.24, agent vSAN node) in some limited sniffing time (several minutes).

No new TCP connections have been initiated between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX09 (192.168.123.25, agent vSAN node) in some limited sniffing time (several minutes).

No new TCP connections have been initiated between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX10 (192.168.123.26, agent vSAN node) in some limited sniffing time (several minutes).

Interesting observation

New TCP Connection between DCSERV-ESX06 (192.168.123.22, backup vSAN node) and DCSERV-ESX07 (192.168.123.23, agent vSAN node) is usually established (SYN/SYN-ACK) every minute.

Why this happening only between DCSERV-ESX06 (backup node) and DCSERV-ESX07 (agent node) and not with other nodes? I do not know.

Further TCP network troubleshooting

Next step is to collect TCP SYN, TCP SYN/ACK, TCP stats, and NET stats on DCSERV-ESX06 (most "problematic" vSAN node) and DCSERV-ESX10 (not "problematic" vSAN node) into the files. I will capture data during one hour (60 minutes) to be able to compare number of SYN and SYN/ACK packets and compare it with TCP and network statistics. 

Capturing of TCP SYN

timeout -t 3600 pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0' > /tmp/dcserv-esx06_tcp-syn.dump

timeout -t 3600 pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0' > /tmp/dcserv-esx10_tcp-syn.dump

Capturing of TCP SYN/ACK

timeout -t 3600 pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags] & (tcp-syn|tcp-ack) = (tcp-syn|tcp-ack)' > /tmp/dcserv-esx06_tcp-syn_ack.dump

timeout -t 3600 pktcap-uw --uplink vmnic4 --capture UplinkRcvKernel,UplinkSndKernel -o - | tcpdump-uw -r - 'tcp[tcpflags] & (tcp-syn|tcp-ack) = (tcp-syn|tcp-ack)' > /tmp/dcserv-esx10_tcp-syn_ack.dump

Capturing of TCP Statistics

for i in $(seq 60); do { date; vsish  -e get /net/tcpip/instances/defaultTcpipStack/stats/tcp; }  >> /tmp/dcserv-esx06_tcp_stats; sleep 60; done

for i in $(seq 60); do { date; vsish  -e get /net/tcpip/instances/defaultTcpipStack/stats/tcp; }  >> /tmp/dcserv-esx10_tcp_stats; sleep 60; done 

Capturing of TCP Statistics

netstat  captures 60 min with 30 sec x 120 times = 3600 sec = 60 min 

for i in $(seq 120); do { date; net-stats -A -t WwQqihVv -i 30; } >> /tmp/dcserv-esx06_netstats ; done

for i in $(seq 120); do { date; net-stats -A -t WwQqihVv -i 30; } >> /tmp/dcserv-esx10_netstats ; done

Output Files Comparison

ESX06
tcpdump
15:48:32.422347 - 16:48:16.542078: 199 TCP SYN
15:49:16.434140 - 16:48:46.533262: 199 TCP SYN/ACK

Fri Mar  7 15:49:10 UTC 2025
tcp_statistics
   connattempt:253432751
   accepts:3996127
   connects:8341861
   drops:4778493
   conndrops:247894569
   minmssdrops:0
   closed:257671058

Fri Mar  7 16:48:10 UTC 2025
tcp_statistics
   connattempt:253587720
   accepts:3997071
   connects:8345071
   drops:4781004
   conndrops:248047267
   minmssdrops:0
   closed:257827086

tcp_statistics difference
   connattempt:154969
   accepts:944
   connects:3210
   drops:2511
   conndrops:152698
   minmssdrops:0
   closed:156028

ESX10
tcpdump
15:49:44.554242 - 16:49:16.544940: 179 TCP SYN
15:50:16.441776 - 16:49:54.142493: 185 TCP SYN/ACK

Fri Mar  7 15:50:49 UTC 2025
tcp_statistics
   connattempt:826534
   accepts:2278888
   connects:3105348
   drops:1414905
   conndrops:74
   minmssdrops:0
   closed:3338137

Fri Mar  7 16:49:49 UTC 2025
tcp_statistics
   connattempt:826864
   accepts:2279789
   connects:3106579
   drops:1415439
   conndrops:74
   minmssdrops:0
   closed:3339470

Difference
   connattempt:330
   accepts:901
   connects:1231
   drops:534
   conndrops:0
   minmssdrops:0
   closed:1333
 

What does it mean? I don't know. I have VMware support case opened and waiting on their analysis.


No comments:

Post a Comment