Thursday, August 5, 2021

Retrospect : Intermittent web service calls to WSO2 fails

 Modern day business is highly reliable on web services. As almost every thing is connected, If something in the middle connection breaks, It is a business loss. So, With my experience in Customer Success team, I thought to share one my experiences where we encountered with WSO2 enterprise service bus or Enterprise integrator. 


The problem we got was, the client who was invoking the service of WSO2 ESB was getting a timeout when making the connection to WSO2 server. This was not happening for all the calls, But intermittently. When we analyzed the logs, we could not locate any issue inside the server.  

Then we checked the thread dumps for finding out any clue where it can have that kind of behavior. But we were clueless. 

Then as the next step, We mapped the timings of client requests and the log timings and we could see that given specific requests which was failing is not reaching inside WSO2 server. 

Damn... What do we do now? 

Then we captured the TCP dump for suspicious things. During that time period, We observed following pattern of packets.



If you are familiar with the flow for making the TLS connection, it is as follows.

image courtesy: https://hpbn.co/transport-layer-security-tls/



So, When making a TLS connection from the client side to the server side, Client Sends a "SYN" packet, and server needs to respond to it with a "SYN, ACK" packet. 

But in the previous image, you can see that, even-though client sent the "SYN" packet, It did not get the response "SYN,ACK" packet from the server. Then client does three more TCP RETRANSMISSIONS. But server is not responding. 

Then after some time, With a different port, Client sends "SYN" and server sends "SYN,ACK" as expected.

So, this we narrowed down to know the cause on why WSO2 ESB did not respond, It was because Client could not make the connection to the WSO2 server. Then we researched on what could cause this kind of behavior and we ended up with multiple posts like [1] [2]

Upon those, We could narrow down following entry which we asked to add in the /etc/sysctl.conf  file [3] seems to be the key factor for this behavior.

net.ipv4.tcp_tw_recycle

Basically, When there is NAT ( Network Address Translation) happening in the communication path, Having above property configured with value "1" could cause this kind of intermitted TCP communication behaviors. 

In WSO2 document you have following warning.

Note

Change this with caution and ONLY in internal networks where the network connectivity speeds are faster.
It is not recommended to use net.ipv4.tcp_tw_recycle = 1 when working with network address translation (NAT), such as if you are deploying products in EC2 or any other environment configured with NAT. 

So, Ultimately we could resolve the problem by removing this entry. So, I believe it would be worth to know this in your cases where you encounter the same.

Enjoy!!

[1] https://serverfault.com/questions/235965/why-would-a-server-not-send-a-syn-ack-packet-in-response-to-a-syn-packet

[2] https://www.programmersought.com/article/45056975684/

[3] https://docs.wso2.com/display/EI600/Network+and+OS+Level+Performance+Tuning