Timeout in Distributed Systems

Possible Failures

The timeout is generally used to detect (possible) failure.

After a user opens a URL, the browser will look up the IP addresses of the domain name. DNS lookup timeout can detect DNS resolution failures.

The browser will then try to establish a TCP connection to one IP address where Connect Timeout can detect TCP connection failures. Once connection is established, TLS handshake timeout can detect TLS handshake failures.

After an HTTP request is successfully sent, Read Timeout will be raised if the server fails to issue a response in time. The call timeout spans the entire call: resolving DNS, connecting, writing the request body, server processing, and reading the response body 1. Any anomalies that block the call thus can be detected.

Benefits of Blocking Call

Blocking call is preferred when the client needs to perform serial operations i.e. do a particular operation after another operation is completed. For instance, the client cannot make a TCP connection if the remote address is unknown. Similarly, the client will not be able to send an HTTP request if TLS handshake fails.

Application Layer Recovery

"TCP is said to be a reliable communication protocol." However, TCP cannot handle time out events above transport layer. For example on application layer, the client needs to decide what time to disconnect in case the server does not issue an HTTP response.

Besides, application layer logic can improve the overall resilience to connectivity issues. For example, feitsui.com has 3 different IP addresses. Modern browsers and clients such as curl silently recovers by retrying other IPs in case any individual IP address times out.

Timeout Interval

The timeout interval needs to be long enough to allow acceptable latency, processing time and network jitter. For example, the network latency between Cape Town and Seoul is above 500ms 2. It will be challenging for a user in South Africa to use an application hosted in North East Asia if the timeout interval is in the order of milliseconds. Some API calls require expensive computations and/or database queries, the client also needs to wait "patiently" for the results.

Retry Interval

Immediate retry is appropriate for some scenarios. For example, it is reasonable to try the next IP immediately after the current one times out in multi-IP failover. Exponential backoff is commonly used for background operations. The wait time increases exponentially when the retry count increments tn = t0 · 2n.

The type of operation decides the retry interval and the number of attempts. The collaborative whiteboard client is interactive and one client with connection error will not lead to global systematic failure. Therefore, it is reasonable to keep retry interval short and attempt only a few retries in the background 3. The client can stop retrying and prompt the user to manually refresh the page or ignore the error after several consecutive failures.

Footnotes

  1. https://square.github.io/okhttp/4.x/okhttp/okhttp3/-ok-http-client/-builder/call-timeout/
  2. https://www.cloudping.co/grid
  3. https://docs.microsoft.com/en-us/azure/architecture/best-practices/transient-faults