This article a is translation by popular request of Optimisations Nginx, bien comprendre sendfile, tcpnodelay et tcpnopush I wrote in French in January.
Most articles dealing with optimizing Nginx performances recommend to use
tcp_nopush options in the
nginx.conf configuration file. Unfortunately, almost none of them tell neither how they impact the Web server nor how they actually work.
Everything started with a after Greg did the peer review of my Nginx configuration. He was challenging my optimization, asking me if I really knew what I was doing. I started to dig into the TCP stack basement, as mixing
tcp_nopush seemed to be as logical as a pacifist joining the Navy Seals (which have nothing with baby seals).
How can you force a socket to send the data in its buffer? A solution lies in the
TCP_NODELAY option of the
TCP (7) stack. Activating
TCP_NODELAY forces a socket to send the data in its buffer, whatever the packet size. Nginx option
tcp_nodelay adds the
TCP_NODELAY options when opening a new socket.
To avoid network congestion, the TCP stack implements a mechanism that waits for the data up to 0.2 seconds so it won’t send a packet that would be too small. This mechanism is ensured by Nagle’s algorithm, and 200ms is the value of the UNIX implementation.
To understand Nagle’s purpose, you need to remember that Internet is not only about sending Web pages and huge files. Imagine yourself back in the 90s, using telnet to connect on a distant machine over a 14400 RTC connection. When you press
ctrl+c, you send a one byte message to the telnet server. To that message, you need to add the IP headers (20 bytes for IPv4, 40 bytes for IPv6) and the TCP headers (20 bytes). When pressing
ctrl+c, you actually send 61 bytes over the network. Angle ensures you may have something else to type before the data is sent.
That’s cool, but Nagle is not relevant to the modern Internet anymore. It is even counterproductive when you need to stream data over the network. Chances your file fills exactly a bunch of full packets are close to 0, which means Nagle creates a 0.2 seconds latency on the client side for every file it downloads.
TCP_NODELAY option allows to bypass Naggle, and then send the data as soon as it’s available.
TCP_NODELAY on HTTP
keepalive connections are sockets that stay open for a few times after sending data.
keepalive allows to send more data without initiating a new connection and replaying a TCP 3 ways handshake for every HTTP request. This saves both time and sockets as they don’t switch to
FIN_WAIT after every data transfer.
Connection: Keep-alive is an option in HTTP 1.0 and HTTP 1.1 default behavior.
When downloading a full Web page,
TCP_NODELAY can save you up to 0.2 second on every HTTP request, which is nice. When it comes to online gaming or high frequency trading, getting rid of latency is critical even at the price of a relative network saturation.
On Nginx, the configuration option
tcp_nopush works as an opposite to
tcp_nodelay. Instead of optimizing delays, it optimizes the amount of data sent at once.
To keep everything logical, Nginx
tcp_nopush activates the
TCP_CORK option in the Linux TCP stack since the
TCP_NOPUSH one exists on FreeBSD only.
The well named
TCP_CORK blocks the data until the packet reaches the MSS, which equals to the MTU minus the 40 or 60 bytes of the IP header.
Everything is well explained in the Linux kernel source code
/* Return false, if packet can be sent now without violation Nagle's rules: * 1. It is full sized. * 2. Or it contains FIN. (already checked by caller) * 3. Or TCP_CORK is not set, and TCP_NODELAY is set. * 4. Or TCP_CORK is not set, and all sent packets are ACKed. * With Minshall's modification: all sent small packets are ACKed. */ static inline bool tcp_nagle_check(const struct tcp_sock *tp, const struct sk_buff *skb, unsigned int mss_now, int nonagle) return skb-\>len \< mss_now && ((nonagle & TCP_NAGLE_CORK) (!nonagle && tp-\>packets_out && tcp_minshall_check(tp))); }
TCP_CORK needs to be explicitly removed if you want to send half empty (or half full) packets.
TCP(7) manpage explains that
TCP_CORK are mutually exclusive, but they can be combined since Linux 2.5.9.
In Nginx configuration,
tcp_nopush must be activated with
sendfile, which is exactly where things get interesting.
Nginx initial fame came from its awesomeness at sending static files. This has lots to do with the association of
sendfile Nginx option enables to use of
sendfile(2) for everything related to… sending file.
sendfile(2) allows to transfer data from a file descriptor to another directly in kernel space.
sendfile(2) allows to save lots of resources:
sendfile(2)is a syscall, which means execution is done inside the kernel space, hence no costly context switching.
sendfile(2)replaces the combination of both
sendfile(2)allows zero copy, which means writing directly the kernel buffer from the block device memory through DMA.
sendfile(2) requires a file descriptor that supports
mmap(2) and friends, which excludes UNIX sockets, for example as a way to send data to a local Rails backend without all the network latency.
The in_fd argument must correspond to a file which supports mmap(2)-like operations (i.e., it cannot be a socket).
Depending on your needs,
sendfile can be either totally useless or completely essential.
If you’re serving locally stored static files,
sendfile is totally essential to speed your Web server. But if you use Nginx as a reverse proxy to serve pages from an application server, you can deactivate it. Unless you start serving micro caching on a
tmpfs. I’ve been doing it here, and didn’t even notice the day I was featured on HN homepage, Reddit or good old Slashdot.
Let’s mix everything together
Things get really interesting when you mix
tcp_nopush together. I was wondering why anyone would mix 2 antithetic and mutually exclusive options. The answer lies deep inside a 2005 thread from the (Russian) Nginx mailing list.
tcp_nopush ensures that the packets are full before being sent to the client. This greatly reduces network overhead and speeds the way files are sent. Then, when it reaches the last – probably halt – packet, Nginx removes
tcp_nodelay forces the socket to send the data, saving up to 0.2 seconds per file.
This behavior is confirmed in a comment from the TCP stack source about
When set indicates to always queue non-full frames. Later the user clears this option and we transmit any pending partial frames in the queue. This is meant to be used alongside
sendfile()to get properly filled frames when the user (for example) must write out headers with a
write()call first and then use sendfile to send out the data parts.
TCP_CORKcan be set together with
TCP_NODELAYand it is stronger than
Nice isn’t it?
Here we are, I think we’re done. I did not mention
writev(2) as an alternative to
tcp_nopush on purpose to avoir adding complexity. I hope you enjoyed reading this, don’t mind sending me an email if you have something to add, I’ll publish it with pleasure.