Building High-Performance Application Servers – What You Need to Know
Using scalable and reliable software is vital for the success of any large-scale IT project. As increasing numbers of transactions are made, application infrastructure needs to stand strong and support that growth, and not be another source of problems.
Building application servers is no easy task. It requires in-depth knowledge about various components such as networking or OS kernel. I spend the last couple of years in the fascinating area of software engineering: high-performance application servers, such as HTTP servers, SSL terminators, API gateways, recent Software-defined Networks (SDN) or Network Function Virtualization (NFV) tools.
This article elaborates on key findings that I have found during this journey. I humbly hope they will be useful for people who work in this exciting field.
The reactor pattern is the secret behind the success of NGINX, Node.js, and many other modern software programs. 
The reactor pattern consists of two main elements: the event loop and events. Events are fired by OS kernel to act upon various situations such as incoming network packets. The application server sleeps during the event loop and is “woken up” by the arrival of such event. The event is then handled by the application code and the server returns to the event loop.
The event loop is built on top of a well-known POSIX functions such as
poll() or their modern alternatives
epoll(). Another OS provides various methods how to achieve the same effect such as kqueue in BSD or macOS.
There are useful libraries that provide a handy boilerplate for reactor-based high-performance network IO, such as libevent or libev.   The problem is elaborated in more details in older but still relevant article "The C10K problem". 
This architecture offers outstanding performance for two reasons - it offers UNIX design match and provides non-blocking IO operations. The application doesn't wait for the read or write to finish in IO non-blocking mode - it can handle other events in meanwhile. This is a secret sauce of how to deliver higher possible IO performance.
The reactor pattern matches the UNIX design, the application remains largely inactive and reacts only to input and output events that are propagated by the OS kernel to the userspace so the available system resources are utilised optimaly.
Inbound versus Outbound
A common network protocol is bi-directional and full duplex, such as a TCP connection on lower layer or HTTP protocol on the higher layer. A TCP connection can exist in a half-closed state and still transmit data in the remaining open direction.
HTTP protocol takes advantage of this half-closed state. One way of indicating end-of-data in the HTTP request is to close outbound direction of the connection after a request data is sent to the server. The HTTP response is then received from this half-closed connection which is eventually closed by a server. The half-closed state is achieved by
shutdown() system call.
To deal with these situations, the application server should implement processing of inbound and outbound direction as much independently as possible. It is valid also for application servers which reply to the same network connection that has been used to receive a respective request. Think of the difference between an HTTP server and an HTTP proxy.
On top of that, the inbound and outbound are also the start and the end of an application transaction. The application server transaction starts with a read event that is initiated by an inbound network data, a
read() system call, an application logic, and it ends at the outbound part.
Based on the write 'readiness', the transaction can perform so-called direct
write() system call, or it stores outbound data in an outbound buffer. In the later case, a
write() call is delayed till an event that indicates that underlying network is ready, arrives.
A production-grade application server has to deal with the bandwidth management particularly gateway-type of servers that sit between networks with different throughputs such as a data center network and a mobile network. As the outbound buffer grows, throttling provides a useful feedback loop for inbound event handling.
Satisfactory results can be seen in a quite simple approach, such as a temporal 'pause' of inbound read events when the outbound buffer reaches a certain size, coupled with a 'resume' action when outbound buffer size decreases under another threshold.
Single-threaded over multi-threaded
A common performance penalty associated with higher code complexity comes from the multi-thread nature of application servers. Threads have to fight one another for the locks and mutexes that synchronize access to various important resources such as memory, file descriptors, or event logs. Such a synchronization is vital for a proper application functionality but it takes a lot of availabile resources such as CPU capacity.
A multi-threaded application server doesn’t scale in a linear fashion. You’ll eventually reach a point where adding more CPU cores won’t have any positive impact on performance.
Because of these potential problems, I much prefer single-threaded designs. They scale more or less linearly and free from major types of race conditions - a common pitfall of multi-threaded network I/O apps. Another benefit is that the code is simple and straightforward.
Single-threaded designs are able to work parallel to each other, thanks to dedicated processes, also called workers. Each worker process is single-threaded and handles a portion of the application transactions.
A worker process can be easily restarted after some time, for instance if you needed to reduce memory fragmentation.
Memory and zero-copy
Memory is also vital. The pre-allocated memory pool of fixed frames used to store transaction data tends to work very well.
The application server also should avoid data copy operations as much as possible. Part of the protocol implementation could therefore be built as header/trailer add/remove. The data then stays in the same memory location during the whole transaction. Vectored I/O (scatter/gather I/O) is another effective method to help avoid unnecessary copy operations. 
A notable exception to this rule is encryption and decryption of the data, which is a copy operation by its very nature.
Surprisingly, logging operations can be costly in high-performance servers. Logs provide necessary insights into server operations, and therefore they shouldn't be sacrificed completely. However, some naive coders tend to accidentally introduce bottlenecks such as additional I/O operations connected with log file writes.
A more viable approach is to log into memory during an active application transaction and flush the memory in bigger chunks. This can introduce lags in the logging, but you can mitigate this problem by choosing a good flushing interval. 200ms is typically a good starting point.
Also, modern log crunching tools handle log stream with non-linear timestamps without any issues.
All techniques described in this article are available in the open-source project LibFT or "Frame Transporter."  The project is accessible on GitHub, and I would be happy to receive your feedback about how you’ve got on with the project.
The most popular language in this area are still C, closely followed by C++, and Go. This is only logical, since you want to stay as close to the chip, "the Silicon" as possible. Higher abstraction languages such as C#, Java, or Ruby don’t work well, and may limit desired performance.
It is not that uncommon to see techniques that utilize assembler code. 
Another common theme is to optimize the architecture and code to the underlying hardware design such as network card or CPU. For example, aligning a network card queue with CPU cores can dramatically improve a network throughput. See article "How to receive a million packets per second." 
What you can expect in the future
I'm not going to stop here. There are many topics in this area that need to be explored. I've started with a protocol chaining that allows easy and standardized implementation of various layered protocols without impacting the performance, for example, the chain of TCP, TLS and HTTP. 
You can avoid this hassle by skipping a Linux network stack and dealing with raw packets directly in the user space. There are projects out there in this area, for instance DPDK.  This approach has to be combined with the user space implementation of the TCP/IP stack, which can cause plenty of problems. However, it's well worth the effort because the resulting throughput performance is outstanding.
The next logical step is to look at TLS implementation on top of such architecture because standard libraries such as SSL, that are part of OpenSSL, don't provide optimal tools for this kind of data traffic. 
TeskaLabs produces hi-tech software for you to build and operate high-performance, large-scale mobile and Internet of Things applications that run high-performance application servers. To find out more about our products, simply click on this link. Alternatively, you can connect with me on Linkedin to get all the latest updates and articles.
- Reactor Pattern - https://en.wikipedia.org/wiki/Reactor_pattern
- libevent - http://libevent.org
- libev - http://software.schmorp.de/pkg/libev.html
- Vectored I/O - https://en.wikipedia.org/wiki/Vectored_I/O
- LibFT - https://github.com/TeskaLabs/Frame-Transporter
- HTTP Strings Processing Using C, SSE4.2 and AVX2 - http://natsys-lab.blogspot.com/2016/10/http-strings-processing-using-c-sse42.html
- DPDK - http://dpdk.org
- OpenSSL - https://www.openssl.org/
You Might Be Interested in Reading These Articles
In order to help you to evaluate and use our product we have prepared an trial version that is freely available for download. Trial version is limited to emulator/simulator only, you cannot use that on a real device. There is however no expiration date of a trial, so feel free to use it for any amount of time you need for the evaluation or even an actual development.
Published on August 17, 2014
Containerization is an alternative for full machine virtualization. You probably know well-known containerization technology from Docker or Rocket. However, this article addresses the pros and cons of mobile “containerization” or wrapper used to isolate the mobile app from the mobile operating system or other applications installed on the same device. These type of “containerization” work in a different way.
Published on September 27, 2016
The official source of OpenSSL software is the OpenSSL website. One can download OpenSSL source codes archives and compile them for a given platform. The compilation work can sometimes be quite tedious, especially for exotic platforms. We, at TeskaLabs, set up this page because we frequently compile OpenSSL for various platforms for our internal purposes and this may save some time to other developers.
Published on July 20, 2017