Will PostgreSQL switch to a Thread-based model ?
Redesigning the PostgreSQL architecture using thread-based model
Introduction
There is a renowned saying “Change is the only constant in software”. The code that you write today may or may not be used in the later versions. Software & Hardware both are evolving at a rapid pace. Every month a new version of software is released & developers are at the forefront to leverage the latest features.
Also, I have observed that developers including me are never happy with the code they write. We always think about rewriting, optimising, and pushing maintainable code. However, due to business priorities, we never reach that level of perfection.
Re-architecting a software system like a Database system would years of efforts and contribution from hundreds of people. There is a significant investment that goes into software re-architecture. Since the investments are huge, such projects are picked up only if there is a proportionate Return On Investment (ROI). In the end, it’s all business right ?
I have been using PostgreSQL since the past five years. Over the years, I developed an interest in understanding the internals of PostgreSQL. I am amazed that it’s an open source software and powering critical applications such as banking, trading, flight booking & other systems. It is a 35 year old system and has stood the test of time.
Recently, I came across a discussion where the community was deliberating to move to a Thread-based model instead of Process-based model. This sparked my curiosity and I have gone through interesting emails, which discussed the pros and cons of each approach. In this article, we will understand how PostgreSQL uses the process-based model, why it does it, what are the advantages of thread-based model & direction that PostgreSQL community will take in the future.
Process-based model in PostgreSQL
PostgreSQL is a database server and it is responsible for managing, organizing, and querying the data. Additionally, it needs to handle the client connection, parse the query, optimize the query, & manage the transactions.
The core of PostgreSQL is a process known as postmaster. This process acts as facade that clients talk to. The postmaster process listens on the default port that is 5432 and accepts the incoming connections. For every new client connection, the postmaster spins up a new worker process and offloads the connection to this new process. So, every time, a client connects to the database, a new process is started that handles the client connection. The other interactions such as query parsing, optimization, etc are handled by the worker processes.
The below diagram illustrates how PostgreSQL launches a new process for every incoming connection :-
Processes are heavy weight and PostgreSQL can’t create infinite processes and handle millions of connections. For this reason, it defines a parameter max_connections that indicates how many connections it can handle at once. By default, the value is 100 & you can’t create more than 100 connections at a time to PostgreSQL. As soon as you exceed this limit, the database server will refuse any new incoming connection.
The connection limit of PostgreSQL is justified since it expects clients such as web servers to create a pool of connections and reuse instead of creating a new connection every time. However, there are several disadvantages and also advantages of the process-bases model. We will discuss this in the subsequent sections.
Advantages of the Process-Based model
Fault Isolation & Stability - In case a process crashes, it will impact only one connection. The client will need to retry and establish a new connection. However, imagine that we had a thread-based model and the main process itself crashed. This would bring down all the connections. Process-based model offers a high degree of stability and fault isolation.
Strong Security Guarantees - Each process can have its own set of permissions and privileges. One client connection can’t read or write onto someone else’s connection. With the process-based model, the database guarantees strong security for the clients.
Efficient Resource Management - Each process handling a client connection can be assigned system resources such as CPU, memory and I/O fairly. This prevents one connection from dominating the system resources and starving other connections. The model provides robust control and ensures efficient resource allocation to handle the demands of individual connections.
Cross-Platform Compatibility - The thread libraries are Operating system dependent and the implementation differs across different Operating systems. As a result, thread-based models aren’t compatible and can’t be ported easily. Process-based model overcomes this hurdle and it ensures reliable and consistent behaviour across different environments.
Limitations of the Process-Based Model
Although, there a several advantages of using a process-based model, it also has a list of drawbacks. The drawbacks are one of the reasons why the community is planning to move to a thread-based model.
High resource overhead
Processes are heavy weight as compared to the threads. In the future, PostgreSQL would hit it’s limits on bigger machines particularly while context switching. Let’s understand how this takes place in detail.
The operating system uses assigns virtual memory to all the processes. Since the OS can’t keep all the processes in the memory at the same time, it divides the process into chunks known as pages. At a given time, there are set of pages present in the working memory.
OS maintains a data structure known as page tables to translate the virtual memory to physical memory. The CPU uses a hardware component known as Translation Lookaside Buffer (TLB) to do a quick look up of the translation. The TLB acts like a cache and in case the virtual address isn’t found in the TLB, it is fetched from the memory. The TLB can’t be shared across different processes.
If the database server handles lots of connections, it will have same number of processes handling each connection. This would lead to high number of context switches between the processes. Since the number of processes is high, the probability of TLB miss would increase with each context switch. This would create a significant overhead and reduce the performance of the database.
Increased Complexity & Code duplication
Currently, the individual processes use a shared memory for communication. As stated in the official docs of PostgreSQL, “Those backend processes communicate with each other and with other processes of the instance using semaphores and shared memory to ensure data integrity throughout concurrent data access“.
Threads run within the same address space. In case of a single address space, it is possible to share the state in the memory more efficiently between several threads.
The process-based model results in lot of code duplication. There are several memory management mechanisms used in the current code of PostgreSQL. This model introduces additional complexity in the design and development.
Why PostgreSQL Chose the Process-Based Model?
Until now, we have understood the process-based model and looked at it’s pros and cons. You might wonder why PostgreSQL decide to use a process-based model instead of a thread-based model ?
In software development, there is no right or wrong answer. It’s all about trade-offs. You choose one solution that best suits your use case but compromise on certain quality attributes.
Also, the software design can evolve over the years. This is why we have seen companies move from monolith to microservices and back to monolithic systems. Similarly, many companies switched from onprem to cloud & then back to onprem. And now we are thinking of moving PostgreSQL to a thread-based model.
Let’s see why PostgreSQL used a process-based approach in the first place. Following are the primary reasons why it adopted this architecture :-
Variations in Threading Libraries - Each OS has its own implementation of threading libraries. As per this discussion, there was a bug in the BSD which crashed MySQL. Thus, the process-based approach is more robust and ensures better compatibility across different Operating systems.
Single-point of failure - Process-based model ensures fault isolation and reduces the blast radius & the impact.
Race conditions - Multi-threaded code is prone to race conditions. It’s very difficult to debug a multi-threaded code. From my personal experience, it took me about a month to figure about a race condition and patch it with a permanent fix. Your code can work as expected most of the times but that one race-condition out of a million transactions can pose a challenge & would require lot of efforts to find the root cause.
Exploring the Thread-Based Model
Recently, the community was discussing the thread-based model and how it would benefit PostgreSQL. In case of a thread-based model, there would be a single process which would listen to the incoming connections. It would then offload the connection to a thread and subsequently the thread would be responsible for managing the connection.
The thread-based model seems simple and doesn’t create an extra overhead. It is also light-weight as compared to the process-based model. MySQL also uses a similar approach. In case of MySQL, a user thread is created for every new connection & it is responsible for executing the queries and managing the connections.
In theory, the thread-based model looks good but introducing this change in PostgreSQL code would be a daunting task. I was looking at one of the discussions and the concerns that developers had highlighted. It was primarily around use of global variables.
PostgreSQL’s current implementation relies heavily on the global variables. There are more than 2000 global variables that are currently used. In a process-based model, each process has it’s own copy of global variables.
For migrating to a thread-based model, one of the approach discussed was to move the global variables to a thread-local storage. This would be straight-forward, and simple to implement. However, increased reliance on thread-local storage would impose a performance penalty and we wouldn’t realise the benefits of switching to the new model.
Another thing that was highlighted was supporting both the thread-based and process-based models of PostgreSQL. By adopting a thread-based model, they would compromise certain benefits offered by the process-based model. As a result, it would be essential to support both the models for sometime. This would create a maintenance overhead.
Furthermore, the community would have to redesign the postmaster, develop tooling, ensure compatibility with several extensions which would add to the complexity.
Conclusion
As per the discussion, there is a consensus in the community that thread-based approach is good in theory. However, several members have raised a concern and porting to a thread-based model doesn't seem feasible in the near term.
It’s very cumbersome to introduce change in a mature software like PostgreSQL. The software is used widely throughout the globe for several purposes. A simple security flaw due to redesign can have disastrous consequences.
Additionally, the benefits of redesign should outweigh the costs involved. In case the redesign doesn’t offer significant benefit, one should refrain from making any investments.
According to me, the best way to go forward would be to execute a Proof of Concept (POC) & port PostgreSQL to a thread-based model. The POC would validate the hypothesis and also give some direction to the community on which approach seems best. There are several such initiatives which are on-going like Thread-based Postgres. The author has drafted the learnings here - Let’s make PostgreSQL multi-threaded.
So will PostgreSQL move to a thread-based model ? Well, not in the near term at least. We don’t know how things will shape up in the future. As “Change is the only constant” in software, it wouldn’t be surprising to see a multi-threaded PostgreSQL server running in production.
Let me know what you think about the future of PostgreSQL in the comments below.
Thanks for reading the article! Before you go: