System Design Concepts: Dive deep into Database Sharding Strategies
Sharding Strategies, their pros and cons.
Introduction
In my previous article, I had explained the fundamentals of Database sharding. We learnt how data is distributed across shards through an example. There are several ways in which data can be distributed across the shards. Sharding strategy is the technique used to distribute data across different shards.
Software systems deal with a variety of data, encompassing user information, financial transactions, and real-time stock tick data. Each system has its own requirements. For eg:- To fetch the user information, you only need the userId. But, to fetch the price data for a stock over a period, you need to pass the Start and End date. Additionally, the size of stock market data could be orders of magnitude larger than that of user information. To handle this complexity and build efficient systems, we have several Sharding strategies.
In this article, we will look at the different database sharding strategies. We will walkthrough each sharding strategy and understand the working in detail. Additionally, we will explore the downsides of each strategy and the workloads that the strategy is suited for. Let’s begin our journey with the simplest strategy that is Range-based sharding.
Range-Based Sharding
Working
Let’s understand the concept of Range-based sharding with a real-world example. Assume that you have lot many books and you want to organize them in a book shelf.
The first thing that you would want is to easily find a book whenever you feel like reading. You don’t want to scan every book as it’s a cumbersome activity.
The simplest technique is to assign a range to each row in the book shelf. For eg:- Row 1 would have books whose starting letter lies between A-G (both included). Similarly, Row 2 would be assigned a range H-O, and so on.
If you want to add a book named Deep Work, you would place it in the first row. The book War and Peace would be added in the last row. The below diagram illustrates this process.
In future, if you feel like reading “Deep Work”, you would search only the first row since it holds the range A-G. Thus, you would save lot of time and happily read the book.
The same concept is used and applied in Range-based sharding. Each database shard is responsible for managing a range of values. The application or the service is aware of the shards and their corresponding ranges. The application uses the shard key, finds the range in which the key lies and determines the shard number.
Let’s take example of stock data for Microsoft (MSFT). The data consists of stock close price for each day. We can distribute the data based on the date. For eg:- Shard 1 would contain data for year 2010, Shard 2 would have the data for 2011, and so on. The date would become the shard key here.
In case we want to fetch the stock price between April 2010 till December 2010, we can query Shard 1 and get the data. Additionally, each shard would store the data in a sorted order to improve the query performance. This is illustrated in the below diagram.
Applications
Time-series data - Systems that handle stock market data, financial transactions, logs, sensor data are suited for range-based sharding. Data is partitioned based on time intervals (e.g., hourly, daily, or monthly) to optimize querying different time ranges.
E-commerce and Order Management - In e-commerce, data of the products can be sharded using price as the shard key. It becomes easy to apply filters and fetch a range of items that lie with given price-bounds.
Pros
Range-based queries - It’s easy to query a range of values (provided all reside on a single database server).
Easy to implement - It is less complex and requires only a lookup table to determine the shard using the shard key.
Cons
Non-uniform data distribution - Range-based sharding could lead to uneven data distribution. This happens if the shard key values aren’t uniformly distributed. For eg:- There are very few books which start with W, U, V or Z. As a result, the shard assigned for this range will store few values. While the shard that would manage the range A-G or H-O will be overloaded.
Multi-range queries - Range-based sharding isn’t helpful if your application fetches the data that spans multiple ranges. In such cases, data needs to be retrieved from multiple shards, and thus increases the complexity.
Hash-based Sharding
Working
One of the primary drawbacks of range-based sharding was uneven distribution of data. As a result, some shards wouldn’t be fully utilized while others would reach their maximum capacity. Hash-based sharding solves this problem by uniformly distributing the data across different shards. Let’s understand hash-based sharding with an example.
Let’s say you are building a social media website and storing the user’s information. In case you want to shard the user information, you can use the userId as the shard key. This shard key is then passed to a hash function, that outputs an integer result. We use this result and perform the operation result % (shard count), to determine the owner shard.
The below diagram illustrates the process of hash-based sharding.
In the above example, we used userId as the shard key because we would lookup the user information using the userId. The choice of the shard key depends on the data access pattern. The userId uniquely identifies a user record in a database, hence, it’s suited to be a shard key.
We rarely search user info using the date of birth, last name, city, etc. So, such candidates are not suited for the shard key.
Applications
Content-Delivery Networks (CDNs) - CDNs use the hash value of the content’s url to determine the edge server that would store the content. This enables faster content delivery and reduces latency for end-users.
Load Balancing - Hash-based sharding efficiently distributes the data records across a set of backend database servers. This evenly distributes the load among the servers. Additionally, a single shard doesn’t become a bottleneck unlike range-based sharding.
Pros
Uniform data distribution - Hash-based sharding uniformly distributes the data across several database servers. It doesn’t result in skewed traffic to certain database servers.
Cons
Data rebalancing - This technique uses the formula Hash(shard key) % (shard count), to determine the owner shard. This works fine if the number of shards remains constant. But the moment, we introduce new shards or take down existing shards (shard count changes), this would fail for existing shard keys. The shard for each key would change, and we would have to move all the data resulting in heavy I/O operation. As a result, this approach is not suitable for workloads which frequently scale by adding new shards.
Range-based queries - Since data is distributed uniformly based on the hash values, it’s unlikely to find values which lie within a range on a single server. Thus, if your requirement is to search for a range of values, you shouldn’t use hash-based sharding.
Consistent Hashing
Working
The downside of Hash-based sharding is that majority of the data needs to be rebalanced while adding or removing database servers. This increases the I/O and also consumes lot of bandwidth. Further, it introduces additional complexity due to shard rebalancing.
Consistent Hashing solves this problem and ensures that only a subset of records are moved in case a shard is added or removed. Let’s understand briefly how Consistent Hashing works.
We start with a ring and place the servers on this ring. The server’s IP address is used, passed to a hash function and the operation hash(IP address) % 360 is done. This operation helps us to determine the position of the server on the ring. This process is followed for all the servers and the servers are placed on the ring.
Next, the operation is applied on the shard key to determine it’s location on the ring. The closest server in the clock-wise direction becomes the owner of the shard-key and it’s record. This process is followed every time a new record is added. This is explained in the below diagram.
What if a server fails ? In case of server failure, the keys will moved to the first server in the clock-wise direction.In the above example, if Server 2 fails, the keys that it holds will get transferred to the next server i.e Server 3.
Similarly, if a new server is introduced between Server 3 and Server 1, it would take the ownership of the keys just before it and after Server 1.
Consistent hashing is used in databases such as Amazon’s DynamoDB and Apache Cassandra. It is also used in most of the key-value stores and for distributed caching.
Pros
Scalability - Shards can be easily introduced or removed by impacting minimum number of shard keys.
Cons
Range-based queries - Similar to hash-based sharding, Consistent Hashing isn’t suitable if your want to query a range of values.
Geo-based sharding
Working
In recent times, many of the tech companies have started expanding to different countries. One of the patterns that has emerged is to store the data close to the users. This is known as Geo-sharding.
The application uses an attribute such as the country or the IP address as the shard key. It then determines the closest database server and then persists the data. For eg:- Data for European users would be stored in Dublin. While data for users in Asia would be stored somewhere in Mumbai.
The below diagram illustrates the process of Geo-sharding.
Pros
Performance - One of the biggest advantage of Geo-sharding is improvement in the latency and performance. As the data is stored close to the users, it would require less time to fetch the data.
Cons
Data replication - The data needs to be replicated to a different region for redundancy. Transferring data between regions incurs network transfer costs & leads to higher operational expenses.
Data Security Concerns - Storing data in multiple regions may increase the attack surface and introduce potential security vulnerabilities. Proper measures must be implemented to ensure data security and protect against breaches.
Conclusion
Database sharding is the best solution to meet the increasing demands of the customers. Sharding helps us to scale our databases and store large amounts of data. There are several techniques to shard the database & each has its own pros & cons.
Range-based sharding is suitable for the workloads that query a range of values. The only downside is uneven data distribution and skewness of the load.
Hash-based sharding distributes the data uniformly across the different shards. It’s not suitable for queries that fetch a range of values. The data needs to be redistributed in case a shard fails or gets added to the cluster.
Consistent Hashing solves the problem associated with Hash-based sharding. It minimizes the number of records that need to be moved due to addition or removal of nodes. Amazon DynamoDB and Apache Cassandra use this strategy to shard the data.
With geographical expansion, storing data close to the users has become an emerging pattern. This results in improved performance due to lower latency.
The choice of sharding strategy depends on the data access pattern & data distribution. There is no one size fits all strategy that can be used for all the data storage problems.
We will continue our journey to demystify complex tech concepts next week. I am planning to cover Microservices, Service Mesh, etc in my article next week. Stay tuned for subscribe to my newsletter. Let me know your thoughts in the comments.
Thanks for reading the article! Before you go: