DynamoDB Global Secondary Indexes - Internal Working and Best Practices
Understanding the Trade-offs, Performance Impact, and Real-World Use Cases
DynamoDB is an ideal database until your application requires different keys for data lookup. It provides a feature known as Global Secondary Index (GSI) that enables users to lookup data using a variety of keys.
But is it free of cost? Does it impact the table’s read or write performance? What are some best practices to use while designing GSIs?
This article will answer all of the above questions. We will start with a real-world problem and understand the internals of GSIs.
Having a sound knowledge of database features is essential to build robust and scalable applications. By the end of this article, you will gain sufficient expertise to apply the knowledge and use best practices to solve impactful problems.
With that, let’s begin and solve a simple real-world problem.
Problem statement
Let’s assume that you want to design a shopping cart system for an e-commerce website. The system must support the following functions:-
Users must be able to add items to the cart.
They must be able to lookup all the added items in the cart.
Find all the items that were added one day ago to the cart.
We will now devise a solution that meets all the above requirements.
Solution
Given the simple requirements, a key-value store like DynamoDB would be appropriate for our use case. Here’s how our data for a single added item would look like:-
DynamoDB requires users to state a partition key and a sort key (optional) during schema definition. Both should be chosen in a way to make the lookup efficient.
Since our access pattern needs to lookup all items for a user, the userId should be the partition key. By passing a userId, we should be able to find all the added items in the cart.
We can optionally choose the itemId as the sort key. For a given user, all the items would be sorted based on the itemId. A combination of userId and itemId would return the information of the item in the cart.
The solution meets the requirements 1 and 2 in the previous section. But what about the requirement 3? Would the given schema enable efficient lookups even for an item? Let’s understand this in the next section.
Non-partition key lookups
In our schema, the itemId is a non-partition key attribute that doesn’t support efficient lookups. However, we can state filtering condition in the scan API of DynamoDB to retrieve a given item.
We would now scan all the records in the table and only return the one that match the criteria. For example: filter only the items where itemId = “item1“ and createdAt < today - 1
But would this solution scale? What if we have 1 million items or 100 million items in the table? Think about it before reading further.
The answer is - No. Here’s what would happen in case of 100 million records:-
Client would send a request with the filtering criteria.
DynamoDB would retrieve each record from the disk and apply the filtering criteria.
Every record would be read from the disk resulting in a time-taking operation (~100-200 milliseconds).
If we extrapolate this to 100 million records, the overall time taken would be in hours or worst days.
The slow operation would result in poor user experience and timeouts. So, how do you fix this?
DynamoDB provides a feature known as Global Secondary Indexes (GSIs) to solve the same problem. We will now learn in the next section, how GSIs solve this problem.
Global Secondary Indexes (GSIs)
GSIs enable the users to define an index on a non-partition key attribute. Thus, the key-value lookup is not constrained only to partition key and sort key.
For our use case, we need to lookup all the users based on an item. We can create a GSI by using itemId as the partition key and createdAt as the sort key.
With the GSI, DynamoDB will lookup the itemId and use the createdAt to find items that were added a day ago. The operation would complete in less than 10 milliseconds.
In future, we may get a requirement to find all the users who have added an item to the cart. We can create a GSI with the following schema:-
itemId - partition key
userId - sort key
This would allow us to perform queries such as :-
Find all users who have purchased the item -
Key = {itemId}
Find the status of the item for a given user -
Key = {itemId, userId}
Let’s now dive into what happens behind the scenes when you create a GSI.
GSI internals
When you create a GSI on an existing DynamoDB table, it internally creates a new physical table. It uses the GSI’s partition and sort key for the newly created table and allocates the data accordingly.
The following diagram shows the original DynamoDB table along with its GSI.
In the above example, we created a GSI using itemId as the partition key and the createdAt attribute as the sort key. As seen from the diagram, it creates a separate table for the GSI.
DynamoDB abstracts the new physical table from the end users. End users only are aware of the GSI index.
Now that you understand the working of GSI, can you think of the downsides of using a GSI? Give it a thought before reading further.
Let’s now look at some downsides of having GSIs.
Downsides of GSIs
Costs
Astute readers might have noticed that you are duplicating the data for every GSI. The amount of data stored would grow with every GSI.
Additionally, GSIs don’t share the reserved capacity units (RCUs/WCUs) with the original table. Users have to manually allocate the capacity units.
As a result, if you are being charged $X for the original table, then your overall cost would be 2*$X including a GSI. With every new GSI, your storage costs would grow linearly.
Hot partition
For our use case, what would happen if an item suddenly becomes popular? For eg:- Many users trying to purchase a new iPhone on release.
Our main table would be able to scale and efficiently handle the writes. Since the original table uses userId as the partition key, it would distribute the data efficiently across different partitions.
But what about the GSI on the itemId? What would happen to the user writes?
Let’s understand in detail what happens to the GSI. Here’s what would happen:-
The GSI would receive several requests for the same item (eg: iPhone)
Since the itemId is the partition key, the partition handling the item would receive a spike in the count of requests.
If the request count crosses a threshold, it would throttle(reject) the requests.
As a result, the data wouldn’t get stored in the GSI.
This would lead to temporary inconsistency between the GSI and the original table.
The below diagram illustrates how a GSI partition becomes a hot partition. The original table uses userId as the partition key and uniformly distributes data. But, if many users purchase the same item, the partition storing that item becomes a hot partition.
It’s important to assess the impact of choosing an attribute as a GSI partition key. We need to take measures to ensure that we don’t run into hot partition issues in the future.
Eventual Consistency
As stated before, the data is propagated to the GSIs asynchronously. It may take few seconds to minutes for the data to appear in the GSI.
Since GSIs support eventual consistency, there would be brief periods where the data would be present in the original table but not in the GSI.
Now that you understand the downsides of having GSIs, let’s look at some best practices while designing GSIs.
GSI best practices
Schema design
Once GSIs are created, they can’t be modified. So, it’s essential to carefully design the GSI schema. Here are some tips to design the schema:-
Design the GSI keys based on your application’s most frequent query patterns.
Include multiple attributes in the sort key to enable filtering on several criteria.
Performance considerations
Based on your application’s performance requirements, following are some practices that you can follow:-
Avoid hot partitions for low cardinality GSI keys. For eg:- status can be ACTIVE, or FAILED. In such cases, add a suffix (0-100) to distribute the key uniformly.
GSIs are not the right fit in case you want strong consistency. So, choose GSIs if your application can tolerate eventual consistency.
Monitor throttling of base table and GSI to minimize consistency issues.
Consider using on-demand capacity for GSIs in case the load is unpredictable and spiky in nature. For predictable workloads, provisioned capacity would be appropriate.
Cost considerations
GSIs are costly since you get charged for the additional storage and writes to the original table. Here’s how you can optimize the costs:-
Avoid creating GSIs unless there’s a strong need.
In case the number of GSIs exceed 10 or 15 (maximum is 20), it maybe worth considering DynamoDB as the right choice of database.
Keep sparse data in the GSI. For eg:- In case you only want active items in the cart, create a separate column (GSI_Status) and set its value only when the cart item is ACTIVE. For other states, set the value as null. GSI ignores null values and you can reduce the data size.
Conclusion
DynamoDB provides Global Secondary Index (GSI) feature to perform a look-up on non-partition key attribute. The feature is helpful for applications that need different data access patterns.
While GSIs extend the querying capabilities of application, it has the following downsides:-
Internally, a separate physical table is created with user specified partition and sort key.
It results in data duplication and higher storage costs.
GSIs consume separate capacity units than the original table.
Sound understanding of GSIs and their best practices help you avoid problems like hot partitions. Also, it helps prevent unexpected billing issues in the future.
It’s important to keep track of how many GSIs you use. If the count crosses into double digits, you should reflect whether key-value store is the right solution to the problem.
If you have used GSIs in the past, what was your experience using it? Leave your thoughts in the comments below.
Before you go:
❤️ the story and follow the newsletter for more such articles
Your support helps keep this newsletter free and fuels future content. Consider a small donation to show your appreciation here - Paypal Donate