Library vs Service: Lessons from Netflix's Multi-Year Migration
Framework with real-world case studies to future-proof your architecture decisions
Should I choose a library or build a service? This is a question that puzzles many software engineers. A wrong decision can lead to months or even years of rewrites and increased costs.
Software teams constantly change, and a wrong decision can impact future team members—especially those with the least context. I have worked on both sides of the table — building a service by deprecating an old library and replacing a service with a lightweight library.
I worked at a mid-sized startup where microservices were a golden hammer. They built microservices for the smallest of the functionalities. The loosely coupled architecture did create a mess. I remember a colleague once joked, “They’ll probably create a service just for Java’s toString()
function!” 😁
Finally, the company took a call to combine the related functionalities in a library. The rewrite frustrated the developers who always pondered - “Why didn’t the previous team think of this?“ 🤔
On the other hand, while working in one of the big tech teams, we ran into scalability challenges and bottlenecks with a legacy library. Our team members blamed the old library owners for the poor design.
One common pattern in both the anecdotes was - The old team members always became a scapegoat. 😁
How can we prevent this? How do we take better decision as software developers? How do we build software that outlives the team that built it?
This article will answer all of the above questions. It will walkthrough a case study of how Netflix substituted a library with a service. You will learn how to take critical decisions and future proof your tech choices.
With that, let’s begin with the case study.
Netflix Case Study - Platform for managing Membership plans
Today, Netflix has presence in 190 countries and has a 300 million subscriber base. It offers a variety of plans tailored to devices, streaming quality, regions, and more.
But back in 2010, it only had presence in the US with just 12 million subscribers. It offered only three plans - Basic, Premium and Standard to the customers.
Let’s now understand the software behind the membership plans.
Do you want to level-up your skills as a software developer? Do you aspire to crack interviews at top-tech companies and land a job?
If yes, then Educative.io is the best platform with handcrafted courses for developers and personalized by AI. Get additional 10% off by using this link.
Membership plans library
In 2010, Netflix had a monolith (Membership service) that was responsible for managing and tracking the user subscription plans. For modularity, they decided to build a library that would:
Handle the business logic for evaluating the plans.
Store the plan metadata in various configuration files.
The below diagram shows the basic data model for the different membership plans:
Stakeholders could easily modify the plans, release and deploy the library. The below diagram shows how the service read the plans from the library.
Everything was simple until the company’s global expansion in the mid-2010.
Global expansion
It first launched in Canada followed by Latin America, United Kingdom, Australia, India, etc. With the expansion, there was a growth in the number of subscribers.
For scalability, the tech teams decided to split the monolith into multiple services. Each service served a different purpose such as Playback, Billing, Edge services, etc.
Its initial strategy of offering three basic plans wasn’t sufficient for the new regions. They needed better strategies to acquire customers.
The subscription plans grew in the following dimensions:-
Devices - Plans catered for specific devices such as Mobile, TV, etc.
Quality of service - Different plans based on the resolution and audio quality.
Regions - People’s buying choices led to unique plans for various regions.
In addition, product team wanted to perform A/B tests on different subscription plans and evaluate the impact.
As the company grew, the original light-weight library now had complex rules and plan metadata. With the proliferation of microservices, the library was imported by every other service. Although the company grew rapidly, the library couldn’t keep pace with it.
The below diagram presents the membership plans data model after Netflix’s geographical expansion.
Let’s now dive into the challenges that the team faced while managing the library.
Library challenges
The library owners faced the following challenges:-
Operational complexity - Plan changes required changes in metadata config and business logic. The development and testing took several days. Additionally, even simple changes took days to reach production.
Reliability - Pricing consistency was critical for Netflix across devices and pages showing the subscription plans. With the library, two or more services could rely on different versions resulting in inconsistency in the shown plans.
Manageability - It became difficult to manage thousands of different plans along with custom business logic.
One of the biggest challenges of using a library is maintaining version consistency across services. Netflix team also wrote a web crawler at scale for version correctness.
The below diagram shows how different services and their instances use different library versions.
The tight-coupling due to the library resulted in less flexibility, increased blast radius, and long dev cycles.
We will now understand how Netflix tackled these challenges.
Library deprecation
To solve the pain-points, the team decided to centralize the logic in a service. The primary goal of the platform was to provide an API interface that abstracted the business rules and data.
Here’s how the library got refactored into a platform:-
Business rules - A rule engine was developed to manage the complex rules. Rules were written in the form of expressions instead of code.
Persistence - The data was migrated from metadata configurations to a database layer for persistence. The database acted as a single source of truth for all the plans.
Self-service UI - This layer enabled stakeholders to easily configure new plans, create and validate rules. This eliminated the need to make library code changes that took weeks of efforts.
Service layer - It provided APIs different services to fetch the plans based on the context. For eg:- A service could pass region, device, userId in the context and it could return the plans after executing the business rules.
The below diagram shows how rules were written in the library and then transformed into rule engine expressions.
It can be seen from above, how easy it is to define the rules in the form of expressions than writing business logic every time a plan changes.
The following diagram shows the high-level architecture of the platform.
The re-architecture was a multi-year effort and involved more than 20 different engineering teams. The new architecture could now scale and keep up with Netflix’s expansion in different regions.
While the effort was worth it, do you think Netflix made a wrong choice in developing a library first? 🤔
Was Netflix wrong?
Now, that you understand Netflix’s current and old architecture, let’s assess whether Netflix made a wrong choice in 2010s.
Had Netflix built a Plan service platform instead of the library, it would have supported the company’s rapid growth. It would have given more flexibility to the product teams to experiment and iterate fast.
But what would have happened had Netflix not grown the way it grew in 2010s? 🤔
Assuming that Netflix existed only in US with no business growth, the membership platform would have:-
Increased the infrastructure costs.
Been an overkill for a simple use case.
Resulted in most of the functionality remaining unused.
So, Netflix wasn’t wrong in choosing a library in the first place. But they didn’t tackle the library’s tech debt at the right moment which resulted in multi-year re-architecture.
It’s very easy to criticize the previous software teams for the current challenges. But, we must respect the past engineering decisions assuming they were taken rationally with certain assumptions.
Netflix could have avoided the multi-year migration effort by:-
Identifying the issues with the library proactively.
Prioritising the tech debt and reducing operational burden.
Future-proofing the tech choice of library by easily evolving it into a service.
A better approach would have been to understand the library pain-points after expanding in two or three regions. A futuristic view of growing requirements would have helped them tackle the problem effectively.
Library vs Service
While deciding between a library and a service, you can use the following table to weigh the pros and cons.
It’s essential to evaluate the effectiveness of either approaches in production. A proactive approach saves efforts and costs in future.
Here’s how you can evaluate the success of the solution:-
Consumer count - Service is suited for large number of consumers. If your library is being adopted by large number of consumers (> 5), it’s sign to move the logic to a service.
Frequency of changes - Library results in less flexibility. Hence, if the code or config is changed frequently (> 10 changes/month), it may indicate need for a service.
Performance implications - If you have tight performance constraints, service may not be the right solution. For eg:- A 10ms delay results in $100K loss. In such cases, you need to make trade-offs and adopt a library based approach.
Blast radius - If changes impact multiple upstream consumers and affect critical business decisions (dollar value impact), then a centralized service is appropriate for the use case.
The above metrics would help the teams decide when to pivot from one solution to another (library to service or other way round). Additionally, to future-proof technology choices, teams must build architecture that evolves seamlessly.
Let’s understand with the help of an example, how teams can design an evolvable architecture.
Library to service migration
If a team decides to develop a library, it must anticipate future business growth and challenges. They must ensure that the transition is smooth and doesn’t result in multi-year effort.
Here’s how the problem can be tackled:-
API facade - Introduce a facade that mimics the future service’s API.
Client migration - Migrate the clients in a controlled manner to the internal API endpoints.
Library logic in service - The service must run the same library logic.
Fault tolerance - Include retries, monitoring and fallbacks to protect the clients.
Let’s go through a code example to understand this in detail. We will take Netflix’s example of fetching the plans.
The below code is used to fetch the membership plans.
# plan_selector.py (the current library)
def get_available_plans(user_country: str, is_mobile_enabled: bool) -> list:
if user_country in ["IN", "TH"] and is_mobile_enabled:
return ["Mobile", "Premium"]
return ["Basic", "Standard", "Premium"]
Now, we will create a facade module that would:
Wrap the current library.
Mimic the API of the planned HTTP/gRPC service (inputs and outputs).
Allow switching between library and service backends in the future.
# plan_facade.py
from plan_selector import get_available_plans
# from requests import post # Future: for HTTP call to service
USE_REMOTE_SERVICE = False # Toggle this when ready to switch
def fetch_plans(user_context: dict) -> dict:
"""
Mimics the API structure of the planned service.
Input: { country: "IN", features: { mobile_enabled: true } }
Output: { plans: [...] }
"""
if USE_REMOTE_SERVICE:
# Future call to the remote Plan service
# response = post("http://plan-service/api/v1/plans", json=user_context)
# return response.json()
raise NotImplementedError("Remote service not yet implemented")
else:
# Local fallback using the library
country = user_context.get("country")
mobile_enabled = user_context.get("features", {}).get("mobile_enabled", False)
plans = get_available_plans(country, mobile_enabled)
return {"plans": plans}
This approach ensures a smooth transition from a library to a service.
How would the same work if you want to replace a service with a library? Think about it and share your thoughts in the comments.
Conclusion
Netflix’s case study teaches us that decisions are not inherently right or wrong. But what’s wrong is turning a blind eye to the tech debt and not thinking strategically.
Few companies may witness exponential growth while many others may not. However, this shouldn’t stop the developers from deciding between a library and a service.
A more pragmatic approach is to make your tech choice future-proof by pivoting at the right time. It’s critical to keep an eye on signals such as tech debt and operational burden.
For early-stage companies or teams, it makes sense to invest in a library. But develop layers of abstraction to easily convert it into a service. It’s important to be mindful of the growth and not fall into the trap of thinking tactically.
Have you worked on a refactor or a rewrite similar to Netflix? If yes, what did you learn from it? Leave your thoughts in the comments below.
References
Before you go:
❤️ the story and follow the newsletter for more such articles
Your support helps keep this newsletter free and fuels future content. Consider a small donation to show your appreciation here - Paypal Donate