Designing Instagram's Video Uploads: Optimizing for Low Latency and scalability

System Design, Architecture, and Trade-offs in Building a Video Upload Service

Jan 20, 2025

Have you wondered how Instagram handles millions of video uploads daily ?

How does it make the uploaded video available for consumption within seconds ?

Instagram Engineers have built a robust and scalable Video Upload Service that abstracts the complexity of video upload and processing.

In this article, we will to design and architect one such service. We will tackle the problem starting with a simple solution. We will identify the bottlenecks, iterate and optimize the solution.

By the end, you will learn the core challenges and tradeoffs done while building a reliable, scalable and an efficient video upload service. You’ll gain key skills to tackle a similar problem in your next system design interview.

With that, let’s revisit our fundamental video processing concepts that were discussed in the last edition of this newsletter.

Video processing fundamentals

Videos go through six phases before they are available for user consumption. The below diagram illustrates the process.

Video viewers use different applications and devices for playing videos. Similarly, their network bandwidth also varies.

Hence, it’s essential to convert a video into several formats and also vary the bitrate for compatibility. For eg:- A 1080p video may not render as expected on slow internet connection. So, it must be converted into a format like 240p or 144p.

In the video processing step, the system re-encodes the video into multiple formats. Besides, the step also performs operations such as watermarking, thumbnail generation, encryption, etc.

As seen from the above diagram, the video processing step can be modelled as a Directed Acyclic Graph (DAG). Each node in the DAG represents an operation and the vertices indicate data flow.

Now that you know the basics, let’s design a service that uploads and processes videos.

Problem Statement

Design a scalable, efficient and a reliable video upload and processing service

Let’s break down the problem statement into functional and non-functional requirements.

Functional Requirements

Users must be able to upload the videos
The system must process and generate multiple video formats. It must include functionalities such as watermarking, encryption, etc.
Users must be able to upload the videos through channels like messages, stories, and posts.

Let’s define the non-functional requirements in the context of each functional requirement.

Non-functional Requirements

System must scale to handle millions of concurrent uploads.
Upload and processing p90 latency must be within 10 secs. (It would make the experience interactive)
It must be fault-tolerant and reliably recover from any failures.

Given clear requirements, we will now come up with a simple design that meets them.

Approach 1 - Synchronous solution

In this approach, the client uploads the video and it gets stored in a blob storage. The video then gets processed by the backend Video Processing Service.

The following diagram shows the end to end process.

While the solution is easy to understand and simple to manager, would it scale ?

Before reading further, take a pause to think where it would fail to meet the our goals.

Here’s why the solution wouldn’t scale :-

Reliability & Fault tolerance - In case of video processing failure, the client would have to restart the whole process. Even if one of the operation (for eg:- watermarking) fails, the whole step needs to be retried. This compromises the service reliability.
Latency - Video processing time is proportional to the length of the video. Due to sequential processing, large video files would take a long time. It would take hours if not minutes for few large files (more than 200 MB).
Scalability - Given the sequential nature, a given server would be able to handle only limited number of requests. Increasing the instances would result in higher costs.
Efficiency - The solution performs compute-heavy video transcoding operations in sequence. It doesn’t exploit parallel processing which makes it inefficient.

Since we now know the downsides of the synchronous solution, let’s think of improving this one step at a time.

Approach 2 - Asynchronous solution

We can improve the solution in the previous section by making the backend video processing asynchronous.

The compute-heavy video transcoding can be decoupled from the uploading and pre-processing steps (video file validation, repairing malformed files, etc). This decoupling would allow both the layers to scale independently and robustly handle failures.

**Asynchronous solution using Scheduler & workers**

The above diagram shows the new architecture. Let’s dive into the key changes introduced in the design.

Here’s how the video transcoding can be handled :-

Scheduler - Scheduler would be responsible for orchestrating the video processing step (DAG). It would manage and co-ordinate the work of several worker instances.
Workers - These server instances would perform operations such as transcoding, watermarking, etc. They would be independent and inform the scheduler on completion of task.

The communication between Scheduler and Worker would take place via message queue.

This solution addresses the primary bottlenecks of the previous solution as follows :-

Reliability & Fault-tolerance - The whole video upload process doesn’t need to be retried in case of failures. Individual processing operations can be retried either by scheduler or worker.
Scalability - The async process with workers make it easy to scale the processing step. Similarly, the upload & pre-processing steps can be scaled independently.
Efficiency - Unlike the previous approach, two or more transcoding operations can be performed on multiple worker machines in parallel.

While this approach introduces excellent improvements, it comes with the following downsides -

Latency - The overall latency doesn’t significantly improve since it’s dependent on successful completion of all the individual processing steps.
Critical operations - The scheduler doesn’t identify and prioritize critical operations and non-critical operations (like video analytics). This further impacts the latency.

Let’s see how we can make this solution better.

Optimizations

We can further improve the process by :-

Latency - The video can be made available once the highest resolution is available. Hence, the system doesn’t need to wait for all the resolutions to be available.
Critical operations - The Scheduler can have two different queues for the asks - critical and non-critical. Latency sensitive tasks can be added in the critical queue while others can go in the non-critical one.

Although the above solution reduces the latency to an extent but the clients on slow networks wouldn’t be able to load high resolution videos. This would compromise the experience for clients with limited bandwidth.

This is an acceptable tradeoff since we expect all the resolutions to be available within couple of minutes. And the percentage of people with poor internet connection is less (< 10%).

Can you further improve the design and optimize it ? Before proceeding to the next section, take a moment to identify ways to tackle the problem.

Approach 3 - Asynchronous processing with Segmented video upload

In the previous two approaches, the video processing starts once the file upload completes. The longer the upload time, higher would be the overall latency.

Instead of waiting for the full file to be uploaded, what if we upload segments of the file ? Would that reduce the overall latency ?

The answer to the above question is Yes. The video file can be divided into a sequence of continuous segments known as Group of Pictures (GOP). Clients can upload the segments in parallel.

Further, each GOP segment can be encoded and processed independently. So, they can be processed in parallel and eventually stitched into a complete video.

We have reduced the overall latency using two techniques :-

Overlapping upload and processing through segmentation.
Parallel processing of individual segments.

The below diagram shows the overall architecture which meets our functional and non-functional requirements.

**Video upload and processing architecture**

Let’s now understand few tradeoffs of the final solution.

Tradeoffs

Stitching step

The final stitching step that combines all the segments together adds complexity to the pipeline. However, the benefits of parallel processing outweigh the downsides of stitching step.

Small video files

For small video files ( < 10 MB), there isn’t an additional advantage of segmented upload and parallel processing. Since it’s an overkill, such files can be separately processed through a separate pipeline.

Segment size

Compression algorithms exploit temporal locality of large segments and lead to better compressions. Higher segment sizes result in less number of segments and reduce the degree of parallelism.

Hence, the segment size needs to be adjusted based on the expected video quality. Higher quality videos (reels, stories) can use more segment size. While use cases like videos shared via messages can use less segment size.

Client-side bandwidth constraints

In certain cases, the clients may have poor connectivity. In such cases, the video processing can be done on the client side (provided it’s supported by client devices). Client-side processing reduces the file size and helps with faster video uploads.

Conclusion

In this article, we learnt the system design and architecture of a Video upload service. We started with a naive solution and iteratively improved by identifying the bottlenecks.

Here’s a quick summary of how to build a scalable, reliable and a robust video upload service :-

Segmented upload - Break the video into segments and upload each segment. Further, upload and process each segment in parallel.
Parallelism - Use parallelism to execute the compute-heavy operations.
Critical vs Non-critical operations - Segregate critical and non-critical operations. Prioritize the critical operations while scaling the system.
Fault-tolerance - Ensure that the system robustly recovers from failures in individual steps through retries.

Before you leave, take this question as an exercise -

What would happen if the Video upload service receives a surge of video uploads (10x traffic due to some event) ? How would the system scale ?

Leave your answers to the above question in the comments below.

Thanks for reading Engineering At Scale! This post is public so feel free to share it.

Before you go:

❤️ the story and follow the newsletter for more such articles
🔔 Follow me: LinkedIn, Twitter, Medium
Your support helps keep this newsletter free and fuels future content. Consider a small donation to show your appreciation here - Paypal Donate

Engineering At Scale