Designing systems is tough. A system design interview is tougher. With little time to impress, it’s vital to stay focused and avoid getting sidetracked by unimportant details. That's why a structured framework is essential. It keeps you on track and ensures you emphasize the right aspects of problem-solving, communication, and technical skills.
Why does this matter?
It can determine whether you’re seen as a fit for junior or senior roles.
It also has a direct impact on your compensation.
This article will explore a proven framework. It will help you ask the right questions and focus on what matters in a system design interview. ⚡️
If you’ve recently subscribed, these articles will help you get up to speed on system design:
Before we dive deep, here's a quick system design cheat sheet 🧷
Step 1: Clarifying System Design Requirements
The first step in any system design interview is to grasp the problem. It's vital to ask good questions. They ensure a deep understanding of the system's purpose, users, and limits.
1.a Types of requirements in system design
A key part of this step is distinguishing between the two broad types of requirements.
Functional requirements
Core features: What is the primary function of the system?
Feature prioritization: Are there any features that are more critical or time-sensitive?
Users: Who are the main users (e.g., customers, internal teams like developers, DevOps, QA)?
User interaction: How will users interact with the system—via web, mobile, or API?
External integrations: Does the system need to connect with third-party services or APIs?
Business logic: Must the system follow any specific rules or constraints?
Notifications: Should there be real-time or delayed notifications (eg: emails, push notifications)?
Error handling: What should happen in the case of a system error or failure?
Non-functional requirements
High-Priority Considerations:
Scalability: How many users and requests should the system handle? Plan for future growth.
Availability: Must the system be highly available, or can it have some downtime?
Latency: What are the acceptable latency and throughput for efficient system functioning?
Low Priority Considerations:
Consistency: Can the system tolerate eventual consistency? Or does it need strong consistency?
Security: How critical is securing the APIs and protecting data? Are there specific security requirements or compliance needs?
Accuracy: How important is absolute accuracy in the system? Can it tolerate minor discrepancies, or must it be 100% accurate?
Tip💡: For senior roles, emphasize non-functional requirements like scalability, availability, and performance. For junior roles, focus more on the core functional requirements of the system.
1.b Capacity Estimation
The second sub-step estimates the needed system capacity. It is based on insights from the previous step. This estimate is vital for sound design choices in the next stages of the interview. Capacity estimation helps guide choices around infrastructure, scaling strategies, and performance optimization.
We can break down capacity estimation into four key aspects:
Users
Estimate the number of monthly and daily active users.
Account for peak activity periods by estimating the number of users.
Storage
Estimate volume through activity patterns (eg: 2 daily posts for a Twitter-like system).
Identify the type of data (eg: text, images, videos) that is being stored and the expected duration of storage.
Calculate the total storage capacity required based on the volume and type of data.
Requests
Estimate the system's Queries Per Second (QPS) based on the number of users and their activity.
Provide a peak QPS estimate to ensure the system can handle high traffic.
Divide QPS into read and write operations to inform design decisions, such as database architecture and caching.
Memory
Estimate the memory required to store frequently accessed data for rapid retrieval.
Figure 3 below shows a back-of-the-envelope calculation for a Twitter-like system based on the above aspects (These numbers are illustrative, not actual Twitter data).
Step 2: Propose a high-level system design
With the requirements and capacity estimation phase complete, the next step is designing the system. This process has three key parts (Figure 4), each of which will be discussed in detail.
2.a High level design
Create a simple block diagram. It should show the key system components. It must outline the high-level flow of data and requests. This starts from the client, goes through the backend, and returns the response.
Figure 5 highlights the essential components that should be emphasized in any high-level design.
The diagram should include the following components:
Client Applications
Specify if the client is a mobile app, web app, or an internal dashboard for internal teams.
If internal teams are the users, the client may be a CLI tool within the VCN.
Content Delivery Networks (CDN)
Use case: If handling media storage or streaming, CDNs improve performance by caching media close to users.
CDNs can also help limit rates at the edge to protect backend services.
API Gateway
Entry point for all incoming requests into the VCN, handling routing to respective services.
Functions: Load balancing, rate limiting, and sometimes authentication (e.g., AWS API Gateway).
Application Servers
Deploy microservices for specific functionalities (e.g., user onboarding and authentication services).
Bonus: Briefly mention Monolith vs Microservices but focus on microservices, which is the modern approach.
Database
Specify database type based on the use case.
Examples: MySQL for relational data and MongoDB for unstructured data in document-oriented databases.
Caching
Use Redis or Memcached to reduce database load by caching frequently accessed data.
Bonus: Mention caching strategies such as write-through or TTL for cache invalidation.
Message Queues
Required for asynchronous communication, like OTP or email delivery.
Bonus: Compare Kafka (distributed, scalable) vs RabbitMQ (traditional messaging) if needed.
External Services
Outline external dependencies like payment processors and alert services. Note any third-party integrations critical to the core functionality
Bonus: Discuss methods of status updates like webhooks, status APIs, or event-driven updates.
Bonus Components (Optional): If time allows, and you understand them, add these optional components to your block diagram.
Ingress Gateway
Sits at the edge of the Kubernetes cluster, directing traffic from the API gateway to the microservices.
Egress Gateway
It handles outbound requests from microservices to external services. It sits at the edge of the Kubernetes cluster.
2.b API Design
Designing the API and choosing the right protocols are crucial. They define how system components communicate and how clients access the system's features.
Figure 6 shows the key aspects of API design.
Types of Communication Protocols:
HTTPS: Standard for secure communication in most APIs (eg: REST APIs for user authentication or data fetching).
WebSockets: Ideal for real-time data exchange where low latency is key (eg: live chat, stock market updates).
gRPC: Best for fast, efficient communication between microservices (eg: internal service-to-service calls).
Message Protocols: Use in asynchronous scenarios (eg: Kafka or RabbitMQ for event-driven architectures).
Types of APIs:
REST: It is often used for CRUD operations, like creating, reading, updating, and deleting user details.
GraphQL: Allows clients to request specific data fields (eg: flexibly fetching custom user profiles).
XML: Suitable for legacy systems or applications needing a strict schema (eg: financial data exchange systems).
Follow Common Conventions:
Plural Naming: Use plurals for resource names (eg:
/v1/users
for multiple users).Versioning: Use versioning for backward compatibility (eg:
/v1/users
not/users/v1
).Pagination: Always paginate search or listing APIs to handle large datasets (eg:
GET /v1/users?page=5&pageSize=20
).
Idempotency:
Ensure critical APIs, like payment processing, are idempotent. This prevents duplicate charges. For example, use an idempotency key like
{idempotency_key: "abcd-fdfs"}
withPOST /v1/payment/initiate
.
Data in API Calls:
Define request/response body structures for consistency and ease of use. For example, include user details in the
POST /v1/users
request body.For use cases such as fraud detection, geolocation and IP address are often required. Pass this data via headers (eg:
x-latitude
for location andx-ip-address
for the user's IP).Mention response status codes clearly:
200 OK: Request was successful and data is returned.
201 CREATED: A new resource was successfully created.
403 FORBIDDEN: The request is valid, but the server is refusing to fulfill it.
302 FOUND: The requested resource has been temporarily moved to a different URL.
API Segregation:
Public APIs:
These are open to external clients, often with few restrictions. For example, anyone can create an account via the public
POST /v1/signup
.Implement rate limiting to prevent abuse. Inform clients of limits via headers (eg:
x-rate-limit: 1000
for a max of 1000 requests per user).
Private APIs:
Require authentication and are accessible only to authorized users. For example, to access sensitive account details, use
POST /v1/account
with aAuthorization: Bearer <token>
header.Use JWT-based tokens for managing sessions and validating users securely.
Internal APIs:
Internal services or dashboards rely on these APIs for operations not exposed to external clients. For example,
GET /v1/users
retrieves user data for internal admin dashboards.To ensure that they are not accessible from outside the network, block these at the CDN/API gateway level.
2.c Database Design
This step (Figure 7) involves designing the DB schema and choosing a DB type. We must also address concerns about query performance through normalization, partitioning, and caching.
Database schema
Table Structure - Map out the key tables and include important fields. Clearly define the data types for each attribute. Eg:
user_id
asINT
,product_name
asVARCHAR
.Primary & Foreign Keys - Specify primary and foreign keys to establish relationships between tables. Eg:
product_id
as the primary key inProducts
table and foreign key inPrices
table.Indexing: Identify columns that should be indexed for faster query performance. Eg: index
product_name
andcategory_name
in an e-commerce data to improve search speed.Denormalization - Create denormalized tables to avoid complex joins in read-heavy queries. Eg: In an e-commerce app, a denormalized table combining product details, reviews, and prices reduces the need for multiple table joins.
Partitioning - Partition large tables by date or range. Eg: Partition orders by month in an online store to improve query performance for recent transactions.
Caching Frequently Accessed Data - Implement caching for commonly queried data like product prices or user profiles to reduce load on the database and improve performance.
Types of Databases
Relational (SQL) - Store data in structured tables with rows and columns, useful in case of relations between data. Eg: MySQL, Postgres, CockroachDB (distributed SQL)
Non-Relational (NoSQL) - Flexible schema design; can store various data types. Types of NoSQL DBs:
Key-value - Simple database where data is stored as key-value pairs. Eg: Redis → used for caching and real-time analytics.
Column store - Store data by columns rather than rows, optimizing read-heavy queries. Eg: Cassandra → Well-suited for building metadata stores in distributed systems; Clickhouse → Optimized for high-performance analytics and data warehousing, ideal for processing large volumes of data quickly.
Graph - Store data as nodes and edges, ideal for complex relationships. Eg: Neo4j → Used for social networks and recommendation engines.
Document - Store data in JSON-like documents; flexible schema. Eg: MongoDB → Popular for applications requiring dynamic data structures.
Before wrapping up this section, here are some useful tips to follow during the high-level Level Design step:
Tips💡:
Avoid including irrelevant components. For eg, in a payment system, there's no need to include authentication in the high-level design.
At this stage, avoid going into details about individual components. Allocate time to dive deeper later on in the discussion.
Note down key points to discuss in the next section:
API scaling
DB scaling
Concurrency
Failure scenarios
Consistency
Step 3: Address Key Issues in the design
This is the final step which involves identifying and addressing the key challenges the system is likely to face (Figure 8). This phase needs a structured approach to problem-solving. Collaborate with the interviewer to find which areas to explore in depth.
Key points to keep in mind:
Non-Functional Requirements: Emphasize scalability, performance, reliability, and other non-functional factors based on prior discussions.
Collaborate on Focus Areas: Work with the interviewer to pick specific sections or problems to explore in-depth.
Follow a 4-step problem-solving framework to apply to any problem in this section (Figure 9):
Articulate the problem: Clearly define the challenge being addressed.
Generate multiple solutions: Present at least two potential solutions.
Discuss trade-offs: Compare the pros and cons of each solution. Consider cost, complexity, and scalability.
Pick and deep dive: Choose the most appropriate solution and analyze it in detail with the interviewer.
Potential problem areas and solution approaches:
Scalability:
Application Scaling:
Discuss whether to scale horizontally (adding more servers) or vertically (upgrading CPU and memory of existing servers).
Example: In a high-traffic e-commerce platform, horizontal scaling might be preferred to handle spikes during peak sales.
Database Scaling:
Add indexes to critical columns to improve read performance.
Partition large tables into smaller ones to reduce query time.
Normalize tables to reduce data redundancy, or use denormalization to optimize read-heavy operations.
Caching:
Identify high QPS APIs where caching can reduce load (eg: frequently accessed product pages in an e-commerce system).
Discuss cache storage options like Redis or Memcached, outlining the pros and cons of each.
Present different caching strategies (eg: write-through, write-back) and select the most appropriate one for the system’s needs.
Define cache invalidation strategies, either manual (via code) or automatic with TTL (Time To Live).
Message Queues:
Replace asynchronous communications with message queues to handle task processing efficiently.
Discuss strategies to ensure exactly-once delivery, critical for systems like payments (Read Avoid double payments to learn the key strategies)
Example: In a payment processing system, idempotency keys ensure that users are charged only once, even if the request is retried.
Implement retry mechanisms for handling failed deliveries at both producer and consumer sides, ensuring reliability.
Reliability:
Single Point of Failures (SPOF): Identify and eliminate SPOFs by introducing redundancy (eg: multiple load balancers, database replicas).
Failure Handling:
Implement exponential back-off strategies for retries, preventing the system from being overwhelmed.
Log persistent failures in a DB and retry later using cron jobs or background workers until successful completion.
Apply circuit breaker patterns to prevent cascading failures, protecting the system from becoming overloaded.
Data Replication & Disaster Recovery (DR):
Discuss strategies to handle large-scale failures, such as a cloud service outage (eg: AWS going down).
Explore disaster recovery strategies, ensuring high availability through data replication across regions or cloud providers.
Monitoring and Alerting:
Implement logging and metrics collection to monitor system health and performance.
Use the ELK stack (Elasticsearch, Logstash, Kibana) for log aggregation and analysis.
Discuss how to use OpenTelemetry for collecting traces, offering visibility into application performance and failure points.
Define alerting mechanisms to notify teams of critical issues in real-time.
This 3-step template (with many sub-steps😄) should help. It will let you tackle any system design interview question without fear!
The goal is not to find a perfect solution. It's to show you can analyze complex problems, make design choices, and communicate your reasoning. 😄
Thanks for reading 🙏🏻!
The details in this article are based on 6+ yrs of practical experience that has come from designing real systems and seeing problems first hand. If you liked the deep dive, do give a ❤️ and subscribe to encourage me to publish more such deep dives.