System Design
105 articles tagged with System Design.

How Slack Built Secure Enterprise Search?
Slack enables secure enterprise search using real-time fetch, RAG, ACL & OAuth, no data storage, always permission-aware & private across tools.

How Airbnb Migrated a Petabyte Without Users Noticing
Airbnb rebuilt Mussel into a cloud-native KV store and migrated 1PB+ data using Apache Kafka with zero downtime.

API Gateway vs Load Balancer
Discover the differences between API gateway vs load balancer and find out which is best for your system's performance and security needs.

How Slack cut their E2E Build Time by 80%?
Slack cut E2E time 80% by skipping redundant frontend builds and reusing cached assets, saving compute, storage, and hours.

What are Immutable Data Structures?
Explore why immutable data structures: why they matter in modern coding. Discover how they enhance reliability, simplify concurrency, and prevent bugs.

How Shopify Made Commerce Data Queryable Without SQL
ShopifyQL Notebooks lets merchants explore business data without SQL, using commerce-focused models built for clarity, speed, and action.

What is Backpressure?
Learn what is backpressure in distributed systems, why it’s vital for stability, and key strategies to prevent overloads in large-scale systems.

How Slack Automatically Stops Suspicious Activity in Real Time
Slack’s AER detects suspicious activity and automatically terminates user sessions, shrinking response time from hours to minutes.

How Shopify Built Super-Fast Search at C++ Speed
Shopify built RankFlow to run ML-powered search at C++ speed, letting data scientists iterate fast without sacrificing latency or scale.

Why Spotify’s Shuffle Never Felt Random (and What They Did About It)
Spotify kept Shuffle random but made it feel fair by choosing the least repetitive random order, so songs feel fresher without breaking true randomness.

Replication vs Redundancy. What's the Difference?
Learn the key differences between replication and redundancy to optimize your data protection strategies. Discover which method suits your needs best.

How Dropbox Dash Uses a Feature Store for Real-Time AI
Dropbox Dash uses a hybrid feature store to deliver fast, fresh signals at scale, keeping AI search accurate, low-latency, and reliable at scale.

What Is MLOps?
Learn what is MLOps: bridging the gap between devops and machine learning. Explore its lifecycle, tools, and best practices for scaling AI effectively.

How Spotify Scaled Content Annotations to Millions (Without Losing Quality)
Spotify built a scalable annotation platform by combining human experts, smart tools, and strong infrastructure to power high-quality ML training data

How Dropbox Dash Uses Context Engineering to Build Smarter AI
Dropbox Dash evolved into agentic AI by engineering context fewer tools, relevant data, and specialized agents making AI faster, smarter at work.

What is Publish-Subscribe Pattern?
What is publish-subscribe pattern? Learn how pub/sub decouples components, with real-world examples and benefits for scalable systems.

How LinkedIn Uses Machine Learning to Moderate Content at Scale
LinkedIn is using ML to prioritize content smarter, not replace humans but helping reviewers act faster, scale better, and keep the platform safe without losing judgment or nuance.

What is Configuration Drift?
What is Configuration Drift? Learn causes, risks, and best practices to detect, prevent, and fix drift with IaC and GitOps.

How Instagram Improved HDR Video on iOS With Dolby Vision
Dolby Vision first hurt Reels due to load delays from metadata. Compression fixed it, boosting watch time and enabling rollout on Instagram iOS

What is Lazy Loading vs Eager Loading?
lazy loading vs eager loading explained with practical examples. Learn when to apply each approach for performance and resource efficiency.

How LinkedIn Rebuilt its Profile Highlights System
LinkedIn rebuilt Profile Highlights into a plug-in platform, enabling faster experiments, independent teams, better performance, and ~50% lower costs.

How LinkedIn Reduced Latency and Cost by Merging Two Critical Systems
LinkedIn merged identity midtier and data services, cutting network hops to reduce latency, memory use, and cost while keeping APIs unchanged.

How Lyft Built an In-App Messaging Without Annoying Riders
Lyft built in-app messaging by starting with simple banners and scaling into a smart, context-aware system that delivers timely messages without annoying riders.

How Airbnb builds Products 10x Faster Using GraphQL and Apollo
Airbnb ships faster by using GraphQL and Apollo to power backend-driven UI, automatic types, and tooling that lets engineers focus on building features

How Zomato Improved their Android App Startup Time by Over 20% Using Baseline Profiles
Zomato cut Android app startup time by 20% using Baseline Profiles, pre-optimizing key code paths for faster launches and a smoother, consistent user experience.

What Happens During a Database Migration?
Discover what happens during a database migration. This practical guide covers planning, execution, validation, and strategies for a smooth transition.

How Airbnb Measures the Lifetime Value of a Listing
Airbnb’s LTV framework shows which listings drive value, supports hosts, and adapts to market changes for smarter, data-driven decisions.

What Is CQRS?
What is CQRS? This guide explains the CQRS pattern with simple analogies and practical examples to help you build scalable and high-performance applications.

How Lyft Rebuilt its Iconic Dashboard Emblem and its entire IoT Platform along with it?
Lyft’s Glow is more than an emblem, it’s a unified IoT platform with secure provisioning, real-time control, device shadowing, and safe OTA updates.

How LinkedIn Made the “My Network” Tab Faster, Smoother, and More Flexible
LinkedIn sped up My Network by unifying APIs, adding pagination, and using a backend-driven render model, cutting latency and improving the overall UX.

What Is the N+1 Query Problem?
What is the n+1 query problem? Learn how it slows apps, why it happens, and practical fixes with code examples to speed up performance.

How Swiggy Cut QA Regression Time by 66% Using Automated Event Testing
Swiggy built ARD Automator to automate mobile event verification using contracts and validators, cutting QA time by 66% and boosting accuracy.

How Razorpay Uses Terraform to Simplify and Scale Infrastructure Management
Razorpay leverages Terraform + Atlantis to automate, secure, and scale infrastructure with GitOps workflows and modular IaC practices.

What is Token Bucket Algorithm?
Discover how context switching lets operating systems multitask smoothly, switching between processes to keep your system fast and efficient.

How LinkedIn Cut Build Times from 30 Minutes to 10 Seconds
LinkedIn’s RDev lets engineers code in the cloud with pre-built containers, cutting setup from 30 mins to 10 secs while keeping CI consistent.

How Swiggy Scaled and Maintained Postgres
Swiggy scaled Postgres by cleaning unused indexes, controlling auto-vacuum, and using pg_repack for online maintenance and better performance.

How Razorpay prepared for Chrome’s Third-Party Cookie Deprecation
Razorpay uses partitioned cookies (CHIPS) to tackle Chrome’s 3P cookie phaseout, cutting drop-offs while ensuring a smooth, reliable checkout.

How LinkedIn Built a Faster, Safer, and Smarter HDFS Ecosystem
LinkedIn scaled HDFS with HA, Observer nodes, encryption & Wormhole, boosting speed, reliability & secure data access for massive growth.

How Salesforce Reinvented Task Execution for the Cloud Era
Salesforce built a cloud-native task execution system in Hyperforce, replacing SSH with secure, scalable, multi-cloud automation using recipes & workers.

How Zomato Handles 100 Million Daily Search Queries
Zomato fixed search scale issues by moving from Field Cache to DocValues and using nested docs, cutting costs, OOM errors & boosting speed.

How Salesforce migrated 200,000 Machines from CentOS 7 to RHEL 9
Using automation for zero downtime, stronger security & faster parallel upgrades, Salesforce successfully migrated 200,000 machines from CentOS 7 to RHEL 9

Edge Computing vs Fog Computing: Making the Right Choice
When comparing edge computing vs. fog computing, the main difference comes down to a simple question: where does the data processing happen?

How Swiggy Improved Video Performance with Smart Caching
Swiggy boosted video cache hits & cut costs by clustering widths with K-means, reducing redundant processing while keeping playback seamless.

Circuit Breaker vs Retry in Microservices
When building resilient systems, the debate of circuit breaker vs retry is about choosing the right tool for the right kind of failure. A Retry pattern is...

Sharding vs Partitioning: What's the Difference?
Partitioning splits data within one database for faster retrieval, while sharding spreads data across multiple databases to handle scale and traffic.

Latency vs Throughput: A Guide for System Performance
When you hear engineers talk about latency vs throughput, they are discussing two sides of the same coin: speed versus capacity.

Write-Through, Write-Back & Write-Around in Cache: A Practical Guide
Your app writes data every second but how it writes can change everything. Write-Through, Write-Back & Write-Around hide big trade-offs.

How Hyperforce Edge Networking Scaled to 20 Million Domains With Less Than 30GB of RAM
Scaled from 3M→20M+ domains, Salesforce Hyperforce Edge cut memory <30GB with new storage design, boosting speed, reliability & security.

What Is an Application Server? Role & Importance
Ever wondered what happens behind the curtain when you log into an app, book a flight, or add something to your online shopping cart? That seamless, interactive experience is powered by an unseen engine...

How Razorpay Capital Detects Duplicate or Fraudulent Merchants
Razorpay scaled payments to billions of transactions by re-engineering its core systems, ensuring speed, security & reliability at scale.

Performance and Scalability in Web Applications
Ever wondered why some apps stay smooth at 100 users but crash at 10k? That is where performance meets scalability.

Data Management in Applications
Whether you’re building a simple note-taking app, a social media platform, or a large-scale e-commerce system, your application’s success depends on how well...

Authentication & Access Control
You sign in to your bank account and can only view your balance. The bank manager logs in and can approve loans. Same system, different powers but how does the app decide?

System Design Tutorial
When applications grow beyond a handful of users, writing code alone isn’t enough. To scale, stay reliable, and support complex features, software needs strong...

How Salesforce Migrated 760+ Kafka Nodes Handling 1M Messages per Second with Zero Downtime
Salesforce upgraded 760+ Kafka nodes handling 1M+ msg/sec with zero downtime, scaling Marketing Cloud seamlessly for the future.

Vertical vs Horizontal Scaling
Is it better to make one server stronger or add more servers?

How X (Formerly Twitter) Handles Millions of Tweets Every Second
X scaled from Ruby to Java, microservices, real-time data, and AI to handle millions of tweets, searches, and users with speed and reliability.

How Spotify Powers Music Streaming for Millions
Spotify uses Kafka, microservices, and ML to deliver real-time, personalized music to millions, powered by a fast, scalable cloud backend.

How Meta Powers its Cloud Gaming Infrastructure at Scale
Meta streams games from cloud GPUs to your device with ultra-low latency, using real-time encoding, smart networking, and fast decoding.

How Amazon Key Unlocks 100 Million Doors a Year
Amazon Key lets drivers unlock gates for faster deliveries. From serverless to microservices, it now powers 100M+ secure unlocks yearly.

EP 88: How Pinterest Evolved its Architecture to Serve 500 Million Users
Pinterest began as a simple side project and scaled by simplifying tech, embracing microservices, and building strong pipelines and monitoring.

EP 87: How Uber Handles 40 Million+ Reads Per Second Using an Integrated Cache
Uber serves 40M+ reads/sec by pairing Docstore with a smart Redis cache, using CDC for near-instant updates and clever sharding for scale.

EP 86: How Facebook Scales Live Streaming for Millions of Viewers at Once?
Facebook scaled Live streaming for millions by building robust ingestion, delivery, and ISP optimizations, powering events like the UEFA Final.

How Uber Eats Scaled Search to Handle Billions of Daily Queries
Uber Eats scaled search by revamping indexing, geo-sharding & ranking, supporting billions of queries daily without compromising latency.

EP 84: How Pinterest Built Text-to-SQL to make Data analysis easier
Pinterest built a Text-to-SQL tool using LLMs and RAG to help analysts convert questions into SQL and find the right data faster and easier.

EP 83: How Pinterest Rebuilt its $3B+ Ads System without any Downtime
Pinterest rebuilt its \$3B+ ad system with a graph-based design for better scale, safety & dev speed, launched with zero downtime and big cost wins.

EP 82: How Pinterest uses LLMs to make your Search Results more Relevant?
Pinterest's AI teacher-student system improved search by 19.7%, understanding user intent beyond keywords for better relevance globally

EP 81: How Pinterest Built “Holiday Finds” to make Gift Shopping easier?
Pinterest Holiday Finds uses smart recommendations, auto wishlists and a fresh UI to make holiday gifting easy!

EP 80: How Pinterest improved ABR Video Performance?
Pinterest sped up video playback by embedding manifests in API responses and using Memcache to reduce startup latency.

EP 79: How Grab enabled near Real-Time analytics on their Data Lake
Grab used Apache Hudi with Flink and Spark to enable near real-time analytics, ensuring fast ingestion and low-latency queries on their data lake.

How Discord’s "Go Live" streaming works
Discord’s “Go Live” streams in real-time by capturing, encoding, transmitting, and decoding adapting quality to your network and device.

EP 77: How GitHub made Push Processing faster and more Reliable
GitHub sped up and stabilized push processing by splitting one big job into parallel Kafka-triggered tasks with better retries and monitoring.

EP 76: How Mixpanel Fixed Their Load Balancing Problem using Power of 2 Choices
Mixpanel fixed Compacter’s load imbalance using Power-of-2-Choices, boosting efficiency and cutting costs by 70% with minimal changes!

EP 75: How Netflix built a Distributed Counter for Billions of User Interactions
Netflix uses a smart Distributed Counter system to track billions of user actions daily with speed, accuracy, and massive scale.

How Stripe Scales its APIs using Rate Limiters
Stripe uses token buckets, concurrency limits & load shedders to scale APIs, prevent abuse & keep critical traffic flowing reliably.

EP 55: How did Magic Pocket help Dropbox save millions?
Dropbox scaled its storage with its custom-built system- Magic Pocket, and utilized high-density SMR drives, increasing its gross revenue by 75%.

EP 54: How Dropbox scaled its storage infrastructure?
Dropbox scaled its storage infrastructure with a custom-built system called Magic Pocket, utilizing high-density SMR drives and advanced data replication for durability and scalability.

EP 53: How TikTok Optimizes Video Streaming
TikTok boosts streaming by preloading videos, optimizing buffers, and reusing media players, with on-device upscaling and task distribution for smooth playback on all networks.

EP 52: How GitHub manages continuous integration and deployment
GitHub manages CI/CD by automating testing, building, and deploying code changes, allowing developers to release updates faster and with confidence.

EP 51: How Instagram handled user growth and scale?
Instagram achieved rapid user growth by maintaining a simple and efficient tech stack, utilizing AWS, Django, and Postgres also effectively managing traffic with load balancing, caching, and data sharding to handle the increasing demand.

EP 50: How Google search works?
Google Search works by using crawlers to scan and index web pages, then processes your queries to rank and display relevant results in seconds.

EP 49: How Stripe Handles Global Payments Technology
Stripe utilizes a tech stack of Ruby and JavaScript to enable secure, compliant global payments and currency conversion.

EP 48: How Tinder Streams to 75 Million Users with HTTP Live Streaming
Tinder used HTTP Live Streaming (HLS) & AWS CloudFront to deliver Swipe Night videos efficiently, ensuring seamless, adaptive playback.

How Netflix Secures Content Delivery using Open Connect CDN?
Netflix secures content delivery through its proprietary Open Connect CDN, which caches content on local servers, ensuring low-latency streaming and minimizing network congestion.

EP 46: How Uber Manages Real-Time Analytics with Apache Flink
Uber Eats uses real-time data processing with Apache Kafka, Flink, and Pinot to manage order updates, optimize delivery logistics, and provide quick analytics for efficient and accurate food delivery.

EP 45: How Slack Maintains Reliability and Uptime
Slack maintains reliability and uptime through automated incident detection, real-time collaboration, proactive monitoring, and a resilient microservices architecture.

How Zoom Ensures Low Latency Video Calls
Zoom ensures low latency by using distributed data centers, optimized video encoding, and adaptive bitrate streaming to maintain real-time communication quality.

EP 43: How Amazon Personalizes Product Recommendations
Amazon personalizes product recommendations using machine learning, collaborative filtering, and user interaction data to tailor suggestions based on individual preferences

EP 42: How Pinterest Scales Their Image Search with Elasticsearch
Pinterest scales its image search by using Elasticsearch for fast indexing, real-time search, and advanced machine learning features.

EP 41: How Facebook Handles Billions of Messages Daily
Facebook manages billions of daily messages using scalable servers, distributed systems and advanced algorithms for efficient processing and real-time delivery.

EP 39: How Twitter Manages High Availability with Kubernetes
Twitter achieves high availability with Kubernetes through multi-node deployments, load balancing, and data center redundancy.

EP 38: How Spotify Optimized Their Recommendation System
Spotify optimized recommendations by combining collaborative filtering, content-based filtering, and audio analysis to deliver highly personalized music recommendations.

EP 37: What is OAuth?
OAuth is an open standard protocol that allows users to grant apps access to their data without sharing their passwords.

What is PostgreSQL?
PostgreSQL is a robust, open-source object-relational database system known for advanced features, scalability, and support for complex queries.

EP 29: What is Cassandra?
Cassandra is a scalable, distributed NoSQL database for handling large data with high availability.

EP 28: What is Kafka?
Kafka is a distributed streaming platform for real-time data with low latency and high throughput.

EP 27: What is Kubernetes?
Kubernetes orchestrates and automates the deployment, scaling, and management of containerized applications.

EP 26: What is Docker?
Docker simplifies application deployment by packaging software into standardized containers.

EP 25: What is DMARC Record? Why is it used?
DMARC prevents email spoofing and phishing by authenticating email senders.

EP 22: What is SPF Record? Why is it used?
An SPF record controls which servers can send emails for a domain, preventing email fraud.

What is Micro Frontend Architecture?
Micro frontends are extending the concepts of micro services to the frontend world.

What is CI/CD and why is it even needed?
It's the automation that makes developer life simpler and efficient!

What is DNS, and how does it work?
Why is DNS so important that Facebook, Instagram and Whatsapp had a outage due to that?

What are microservices?
Netflix uses microservices but Google doesn't. But what exactly is that?

What is an ORM?
Ever thought of skipping database languages? ORM is for you!