2.2 Indexing for Speed
🪄 Step 1: Intuition & Real-World Analogy
🎯 Core Idea
Indexes make data retrieval lightning-fast — but they come with a cost. They act like a shortcut for reading, but every shortcut has a maintenance fee.
When you run:
CREATE INDEX idx_customer_id ON orders(customer_id);you’re telling the database:
“Hey, please remember where each customer’s rows are stored — so you can find them instantly later.”
But the trade-off is that every INSERT, UPDATE, or DELETE now has extra work to keep that “shortcut” accurate.
🌍 Real-Life Analogy
Think of an index like a library catalog.
Without a catalog (no index): you walk through every shelf until you find the book. → Sequential scan.
With a catalog (index): you look up the author’s name and go directly to the shelf. → Index scan.
But every time a new book arrives (new record inserted), the librarian must update the catalog — which takes time.
So:
“Indexes make readers happy but librarians cry.” 😅
💬 Conversational Style
You might think, “If indexes make things faster, why not index every column?” Because — every index is like a performance tax on writes. Databases love balanced trade-offs, not blind optimization.
🌱 Step 2: The Core Concept Explained
What It Really Does
How It Works Internally
When you query:
SELECT * FROM orders WHERE customer_id = 101;the DBMS can:
- Scan the entire table (Sequential Scan) — O(N)
- Use the index to jump directly to matching rows — O(log N)
That’s like skipping to the right page instead of reading every line.
Internally, a B-Tree index stores (key → pointer to row) pairs, ordered for binary search.
Common Use Cases
- Columns frequently used in WHERE, JOIN, or ORDER BY clauses.
- Foreign keys (like
customer_idinorders). - Columns with high cardinality (many unique values).
💻 Step 3: Code Demonstration & Explanation
🧩 Example 1: Without Index
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42;Output:
Seq Scan on orders (cost=0.00..350.00 rows=100 width=50)
Filter: (customer_id = 42)Interpretation:
The database reads every row, checking if customer_id = 42.
If the table has 10 million orders — it scans all 10 million. 🐢
🧩 Example 2: With Index
CREATE INDEX idx_customer_id ON orders(customer_id);
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42;Output:
Index Scan using idx_customer_id on orders (cost=0.43..8.52 rows=100 width=50)
Index Cond: (customer_id = 42)Interpretation: The DB now jumps directly to the rows for customer 42. → No full table scan. Result: Query that took seconds may now finish in milliseconds.
🧩 Example 3: Write Trade-Offs
INSERT INTO orders VALUES (10001, 42, 499.99, '2025-10-24');Now the DB must:
- Insert into the table.
- Update every relevant index (like
idx_customer_id).
So bulk inserts or frequent updates slow down slightly.
🧩 Example 4: Low-Cardinality Trap
CREATE INDEX idx_status ON orders(status);If status only has 3 values: 'Pending', 'Shipped', 'Cancelled'
→ The index doesn’t help — the DB might still scan most rows.
That’s like having a catalog sorted by “color” when 90% of books are blue. 😅
Low selectivity = poor index performance.
⚙️ Step 4: Performance Insights & Optimization
Execution Choice: The optimizer decides between a Seq Scan and Index Scan based on:
- Table size
- Index selectivity (uniqueness of values)
- Cost of random I/O vs sequential read
Rule of Thumb:
- Indexes shine when queries filter down to <10% of rows.
- For large scans, sequential reads are faster (fewer disk seeks).
Composite Indexes:
CREATE INDEX idx_customer_date ON orders(customer_id, order_date);Great for queries like:
WHERE customer_id = 42 AND order_date > '2025-01-01'But order matters: The leftmost column (
customer_id) must appear in the condition.Maintenance Tip: Periodically run:
REINDEX TABLE orders;to rebuild fragmented indexes and keep lookups fast.
🎯 Step 5: Interview Perspective & Reasoning
Common Interview Questions
- “When does an index improve performance?”
- “Why might an index be ignored by the query planner?”
- “What’s the downside of having too many indexes?”
- “Would an index help on a boolean column?”
- “Explain the difference between clustered and non-clustered indexes.”
Reasoning Expectation
Interviewers test depth of understanding, not memorization:
- Can you explain the trade-off between read and write performance?
- Do you know how selectivity affects index usefulness?
- Can you justify why an index isn’t being used in a given plan?
Bonus Challenge
Suppose your query filters on
gender = 'M'andregion = 'North'. Which index is better: 1️⃣(gender, region)or 2️⃣(region, gender)?Discuss why — in terms of cardinality and leftmost prefix rules.
🧩 Step 6: Common Mistakes & Debugging Insights
🚨 Common Mistakes (Click to Expand)
Over-indexing: Creating indexes on every column → slower inserts & wasted space.
Wrong column order in composite indexes: The optimizer can’t use
(A, B)index if you only query byB.Function-wrapped columns:
WHERE LOWER(name) = 'john'won’t use an index on
name. Use a functional index instead:CREATE INDEX idx_lower_name ON customers(LOWER(name));Ignoring low-cardinality columns: Index on
is_active(Y/N) rarely helps — DB will still scan most rows.Forgotten maintenance: After major data changes, run:
ANALYZE;to update table stats; otherwise, planner may choose bad paths.
🧠 Step 7: Key Takeaways
🧠 Main Idea: Indexes accelerate lookups by maintaining sorted key references — but they slow down data modifications.
💻 How It Works: Uses data structures like B-Trees or Hash maps to find rows in logarithmic time instead of scanning everything.
🚀 Why It Matters: Smart indexing is one of the most powerful performance tuning tools in SQL — mastering it separates data users from database engineers.