2.2 Indexing for Speed

5 min read 939 words

🪄 Step 1: Intuition & Real-World Analogy

🎯 Core Idea

Indexes make data retrieval lightning-fast — but they come with a cost. They act like a shortcut for reading, but every shortcut has a maintenance fee.

When you run:

CREATE INDEX idx_customer_id ON orders(customer_id);

you’re telling the database:

“Hey, please remember where each customer’s rows are stored — so you can find them instantly later.”

But the trade-off is that every INSERT, UPDATE, or DELETE now has extra work to keep that “shortcut” accurate.


🌍 Real-Life Analogy

Think of an index like a library catalog.

  • Without a catalog (no index): you walk through every shelf until you find the book. → Sequential scan.

  • With a catalog (index): you look up the author’s name and go directly to the shelf. → Index scan.

But every time a new book arrives (new record inserted), the librarian must update the catalog — which takes time.

So:

“Indexes make readers happy but librarians cry.” 😅


💬 Conversational Style

You might think, “If indexes make things faster, why not index every column?” Because — every index is like a performance tax on writes. Databases love balanced trade-offs, not blind optimization.


🌱 Step 2: The Core Concept Explained

What It Really Does
An index is a data structure (usually a B-Tree or Hash) that allows the database to quickly find rows matching a condition — without scanning the entire table.
How It Works Internally

When you query:

SELECT * FROM orders WHERE customer_id = 101;

the DBMS can:

  1. Scan the entire table (Sequential Scan) — O(N)
  2. Use the index to jump directly to matching rows — O(log N)

That’s like skipping to the right page instead of reading every line. Internally, a B-Tree index stores (key → pointer to row) pairs, ordered for binary search.

Common Use Cases
  • Columns frequently used in WHERE, JOIN, or ORDER BY clauses.
  • Foreign keys (like customer_id in orders).
  • Columns with high cardinality (many unique values).

💻 Step 3: Code Demonstration & Explanation

🧩 Example 1: Without Index

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42;

Output:

Seq Scan on orders  (cost=0.00..350.00 rows=100 width=50)
  Filter: (customer_id = 42)

Interpretation: The database reads every row, checking if customer_id = 42. If the table has 10 million orders — it scans all 10 million. 🐢


🧩 Example 2: With Index

CREATE INDEX idx_customer_id ON orders(customer_id);

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 42;

Output:

Index Scan using idx_customer_id on orders  (cost=0.43..8.52 rows=100 width=50)
  Index Cond: (customer_id = 42)

Interpretation: The DB now jumps directly to the rows for customer 42. → No full table scan. Result: Query that took seconds may now finish in milliseconds.


🧩 Example 3: Write Trade-Offs

INSERT INTO orders VALUES (10001, 42, 499.99, '2025-10-24');

Now the DB must:

  • Insert into the table.
  • Update every relevant index (like idx_customer_id).

So bulk inserts or frequent updates slow down slightly.


🧩 Example 4: Low-Cardinality Trap

CREATE INDEX idx_status ON orders(status);

If status only has 3 values: 'Pending', 'Shipped', 'Cancelled' → The index doesn’t help — the DB might still scan most rows.

That’s like having a catalog sorted by “color” when 90% of books are blue. 😅

Low selectivity = poor index performance.


⚙️ Step 4: Performance Insights & Optimization

ℹ️
  • Execution Choice: The optimizer decides between a Seq Scan and Index Scan based on:

    • Table size
    • Index selectivity (uniqueness of values)
    • Cost of random I/O vs sequential read
  • Rule of Thumb:

    • Indexes shine when queries filter down to <10% of rows.
    • For large scans, sequential reads are faster (fewer disk seeks).
  • Composite Indexes:

    CREATE INDEX idx_customer_date ON orders(customer_id, order_date);

    Great for queries like:

    WHERE customer_id = 42 AND order_date > '2025-01-01'

    But order matters: The leftmost column (customer_id) must appear in the condition.

  • Maintenance Tip: Periodically run:

    REINDEX TABLE orders;

    to rebuild fragmented indexes and keep lookups fast.


🎯 Step 5: Interview Perspective & Reasoning

Common Interview Questions
  • “When does an index improve performance?”
  • “Why might an index be ignored by the query planner?”
  • “What’s the downside of having too many indexes?”
  • “Would an index help on a boolean column?”
  • “Explain the difference between clustered and non-clustered indexes.”
Reasoning Expectation

Interviewers test depth of understanding, not memorization:

  • Can you explain the trade-off between read and write performance?
  • Do you know how selectivity affects index usefulness?
  • Can you justify why an index isn’t being used in a given plan?
Bonus Challenge

Suppose your query filters on gender = 'M' and region = 'North'. Which index is better: 1️⃣ (gender, region) or 2️⃣ (region, gender)?

Discuss why — in terms of cardinality and leftmost prefix rules.


🧩 Step 6: Common Mistakes & Debugging Insights

🚨 Common Mistakes (Click to Expand)
  • Over-indexing: Creating indexes on every column → slower inserts & wasted space.

  • Wrong column order in composite indexes: The optimizer can’t use (A, B) index if you only query by B.

  • Function-wrapped columns:

    WHERE LOWER(name) = 'john'

    won’t use an index on name. Use a functional index instead:

    CREATE INDEX idx_lower_name ON customers(LOWER(name));
  • Ignoring low-cardinality columns: Index on is_active (Y/N) rarely helps — DB will still scan most rows.

  • Forgotten maintenance: After major data changes, run:

    ANALYZE;

    to update table stats; otherwise, planner may choose bad paths.


🧠 Step 7: Key Takeaways

🧠 Main Idea: Indexes accelerate lookups by maintaining sorted key references — but they slow down data modifications.

💻 How It Works: Uses data structures like B-Trees or Hash maps to find rows in logarithmic time instead of scanning everything.

🚀 Why It Matters: Smart indexing is one of the most powerful performance tuning tools in SQL — mastering it separates data users from database engineers.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!