4.2 Star and Snowflake Schemas

6 min read 1147 words

๐Ÿช„ Step 1: Intuition & Real-World Analogy

๐ŸŽฏ Core Idea

When you move from transactional systems (like OLTP databases) to analytical systems (like data warehouses), the design philosophy changes:
You stop optimizing for updates and start optimizing for fast reads, aggregations, and analytics.

The go-to design patterns for this are:

  • Star Schema ๐ŸŒŸ โ€” simple, denormalized design with a central Fact table surrounded by Dimension tables.
  • Snowflake Schema โ„๏ธ โ€” normalized version of a star schema with multi-level dimensions.

๐ŸŒ Real-Life Analogy

Think of a shopping mallโ€™s accounting system:

  • The Fact table is your giant sales log โ€” every purchase, every penny, every customer.
  • The Dimension tables are your โ€œlookup guidesโ€:
    • Who made the purchase? (Customer dimension)
    • What did they buy? (Product dimension)
    • When did it happen? (Date dimension)
    • Where was it sold? (Store dimension)

A Star Schema connects all these lookups directly to the sales log โ€” like spokes of a wheel.
A Snowflake Schema adds detail to the spokes โ€” for example, the โ€œProductโ€ dimension has its own โ€œCategoryโ€ table.


๐Ÿ’ฌ Conversational Style

You might ask:

โ€œWhy not just keep everything in one big table?โ€

Because queries like:

SELECT region, category, SUM(sales_amount)
FROM transactions
GROUP BY region, category;

would become slow and redundant if all attributes (region, category, brand, etc.) are stored repeatedly per row.

Fact and dimension tables separate raw numbers from descriptive attributes, reducing storage and speeding up analytical queries.


๐ŸŒฑ Step 2: The Core Concept Explained

๐Ÿงฎ Fact Tables: The Numerical Core

A Fact Table stores measurable events โ€” things you can add, count, or average. Examples: Sales, Clicks, Orders, Transactions.

Typical columns:

  • Foreign Keys to dimensions (product_id, customer_id, date_id)
  • Measures (e.g., sales_amount, quantity, profit)
  • Granularity (defines one row = one event, e.g., per sale, per day)

Example:

date_idcustomer_idproduct_idstore_idsales_amountquantity
202501011001501301249.992

๐Ÿงพ Dimension Tables: The Context Providers

Dimension Tables describe who, what, where, when, and how of each fact. They provide textual or categorical attributes for grouping and filtering.

Examples:

  • dim_customer โ€” name, age, gender, region
  • dim_product โ€” brand, category, supplier
  • dim_date โ€” day, month, quarter, year

Example:

product_idproduct_namecategorybrand
501iPhone 15SmartphonesApple

๐ŸŒŸ Star Schema

A Star Schema has:

  • One central Fact Table
  • Multiple directly connected Dimension Tables

Diagrammatically:

  graph TD
    F[Fact_Sales] --> D1[Dim_Customer]
    F --> D2[Dim_Product]
    F --> D3[Dim_Date]
    F --> D4[Dim_Store]

Advantages:

  • Simple structure
  • Fast aggregations (fewer joins)
  • Ideal for BI tools like Power BI, Tableau, or Looker

Disadvantage:

  • Some data duplication in dimension tables.

โ„๏ธ Snowflake Schema

A Snowflake Schema normalizes the dimensions โ€” splitting them into smaller related tables.

Diagrammatically:

  graph TD
    F[Fact_Sales] --> D1[Dim_Product]
    D1 --> D11[Dim_Category]
    D1 --> D12[Dim_Brand]
    F --> D2[Dim_Customer]
    D2 --> D21[Dim_Region]
    F --> D3[Dim_Date]

Advantages:

  • Saves space by removing redundancy.
  • Better for complex hierarchies (e.g., Product โ†’ Category โ†’ Department).

Disadvantages:

  • More joins โ†’ slower queries.
  • Slightly harder for analysts to navigate.

๐Ÿ’ป Step 3: Code Demonstration & Explanation

๐Ÿงฉ Example 1: Star Schema Implementation

-- Dimension: Product
CREATE TABLE dim_product (
  product_id INT PRIMARY KEY,
  product_name TEXT,
  category TEXT,
  brand TEXT
);

-- Dimension: Customer
CREATE TABLE dim_customer (
  customer_id INT PRIMARY KEY,
  customer_name TEXT,
  region TEXT
);

-- Dimension: Date
CREATE TABLE dim_date (
  date_id DATE PRIMARY KEY,
  month INT,
  year INT,
  quarter INT
);

-- Fact: Sales
CREATE TABLE fact_sales (
  date_id DATE REFERENCES dim_date(date_id),
  customer_id INT REFERENCES dim_customer(customer_id),
  product_id INT REFERENCES dim_product(product_id),
  sales_amount NUMERIC,
  quantity INT
);

Query Example:

SELECT 
  d.year,
  p.category,
  SUM(f.sales_amount) AS total_sales
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_product p ON f.product_id = p.product_id
GROUP BY d.year, p.category;

Fast, simple, and optimized for reporting tools.


๐Ÿงฉ Example 2: Snowflake Schema Variation

CREATE TABLE dim_category (
  category_id INT PRIMARY KEY,
  category_name TEXT
);

CREATE TABLE dim_product (
  product_id INT PRIMARY KEY,
  product_name TEXT,
  category_id INT REFERENCES dim_category(category_id),
  brand TEXT
);

Now, the product table connects to dim_category instead of embedding it. More normalized, but requires extra joins during aggregation.


โš™๏ธ Step 4: Performance Insights & Optimization

โšก Performance Deep Dive

Fact Tables Are Large

  • Contain billions of rows โ€” one per transaction or event.
  • High cardinality (many unique entries).
  • Indexed heavily on foreign keys for fast joins.

Dimension Tables Are Small

  • Contain descriptive info with limited unique entries (e.g., 10K products, 5K customers).
  • Low cardinality = small size = efficient lookups.

Star Schema โ†’ Fewer Joins

  • Analytical queries scan fact tables, filter via dimensions, and aggregate results.
  • BI engines love it โ€” fewer joins, predictable paths.

Snowflake Schema โ†’ More Normalized

  • Saves storage (reduces redundancy in dimensions).
  • Slightly more JOIN operations โ†’ higher query latency.

Best Practice: Use a Star Schema for BI and analytical queries. Use Snowflake Schema only when dimensions are large and hierarchically structured.

Typical Optimization Tactics

  • Partition Fact Tables by date (e.g., one partition per month).
  • Use Columnar Storage (e.g., Redshift, BigQuery, Snowflake DB).
  • Use Bitmap Indexes for categorical joins.
  • Create Materialized Views for common aggregates (e.g., sales by month, by region).

๐ŸŽฏ Step 5: Interview Perspective & Reasoning

Common Interview Questions

  • Whatโ€™s the difference between a Star and a Snowflake schema?
  • Whatโ€™s a Fact table vs a Dimension table?
  • Why are dimension tables small and fact tables large?
  • When would you use Snowflake schema over Star schema?
  • How do cardinality and selectivity impact joins?

Reasoning Expectation

Interviewers test whether you understand analytical schema design trade-offs:

  • Fact table = many records (high cardinality)
  • Dimension table = few unique entries (low cardinality)
  • Joins depend on selectivity โ€” filtering via small dimension first drastically reduces fact scans.

A top answer connects schema choice to query performance and business reporting goals.


Bonus Challenge

Suppose your company has 1B sales transactions and 100K products. You need to calculate total revenue per brand quickly.

Would you use: (A) Star schema (dim_product.brand directly in fact join), or (B) Snowflake schema (brand normalized into dim_brand)?

Explain your choice in terms of JOIN complexity and selectivity.


๐Ÿงฉ Step 6: Common Mistakes & Debugging Insights

๐Ÿšจ Common Mistakes

  • Mixing transactional (OLTP) and analytical (OLAP) design in one database.
  • Not defining granularity of the fact table โ€” leads to duplicate counting.
  • Using natural keys (like product names) instead of surrogate numeric keys.
  • Over-normalizing dimensions in a star schema (turning it into a snowflake unnecessarily).
  • Not indexing foreign keys in fact tables.

Debugging Tip: Use EXPLAIN plans to check:

  • Join order and fact table scan cost.
  • Ensure dimension filters apply before aggregating large fact tables.

๐Ÿง  Step 7: Key Takeaways

๐Ÿง  Main Idea: Star and Snowflake schemas structure analytical databases for fast aggregation and reporting by separating facts (numbers) from dimensions (descriptors).

๐Ÿ’ป How It Works:

  • Fact Tables: large, numeric, high cardinality.
  • Dimension Tables: small, descriptive, low cardinality.
  • Star = denormalized (fewer joins).
  • Snowflake = normalized (saves space).

๐Ÿš€ Why It Matters: These schemas are the foundation of BI and data warehousing. Understanding them helps you balance performance, storage, and usability in enterprise-scale systems.

Any doubt in content? Ask me anything?
Chat
๐Ÿค– ๐Ÿ‘‹ Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!