4.3 Data Integrity and Keys

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

6 min read 1068 words

🪄 Step 1: Intuition & Real-World Analogy

🎯 Core Idea

Keys are the glue that holds relational data together. They ensure:

Each record is unique.
Relationships between tables remain consistent.
Data can be safely joined, referenced, or updated without corruption.

Think of them as the “identity and passport system” of your database — every row has a unique identifier (primary key), may reference another entity (foreign key), and must obey certain rules of uniqueness (unique constraints).

🌍 Real-Life Analogy

Imagine a university system:

Every student has a unique student ID (primary key).
Every enrollment record links to a student — that’s the foreign key.
Two students can’t have the same email address — that’s a unique constraint.

And you might think:

“Can’t I just use email as the primary key?”

Well… what if the student changes their email? That’s when surrogate keys — stable, system-generated IDs — shine.

💬 Conversational Style

Data integrity rules are like database promises:

“Every student has one ID.”
“Every course exists before you enroll.”
“No two professors share the same office number.”

Break these promises, and your data becomes messy, unreliable, and — worst of all — untrustworthy for analytics.

🌱 Step 2: The Core Concept Explained

Primary Key (PK)

A Primary Key uniquely identifies each record in a table.

Must be unique and non-null.
Enforces row-level identity.
Used as the target of foreign keys from other tables.

Example: In a students table:

student_id	name	email
101	Alice	alice@example.com
102	Bob	bob@example.com

student_id is the primary key — every student has exactly one.

Foreign Key (FK)

A Foreign Key links one table’s record to another — enforcing relational integrity.

Example: In an enrollments table:

enrollment_id	student_id	course_id
1	101	C01
2	102	C02

student_id references students(student_id).

If you try to insert a record with a student who doesn’t exist, the database rejects it — preserving referential integrity.

Unique Constraint

Ensures that no two rows have the same value in a given column (or column combination).

Example: Two students cannot share the same email:

ALTER TABLE students ADD CONSTRAINT unique_email UNIQUE(email);

Difference from PK: A table can have only one primary key but multiple unique constraints.

Surrogate Key vs Natural Key

Type	Description	Example	Pros	Cons
Natural Key	Real-world attribute already unique	Email, SSN, VIN	Easy to understand	May change, privacy risk
Surrogate Key	Artificial system-generated key	`id SERIAL`, `UUID`	Stable, simple, no business meaning	Adds extra join, not human-readable

💡 Guideline: Use surrogate keys in analytics and warehousing systems where data may evolve. Use natural keys in small, controlled domains (like ISO country codes).

💻 Step 3: Code Demonstration & Explanation

🧩 Example 1: Primary and Foreign Keys

CREATE TABLE students (
  student_id SERIAL PRIMARY KEY,
  name TEXT,
  email TEXT UNIQUE
);

CREATE TABLE courses (
  course_id TEXT PRIMARY KEY,
  course_name TEXT
);

CREATE TABLE enrollments (
  enrollment_id SERIAL PRIMARY KEY,
  student_id INT REFERENCES students(student_id),
  course_id TEXT REFERENCES courses(course_id)
);

✅ students.student_id is the PK. ✅ enrollments.student_id is a FK referencing it. ✅ email is enforced as unique.

Now, you can’t:

Enroll a non-existent student.
Create duplicate emails.
Accidentally delete a student without cascading logic.

🧩 Example 2: Surrogate vs Natural Key

Natural key approach:

CREATE TABLE products (
  product_code TEXT PRIMARY KEY,
  product_name TEXT
);

Looks fine — until one day, marketing decides to change product codes. 😬

Better: Surrogate key

CREATE TABLE products (
  product_id SERIAL PRIMARY KEY,
  product_code TEXT UNIQUE,
  product_name TEXT
);

Now, business logic changes don’t break foreign key relationships.

🧩 Example 3: Composite Keys (Multi-column PKs)

CREATE TABLE enrollment_grades (
  student_id INT,
  course_id TEXT,
  grade CHAR(2),
  PRIMARY KEY (student_id, course_id)
);

Here, a student can only have one grade per course — enforced by the composite primary key.

⚙️ Step 4: Performance Insights & Optimization

🔹 Indexing and Lookups

Every Primary Key automatically creates a unique index — making lookups lightning-fast.
Foreign Keys often benefit from indexing as well (manually added).

CREATE INDEX idx_enrollments_student_id ON enrollments(student_id);

🔹 Surrogate Keys in Analytics

They’re integer-based, making joins faster than text-based natural keys (e.g., emails or product codes).
Ideal for fact-dimension joins in data warehouses.

🔹 Cardinality and Stability

Natural keys might seem convenient but are often unstable:

A user can change their username or email.
A product can change SKU.
A region can be renamed.

Surrogate keys never change — ensuring schema stability across ETL processes and time-based snapshots.

🔹 Trade-off Summary

Design	Storage	Read Speed	Schema Evolution
Natural Key	Compact	Slower (text)	Fragile
Surrogate Key	Slightly larger	Fast (int joins)	Stable

🎯 Step 5: Interview Perspective & Reasoning

Common Interview Questions

“What’s the difference between a primary key and a unique key?”
“What happens if a foreign key constraint is violated?”
“Why use surrogate keys instead of natural keys?”
“Can a table have multiple primary keys?”
“What’s the impact of foreign keys on performance?”

Reasoning Expectation

Interviewers expect you to:

Clearly define PK, FK, and unique constraints.
Understand data integrity enforcement.
Show awareness of real-world evolution — like natural keys changing or being reused.
Relate key choices to query performance and maintainability.

Bonus Challenge

You’re designing a dim_customer table in a data warehouse.
Customers can change email, address, and even be merged after deduplication.
Should you use:
(A) Email as the primary key, or
(B) A surrogate customer_id with email as unique?
Explain your reasoning in terms of stability, indexing, and long-term schema flexibility.

🧩 Step 6: Common Mistakes & Debugging Insights

🚨 Common Mistakes (Click to Expand)

Using text-based natural keys for joins — slow and error-prone.
Forgetting to index foreign keys — leads to poor join performance.
Cascading deletes without caution — might delete more than intended.
Having duplicate natural keys (e.g., non-unique emails).
Overusing surrogate keys — adding meaningless IDs for every small lookup table.

Debugging Tip: Run:

EXPLAIN ANALYZE
SELECT * FROM enrollments e
JOIN students s ON e.student_id = s.student_id;

If you see a Seq Scan, your FK field isn’t indexed. Add it to improve performance instantly.

🧠 Step 7: Key Takeaways

🧠 Main Idea: Keys enforce identity, consistency, and referential integrity — the core pillars of reliable databases.

💻 How It Works:
Primary Key = unique row identifier.
Foreign Key = relationship enforcer.
Unique Constraint = duplication guard.
Surrogate Key = system-stable, integer-based identifier for evolving data.

🚀 Why It Matters: Choosing the right key design ensures data accuracy, performance, and evolution readiness — crucial in both OLTP systems and analytical warehouses.

4.2 Star and Snowflake Schemas