4.3 Data Integrity and Keys
🪄 Step 1: Intuition & Real-World Analogy
🎯 Core Idea
Keys are the glue that holds relational data together. They ensure:
- Each record is unique.
- Relationships between tables remain consistent.
- Data can be safely joined, referenced, or updated without corruption.
Think of them as the “identity and passport system” of your database — every row has a unique identifier (primary key), may reference another entity (foreign key), and must obey certain rules of uniqueness (unique constraints).
🌍 Real-Life Analogy
Imagine a university system:
- Every student has a unique student ID (primary key).
- Every enrollment record links to a student — that’s the foreign key.
- Two students can’t have the same email address — that’s a unique constraint.
And you might think:
“Can’t I just use email as the primary key?”
Well… what if the student changes their email? That’s when surrogate keys — stable, system-generated IDs — shine.
💬 Conversational Style
Data integrity rules are like database promises:
- “Every student has one ID.”
- “Every course exists before you enroll.”
- “No two professors share the same office number.”
Break these promises, and your data becomes messy, unreliable, and — worst of all — untrustworthy for analytics.
🌱 Step 2: The Core Concept Explained
Primary Key (PK)
A Primary Key uniquely identifies each record in a table.
- Must be unique and non-null.
- Enforces row-level identity.
- Used as the target of foreign keys from other tables.
Example:
In a students table:
| student_id | name | |
|---|---|---|
| 101 | Alice | alice@example.com |
| 102 | Bob | bob@example.com |
student_id is the primary key — every student has exactly one.
Foreign Key (FK)
A Foreign Key links one table’s record to another — enforcing relational integrity.
Example:
In an enrollments table:
| enrollment_id | student_id | course_id |
|---|---|---|
| 1 | 101 | C01 |
| 2 | 102 | C02 |
student_id references students(student_id).
If you try to insert a record with a student who doesn’t exist, the database rejects it — preserving referential integrity.
Unique Constraint
Ensures that no two rows have the same value in a given column (or column combination).
Example: Two students cannot share the same email:
ALTER TABLE students ADD CONSTRAINT unique_email UNIQUE(email);Difference from PK: A table can have only one primary key but multiple unique constraints.
Surrogate Key vs Natural Key
| Type | Description | Example | Pros | Cons |
|---|---|---|---|---|
| Natural Key | Real-world attribute already unique | Email, SSN, VIN | Easy to understand | May change, privacy risk |
| Surrogate Key | Artificial system-generated key | id SERIAL, UUID | Stable, simple, no business meaning | Adds extra join, not human-readable |
💡 Guideline: Use surrogate keys in analytics and warehousing systems where data may evolve. Use natural keys in small, controlled domains (like ISO country codes).
💻 Step 3: Code Demonstration & Explanation
🧩 Example 1: Primary and Foreign Keys
CREATE TABLE students (
student_id SERIAL PRIMARY KEY,
name TEXT,
email TEXT UNIQUE
);
CREATE TABLE courses (
course_id TEXT PRIMARY KEY,
course_name TEXT
);
CREATE TABLE enrollments (
enrollment_id SERIAL PRIMARY KEY,
student_id INT REFERENCES students(student_id),
course_id TEXT REFERENCES courses(course_id)
);✅ students.student_id is the PK.
✅ enrollments.student_id is a FK referencing it.
✅ email is enforced as unique.
Now, you can’t:
- Enroll a non-existent student.
- Create duplicate emails.
- Accidentally delete a student without cascading logic.
🧩 Example 2: Surrogate vs Natural Key
Natural key approach:
CREATE TABLE products (
product_code TEXT PRIMARY KEY,
product_name TEXT
);Looks fine — until one day, marketing decides to change product codes. 😬
Better: Surrogate key
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
product_code TEXT UNIQUE,
product_name TEXT
);Now, business logic changes don’t break foreign key relationships.
🧩 Example 3: Composite Keys (Multi-column PKs)
CREATE TABLE enrollment_grades (
student_id INT,
course_id TEXT,
grade CHAR(2),
PRIMARY KEY (student_id, course_id)
);Here, a student can only have one grade per course — enforced by the composite primary key.
⚙️ Step 4: Performance Insights & Optimization
🔹 Indexing and Lookups
- Every Primary Key automatically creates a unique index — making lookups lightning-fast.
- Foreign Keys often benefit from indexing as well (manually added).
CREATE INDEX idx_enrollments_student_id ON enrollments(student_id);🔹 Surrogate Keys in Analytics
- They’re integer-based, making joins faster than text-based natural keys (e.g., emails or product codes).
- Ideal for fact-dimension joins in data warehouses.
🔹 Cardinality and Stability
Natural keys might seem convenient but are often unstable:
- A user can change their username or email.
- A product can change SKU.
- A region can be renamed.
Surrogate keys never change — ensuring schema stability across ETL processes and time-based snapshots.
🔹 Trade-off Summary
| Design | Storage | Read Speed | Schema Evolution |
|---|---|---|---|
| Natural Key | Compact | Slower (text) | Fragile |
| Surrogate Key | Slightly larger | Fast (int joins) | Stable |
🎯 Step 5: Interview Perspective & Reasoning
Common Interview Questions
- “What’s the difference between a primary key and a unique key?”
- “What happens if a foreign key constraint is violated?”
- “Why use surrogate keys instead of natural keys?”
- “Can a table have multiple primary keys?”
- “What’s the impact of foreign keys on performance?”
Reasoning Expectation
Interviewers expect you to:
- Clearly define PK, FK, and unique constraints.
- Understand data integrity enforcement.
- Show awareness of real-world evolution — like natural keys changing or being reused.
- Relate key choices to query performance and maintainability.
Bonus Challenge
You’re designing a
dim_customertable in a data warehouse.Customers can change email, address, and even be merged after deduplication.
Should you use:
- (A) Email as the primary key, or
- (B) A surrogate
customer_idwith email as unique?Explain your reasoning in terms of stability, indexing, and long-term schema flexibility.
🧩 Step 6: Common Mistakes & Debugging Insights
🚨 Common Mistakes (Click to Expand)
- Using text-based natural keys for joins — slow and error-prone.
- Forgetting to index foreign keys — leads to poor join performance.
- Cascading deletes without caution — might delete more than intended.
- Having duplicate natural keys (e.g., non-unique emails).
- Overusing surrogate keys — adding meaningless IDs for every small lookup table.
Debugging Tip: Run:
EXPLAIN ANALYZE
SELECT * FROM enrollments e
JOIN students s ON e.student_id = s.student_id;If you see a Seq Scan, your FK field isn’t indexed. Add it to improve performance instantly.
🧠 Step 7: Key Takeaways
🧠 Main Idea: Keys enforce identity, consistency, and referential integrity — the core pillars of reliable databases.
💻 How It Works:
- Primary Key = unique row identifier.
- Foreign Key = relationship enforcer.
- Unique Constraint = duplication guard.
- Surrogate Key = system-stable, integer-based identifier for evolving data.
🚀 Why It Matters: Choosing the right key design ensures data accuracy, performance, and evolution readiness — crucial in both OLTP systems and analytical warehouses.