4.3 Data Integrity and Keys

6 min read 1068 words

🪄 Step 1: Intuition & Real-World Analogy

🎯 Core Idea

Keys are the glue that holds relational data together. They ensure:

  • Each record is unique.
  • Relationships between tables remain consistent.
  • Data can be safely joined, referenced, or updated without corruption.

Think of them as the “identity and passport system” of your database — every row has a unique identifier (primary key), may reference another entity (foreign key), and must obey certain rules of uniqueness (unique constraints).


🌍 Real-Life Analogy

Imagine a university system:

  • Every student has a unique student ID (primary key).
  • Every enrollment record links to a student — that’s the foreign key.
  • Two students can’t have the same email address — that’s a unique constraint.

And you might think:

“Can’t I just use email as the primary key?”

Well… what if the student changes their email? That’s when surrogate keys — stable, system-generated IDs — shine.


💬 Conversational Style

Data integrity rules are like database promises:

  • “Every student has one ID.”
  • “Every course exists before you enroll.”
  • “No two professors share the same office number.”

Break these promises, and your data becomes messy, unreliable, and — worst of all — untrustworthy for analytics.


🌱 Step 2: The Core Concept Explained

Primary Key (PK)

A Primary Key uniquely identifies each record in a table.

  • Must be unique and non-null.
  • Enforces row-level identity.
  • Used as the target of foreign keys from other tables.

Example: In a students table:

student_idnameemail
101Alicealice@example.com
102Bobbob@example.com

student_id is the primary key — every student has exactly one.


Foreign Key (FK)

A Foreign Key links one table’s record to another — enforcing relational integrity.

Example: In an enrollments table:

enrollment_idstudent_idcourse_id
1101C01
2102C02

student_id references students(student_id).

If you try to insert a record with a student who doesn’t exist, the database rejects it — preserving referential integrity.


Unique Constraint

Ensures that no two rows have the same value in a given column (or column combination).

Example: Two students cannot share the same email:

ALTER TABLE students ADD CONSTRAINT unique_email UNIQUE(email);

Difference from PK: A table can have only one primary key but multiple unique constraints.


Surrogate Key vs Natural Key
TypeDescriptionExampleProsCons
Natural KeyReal-world attribute already uniqueEmail, SSN, VINEasy to understandMay change, privacy risk
Surrogate KeyArtificial system-generated keyid SERIAL, UUIDStable, simple, no business meaningAdds extra join, not human-readable

💡 Guideline: Use surrogate keys in analytics and warehousing systems where data may evolve. Use natural keys in small, controlled domains (like ISO country codes).


💻 Step 3: Code Demonstration & Explanation

🧩 Example 1: Primary and Foreign Keys

CREATE TABLE students (
  student_id SERIAL PRIMARY KEY,
  name TEXT,
  email TEXT UNIQUE
);

CREATE TABLE courses (
  course_id TEXT PRIMARY KEY,
  course_name TEXT
);

CREATE TABLE enrollments (
  enrollment_id SERIAL PRIMARY KEY,
  student_id INT REFERENCES students(student_id),
  course_id TEXT REFERENCES courses(course_id)
);

students.student_id is the PK. ✅ enrollments.student_id is a FK referencing it. ✅ email is enforced as unique.

Now, you can’t:

  • Enroll a non-existent student.
  • Create duplicate emails.
  • Accidentally delete a student without cascading logic.

🧩 Example 2: Surrogate vs Natural Key

Natural key approach:

CREATE TABLE products (
  product_code TEXT PRIMARY KEY,
  product_name TEXT
);

Looks fine — until one day, marketing decides to change product codes. 😬

Better: Surrogate key

CREATE TABLE products (
  product_id SERIAL PRIMARY KEY,
  product_code TEXT UNIQUE,
  product_name TEXT
);

Now, business logic changes don’t break foreign key relationships.


🧩 Example 3: Composite Keys (Multi-column PKs)

CREATE TABLE enrollment_grades (
  student_id INT,
  course_id TEXT,
  grade CHAR(2),
  PRIMARY KEY (student_id, course_id)
);

Here, a student can only have one grade per course — enforced by the composite primary key.


⚙️ Step 4: Performance Insights & Optimization

🔹 Indexing and Lookups

  • Every Primary Key automatically creates a unique index — making lookups lightning-fast.
  • Foreign Keys often benefit from indexing as well (manually added).
CREATE INDEX idx_enrollments_student_id ON enrollments(student_id);

🔹 Surrogate Keys in Analytics

  • They’re integer-based, making joins faster than text-based natural keys (e.g., emails or product codes).
  • Ideal for fact-dimension joins in data warehouses.

🔹 Cardinality and Stability

Natural keys might seem convenient but are often unstable:

  • A user can change their username or email.
  • A product can change SKU.
  • A region can be renamed.

Surrogate keys never change — ensuring schema stability across ETL processes and time-based snapshots.

🔹 Trade-off Summary

DesignStorageRead SpeedSchema Evolution
Natural KeyCompactSlower (text)Fragile
Surrogate KeySlightly largerFast (int joins)Stable

🎯 Step 5: Interview Perspective & Reasoning

Common Interview Questions
  • “What’s the difference between a primary key and a unique key?”
  • “What happens if a foreign key constraint is violated?”
  • “Why use surrogate keys instead of natural keys?”
  • “Can a table have multiple primary keys?”
  • “What’s the impact of foreign keys on performance?”
Reasoning Expectation

Interviewers expect you to:

  • Clearly define PK, FK, and unique constraints.
  • Understand data integrity enforcement.
  • Show awareness of real-world evolution — like natural keys changing or being reused.
  • Relate key choices to query performance and maintainability.
Bonus Challenge

You’re designing a dim_customer table in a data warehouse.

Customers can change email, address, and even be merged after deduplication.

Should you use:

  • (A) Email as the primary key, or
  • (B) A surrogate customer_id with email as unique?

Explain your reasoning in terms of stability, indexing, and long-term schema flexibility.


🧩 Step 6: Common Mistakes & Debugging Insights

🚨 Common Mistakes (Click to Expand)
  • Using text-based natural keys for joins — slow and error-prone.
  • Forgetting to index foreign keys — leads to poor join performance.
  • Cascading deletes without caution — might delete more than intended.
  • Having duplicate natural keys (e.g., non-unique emails).
  • Overusing surrogate keys — adding meaningless IDs for every small lookup table.

Debugging Tip: Run:

EXPLAIN ANALYZE
SELECT * FROM enrollments e
JOIN students s ON e.student_id = s.student_id;

If you see a Seq Scan, your FK field isn’t indexed. Add it to improve performance instantly.


🧠 Step 7: Key Takeaways

🧠 Main Idea: Keys enforce identity, consistency, and referential integrity — the core pillars of reliable databases.

💻 How It Works:

  • Primary Key = unique row identifier.
  • Foreign Key = relationship enforcer.
  • Unique Constraint = duplication guard.
  • Surrogate Key = system-stable, integer-based identifier for evolving data.

🚀 Why It Matters: Choosing the right key design ensures data accuracy, performance, and evolution readiness — crucial in both OLTP systems and analytical warehouses.

Any doubt in content? Ask me anything?
Chat
🤖 👋 Hi there! I'm your learning assistant. If you have any questions about this page or need clarification, feel free to ask!