Back to blog

Blog

Building a Movie Recommendation Engine

Shivank Poudel2 min read
machine-learningrecommendation-systemspython

Recommendation systems are a great first real machine learning project: the problem is intuitive, the data is easy to find, and you can see whether the results feel right. This post walks through a small content-based recommender for movies.

The idea

A content-based recommender answers a simple question:

Given a movie you liked, which other movies are most similar to it?

"Similar" here means similar in their content — genres, keywords, cast, and a short overview — rather than similar in who watched them. That makes it a good starting point because it needs no user history to work.

Step 1 — Clean the metadata

Raw metadata is messy: missing fields, stringified JSON, inconsistent casing. The first job is always to get to a tidy table.

import pandas as pd

movies = pd.read_csv("movies.csv")
movies = movies.dropna(subset=["title", "overview"])

# Combine the fields we care about into a single text "soup".
movies["soup"] = (
    movies["genres"].fillna("")
    + " " + movies["keywords"].fillna("")
    + " " + movies["overview"].fillna("")
)

Step 2 — Turn text into vectors

To compare movies numerically, each "soup" string becomes a vector with TF-IDF, which weights words by how distinctive they are.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english", max_features=20000)
matrix = vectorizer.fit_transform(movies["soup"])

Step 3 — Rank by similarity

With vectors in hand, cosine similarity measures the angle between any two movies. The smaller the angle, the more alike they are.

from sklearn.metrics.pairwise import cosine_similarity

def recommend(title, top_n=5):
    idx = movies.index[movies["title"] == title][0]
    scores = cosine_similarity(matrix[idx], matrix).flatten()
    ranked = scores.argsort()[::-1]
    ranked = [i for i in ranked if i != idx][:top_n]
    return movies.iloc[ranked]["title"].tolist()

What the results look like

For a well-known title, the neighbours are reassuringly on-theme:

Query Top recommendations
The Dark Knight Batman Begins, The Prestige, Inception
Toy Story Toy Story 2, Monsters Inc., A Bug's Life

Where to go next

This is deliberately the simplest version. A few natural extensions:

  • Add collaborative filtering to use real user ratings
  • Blend content and collaborative scores into a hybrid ranker
  • Cache the similarity matrix so lookups are instant
  • Serve it behind a small API and a search box

Content-based filtering won't win a Netflix prize, but it is honest, fast, and a genuinely useful baseline to measure everything else against.