Blog
Building a Movie Recommendation Engine
Recommendation systems are a great first real machine learning project: the problem is intuitive, the data is easy to find, and you can see whether the results feel right. This post walks through a small content-based recommender for movies.
The idea
A content-based recommender answers a simple question:
Given a movie you liked, which other movies are most similar to it?
"Similar" here means similar in their content — genres, keywords, cast, and a short overview — rather than similar in who watched them. That makes it a good starting point because it needs no user history to work.
Step 1 — Clean the metadata
Raw metadata is messy: missing fields, stringified JSON, inconsistent casing. The first job is always to get to a tidy table.
import pandas as pd
movies = pd.read_csv("movies.csv")
movies = movies.dropna(subset=["title", "overview"])
# Combine the fields we care about into a single text "soup".
movies["soup"] = (
movies["genres"].fillna("")
+ " " + movies["keywords"].fillna("")
+ " " + movies["overview"].fillna("")
)
Step 2 — Turn text into vectors
To compare movies numerically, each "soup" string becomes a vector with TF-IDF, which weights words by how distinctive they are.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words="english", max_features=20000)
matrix = vectorizer.fit_transform(movies["soup"])
Step 3 — Rank by similarity
With vectors in hand, cosine similarity measures the angle between any two movies. The smaller the angle, the more alike they are.
from sklearn.metrics.pairwise import cosine_similarity
def recommend(title, top_n=5):
idx = movies.index[movies["title"] == title][0]
scores = cosine_similarity(matrix[idx], matrix).flatten()
ranked = scores.argsort()[::-1]
ranked = [i for i in ranked if i != idx][:top_n]
return movies.iloc[ranked]["title"].tolist()
What the results look like
For a well-known title, the neighbours are reassuringly on-theme:
| Query | Top recommendations |
|---|---|
| The Dark Knight | Batman Begins, The Prestige, Inception |
| Toy Story | Toy Story 2, Monsters Inc., A Bug's Life |
Where to go next
This is deliberately the simplest version. A few natural extensions:
- Add collaborative filtering to use real user ratings
- Blend content and collaborative scores into a hybrid ranker
- Cache the similarity matrix so lookups are instant
- Serve it behind a small API and a search box
Content-based filtering won't win a Netflix prize, but it is honest, fast, and a genuinely useful baseline to measure everything else against.