Movies and Shows Data Analysis¶
Project Overview¶
This project analyzes a dataset of movies and shows using Python and pandas. The analysis includes data cleaning, exploration, and the creation of custom functions to extract insights about actors, genres, decades, and IMDb ratings.
Dataset Description¶
The dataset movies_and_shows.csv contains information about various movies and shows, including:
- name: The name of the actor or actress
- character: The character they played
- role: The role type (e.g., ACTOR)
- title: The title of the movie or show
- type: Whether it's a MOVIE or SHOW
- release_year: The year it was released
- genres: A list of genres the movie or show belongs to
- imdb_score: The IMDb score of the movie or show
- imdb_votes: The number of votes on IMDb
1. Setup and Data Loading¶
# Import necessary libraries
import pandas as pd
# Read the dataset into a DataFrame
df = pd.read_csv('data/movies_and_shows.csv')
# Display the first few rows to understand the structure
df.head()
| name | Character | r0le | TITLE | Type | release Year | genres | imdb sc0re | imdb v0tes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Robert De Niro | Travis Bickle | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 1 | Jodie Foster | Iris Steensma | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 2 | Albert Brooks | Tom | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 3 | Harvey Keitel | Matthew 'Sport' Higgins | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 4 | Cybill Shepherd | Betsy | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
2. Data Cleaning¶
The dataset contains inconsistent column names with mixed cases and special characters. Let's standardize them.
# View current column names
print("Original column names:")
print(df.columns.tolist())
Original column names: [' name', 'Character', 'r0le', 'TITLE', ' Type', 'release Year', 'genres', 'imdb sc0re', 'imdb v0tes']
# Standardize column names: strip spaces, lowercase and replace special characters
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('0', 'o')
print("\nCleaned column names:")
print(df.columns.tolist())
Cleaned column names: ['name', 'character', 'role', 'title', 'type', 'release_year', 'genres', 'imdb_score', 'imdb_votes']
# Verify the changes
df.head()
| name | character | role | title | type | release_year | genres | imdb_score | imdb_votes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Robert De Niro | Travis Bickle | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 1 | Jodie Foster | Iris Steensma | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 2 | Albert Brooks | Tom | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 3 | Harvey Keitel | Matthew 'Sport' Higgins | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 4 | Cybill Shepherd | Betsy | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
3. Data Exploration¶
# Display basic information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 85579 entries, 0 to 85578 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 85579 non-null object 1 character 85579 non-null object 2 role 85579 non-null object 3 title 85578 non-null object 4 type 85579 non-null object 5 release_year 85579 non-null int64 6 genres 85579 non-null object 7 imdb_score 80970 non-null float64 8 imdb_votes 80853 non-null float64 dtypes: float64(2), int64(1), object(6) memory usage: 5.9+ MB
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
Missing values per column: name 0 character 0 role 0 title 1 type 0 release_year 0 genres 0 imdb_score 4609 imdb_votes 4726 dtype: int64
# Display summary statistics
df.describe()
| release_year | imdb_score | imdb_votes | |
|---|---|---|---|
| count | 85579.000000 | 80970.000000 | 8.085300e+04 |
| mean | 2015.879994 | 6.425877 | 5.978271e+04 |
| std | 7.724668 | 1.122655 | 1.846287e+05 |
| min | 1954.000000 | 1.500000 | 5.000000e+00 |
| 25% | 2015.000000 | 5.700000 | 1.266000e+03 |
| 50% | 2018.000000 | 6.500000 | 5.448000e+03 |
| 75% | 2021.000000 | 7.200000 | 3.360900e+04 |
| max | 2022.000000 | 9.500000 | 2.294231e+06 |
# Filter for movies/shows that contain 'drama' in their genres
drama_df = df[df['genres'].str.contains('drama', case=False, na=False)]
print(f"Found {len(drama_df)} entries containing 'drama'")
drama_df.head()
Found 52713 entries containing 'drama'
| name | character | role | title | type | release_year | genres | imdb_score | imdb_votes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Robert De Niro | Travis Bickle | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 1 | Jodie Foster | Iris Steensma | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 2 | Albert Brooks | Tom | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 3 | Harvey Keitel | Matthew 'Sport' Higgins | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
| 4 | Cybill Shepherd | Betsy | ACTOR | Taxi Driver | MOVIE | 1976 | ['drama', 'crime'] | 8.2 | 808582.0 |
4.2 Filtering by Decade¶
# Function to filter movies/shows by decade
def filter_by_decade(df, start_year):
"""
Filter the DataFrame for movies/shows released in a specific decade.
Parameters:
- df: DataFrame to filter
- start_year: Starting year of the decade (e.g., 1980 for 1980s)
Returns:
- Filtered DataFrame
"""
end_year = start_year + 9
return df[(df['release_year'] >= start_year) & (df['release_year'] <= end_year)]
# Example: Filter movies from the 1990s
movies_90s = filter_by_decade(df, 1990)
print(f"Movies/shows from the 1990s: {len(movies_90s)}")
movies_90s.head()
Movies/shows from the 1990s: 2845
| name | character | role | title | type | release_year | genres | imdb_score | imdb_votes | |
|---|---|---|---|---|---|---|---|---|---|
| 571 | Ray Liotta | Henry Hill | ACTOR | GoodFellas | MOVIE | 1990 | ['drama', 'crime'] | 8.7 | 1131681.0 |
| 572 | Robert De Niro | James Conway | ACTOR | GoodFellas | MOVIE | 1990 | ['drama', 'crime'] | 8.7 | 1131681.0 |
| 573 | Joe Pesci | Tommy DeVito | ACTOR | GoodFellas | MOVIE | 1990 | ['drama', 'crime'] | 8.7 | 1131681.0 |
| 574 | Lorraine Bracco | Karen Hill | ACTOR | GoodFellas | MOVIE | 1990 | ['drama', 'crime'] | 8.7 | 1131681.0 |
| 575 | Paul Sorvino | Paul Cicero | ACTOR | GoodFellas | MOVIE | 1990 | ['drama', 'crime'] | 8.7 | 1131681.0 |
4.3 High-Rated Movies¶
# Filter for highly-rated content (IMDb score >= 8.0)
high_rated = df[df['imdb_score'] >= 8.0]
print(f"Number of highly-rated titles (>=8.0): {len(high_rated)}")
high_rated[['title', 'imdb_score', 'release_year']].drop_duplicates().sort_values('imdb_score', ascending=False).head(10)
Number of highly-rated titles (>=8.0): 6093
| title | imdb_score | release_year | |
|---|---|---|---|
| 4456 | Breaking Bad | 9.5 | 2008 |
| 4872 | Avatar: The Last Airbender | 9.3 | 2005 |
| 46185 | Our Planet | 9.3 | 2019 |
| 22523 | Reply 1988 | 9.2 | 2015 |
| 65401 | Major | 9.1 | 2022 |
| 30885 | My Mister | 9.1 | 2018 |
| 51965 | Kota Factory | 9.1 | 2019 |
| 44207 | The Last Dance | 9.1 | 2020 |
| 20795 | Leah Remini: Scientology and the Aftermath | 9.0 | 2016 |
| 15764 | Attack on Titan | 9.0 | 2013 |
def get_actors_for_title(title):
"""
Retrieve a comma-separated list of actors for a given movie/show title.
Parameters:
- title: The title of the movie or show
Returns:
- String of actor names separated by commas
"""
# Filter for rows with the specified title and role as 'ACTOR'
title_actors_df = df[(df['title'] == title) & (df['role'] == 'ACTOR')]
# Extract the 'name' column for actor names
actor_names = title_actors_df['name']
# Combine names into a single string
return ', '.join(actor_names)
# Example usage
print("Actors in 'Taxi Driver':")
print(get_actors_for_title("Taxi Driver"))
Actors in 'Taxi Driver': Robert De Niro, Jodie Foster, Albert Brooks, Harvey Keitel, Cybill Shepherd, Peter Boyle, Leonard Harris, Diahnne Abbott, Gino Ardito, Martin Scorsese, Murray Moston, Richard Higgs, Bill Minkin, Bob Maroff, Victor Argo, Joe Spinell, Robinson Frank Adu, Brenda Dickson, Norman Matlock, Harry Northup, Harlan Cary Poe, Steven Prince, Peter Savage, Nicholas Shields, Ralph S. Singleton, Annie Gagen, Carson Grant, Mary-Pat Green, Debbi Morgan, Don Stroud, Copper Cunningham, Garth Avery, Nat Grant, Billie Perkins, Catherine Scorsese, Charles Scorsese, Odunlade Adekola, Ijeoma Grace Agu
5.2 Categorize Movies by IMDb Score¶
def categorize_imdb_score(title):
"""
Categorize a movie/show based on its IMDb score.
Categories:
- Excellent: >= 9.0
- Good: 7.0 - 8.9
- Average: 5.0 - 6.9
- Low: < 5.0
Parameters:
- title: The title of the movie or show
Returns:
- Category string or 'Title not found'
"""
# Filter for the row with the specified title
imdb_scores = df[df['title'] == title]['imdb_score'].tolist()
# Check if title exists
if not imdb_scores:
return 'Title not found'
# Get the IMDb score
imdb_score = float(imdb_scores[0])
# Categorize and return
if imdb_score >= 9.0:
return 'Excellent'
elif imdb_score >= 7.0:
return 'Good'
elif imdb_score >= 5.0:
return 'Average'
else:
return 'Low'
# Test the categorization function with titles that cover all categories
test_titles = ["Breaking Bad", "Taxi Driver", "The Blue Lagoon", "Dostana", "Nonexistent Movie Title"]
for title in test_titles:
category = categorize_imdb_score(title)
print(f"{title}: {category}")
Breaking Bad: Excellent Taxi Driver: Good The Blue Lagoon: Average Dostana: Low Nonexistent Movie Title: Title not found
6. Summary and Insights¶
Key Findings:¶
Data Quality: The dataset contained inconsistent column naming which was standardized for better usability.
Content Distribution: The dataset includes both movies and shows across multiple decades.
Rating Analysis: Movies can be categorized into quality tiers based on IMDb scores, helping identify highly-rated content.
Actor Information: Custom functions allow for easy retrieval of cast information for any title.
Potential Next Steps:¶
- Analyze genre trends over time
- Identify most prolific actors or highest-rated actors
- Compare movie vs. show ratings
- Investigate correlation between number of votes and IMDb scores