Which one is a better plan?¶
You work as an analyst for the telecom operator Megaline. The company offers its clients two prepaid plans, Surf and Ultimate. The commercial department wants to know which of the plans brings in more revenue in order to adjust the advertising budget.
You are going to carry out a preliminary analysis of the plans based on a relatively small client selection. You'll have the data on 500 Megaline clients: who the clients are, where they're from, which plan they use, and the number of calls they made and text messages they sent in 2018. Your job is to analyze the clients' behavior and determine which prepaid plan brings in more revenue.
Project Overview¶
The purpose of this project is to analyze customer data from Megaline's two prepaid plans (Surf and Ultimate) to determine which plan generates more revenue. I will:
- Load and explore data from multiple sources (users, calls, messages, internet usage, and plans)
- Clean and prepare the data for analysis
- Calculate monthly usage and revenue for each customer
- Analyze customer behavior patterns across different plans
- Test statistical hypotheses about revenue differences between plans and regions
- Draw conclusions to help the commercial department make data-driven decisions
Initialization¶
# Loading all the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st
Load data¶
# Load the data files into different DataFrames
plans = pd.read_csv('../data/megaline_plans.csv')
users = pd.read_csv('../data/megaline_users.csv')
calls = pd.read_csv('../data/megaline_calls.csv')
messages = pd.read_csv('../data/megaline_messages.csv')
internet = pd.read_csv('../data/megaline_internet.csv')
Prepare the data¶
Plans¶
# Print the general/summary information about the plans' DataFrame
print(plans.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2 entries, 0 to 1 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 messages_included 2 non-null int64 1 mb_per_month_included 2 non-null int64 2 minutes_included 2 non-null int64 3 usd_monthly_pay 2 non-null int64 4 usd_per_gb 2 non-null int64 5 usd_per_message 2 non-null float64 6 usd_per_minute 2 non-null float64 7 plan_name 2 non-null object dtypes: float64(2), int64(5), object(1) memory usage: 260.0+ bytes None
# Print a sample of data for plans
print(plans)
messages_included mb_per_month_included minutes_included \ 0 50 15360 500 1 1000 30720 3000 usd_monthly_pay usd_per_gb usd_per_message usd_per_minute plan_name 0 20 10 0.03 0.03 surf 1 70 7 0.01 0.01 ultimate
Observations about Plans data:¶
The plans dataset contains information about two plans: Surf and Ultimate. All data types appear appropriate (numeric values for prices and limits, string for plan name). There are no missing values.
Key observations:
- Surf plan: 20 dollars/month with 500 minutes, 50 messages, 15360 MB (15 GB) included
- Ultimate plan: 70 dollars/month with 3000 minutes, 1000 messages, 30720 MB (30GB) included
- Overage charges are lower for Ultimate plan (0.01 vs 0.03 per minute/message, 7 dollars vs 10 dollars per GB)
Fix data¶
# No fixes needed for plans data
Enrich data¶
# Add a column for GB/month included with each plan
plans['gb_per_month_included'] = plans['mb_per_month_included'] / 1024
print(plans)
messages_included mb_per_month_included minutes_included \ 0 50 15360 500 1 1000 30720 3000 usd_monthly_pay usd_per_gb usd_per_message usd_per_minute plan_name \ 0 20 10 0.03 0.03 surf 1 70 7 0.01 0.01 ultimate gb_per_month_included 0 15.0 1 30.0
Users¶
# Print the general/summary information about the users' DataFrame
print(users.info())
print()
print(users.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 500 non-null int64
1 first_name 500 non-null object
2 last_name 500 non-null object
3 age 500 non-null int64
4 city 500 non-null object
5 reg_date 500 non-null object
6 plan 500 non-null object
7 churn_date 34 non-null object
dtypes: int64(2), object(6)
memory usage: 31.4+ KB
None
user_id age
count 500.000000 500.000000
mean 1249.500000 45.486000
std 144.481833 16.972269
min 1000.000000 18.000000
25% 1124.750000 30.000000
50% 1249.500000 46.000000
75% 1374.250000 61.000000
max 1499.000000 75.000000
# Print a sample of data for users
print(users.head(10))
print(f'\nTotal users: {len(users)}')
print(f"\nPlan distribution:\n{users['plan'].value_counts()}")
user_id first_name last_name age city \
0 1000 Anamaria Bauer 45 Atlanta-Sandy Springs-Roswell, GA MSA
1 1001 Mickey Wilkerson 28 Seattle-Tacoma-Bellevue, WA MSA
2 1002 Carlee Hoffman 36 Las Vegas-Henderson-Paradise, NV MSA
3 1003 Reynaldo Jenkins 52 Tulsa, OK MSA
4 1004 Leonila Thompson 40 Seattle-Tacoma-Bellevue, WA MSA
5 1005 Livia Shields 31 Dallas-Fort Worth-Arlington, TX MSA
6 1006 Jesusa Bradford 73 San Francisco-Oakland-Berkeley, CA MSA
7 1007 Eusebio Welch 42 Grand Rapids-Kentwood, MI MSA
8 1008 Emely Hoffman 53 Orlando-Kissimmee-Sanford, FL MSA
9 1009 Gerry Little 19 San Jose-Sunnyvale-Santa Clara, CA MSA
reg_date plan churn_date
0 2018-12-24 ultimate NaN
1 2018-08-13 surf NaN
2 2018-10-21 surf NaN
3 2018-01-28 surf NaN
4 2018-05-23 surf NaN
5 2018-11-29 surf NaN
6 2018-11-27 ultimate 2018-12-18
7 2018-07-11 surf NaN
8 2018-08-03 ultimate NaN
9 2018-04-22 surf NaN
Total users: 500
Plan distribution:
plan
surf 339
ultimate 161
Name: count, dtype: int64
Observations about Users data:¶
The users dataset contains 500 records with user demographics and plan information.
Issues identified:
- Date columns (reg_date and churn_date) are stored as objects instead of datetime format
- The churn_date column has many missing values (users who haven't churned), which is expected
- Most users are on the Surf plan
Fix Data¶
# Convert date columns to datetime format
users['reg_date'] = pd.to_datetime(users['reg_date'])
users['churn_date'] = pd.to_datetime(users['churn_date'])
print(users.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 500 entries, 0 to 499 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 500 non-null int64 1 first_name 500 non-null object 2 last_name 500 non-null object 3 age 500 non-null int64 4 city 500 non-null object 5 reg_date 500 non-null datetime64[ns] 6 plan 500 non-null object 7 churn_date 34 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(2), object(4) memory usage: 31.4+ KB None
Enrich Data¶
# Create a simpler city category to identify NY-NJ area for hypothesis testing
users['is_ny_nj'] = users['city'].str.contains('NY-NJ', na=False)
print(users.info())
print(f"\nUsers in NY-NJ area: {users['is_ny_nj'].sum()}")
print(f"Users in other areas: {(~users['is_ny_nj']).sum()}")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 500 entries, 0 to 499 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 500 non-null int64 1 first_name 500 non-null object 2 last_name 500 non-null object 3 age 500 non-null int64 4 city 500 non-null object 5 reg_date 500 non-null datetime64[ns] 6 plan 500 non-null object 7 churn_date 34 non-null datetime64[ns] 8 is_ny_nj 500 non-null bool dtypes: bool(1), datetime64[ns](2), int64(2), object(4) memory usage: 31.9+ KB None Users in NY-NJ area: 80 Users in other areas: 420
Calls¶
# Print the general/summary information about the calls' DataFrame
print(calls.info())
print()
print(calls.describe())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137735 entries, 0 to 137734
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 137735 non-null object
1 user_id 137735 non-null int64
2 call_date 137735 non-null object
3 duration 137735 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.2+ MB
None
user_id duration
count 137735.000000 137735.000000
mean 1247.658046 6.745927
std 139.416268 5.839241
min 1000.000000 0.000000
25% 1128.000000 1.290000
50% 1247.000000 5.980000
75% 1365.000000 10.690000
max 1499.000000 37.600000
# Print a sample of data for calls
print(calls.head(10))
id user_id call_date duration 0 1000_93 1000 2018-12-27 8.52 1 1000_145 1000 2018-12-27 13.66 2 1000_247 1000 2018-12-27 14.48 3 1000_309 1000 2018-12-28 5.76 4 1000_380 1000 2018-12-30 4.22 5 1000_388 1000 2018-12-31 2.20 6 1000_510 1000 2018-12-27 5.75 7 1000_521 1000 2018-12-28 14.18 8 1000_530 1000 2018-12-28 5.77 9 1000_544 1000 2018-12-26 4.40
Observations about Calls data:¶
The calls dataset contains individual call records with user_id, call_date, and duration (in minutes).
Issues identified:
- The call_date column is stored as object type instead of datetime
- Duration appears to be in minutes with decimal values
- Zero duration calls will be billed as 1 minute
Fix data¶
# Convert call_date to datetime format
calls['call_date'] = pd.to_datetime(calls['call_date'])
#Replace 0-duration calls with 1 minute
calls.loc[calls['duration'] == 0, 'duration'] = 1
print(calls.info())
print()
print('Duration statistics after correction:')
print(calls['duration'].describe())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 137735 entries, 0 to 137734 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 137735 non-null object 1 user_id 137735 non-null int64 2 call_date 137735 non-null datetime64[ns] 3 duration 137735 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 4.2+ MB None Duration statistics after correction: count 137735.000000 mean 6.940750 std 5.623616 min 0.010000 25% 1.290000 50% 5.980000 75% 10.690000 max 37.600000 Name: duration, dtype: float64
Enrich data¶
# Extract month from call_date for aggregation
calls['month'] = calls['call_date'].dt.to_period('M')
# Round up call duration to nearest minute (what the user would be billed for)
calls['duration_rounded'] = np.ceil(calls['duration'])
print(calls.head(10))
id user_id call_date duration month duration_rounded 0 1000_93 1000 2018-12-27 8.52 2018-12 9.0 1 1000_145 1000 2018-12-27 13.66 2018-12 14.0 2 1000_247 1000 2018-12-27 14.48 2018-12 15.0 3 1000_309 1000 2018-12-28 5.76 2018-12 6.0 4 1000_380 1000 2018-12-30 4.22 2018-12 5.0 5 1000_388 1000 2018-12-31 2.20 2018-12 3.0 6 1000_510 1000 2018-12-27 5.75 2018-12 6.0 7 1000_521 1000 2018-12-28 14.18 2018-12 15.0 8 1000_530 1000 2018-12-28 5.77 2018-12 6.0 9 1000_544 1000 2018-12-26 4.40 2018-12 5.0
Messages¶
# Print the general/summary information about the messages' DataFrame
print(messages.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 76051 entries, 0 to 76050 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 76051 non-null object 1 user_id 76051 non-null int64 2 message_date 76051 non-null object dtypes: int64(1), object(2) memory usage: 1.7+ MB None
# Print a sample of data for messages
print(messages.head(10))
id user_id message_date 0 1000_125 1000 2018-12-27 1 1000_160 1000 2018-12-31 2 1000_223 1000 2018-12-31 3 1000_251 1000 2018-12-27 4 1000_255 1000 2018-12-26 5 1000_346 1000 2018-12-29 6 1000_386 1000 2018-12-30 7 1000_413 1000 2018-12-31 8 1000_502 1000 2018-12-27 9 1000_525 1000 2018-12-28
Observations about Messages data:¶
The messages dataset contains individual message records with user_id and message_date. Each row represents one message sent.
Issues identified:
- The message_date column is stored as object type instead of datetime
Fix data¶
# Convert message_date to datetime format
messages['message_date'] = pd.to_datetime(messages['message_date'])
print(messages.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 76051 entries, 0 to 76050 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 76051 non-null object 1 user_id 76051 non-null int64 2 message_date 76051 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 1.7+ MB None
Enrich data¶
# Extract month from message_date for aggregation
messages['month'] = messages['message_date'].dt.to_period('M')
print(messages.head(10))
id user_id message_date month 0 1000_125 1000 2018-12-27 2018-12 1 1000_160 1000 2018-12-31 2018-12 2 1000_223 1000 2018-12-31 2018-12 3 1000_251 1000 2018-12-27 2018-12 4 1000_255 1000 2018-12-26 2018-12 5 1000_346 1000 2018-12-29 2018-12 6 1000_386 1000 2018-12-30 2018-12 7 1000_413 1000 2018-12-31 2018-12 8 1000_502 1000 2018-12-27 2018-12 9 1000_525 1000 2018-12-28 2018-12
Internet¶
# Print the general/summary information about the internet DataFrame
print(internet.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 104825 entries, 0 to 104824 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 104825 non-null object 1 user_id 104825 non-null int64 2 session_date 104825 non-null object 3 mb_used 104825 non-null float64 dtypes: float64(1), int64(1), object(2) memory usage: 3.2+ MB None
# Print a sample of data for the internet traffic
print(internet.head(10))
id user_id session_date mb_used 0 1000_13 1000 2018-12-29 89.86 1 1000_204 1000 2018-12-31 0.00 2 1000_379 1000 2018-12-28 660.40 3 1000_413 1000 2018-12-26 270.99 4 1000_442 1000 2018-12-27 880.22 5 1001_0 1001 2018-08-24 284.68 6 1001_3 1001 2018-12-09 656.04 7 1001_4 1001 2018-11-04 16.97 8 1001_10 1001 2018-11-27 135.18 9 1001_15 1001 2018-12-13 761.92
Observations about Internet data:¶
The internet dataset contains internet session records with user_id, session_date, and mb_used (megabytes).
Issues identified:
- The session_date column is stored as object type instead of datetime
- Data is in MB, but plan pricing is per GB
- Some sessions show 0 MB used
Fix data¶
# Convert session_date to datetime format
internet['session_date'] = pd.to_datetime(internet['session_date'])
print(internet.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 104825 entries, 0 to 104824 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 104825 non-null object 1 user_id 104825 non-null int64 2 session_date 104825 non-null datetime64[ns] 3 mb_used 104825 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 3.2+ MB None
Enrich data¶
# Extract month from session_date for aggregation
internet['month'] = internet['session_date'].dt.to_period('M')
# Convert MB to GB for easier comparison with plan limits
internet['gb_used'] = internet['mb_used'] / 1024
print(internet.head(10))
id user_id session_date mb_used month gb_used 0 1000_13 1000 2018-12-29 89.86 2018-12 0.087754 1 1000_204 1000 2018-12-31 0.00 2018-12 0.000000 2 1000_379 1000 2018-12-28 660.40 2018-12 0.644922 3 1000_413 1000 2018-12-26 270.99 2018-12 0.264639 4 1000_442 1000 2018-12-27 880.22 2018-12 0.859590 5 1001_0 1001 2018-08-24 284.68 2018-08 0.278008 6 1001_3 1001 2018-12-09 656.04 2018-12 0.640664 7 1001_4 1001 2018-11-04 16.97 2018-11 0.016572 8 1001_10 1001 2018-11-27 135.18 2018-11 0.132012 9 1001_15 1001 2018-12-13 761.92 2018-12 0.744062
Study plan conditions¶
# Print out the plan conditions and make sure they are clear for you
for i in range(len(plans)):
print(f"\n{plans.iloc[i]['plan_name'].upper()} Plan:")
print(f" Monthly fee: ${plans.iloc[i]['usd_monthly_pay']}")
print(f" Included minutes: {plans.iloc[i]['minutes_included']}")
print(f" Included messages: {plans.iloc[i]['messages_included']}")
print(f" Included data: {plans.iloc[i]['gb_per_month_included']:.1f} GB")
print(f" Overage cost per minute: ${plans.iloc[i]['usd_per_minute']}")
print(f" Overage cost per message: ${plans.iloc[i]['usd_per_message']}")
print(f" Overage cost per GB: ${plans.iloc[i]['usd_per_gb']}")
SURF Plan: Monthly fee: $20 Included minutes: 500 Included messages: 50 Included data: 15.0 GB Overage cost per minute: $0.03 Overage cost per message: $0.03 Overage cost per GB: $10 ULTIMATE Plan: Monthly fee: $70 Included minutes: 3000 Included messages: 1000 Included data: 30.0 GB Overage cost per minute: $0.01 Overage cost per message: $0.01 Overage cost per GB: $7
Aggregate data per user¶
# Calculate the number of calls made by each user per month. Save the result.
calls_per_month = calls.groupby(['user_id', 'month']).agg(num_calls=('id', 'count')).reset_index()
print(calls_per_month.head(10))
user_id month num_calls 0 1000 2018-12 16 1 1001 2018-08 27 2 1001 2018-09 49 3 1001 2018-10 65 4 1001 2018-11 64 5 1001 2018-12 56 6 1002 2018-10 11 7 1002 2018-11 55 8 1002 2018-12 47 9 1003 2018-12 149
# Calculate the amount of minutes spent by each user per month. Save the result.
minutes_per_month = calls.groupby(['user_id', 'month']).agg(total_minutes=('duration_rounded', 'sum')).reset_index()
print(minutes_per_month.head(10))
user_id month total_minutes 0 1000 2018-12 124.0 1 1001 2018-08 187.0 2 1001 2018-09 326.0 3 1001 2018-10 411.0 4 1001 2018-11 441.0 5 1001 2018-12 422.0 6 1002 2018-10 62.0 7 1002 2018-11 393.0 8 1002 2018-12 393.0 9 1003 2018-12 1135.0
# Calculate the number of messages sent by each user per month. Save the result.
messages_per_month = messages.groupby(['user_id', 'month']).agg(num_messages=('id', 'count')).reset_index()
print(messages_per_month.head(10))
user_id month num_messages 0 1000 2018-12 11 1 1001 2018-08 30 2 1001 2018-09 44 3 1001 2018-10 53 4 1001 2018-11 36 5 1001 2018-12 44 6 1002 2018-10 15 7 1002 2018-11 32 8 1002 2018-12 41 9 1003 2018-12 50
# Calculate the volume of internet traffic used by each user per month. Save the result.
internet_per_month = internet.groupby(['user_id', 'month']).agg(total_mb=('mb_used', 'sum'), total_gb=('gb_used', 'sum')).reset_index()
# Round up GB to nearest whole number (what the user would be billed for)
internet_per_month['gb_rounded'] = np.ceil(internet_per_month['total_gb'])
print(internet_per_month.head(10))
user_id month total_mb total_gb gb_rounded 0 1000 2018-12 1901.47 1.856904 2.0 1 1001 2018-08 6919.15 6.756982 7.0 2 1001 2018-09 13314.82 13.002754 14.0 3 1001 2018-10 22330.49 21.807119 22.0 4 1001 2018-11 18504.30 18.070605 19.0 5 1001 2018-12 19369.18 18.915215 19.0 6 1002 2018-10 6552.01 6.398447 7.0 7 1002 2018-11 19345.08 18.891680 19.0 8 1002 2018-12 14396.24 14.058828 15.0 9 1003 2018-12 27044.14 26.410293 27.0
# Merge the data for calls, minutes, messages, internet based on user_id and month
# First, merge calls data (which includes both number of calls and minutes)
user_month_data = calls_per_month.merge(minutes_per_month, on=['user_id', 'month'], how='outer')
# Merge messages
user_month_data = user_month_data.merge(messages_per_month, on=['user_id', 'month'], how='outer')
# Merge internet
user_month_data = user_month_data.merge(internet_per_month, on=['user_id', 'month'], how='outer')
# Fill NaN values with 0 (users who didn't use a service that month)
user_month_data = user_month_data.fillna(0)
user_month_data.info()
print()
print(user_month_data.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2293 entries, 0 to 2292
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 2293 non-null int64
1 month 2293 non-null period[M]
2 num_calls 2293 non-null float64
3 total_minutes 2293 non-null float64
4 num_messages 2293 non-null float64
5 total_mb 2293 non-null float64
6 total_gb 2293 non-null float64
7 gb_rounded 2293 non-null float64
dtypes: float64(6), int64(1), period[M](1)
memory usage: 143.4 KB
user_id month num_calls total_minutes num_messages total_mb \
0 1000 2018-12 16.0 124.0 11.0 1901.47
1 1001 2018-08 27.0 187.0 30.0 6919.15
2 1001 2018-09 49.0 326.0 44.0 13314.82
3 1001 2018-10 65.0 411.0 53.0 22330.49
4 1001 2018-11 64.0 441.0 36.0 18504.30
total_gb gb_rounded
0 1.856904 2.0
1 6.756982 7.0
2 13.002754 14.0
3 21.807119 22.0
4 18.070605 19.0
# Add the plan information
# First, add user information
user_month_data = user_month_data.merge(users[['user_id', 'plan', 'reg_date', 'churn_date', 'is_ny_nj', 'city']], on='user_id', how='left')
# Add plan details
user_month_data = user_month_data.merge(plans, left_on='plan', right_on='plan_name', how='left')
user_month_data.info()
print()
print(user_month_data.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2293 entries, 0 to 2292
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 2293 non-null int64
1 month 2293 non-null period[M]
2 num_calls 2293 non-null float64
3 total_minutes 2293 non-null float64
4 num_messages 2293 non-null float64
5 total_mb 2293 non-null float64
6 total_gb 2293 non-null float64
7 gb_rounded 2293 non-null float64
8 plan 2293 non-null object
9 reg_date 2293 non-null datetime64[ns]
10 churn_date 158 non-null datetime64[ns]
11 is_ny_nj 2293 non-null bool
12 city 2293 non-null object
13 messages_included 2293 non-null int64
14 mb_per_month_included 2293 non-null int64
15 minutes_included 2293 non-null int64
16 usd_monthly_pay 2293 non-null int64
17 usd_per_gb 2293 non-null int64
18 usd_per_message 2293 non-null float64
19 usd_per_minute 2293 non-null float64
20 plan_name 2293 non-null object
21 gb_per_month_included 2293 non-null float64
dtypes: bool(1), datetime64[ns](2), float64(9), int64(6), object(3), period[M](1)
memory usage: 378.6+ KB
user_id month num_calls total_minutes num_messages total_mb \
0 1000 2018-12 16.0 124.0 11.0 1901.47
1 1001 2018-08 27.0 187.0 30.0 6919.15
2 1001 2018-09 49.0 326.0 44.0 13314.82
3 1001 2018-10 65.0 411.0 53.0 22330.49
4 1001 2018-11 64.0 441.0 36.0 18504.30
total_gb gb_rounded plan reg_date ... \
0 1.856904 2.0 ultimate 2018-12-24 ...
1 6.756982 7.0 surf 2018-08-13 ...
2 13.002754 14.0 surf 2018-08-13 ...
3 21.807119 22.0 surf 2018-08-13 ...
4 18.070605 19.0 surf 2018-08-13 ...
city messages_included \
0 Atlanta-Sandy Springs-Roswell, GA MSA 1000
1 Seattle-Tacoma-Bellevue, WA MSA 50
2 Seattle-Tacoma-Bellevue, WA MSA 50
3 Seattle-Tacoma-Bellevue, WA MSA 50
4 Seattle-Tacoma-Bellevue, WA MSA 50
mb_per_month_included minutes_included usd_monthly_pay usd_per_gb \
0 30720 3000 70 7
1 15360 500 20 10
2 15360 500 20 10
3 15360 500 20 10
4 15360 500 20 10
usd_per_message usd_per_minute plan_name gb_per_month_included
0 0.01 0.01 ultimate 30.0
1 0.03 0.03 surf 15.0
2 0.03 0.03 surf 15.0
3 0.03 0.03 surf 15.0
4 0.03 0.03 surf 15.0
[5 rows x 22 columns]
# Calculate the monthly revenue for each user
# Revenue = monthly fee + overage charges for minutes, messages, and data
#Calculate overage for each service (overage = max(0, used - included))
user_month_data['minutes_overage'] = np.maximum(0, user_month_data['total_minutes'] - user_month_data['minutes_included'])
user_month_data['messages_overage'] = np.maximum(0, user_month_data['num_messages'] - user_month_data['messages_included'])
user_month_data['gb_overage'] = np.maximum(0, user_month_data['gb_rounded'] - user_month_data['gb_per_month_included'])
# Calculate overage charges
user_month_data['minutes_overage_charge'] = user_month_data['minutes_overage'] * user_month_data['usd_per_minute']
user_month_data['messages_overage_charge'] = user_month_data['messages_overage'] * user_month_data['usd_per_message']
user_month_data['gb_overage_charge'] = user_month_data['gb_overage'] * user_month_data['usd_per_gb']
# Calculate total monthly revenue
user_month_data['monthly_revenue'] = (user_month_data['usd_monthly_pay'] + user_month_data['minutes_overage_charge'] + user_month_data['messages_overage_charge'] + user_month_data['gb_overage_charge'])
print(user_month_data[['user_id', 'month', 'usd_monthly_pay', 'minutes_overage_charge', 'messages_overage_charge', 'gb_overage_charge', 'monthly_revenue']].head(10))
user_id month usd_monthly_pay minutes_overage_charge \ 0 1000 2018-12 70 0.00 1 1001 2018-08 20 0.00 2 1001 2018-09 20 0.00 3 1001 2018-10 20 0.00 4 1001 2018-11 20 0.00 5 1001 2018-12 20 0.00 6 1002 2018-10 20 0.00 7 1002 2018-11 20 0.00 8 1002 2018-12 20 0.00 9 1003 2018-12 20 19.05 messages_overage_charge gb_overage_charge monthly_revenue 0 0.00 0.0 70.00 1 0.00 0.0 20.00 2 0.00 0.0 20.00 3 0.09 70.0 90.09 4 0.00 40.0 60.00 5 0.00 40.0 60.00 6 0.00 0.0 20.00 7 0.00 40.0 60.00 8 0.00 0.0 20.00 9 0.00 120.0 159.05
Study user behaviour¶
Calls¶
# Compare average duration of calls per each plan per each distinct month. Plot a bar plat to visualize it.
avg_minutes_by_plan_month = user_month_data.groupby(['plan', 'month'])['total_minutes'].mean().reset_index()
# Create a pivot table for easier plotting
pivot_minutes = avg_minutes_by_plan_month.pivot(index='month', columns='plan', values='total_minutes')
# Plot bar chart
pivot_minutes.plot(kind='bar', title='Average call minutes per month by plan', xlabel='Month', ylabel='Average Minutes', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()
# Compare the number of minutes users of each plan require each month. Plot a histogram.
user_month_data.groupby('plan')['total_minutes'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of monthly minutes by plan',xlabel='Minutes', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(500, color='blue', linestyle='--', label='Surf limit (500 min)', linewidth=2)
plt.axvline(3000, color='orange', linestyle='--', label='Ultimate limit (3000 min)', linewidth=2)
plt.legend(title='Plan')
plt.show()
# Calculate the mean and the variance of the monthly call duration
call_stats = user_month_data.groupby('plan')['total_minutes'].agg(['mean', 'var', 'std', 'median'])
call_stats
| mean | var | std | median | |
|---|---|---|---|---|
| plan | ||||
| surf | 440.324221 | 57743.745959 | 240.299284 | 438.0 |
| ultimate | 442.431944 | 60876.012052 | 246.730647 | 437.5 |
# Plot a boxplot to visualize the distribution of the monthly call duration
user_month_data.boxplot(column='total_minutes', by='plan', figsize=(10, 6))
plt.title('Distribution of Monthly Call Minutes by Plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Total Minutes')
plt.show()
Conclusions on call behavior:¶
Based on the analysis of call data:
- Despite Ultimate offering 3000 minutes vs Surf's 500 minutes, users consume similar amounts
- This suggests people choose plans based on perceived needs or risk aversion rather than actual usage patterns
- Ultimate users are significantly under-utilizing their allowance
Messages¶
# Compare the number of messages users of each plan tend to send each month
avg_messages_by_plan_month = user_month_data.groupby(['plan', 'month'])['num_messages'].mean().reset_index()
# Create a pivot table for easier plotting
pivot_messages = avg_messages_by_plan_month.pivot(index='month', columns='plan', values='num_messages')
# Plot bar chart
pivot_messages.plot(kind='bar', title='Average messages per month by plan', xlabel='Month', ylabel='Average Messages', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()
# Compare the number of messages users of each plan require each month. Plot a histogram.
user_month_data.groupby('plan')['num_messages'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of monthly messages by plan',xlabel='Messages', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(50, color='blue', linestyle='--', label='Surf limit (50 messages)', linewidth=2)
plt.axvline(1000, color='orange', linestyle='--', label='Ultimate limit (1000 messages)', linewidth=2)
plt.legend(title='Plan')
plt.show()
# Calculate the mean and the variance of the monthly messages
message_stats = user_month_data.groupby('plan')['num_messages'].agg(['mean', 'var', 'std', 'median'])
message_stats
| mean | var | std | median | |
|---|---|---|---|---|
| plan | ||||
| surf | 31.159568 | 1126.724522 | 33.566717 | 24.0 |
| ultimate | 37.551389 | 1208.756744 | 34.767179 | 30.0 |
# Plot a boxplot to visualize the distribution of monthly messages
user_month_data.boxplot(column='num_messages', by='plan', figsize=(10, 6))
plt.title('Distribution of monthly messages by plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Total Messages')
plt.show()
Conclusions on Message Behavior:¶
Based on the analysis of message data:
- Both plans show very low usage compared to allowances
- Plan choice doesn't significantly affect messaging behavior
- Message allowances are over-provisioned, especially for Ultimate
- Usage increased throughout 2018
Internet¶
# Compare the internet usage users of each plan tend to use each month
avg_internet_by_plan_month = user_month_data.groupby(['plan', 'month'])['gb_rounded'].mean().reset_index()
# Create a pivot table for easier plotting
pivot_internet = avg_internet_by_plan_month.pivot(index='month', columns='plan', values='gb_rounded')
# Plot bar chart
pivot_internet.plot(kind='bar', title='Average GB used per month by plan', xlabel='Month', ylabel='Average GB used', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()
# Compare the internet usage users of each plan require each month. Plot a histogram.
user_month_data.groupby('plan')['gb_rounded'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of monthly internet usage by plan',xlabel='GB usage', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(15, color='blue', linestyle='--', label='Surf limit (15 GB)', linewidth=2)
plt.axvline(30, color='orange', linestyle='--', label='Ultimate limit (30 GB)', linewidth=2)
plt.legend(title='Plan')
plt.show()
# Calculate the mean and the variance of the monthly messages
internet_stats = user_month_data.groupby('plan')['gb_rounded'].agg(['mean', 'var', 'std', 'median'])
internet_stats
| mean | var | std | median | |
|---|---|---|---|---|
| plan | ||||
| surf | 16.670693 | 61.58360 | 7.847522 | 17.0 |
| ultimate | 17.306944 | 58.83055 | 7.670108 | 17.0 |
# Plot a boxplot to visualize the distribution of monthly internet usage
user_month_data.boxplot(column='gb_rounded', by='plan', figsize=(10, 6))
plt.title('Distribution of internet usage by plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Total GB used')
plt.show()
Conclusions on Internet Usage Behavior:¶
Based on the analysis of internet usage data:
- Nearly identical usage patterns across both plans
- Surf users frequently exceed their allowance
- Ultimate users stay well within their limit
Revenue¶
# Compare the revenue each plan tends to generate each month
avg_revenue_by_plan_month = user_month_data.groupby(['plan', 'month'])['monthly_revenue'].mean().reset_index()
# Create a pivot table for easier plotting
pivot_revenue = avg_revenue_by_plan_month.pivot(index='month', columns='plan', values='monthly_revenue')
# Plot bar chart
pivot_revenue.plot(kind='bar', title='Average revenue generated per month by plan', xlabel='Month', ylabel='Average revenue', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()
# Compare the revenue generated by users for each plan. Plot a histogram.
user_month_data.groupby('plan')['monthly_revenue'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of revenue usage by plan',xlabel='Monthly Revenue ($)', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(20, color='blue', linestyle='--', label='Surf base fee ($20)', linewidth=2)
plt.axvline(70, color='orange', linestyle='--', label='Ultimate base fee ($70)', linewidth=2)
plt.legend(title='Plan')
plt.show()
# Calculate the mean and the variance of the revenue
revenue_stats = user_month_data.groupby('plan')['monthly_revenue'].agg(['mean', 'var', 'std', 'median'])
revenue_stats
| mean | var | std | median | |
|---|---|---|---|---|
| plan | ||||
| surf | 60.899053 | 3075.367902 | 55.456000 | 40.42 |
| ultimate | 72.313889 | 129.848486 | 11.395108 | 70.00 |
# Plot a boxplot to visualize the distribution of monthly revenue
user_month_data.boxplot(column='monthly_revenue', by='plan', figsize=(10, 6))
plt.title('Distribution of revenue by plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Revenue Generated')
plt.show()
Revenue Analysis Summary:¶
1. Ultimate generates more revenue per user:
- Ultimate average: 72.31 dollars per month vs Surf average: 60.71 dollars per month (19 percent higher)
- Ultimate is more profitable per customer
2. Surf revenue is unpredictable due to overage charges:
- Surf median (40.36 dollars) is double the 20 dollar base fee - most users pay overage charges
- High variability (standard deviation: 55.39 dollars) means bills fluctuate significantly month-to-month
- Primary driver: data overages from exceeding the 15 GB limit
3. Ultimate revenue is stable and predictable:
- Ultimate median (70.00 dollars) matches the base fee - most users do not pay overages
- Low variability (standard deviation: 11.40 dollars) means consistent monthly bills
- Users stay within their allowances, generating reliable revenue
4. Revenue increased throughout 2018:
- Surf revenue tripled from about 20 dollars to 70 dollars as data usage and overages grew
- Ultimate remained stable at 65 to 75 dollars throughout the year
- By December, both plans generated similar revenue (about 70 to 75 dollars per month)
Test statistical hypotheses¶
Hypothesis 1: Average revenue from Ultimate and Surf plans differs¶
Hypotheses:
- Null hypothesis (H₀): The average revenue from users of Ultimate and Surf calling plans is equal
- Alternative hypothesis (H₁): The average revenue from users of Ultimate and Surf calling plans differs
Statistical test: Two-sample t-test (independent samples) using scipy.stats.ttest_ind()
Significance level: α = 0.05 (critical statistical significance level)
Reasoning: We use a two-sample t-test because we are comparing the means of two independent groups (Surf users vs Ultimate users). We'll first check the variances of both groups to determine whether to set equal_var=True or equal_var=False.
To ensure the independence assumption of the t-test is satisfied, I first aggregated the data to the user level. For each user, I calculated their average monthly revenue across all 12 months of 2018. This ensures that each observation in the test represents one independent user rather than multiple correlated months from the same user.
# Calculate average monthly revenue per user
user_avg_revenue = user_month_data.groupby('user_id').agg({'monthly_revenue': 'mean','plan': 'first', 'is_ny_nj': 'first'}).reset_index()
# Rename column for clarity
user_avg_revenue.rename(columns={'monthly_revenue': 'avg_monthly_revenue'}, inplace=True)
print(user_avg_revenue.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 490 entries, 0 to 489 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 490 non-null int64 1 avg_monthly_revenue 490 non-null float64 2 plan 490 non-null object 3 is_ny_nj 490 non-null bool dtypes: bool(1), float64(1), int64(1), object(1) memory usage: 12.1+ KB None
# Extract revenue data for each plan
surf_revenue = user_avg_revenue[user_avg_revenue['plan'] == 'surf']['avg_monthly_revenue']
ultimate_revenue = user_avg_revenue[user_avg_revenue['plan'] == 'ultimate']['avg_monthly_revenue']
# Set significance level
alpha = 0.05
# Check variances to decide on equal_var parameter
print('Variance check:')
print(f'Surf variance: {surf_revenue.var():.2f}')
print(f'Ultimate variance: {ultimate_revenue.var():.2f}')
print(f'Variance ratio: {surf_revenue.var() / ultimate_revenue.var():.2f}')
print()
Variance check: Surf variance: 1902.16 Ultimate variance: 55.13 Variance ratio: 34.51
# Perform two-sample t-test
# Since variances appear different, we set equal_var=False
results = st.ttest_ind(surf_revenue, ultimate_revenue, equal_var=False)
print(f'Surf plan - Mean revenue: ${surf_revenue.mean():.2f}')
print(f'Ultimate plan - Mean revenue: ${ultimate_revenue.mean():.2f}')
print()
print(f'p-value: {results.pvalue}')
print()
if results.pvalue < alpha:
print("We reject the null hypothesis")
print('Conclusion: There IS a statistically significant difference in average revenue')
print('between the Surf and Ultimate plans.')
else:
print("We can't reject the null hypothesis")
print('Conclusion: There is NO statistically significant difference in average revenue')
print('between the Surf and Ultimate plans.')
Surf plan - Mean revenue: $58.38 Ultimate plan - Mean revenue: $71.82 p-value: 8.692101224680108e-08 We reject the null hypothesis Conclusion: There IS a statistically significant difference in average revenue between the Surf and Ultimate plans.
Hypothesis 2: Average revenue from NY-NJ area differs from other regions¶
Hypotheses:
- Null hypothesis (H₀): The average revenue from users in NY-NJ area is equal to that of users from other regions
- Alternative hypothesis (H₁): The average revenue from users in NY-NJ area differs from that of users from other regions
Statistical test: Two-sample t-test (independent samples) using scipy.stats.ttest_ind()
Significance level: α = 0.05 (critical statistical significance level)
Reasoning: Similar to the first hypothesis, we use a two-sample t-test to compare means between two independent groups (NY-NJ users vs other region users). We'll check the variances to determine the appropriate equal_var parameter.
Similar to Hypothesis 1, I aggregated the data to the user level by calculating each user's average monthly revenue across all months. This aggregation ensures the independence assumption is met, as we're comparing average revenues from distinct users rather than multiple observations from the same users.
# Extract revenue data for each region
ny_nj_revenue = user_avg_revenue[user_avg_revenue['is_ny_nj'] == True]['avg_monthly_revenue']
other_revenue = user_avg_revenue[user_avg_revenue['is_ny_nj'] == False]['avg_monthly_revenue']
# Set significance level
alpha = 0.05 # critical statistical significance level
# Check variances to decide on equal_var parameter
print('Variance check:')
print(f'NY-NJ variance: {ny_nj_revenue.var():.2f}')
print(f'Other regions variance: {other_revenue.var():.2f}')
print(f'Variance ratio: {ny_nj_revenue.var() / other_revenue.var():.2f}')
print()
Variance check: NY-NJ variance: 1071.36 Other regions variance: 1402.89 Variance ratio: 0.76
# Perform two-sample t-test
# We'll set equal_var based on variance comparison
results = st.ttest_ind(ny_nj_revenue, other_revenue, equal_var=True)
print(f'NY-NJ area - Mean revenue: ${ny_nj_revenue.mean():.2f}')
print(f'Other regions - Mean revenue: ${other_revenue.mean():.2f}')
print()
print(f'p-value: {results.pvalue}')
print()
if results.pvalue < alpha:
print("We reject the null hypothesis")
print('Conclusion: There IS a statistically significant difference in average revenue')
print('between NY-NJ area and other regions.')
else:
print("We can't reject the null hypothesis")
print('Conclusion: There is NO statistically significant difference in average revenue')
print('between NY-NJ area and other regions.')
NY-NJ area - Mean revenue: $60.04 Other regions - Mean revenue: $63.19 p-value: 0.48516408450209136 We can't reject the null hypothesis Conclusion: There is NO statistically significant difference in average revenue between NY-NJ area and other regions.
General Conclusion¶
Key Findings¶
This analysis of Megaline's Surf and Ultimate prepaid plans reveals important insights about customer behavior, revenue generation, and strategic opportunities for the business.
1. Revenue Performance¶
Ultimate plan generates higher revenue per customer:
- Ultimate users contribute an average of 71.82 dollars per month compared to Surf users at 58.38 dollars per month (23 percent higher)
- Statistical testing confirms this difference is significant (p-value = 8.692101224680108e-08)
However, Surf plan dominates in customer volume:
- Surf attracts 339 users (68 percent of customer base) versus Ultimate's 161 users (32 percent)
- The lower 20 dollar price point makes Surf more attractive for customer acquisition
- By December 2018, Surf revenue converged with Ultimate (both generating approximately 70-75 dollars per month) due to accumulated overage charges
2. Customer Behavior Patterns¶
Usage is similar across plans despite different allowances:
- Call minutes: Both plans average around 440 minutes per month (median values nearly identical)
- Messages: Surf users average 31.2 messages vs Ultimate users at 37.6 messages (minimal difference)
- Internet data: Surf users average 16.7 GB vs Ultimate users at 17.3 GB (only 0.6 GB difference)
Critical insight: Customer behavior is driven by personal habits rather than plan limits. Users consume what they need regardless of their allowance.
3. Overage Patterns and Revenue Stability¶
Surf plan shows high revenue variability:
- Median revenue (40.36 dollars) is double the base fee, indicating most users incur overage charges
- High standard deviation (55.39 dollars) means bills fluctuate significantly month-to-month
- Primary driver: data overages from exceeding the 15 GB limit
- Revenue increased throughout 2018 as data usage grew
Ultimate plan provides stable, predictable revenue:
- Median revenue (70.00 dollars) matches the base fee exactly - most users stay within allowances
- Low standard deviation (11.40 dollars) indicates consistent monthly bills
- Users are significantly over-provisioned, using less than 60 percent of data allowance and less than 15 percent of minute allowance
- Stable revenue stream throughout 2018 (65-75 dollar range)
4. Regional Differences¶
- NY-NJ average: 60.04 dollars per month
- Other regions average: 63.19 dollars per month
- Difference: 3.15 dollars per month (5% lower in NY-NJ)
- Statistical testing (p-value = 0.485) indicates this difference is not statistically significant
Strategic Recommendations¶
Based on this analysis, I recommend a dual-strategy approach rather than focusing exclusively on one plan:
Maintain Both Plans with Distinct Positioning¶
Ultimate Plan Strategy:
- Target customers who value predictability and peace of mind
- Position as the premium, worry-free option for heavy users and business customers
- Emphasize stable billing and no surprise charges in marketing
- Continue to provide stable, high-margin revenue stream
Surf Plan Strategy:
- Continue using as primary customer acquisition tool due to attractive entry price
- Accept revenue variability as trade-off for higher customer volume
- Monitor customer satisfaction regarding data overage frequency
- Consider as gateway product for eventual upsell to Ultimate
Consider Plan Modifications¶
Adjust Surf plan data allowance:
- Current 15 GB limit is increasingly inadequate as average usage approaches 17 GB
- Recommendation: Increase to 18-20 GB to reduce customer frustration while maintaining overage revenue potential
- This could improve customer satisfaction without significantly impacting revenue
Introduce a mid-tier plan:
- Create a 40-50 dollar per month option with 1000-1500 minutes and 20-25 GB data
- Capture customers who find Surf too limiting but Ultimate excessive
- Fill the gap between budget-conscious and premium segments
Market Segmentation Approach¶
Surf Plan: Budget-conscious users, light-to-moderate users, students, price-sensitive customers
Mid-tier Plan (if introduced): Moderate users seeking value without overage risk
Ultimate Plan: Heavy users, business customers, those prioritizing bill stability and peace of mind
Final Recommendation¶
Do not focus exclusively on either plan. Both Surf and Ultimate serve critical but different business purposes:
- Surf drives customer acquisition through attractive pricing and captures a larger market share
- Ultimate maximizes per-customer profitability and provides revenue stability
The optimal strategy is to maintain both plans while optimizing each for its target customer segment. As data usage continues to grow (as evidenced by the tripling of usage throughout 2018), consider adjusting plan parameters to balance customer satisfaction with revenue generation.
The company should leverage its diverse plan portfolio to serve different customer needs while building a foundation for long-term revenue growth through strategic upselling and plan optimization.