# Loading all the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st

# Load the data files into different DataFrames
plans = pd.read_csv('../data/megaline_plans.csv')
users = pd.read_csv('../data/megaline_users.csv')
calls = pd.read_csv('../data/megaline_calls.csv')
messages = pd.read_csv('../data/megaline_messages.csv')
internet = pd.read_csv('../data/megaline_internet.csv')

# Print the general/summary information about the plans' DataFrame
print(plans.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   messages_included      2 non-null      int64  
 1   mb_per_month_included  2 non-null      int64  
 2   minutes_included       2 non-null      int64  
 3   usd_monthly_pay        2 non-null      int64  
 4   usd_per_gb             2 non-null      int64  
 5   usd_per_message        2 non-null      float64
 6   usd_per_minute         2 non-null      float64
 7   plan_name              2 non-null      object 
dtypes: float64(2), int64(5), object(1)
memory usage: 260.0+ bytes
None

# Print a sample of data for plans
print(plans)

   messages_included  mb_per_month_included  minutes_included  \
0                 50                  15360               500   
1               1000                  30720              3000   

   usd_monthly_pay  usd_per_gb  usd_per_message  usd_per_minute plan_name  
0               20          10             0.03            0.03      surf  
1               70           7             0.01            0.01  ultimate

# No fixes needed for plans data

# Add a column for GB/month included with each plan
plans['gb_per_month_included'] = plans['mb_per_month_included'] / 1024
print(plans)

   messages_included  mb_per_month_included  minutes_included  \
0                 50                  15360               500   
1               1000                  30720              3000   

   usd_monthly_pay  usd_per_gb  usd_per_message  usd_per_minute plan_name  \
0               20          10             0.03            0.03      surf   
1               70           7             0.01            0.01  ultimate   

   gb_per_month_included  
0                   15.0  
1                   30.0

# Print the general/summary information about the users' DataFrame
print(users.info())
print()
print(users.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     500 non-null    int64 
 1   first_name  500 non-null    object
 2   last_name   500 non-null    object
 3   age         500 non-null    int64 
 4   city        500 non-null    object
 5   reg_date    500 non-null    object
 6   plan        500 non-null    object
 7   churn_date  34 non-null     object
dtypes: int64(2), object(6)
memory usage: 31.4+ KB
None

           user_id         age
count   500.000000  500.000000
mean   1249.500000   45.486000
std     144.481833   16.972269
min    1000.000000   18.000000
25%    1124.750000   30.000000
50%    1249.500000   46.000000
75%    1374.250000   61.000000
max    1499.000000   75.000000

# Print a sample of data for users
print(users.head(10))
print(f'\nTotal users: {len(users)}')
print(f"\nPlan distribution:\n{users['plan'].value_counts()}")

   user_id first_name  last_name  age                                    city  \
0     1000   Anamaria      Bauer   45   Atlanta-Sandy Springs-Roswell, GA MSA   
1     1001     Mickey  Wilkerson   28         Seattle-Tacoma-Bellevue, WA MSA   
2     1002     Carlee    Hoffman   36    Las Vegas-Henderson-Paradise, NV MSA   
3     1003   Reynaldo    Jenkins   52                           Tulsa, OK MSA   
4     1004    Leonila   Thompson   40         Seattle-Tacoma-Bellevue, WA MSA   
5     1005      Livia    Shields   31     Dallas-Fort Worth-Arlington, TX MSA   
6     1006     Jesusa   Bradford   73  San Francisco-Oakland-Berkeley, CA MSA   
7     1007    Eusebio      Welch   42           Grand Rapids-Kentwood, MI MSA   
8     1008      Emely    Hoffman   53       Orlando-Kissimmee-Sanford, FL MSA   
9     1009      Gerry     Little   19  San Jose-Sunnyvale-Santa Clara, CA MSA   

     reg_date      plan  churn_date  
0  2018-12-24  ultimate         NaN  
1  2018-08-13      surf         NaN  
2  2018-10-21      surf         NaN  
3  2018-01-28      surf         NaN  
4  2018-05-23      surf         NaN  
5  2018-11-29      surf         NaN  
6  2018-11-27  ultimate  2018-12-18  
7  2018-07-11      surf         NaN  
8  2018-08-03  ultimate         NaN  
9  2018-04-22      surf         NaN  

Total users: 500

Plan distribution:
plan
surf        339
ultimate    161
Name: count, dtype: int64

# Convert date columns to datetime format
users['reg_date'] = pd.to_datetime(users['reg_date'])
users['churn_date'] = pd.to_datetime(users['churn_date'])
print(users.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   user_id     500 non-null    int64         
 1   first_name  500 non-null    object        
 2   last_name   500 non-null    object        
 3   age         500 non-null    int64         
 4   city        500 non-null    object        
 5   reg_date    500 non-null    datetime64[ns]
 6   plan        500 non-null    object        
 7   churn_date  34 non-null     datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(4)
memory usage: 31.4+ KB
None

# Create a simpler city category to identify NY-NJ area for hypothesis testing
users['is_ny_nj'] = users['city'].str.contains('NY-NJ', na=False)
print(users.info())
print(f"\nUsers in NY-NJ area: {users['is_ny_nj'].sum()}")
print(f"Users in other areas: {(~users['is_ny_nj']).sum()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   user_id     500 non-null    int64         
 1   first_name  500 non-null    object        
 2   last_name   500 non-null    object        
 3   age         500 non-null    int64         
 4   city        500 non-null    object        
 5   reg_date    500 non-null    datetime64[ns]
 6   plan        500 non-null    object        
 7   churn_date  34 non-null     datetime64[ns]
 8   is_ny_nj    500 non-null    bool          
dtypes: bool(1), datetime64[ns](2), int64(2), object(4)
memory usage: 31.9+ KB
None

Users in NY-NJ area: 80
Users in other areas: 420

# Print the general/summary information about the calls' DataFrame
print(calls.info())
print()
print(calls.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137735 entries, 0 to 137734
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   id         137735 non-null  object 
 1   user_id    137735 non-null  int64  
 2   call_date  137735 non-null  object 
 3   duration   137735 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.2+ MB
None

             user_id       duration
count  137735.000000  137735.000000
mean     1247.658046       6.745927
std       139.416268       5.839241
min      1000.000000       0.000000
25%      1128.000000       1.290000
50%      1247.000000       5.980000
75%      1365.000000      10.690000
max      1499.000000      37.600000

# Print a sample of data for calls
print(calls.head(10))

         id  user_id   call_date  duration
0   1000_93     1000  2018-12-27      8.52
1  1000_145     1000  2018-12-27     13.66
2  1000_247     1000  2018-12-27     14.48
3  1000_309     1000  2018-12-28      5.76
4  1000_380     1000  2018-12-30      4.22
5  1000_388     1000  2018-12-31      2.20
6  1000_510     1000  2018-12-27      5.75
7  1000_521     1000  2018-12-28     14.18
8  1000_530     1000  2018-12-28      5.77
9  1000_544     1000  2018-12-26      4.40

# Convert call_date to datetime format
calls['call_date'] = pd.to_datetime(calls['call_date'])

#Replace 0-duration calls with 1 minute
calls.loc[calls['duration'] == 0, 'duration'] = 1

print(calls.info())
print()
print('Duration statistics after correction:')
print(calls['duration'].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137735 entries, 0 to 137734
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   id         137735 non-null  object        
 1   user_id    137735 non-null  int64         
 2   call_date  137735 non-null  datetime64[ns]
 3   duration   137735 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 4.2+ MB
None

Duration statistics after correction:
count    137735.000000
mean          6.940750
std           5.623616
min           0.010000
25%           1.290000
50%           5.980000
75%          10.690000
max          37.600000
Name: duration, dtype: float64

# Extract month from call_date for aggregation
calls['month'] = calls['call_date'].dt.to_period('M')

# Round up call duration to nearest minute (what the user would be billed for)
calls['duration_rounded'] = np.ceil(calls['duration'])

print(calls.head(10))

         id  user_id  call_date  duration    month  duration_rounded
0   1000_93     1000 2018-12-27      8.52  2018-12               9.0
1  1000_145     1000 2018-12-27     13.66  2018-12              14.0
2  1000_247     1000 2018-12-27     14.48  2018-12              15.0
3  1000_309     1000 2018-12-28      5.76  2018-12               6.0
4  1000_380     1000 2018-12-30      4.22  2018-12               5.0
5  1000_388     1000 2018-12-31      2.20  2018-12               3.0
6  1000_510     1000 2018-12-27      5.75  2018-12               6.0
7  1000_521     1000 2018-12-28     14.18  2018-12              15.0
8  1000_530     1000 2018-12-28      5.77  2018-12               6.0
9  1000_544     1000 2018-12-26      4.40  2018-12               5.0

# Print the general/summary information about the messages' DataFrame
print(messages.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76051 entries, 0 to 76050
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            76051 non-null  object
 1   user_id       76051 non-null  int64 
 2   message_date  76051 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB
None

# Print a sample of data for messages
print(messages.head(10))

         id  user_id message_date
0  1000_125     1000   2018-12-27
1  1000_160     1000   2018-12-31
2  1000_223     1000   2018-12-31
3  1000_251     1000   2018-12-27
4  1000_255     1000   2018-12-26
5  1000_346     1000   2018-12-29
6  1000_386     1000   2018-12-30
7  1000_413     1000   2018-12-31
8  1000_502     1000   2018-12-27
9  1000_525     1000   2018-12-28

# Convert message_date to datetime format
messages['message_date'] = pd.to_datetime(messages['message_date'])
print(messages.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76051 entries, 0 to 76050
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            76051 non-null  object        
 1   user_id       76051 non-null  int64         
 2   message_date  76051 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 1.7+ MB
None

# Extract month from message_date for aggregation
messages['month'] = messages['message_date'].dt.to_period('M')
print(messages.head(10))

         id  user_id message_date    month
0  1000_125     1000   2018-12-27  2018-12
1  1000_160     1000   2018-12-31  2018-12
2  1000_223     1000   2018-12-31  2018-12
3  1000_251     1000   2018-12-27  2018-12
4  1000_255     1000   2018-12-26  2018-12
5  1000_346     1000   2018-12-29  2018-12
6  1000_386     1000   2018-12-30  2018-12
7  1000_413     1000   2018-12-31  2018-12
8  1000_502     1000   2018-12-27  2018-12
9  1000_525     1000   2018-12-28  2018-12

# Print the general/summary information about the internet DataFrame
print(internet.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104825 entries, 0 to 104824
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            104825 non-null  object 
 1   user_id       104825 non-null  int64  
 2   session_date  104825 non-null  object 
 3   mb_used       104825 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 3.2+ MB
None

# Print a sample of data for the internet traffic
print(internet.head(10))

         id  user_id session_date  mb_used
0   1000_13     1000   2018-12-29    89.86
1  1000_204     1000   2018-12-31     0.00
2  1000_379     1000   2018-12-28   660.40
3  1000_413     1000   2018-12-26   270.99
4  1000_442     1000   2018-12-27   880.22
5    1001_0     1001   2018-08-24   284.68
6    1001_3     1001   2018-12-09   656.04
7    1001_4     1001   2018-11-04    16.97
8   1001_10     1001   2018-11-27   135.18
9   1001_15     1001   2018-12-13   761.92

# Convert session_date to datetime format
internet['session_date'] = pd.to_datetime(internet['session_date'])
print(internet.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104825 entries, 0 to 104824
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   id            104825 non-null  object        
 1   user_id       104825 non-null  int64         
 2   session_date  104825 non-null  datetime64[ns]
 3   mb_used       104825 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 3.2+ MB
None

# Extract month from session_date for aggregation
internet['month'] = internet['session_date'].dt.to_period('M')

# Convert MB to GB for easier comparison with plan limits
internet['gb_used'] = internet['mb_used'] / 1024

print(internet.head(10))

         id  user_id session_date  mb_used    month   gb_used
0   1000_13     1000   2018-12-29    89.86  2018-12  0.087754
1  1000_204     1000   2018-12-31     0.00  2018-12  0.000000
2  1000_379     1000   2018-12-28   660.40  2018-12  0.644922
3  1000_413     1000   2018-12-26   270.99  2018-12  0.264639
4  1000_442     1000   2018-12-27   880.22  2018-12  0.859590
5    1001_0     1001   2018-08-24   284.68  2018-08  0.278008
6    1001_3     1001   2018-12-09   656.04  2018-12  0.640664
7    1001_4     1001   2018-11-04    16.97  2018-11  0.016572
8   1001_10     1001   2018-11-27   135.18  2018-11  0.132012
9   1001_15     1001   2018-12-13   761.92  2018-12  0.744062

# Print out the plan conditions and make sure they are clear for you
for i in range(len(plans)):
    print(f"\n{plans.iloc[i]['plan_name'].upper()} Plan:")
    print(f"  Monthly fee: ${plans.iloc[i]['usd_monthly_pay']}")
    print(f"  Included minutes: {plans.iloc[i]['minutes_included']}")
    print(f"  Included messages: {plans.iloc[i]['messages_included']}")
    print(f"  Included data: {plans.iloc[i]['gb_per_month_included']:.1f} GB")
    print(f"  Overage cost per minute: ${plans.iloc[i]['usd_per_minute']}")
    print(f"  Overage cost per message: ${plans.iloc[i]['usd_per_message']}")
    print(f"  Overage cost per GB: ${plans.iloc[i]['usd_per_gb']}")

SURF Plan:
  Monthly fee: $20
  Included minutes: 500
  Included messages: 50
  Included data: 15.0 GB
  Overage cost per minute: $0.03
  Overage cost per message: $0.03
  Overage cost per GB: $10

ULTIMATE Plan:
  Monthly fee: $70
  Included minutes: 3000
  Included messages: 1000
  Included data: 30.0 GB
  Overage cost per minute: $0.01
  Overage cost per message: $0.01
  Overage cost per GB: $7

# Calculate the number of calls made by each user per month. Save the result.
calls_per_month = calls.groupby(['user_id', 'month']).agg(num_calls=('id', 'count')).reset_index()
print(calls_per_month.head(10))

   user_id    month  num_calls
0     1000  2018-12         16
1     1001  2018-08         27
2     1001  2018-09         49
3     1001  2018-10         65
4     1001  2018-11         64
5     1001  2018-12         56
6     1002  2018-10         11
7     1002  2018-11         55
8     1002  2018-12         47
9     1003  2018-12        149

# Calculate the amount of minutes spent by each user per month. Save the result.
minutes_per_month = calls.groupby(['user_id', 'month']).agg(total_minutes=('duration_rounded', 'sum')).reset_index()
print(minutes_per_month.head(10))

   user_id    month  total_minutes
0     1000  2018-12          124.0
1     1001  2018-08          187.0
2     1001  2018-09          326.0
3     1001  2018-10          411.0
4     1001  2018-11          441.0
5     1001  2018-12          422.0
6     1002  2018-10           62.0
7     1002  2018-11          393.0
8     1002  2018-12          393.0
9     1003  2018-12         1135.0

# Calculate the number of messages sent by each user per month. Save the result.
messages_per_month = messages.groupby(['user_id', 'month']).agg(num_messages=('id', 'count')).reset_index()
print(messages_per_month.head(10))

   user_id    month  num_messages
0     1000  2018-12            11
1     1001  2018-08            30
2     1001  2018-09            44
3     1001  2018-10            53
4     1001  2018-11            36
5     1001  2018-12            44
6     1002  2018-10            15
7     1002  2018-11            32
8     1002  2018-12            41
9     1003  2018-12            50

# Calculate the volume of internet traffic used by each user per month. Save the result.
internet_per_month = internet.groupby(['user_id', 'month']).agg(total_mb=('mb_used', 'sum'), total_gb=('gb_used', 'sum')).reset_index()

# Round up GB to nearest whole number (what the user would be billed for)
internet_per_month['gb_rounded'] = np.ceil(internet_per_month['total_gb'])

print(internet_per_month.head(10))

   user_id    month  total_mb   total_gb  gb_rounded
0     1000  2018-12   1901.47   1.856904         2.0
1     1001  2018-08   6919.15   6.756982         7.0
2     1001  2018-09  13314.82  13.002754        14.0
3     1001  2018-10  22330.49  21.807119        22.0
4     1001  2018-11  18504.30  18.070605        19.0
5     1001  2018-12  19369.18  18.915215        19.0
6     1002  2018-10   6552.01   6.398447         7.0
7     1002  2018-11  19345.08  18.891680        19.0
8     1002  2018-12  14396.24  14.058828        15.0
9     1003  2018-12  27044.14  26.410293        27.0

# Merge the data for calls, minutes, messages, internet based on user_id and month

# First, merge calls data (which includes both number of calls and minutes)
user_month_data = calls_per_month.merge(minutes_per_month, on=['user_id', 'month'], how='outer')

# Merge messages
user_month_data = user_month_data.merge(messages_per_month, on=['user_id', 'month'], how='outer')

# Merge internet
user_month_data = user_month_data.merge(internet_per_month, on=['user_id', 'month'], how='outer')

# Fill NaN values with 0 (users who didn't use a service that month)
user_month_data = user_month_data.fillna(0)

user_month_data.info()
print()
print(user_month_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2293 entries, 0 to 2292
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype    
---  ------         --------------  -----    
 0   user_id        2293 non-null   int64    
 1   month          2293 non-null   period[M]
 2   num_calls      2293 non-null   float64  
 3   total_minutes  2293 non-null   float64  
 4   num_messages   2293 non-null   float64  
 5   total_mb       2293 non-null   float64  
 6   total_gb       2293 non-null   float64  
 7   gb_rounded     2293 non-null   float64  
dtypes: float64(6), int64(1), period[M](1)
memory usage: 143.4 KB

   user_id    month  num_calls  total_minutes  num_messages  total_mb  \
0     1000  2018-12       16.0          124.0          11.0   1901.47   
1     1001  2018-08       27.0          187.0          30.0   6919.15   
2     1001  2018-09       49.0          326.0          44.0  13314.82   
3     1001  2018-10       65.0          411.0          53.0  22330.49   
4     1001  2018-11       64.0          441.0          36.0  18504.30   

    total_gb  gb_rounded  
0   1.856904         2.0  
1   6.756982         7.0  
2  13.002754        14.0  
3  21.807119        22.0  
4  18.070605        19.0

# Add the plan information

# First, add user information
user_month_data = user_month_data.merge(users[['user_id', 'plan', 'reg_date', 'churn_date', 'is_ny_nj', 'city']], on='user_id', how='left')

# Add plan details
user_month_data = user_month_data.merge(plans, left_on='plan', right_on='plan_name', how='left')

user_month_data.info()
print()
print(user_month_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2293 entries, 0 to 2292
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   user_id                2293 non-null   int64         
 1   month                  2293 non-null   period[M]     
 2   num_calls              2293 non-null   float64       
 3   total_minutes          2293 non-null   float64       
 4   num_messages           2293 non-null   float64       
 5   total_mb               2293 non-null   float64       
 6   total_gb               2293 non-null   float64       
 7   gb_rounded             2293 non-null   float64       
 8   plan                   2293 non-null   object        
 9   reg_date               2293 non-null   datetime64[ns]
 10  churn_date             158 non-null    datetime64[ns]
 11  is_ny_nj               2293 non-null   bool          
 12  city                   2293 non-null   object        
 13  messages_included      2293 non-null   int64         
 14  mb_per_month_included  2293 non-null   int64         
 15  minutes_included       2293 non-null   int64         
 16  usd_monthly_pay        2293 non-null   int64         
 17  usd_per_gb             2293 non-null   int64         
 18  usd_per_message        2293 non-null   float64       
 19  usd_per_minute         2293 non-null   float64       
 20  plan_name              2293 non-null   object        
 21  gb_per_month_included  2293 non-null   float64       
dtypes: bool(1), datetime64[ns](2), float64(9), int64(6), object(3), period[M](1)
memory usage: 378.6+ KB

   user_id    month  num_calls  total_minutes  num_messages  total_mb  \
0     1000  2018-12       16.0          124.0          11.0   1901.47   
1     1001  2018-08       27.0          187.0          30.0   6919.15   
2     1001  2018-09       49.0          326.0          44.0  13314.82   
3     1001  2018-10       65.0          411.0          53.0  22330.49   
4     1001  2018-11       64.0          441.0          36.0  18504.30   

    total_gb  gb_rounded      plan   reg_date  ...  \
0   1.856904         2.0  ultimate 2018-12-24  ...   
1   6.756982         7.0      surf 2018-08-13  ...   
2  13.002754        14.0      surf 2018-08-13  ...   
3  21.807119        22.0      surf 2018-08-13  ...   
4  18.070605        19.0      surf 2018-08-13  ...   

                                    city  messages_included  \
0  Atlanta-Sandy Springs-Roswell, GA MSA               1000   
1        Seattle-Tacoma-Bellevue, WA MSA                 50   
2        Seattle-Tacoma-Bellevue, WA MSA                 50   
3        Seattle-Tacoma-Bellevue, WA MSA                 50   
4        Seattle-Tacoma-Bellevue, WA MSA                 50   

  mb_per_month_included  minutes_included  usd_monthly_pay  usd_per_gb  \
0                 30720              3000               70           7   
1                 15360               500               20          10   
2                 15360               500               20          10   
3                 15360               500               20          10   
4                 15360               500               20          10   

   usd_per_message  usd_per_minute  plan_name  gb_per_month_included  
0             0.01            0.01   ultimate                   30.0  
1             0.03            0.03       surf                   15.0  
2             0.03            0.03       surf                   15.0  
3             0.03            0.03       surf                   15.0  
4             0.03            0.03       surf                   15.0  

[5 rows x 22 columns]

# Calculate the monthly revenue for each user
# Revenue = monthly fee + overage charges for minutes, messages, and data

#Calculate overage for each service (overage = max(0, used - included))
user_month_data['minutes_overage'] = np.maximum(0, user_month_data['total_minutes'] - user_month_data['minutes_included'])
user_month_data['messages_overage'] = np.maximum(0, user_month_data['num_messages'] - user_month_data['messages_included'])
user_month_data['gb_overage'] = np.maximum(0, user_month_data['gb_rounded'] - user_month_data['gb_per_month_included'])

# Calculate overage charges
user_month_data['minutes_overage_charge'] = user_month_data['minutes_overage'] * user_month_data['usd_per_minute']
user_month_data['messages_overage_charge'] = user_month_data['messages_overage'] * user_month_data['usd_per_message']
user_month_data['gb_overage_charge'] = user_month_data['gb_overage'] * user_month_data['usd_per_gb']

# Calculate total monthly revenue
user_month_data['monthly_revenue'] = (user_month_data['usd_monthly_pay'] + user_month_data['minutes_overage_charge'] + user_month_data['messages_overage_charge'] + user_month_data['gb_overage_charge'])

print(user_month_data[['user_id', 'month', 'usd_monthly_pay', 'minutes_overage_charge', 'messages_overage_charge', 'gb_overage_charge', 'monthly_revenue']].head(10))

   user_id    month  usd_monthly_pay  minutes_overage_charge  \
0     1000  2018-12               70                    0.00   
1     1001  2018-08               20                    0.00   
2     1001  2018-09               20                    0.00   
3     1001  2018-10               20                    0.00   
4     1001  2018-11               20                    0.00   
5     1001  2018-12               20                    0.00   
6     1002  2018-10               20                    0.00   
7     1002  2018-11               20                    0.00   
8     1002  2018-12               20                    0.00   
9     1003  2018-12               20                   19.05   

   messages_overage_charge  gb_overage_charge  monthly_revenue  
0                     0.00                0.0            70.00  
1                     0.00                0.0            20.00  
2                     0.00                0.0            20.00  
3                     0.09               70.0            90.09  
4                     0.00               40.0            60.00  
5                     0.00               40.0            60.00  
6                     0.00                0.0            20.00  
7                     0.00               40.0            60.00  
8                     0.00                0.0            20.00  
9                     0.00              120.0           159.05

# Compare average duration of calls per each plan per each distinct month. Plot a bar plat to visualize it.
avg_minutes_by_plan_month = user_month_data.groupby(['plan', 'month'])['total_minutes'].mean().reset_index()

# Create a pivot table for easier plotting
pivot_minutes = avg_minutes_by_plan_month.pivot(index='month', columns='plan', values='total_minutes')

# Plot bar chart
pivot_minutes.plot(kind='bar', title='Average call minutes per month by plan', xlabel='Month', ylabel='Average Minutes', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()

# Compare the number of minutes users of each plan require each month. Plot a histogram.
user_month_data.groupby('plan')['total_minutes'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of monthly minutes by plan',xlabel='Minutes', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(500, color='blue', linestyle='--', label='Surf limit (500 min)', linewidth=2)
plt.axvline(3000, color='orange', linestyle='--', label='Ultimate limit (3000 min)', linewidth=2)
plt.legend(title='Plan')
plt.show()

# Calculate the mean and the variance of the monthly call duration
call_stats = user_month_data.groupby('plan')['total_minutes'].agg(['mean', 'var', 'std', 'median'])
call_stats

# Plot a boxplot to visualize the distribution of the monthly call duration
user_month_data.boxplot(column='total_minutes', by='plan', figsize=(10, 6))
plt.title('Distribution of Monthly Call Minutes by Plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Total Minutes')
plt.show()

# Compare the number of messages users of each plan tend to send each month
avg_messages_by_plan_month = user_month_data.groupby(['plan', 'month'])['num_messages'].mean().reset_index()

# Create a pivot table for easier plotting
pivot_messages = avg_messages_by_plan_month.pivot(index='month', columns='plan', values='num_messages')

# Plot bar chart
pivot_messages.plot(kind='bar', title='Average messages per month by plan', xlabel='Month', ylabel='Average Messages', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()

# Compare the number of messages users of each plan require each month. Plot a histogram.
user_month_data.groupby('plan')['num_messages'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of monthly messages by plan',xlabel='Messages', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(50, color='blue', linestyle='--', label='Surf limit (50 messages)', linewidth=2)
plt.axvline(1000, color='orange', linestyle='--', label='Ultimate limit (1000 messages)', linewidth=2)
plt.legend(title='Plan')
plt.show()

# Calculate the mean and the variance of the monthly messages
message_stats = user_month_data.groupby('plan')['num_messages'].agg(['mean', 'var', 'std', 'median'])
message_stats

# Plot a boxplot to visualize the distribution of monthly messages
user_month_data.boxplot(column='num_messages', by='plan', figsize=(10, 6))
plt.title('Distribution of monthly messages by plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Total Messages')
plt.show()

# Compare the internet usage users of each plan tend to use each month
avg_internet_by_plan_month = user_month_data.groupby(['plan', 'month'])['gb_rounded'].mean().reset_index()

# Create a pivot table for easier plotting
pivot_internet = avg_internet_by_plan_month.pivot(index='month', columns='plan', values='gb_rounded')

# Plot bar chart
pivot_internet.plot(kind='bar', title='Average GB used per month by plan', xlabel='Month', ylabel='Average GB used', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()

# Compare the internet usage users of each plan require each month. Plot a histogram.
user_month_data.groupby('plan')['gb_rounded'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of monthly internet usage by plan',xlabel='GB usage', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(15, color='blue', linestyle='--', label='Surf limit (15 GB)', linewidth=2)
plt.axvline(30, color='orange', linestyle='--', label='Ultimate limit (30 GB)', linewidth=2)
plt.legend(title='Plan')
plt.show()

# Calculate the mean and the variance of the monthly messages
internet_stats = user_month_data.groupby('plan')['gb_rounded'].agg(['mean', 'var', 'std', 'median'])
internet_stats

# Plot a boxplot to visualize the distribution of monthly internet usage
user_month_data.boxplot(column='gb_rounded', by='plan', figsize=(10, 6))
plt.title('Distribution of internet usage by plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Total GB used')
plt.show()

# Compare the revenue each plan tends to generate each month
avg_revenue_by_plan_month = user_month_data.groupby(['plan', 'month'])['monthly_revenue'].mean().reset_index()

# Create a pivot table for easier plotting
pivot_revenue = avg_revenue_by_plan_month.pivot(index='month', columns='plan', values='monthly_revenue')

# Plot bar chart
pivot_revenue.plot(kind='bar', title='Average revenue generated per month by plan', xlabel='Month', ylabel='Average revenue', rot=45, figsize=(14,6))
plt.legend(title='Plan')
plt.show()

# Compare the revenue generated by users for each plan. Plot a histogram.
user_month_data.groupby('plan')['monthly_revenue'].plot(kind='hist', bins=30, alpha=0.5, title='Distribution of revenue usage by plan',xlabel='Monthly Revenue ($)', ylabel='Frequency', figsize=(14,6), legend=True)
plt.axvline(20, color='blue', linestyle='--', label='Surf base fee ($20)', linewidth=2)
plt.axvline(70, color='orange', linestyle='--', label='Ultimate base fee ($70)', linewidth=2)
plt.legend(title='Plan')
plt.show()

# Calculate the mean and the variance of the revenue
revenue_stats = user_month_data.groupby('plan')['monthly_revenue'].agg(['mean', 'var', 'std', 'median'])
revenue_stats

# Plot a boxplot to visualize the distribution of monthly revenue
user_month_data.boxplot(column='monthly_revenue', by='plan', figsize=(10, 6))
plt.title('Distribution of revenue by plan')
plt.suptitle('')
plt.xlabel('Plan')
plt.ylabel('Revenue Generated')
plt.show()

# Calculate average monthly revenue per user
user_avg_revenue = user_month_data.groupby('user_id').agg({'monthly_revenue': 'mean','plan': 'first', 'is_ny_nj': 'first'}).reset_index()

# Rename column for clarity
user_avg_revenue.rename(columns={'monthly_revenue': 'avg_monthly_revenue'}, inplace=True)

print(user_avg_revenue.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 490 entries, 0 to 489
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              490 non-null    int64  
 1   avg_monthly_revenue  490 non-null    float64
 2   plan                 490 non-null    object 
 3   is_ny_nj             490 non-null    bool   
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 12.1+ KB
None

# Extract revenue data for each plan
surf_revenue = user_avg_revenue[user_avg_revenue['plan'] == 'surf']['avg_monthly_revenue']
ultimate_revenue = user_avg_revenue[user_avg_revenue['plan'] == 'ultimate']['avg_monthly_revenue']

# Set significance level
alpha = 0.05

# Check variances to decide on equal_var parameter
print('Variance check:')
print(f'Surf variance: {surf_revenue.var():.2f}')
print(f'Ultimate variance: {ultimate_revenue.var():.2f}')
print(f'Variance ratio: {surf_revenue.var() / ultimate_revenue.var():.2f}')
print()

Variance check:
Surf variance: 1902.16
Ultimate variance: 55.13
Variance ratio: 34.51

# Perform two-sample t-test
# Since variances appear different, we set equal_var=False
results = st.ttest_ind(surf_revenue, ultimate_revenue, equal_var=False)

print(f'Surf plan - Mean revenue: ${surf_revenue.mean():.2f}')
print(f'Ultimate plan - Mean revenue: ${ultimate_revenue.mean():.2f}')
print()
print(f'p-value: {results.pvalue}')
print()

if results.pvalue < alpha:
    print("We reject the null hypothesis")
    print('Conclusion: There IS a statistically significant difference in average revenue')
    print('between the Surf and Ultimate plans.')
else:
    print("We can't reject the null hypothesis")
    print('Conclusion: There is NO statistically significant difference in average revenue')
    print('between the Surf and Ultimate plans.')

Surf plan - Mean revenue: $58.38
Ultimate plan - Mean revenue: $71.82

p-value: 8.692101224680108e-08

We reject the null hypothesis
Conclusion: There IS a statistically significant difference in average revenue
between the Surf and Ultimate plans.

# Extract revenue data for each region
ny_nj_revenue = user_avg_revenue[user_avg_revenue['is_ny_nj'] == True]['avg_monthly_revenue']
other_revenue = user_avg_revenue[user_avg_revenue['is_ny_nj'] == False]['avg_monthly_revenue']

# Set significance level
alpha = 0.05  # critical statistical significance level

# Check variances to decide on equal_var parameter
print('Variance check:')
print(f'NY-NJ variance: {ny_nj_revenue.var():.2f}')
print(f'Other regions variance: {other_revenue.var():.2f}')
print(f'Variance ratio: {ny_nj_revenue.var() / other_revenue.var():.2f}')
print()

Variance check:
NY-NJ variance: 1071.36
Other regions variance: 1402.89
Variance ratio: 0.76

# Perform two-sample t-test
# We'll set equal_var based on variance comparison
results = st.ttest_ind(ny_nj_revenue, other_revenue, equal_var=True)

print(f'NY-NJ area - Mean revenue: ${ny_nj_revenue.mean():.2f}')
print(f'Other regions - Mean revenue: ${other_revenue.mean():.2f}')
print()
print(f'p-value: {results.pvalue}')
print()

if results.pvalue < alpha:
    print("We reject the null hypothesis")
    print('Conclusion: There IS a statistically significant difference in average revenue')
    print('between NY-NJ area and other regions.')
else:
    print("We can't reject the null hypothesis")
    print('Conclusion: There is NO statistically significant difference in average revenue')
    print('between NY-NJ area and other regions.')

NY-NJ area - Mean revenue: $60.04
Other regions - Mean revenue: $63.19

p-value: 0.48516408450209136

We can't reject the null hypothesis
Conclusion: There is NO statistically significant difference in average revenue
between NY-NJ area and other regions.

	mean	var	std	median
plan
surf	440.324221	57743.745959	240.299284	438.0
ultimate	442.431944	60876.012052	246.730647	437.5

	mean	var	std	median
plan
surf	16.670693	61.58360	7.847522	17.0
ultimate	17.306944	58.83055	7.670108	17.0

	mean	var	std	median
plan
surf	31.159568	1126.724522	33.566717	24.0
ultimate	37.551389	1208.756744	34.767179	30.0

	mean	var	std	median
plan
surf	60.899053	3075.367902	55.456000	40.42
ultimate	72.313889	129.848486	11.395108	70.00

Which one is a better plan?¶

Project Overview¶

Initialization¶

Load data¶

Prepare the data¶

Plans¶

Observations about Plans data:¶

Fix data¶

Enrich data¶

Users¶

Observations about Users data:¶

Fix Data¶

Enrich Data¶

Calls¶

Observations about Calls data:¶

Fix data¶

Enrich data¶

Messages¶

Observations about Messages data:¶

Fix data¶

Enrich data¶

Internet¶

Observations about Internet data:¶

Fix data¶

Enrich data¶

Study plan conditions¶

Aggregate data per user¶

Study user behaviour¶

Calls¶

Conclusions on call behavior:¶

Messages¶

Conclusions on Message Behavior:¶

Internet¶

Conclusions on Internet Usage Behavior:¶

Revenue¶

Revenue Analysis Summary:¶

Test statistical hypotheses¶

Hypothesis 1: Average revenue from Ultimate and Surf plans differs¶

Hypothesis 2: Average revenue from NY-NJ area differs from other regions¶

General Conclusion¶

Key Findings¶

1. Revenue Performance¶

2. Customer Behavior Patterns¶

3. Overage Patterns and Revenue Stability¶

4. Regional Differences¶

Strategic Recommendations¶

Maintain Both Plans with Distinct Positioning¶

Consider Plan Modifications¶

Market Segmentation Approach¶

Final Recommendation¶