Starbucks Best Offers Analysis

Anurag Lahon
7 min readApr 25, 2020

Overview

In this project, I will try to find how Starbucks customers use the app, and how well is the current offers system. I will also see who should the app target in promotions. This dataset contains simulated data that mimics customer behavior on the Starbucks rewarding system in their mobile application. From it, we can understand the costumers’ behavior and it might help us make better decisions. Some users might not receive any offer during certain weeks

The process of our analysis will be by the following step: Define our Business question, understanding the Datasets, Data preparation and wrangling, analyze the data, model the data, compare model performance, and finally selecting one model and improving it.

Business Understanding

My goal for this project is predicting which kind of offers, Buy One Get One Free (BOGO), Discount or informational is better to give a current customer by only knowing his/her age, gender, income and the amount they are paying.

Data Understanding

There are three data frames profile,portfolio and transcript.

Before analyzing the we have to explore what is the data we have. We need to check if it is clean or not,checking missing values etc.

The data is provided by Starbucks. Here is a brief overview of how the data looks like:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files: e

portfolio.json10 rows, 6 columns.

  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

profile.json17000 rows, 5 columns.

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

transcript.json306534 rows, 4 columns.

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

Data preparation / Wrangling :

For the portfolio, we can see that the ‘channels’ column need some cleaning. It contains a list, so each value in that list must have its own column. After separating each value, we will obviously need to drop the ‘channels’ column as it is no longer needed. The table will look like this:

For profile.json dataframe ,the gender and income column have NaN values. For gender, NaN were converted to NA. For income, NaN were replaced by the mean.

For the third dataframe transcript,there is no NaN:

But ,here the value column holds dictionary of offer id, amount, offer_id and reward so we have to separate each value and drop the ‘value’ column as it is no longer needed.

Now we will move to the analysis and modelling part after cleaning and understanding the data.

Analyzing the Data :

I. Univariate Exploration :

What is the average income for Starbucks customers?

Average income is 65404.991568296799

What is the average age for Starbucks customers?

The average age  is 62.531411764705879

What is the most common promotion?

Top 3 promotion only and show only the completed promotions as they are more important.

The offer ID ‘fafdcd668e3743c1bb461111dcafc2a4’ is the most common with number of completion equal to 5317. The least common offer is ‘4d5c57ea9a6940dd891ad53e9dbe8da0’ with total of 3331 completion

BOGO(buy one get one free) is the most used followed be discount with a small difference. While informational came third with ~40000 difference, that’s a huge gap.

What are the most common values for each column in each dataframe?

Age 58 , Adult amd Male are most common.

Who are the most loyal customer, in other words most transcripts?

List of most loyal customers (customers spends lot of money on offers/transactions).

What are the most events we have in our transcripts?

Transaction have the most amount of rows in the transcript data frame with around 140k, almost half of the total data frame.

Lets move on to Multivariate Exploration

II. Multivariate Exploration

What is the most common promotion for children, teens, young adult, adult and elderly customors?

Since its a Multivariate question we will use a multi bar chart.

We can observe that all of them have similar results in offer type, Transactions has the upper hand, followed by BOGO. We can also see that young adults and teens are not our main customer group, so we can focus on elderly and adults.

which get more income, males or females?

The graph above shows that income median (the white dot) for females (around 70k) is higher than males (around 60k) we can see that for females the income spreads from 40k to 100k. For males most of them around 40k to 70k which close to median.

Which type of promotions each gender likes?

It seems that all share the same interest and prefer BOGO.But the difference between discount and BOGO is low.

Modeling the Data :

In this part, We make a model that can identify which kind of offers we should give a customer. Because the model will guess the offer_type, I will only get those transcripts with offer id’s. So I will ignore all transactions without offer id for now.

Since we have a simple classification problem, I will use accuracy to evaluate my models

Our features will be:

  • Event. (Will be replaced from categorical to numerical)
  • Time. (normalized)
  • Offer_id. (Will be replaced from categorical to numerical)
  • Amount. (normalized)
  • Reward. (normalized)
  • Age_group. (Will be replaced from categorical to numerical)
  • Gender. (Will be replaced from categorical to numerical).
  • Income. (normalized)

While our target will be offer type.

The models used are: Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machine, Random Forest, and Naive Bayes.

Comparing different model performance :

Evaluating their performance based on accuracy.

We can see that 100% accuracy in the training and testing datasets on 4 models. To avoid overfitting, I will choose Logistic Regression since it got good results 80.5% on training and 92.8% on testing datasets. Logistic Regression is better used here since we have few binomial outcomes. It also good here because we have a decent amount of data to work with. Now, let’s improve our model to have better results.

Model Improvement :

After using Grid Search with Logistic Regression we managed to get better.

About 2.21% increase, which is great. I don’t think it needs further improvements.But to make our results even better, try to improve my data collection and fix issues I have with NaN values. I will also try to get even more data like location and when the transaction were completed, which branch and what time of the day. All these data can help us know when and where to give our offers.

Conclusion :

In this project, I tried to analyze and make model to predict the best offer to give a Starbucks customer. First I explored the data and see what I have to change before start the analysis. Then I did some exploratory analysis on the data after cleaning. After that I trained the data, then choose one model and improved it to get better results. In conclusion, I think that Starbucks needs to focus more on adults and Males. Also, offer more BOGO and discounts to their customers.

--

--