Causal ML for Orange Juice Price Elasticity

DATA622-Lab6

#--package load--
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Overview

You have been given a dataset consisting of a single file, [oj_large.csv]. The dataset is a subset of a dataset compiled during a large study by the Chicago Booth School of Business, which collaborated with a local supermarket chain called Dominick’s Finer Foods, to study the impact of prices, advertising, and demographics on the sales of a number of products. The dataset we are working with has over 29,000 observations of the price of orange juice and sales for different brands at different Dominick’s stores. There is a dataset description here: [oj_dictionary.qmd]

The goal of this homework assignment is to use Causal Machine Learning to understand how different demographic factors influence something called the price elasticity of orange juice. You can read more about price elasticity here: [Wikipedia Price Elasticity of Demand]

Elasticity of demand is the relationship between a percentage change in sales and a percentage change in price:

\[ \epsilon = \frac{\partial (\mathrm{SALES})}{\partial (\mathrm{PRICE})}\frac{\mathrm{PRICE}}{\mathrm{SALES}} \]

This relationship is most natural when expressed in terms of the log transform of both sales and price, as the elasticity becomes the coefficient in a linear regression model relating the two:

\[ \log(\mathrm{SALES}) = \epsilon\log(\mathrm{PRICE}) + \mathrm{error\ terms} \]

The elasticity \(\epsilon\) can be dependent upon a variety of other factors. It can depend on the price itself (so that we don’t get a straight line relationship between the logs), it can depend on the type of product, the demographics of the shoppers and more. The EconML package developed a vignette where they used Causal ML to show that \(\epsilon\) is a function of income, which is to say that the sales are more sensitive to price in stores where the median income is lower. You can find that vignette here and I recommend that you read it and use some code from it as a starting point (it covers more than OJ but it is there): [EconML OJ Vignette]. To read more about the package and its applications, see [pywhy EconML].The [tutorial for this package] is helpful as well.

Problem 1: Testing on Fake Data

(a)

It is standard practice in Causal Inference to test models on simulated response data based on the original covariates of the dataset before fitting to the original dataset. Use the same selection of confounders as in the original vignette (The W matrix), excluding week, store, price, INCOME, and logmove, applying One-Hot encoding/dummies to the brand variables to incorporate them into W. Apply the StandardScaler to W. Then put the variable INCOME into a matrix called “X” and standardize it. The matrix X contains _modifier_ variables whose effect on the elasticity will be studied.

Then you will simulate a relationship between your confounders and the price, you can use code like this: T_sim = 0.8 + W[:, support] @ coefs_T + noise, where support is sparse (most entries are 0, the rest are 1) and coefs_T is random (the 0.8 is just for scale).

Look to the simulation code earlier in the EconML vignette for inspiration.

(b)

Now, simulate the values of the logmove (in a matrix called Y_sim) using your T_sim, your confounders W, and your modifier X. Make the relationship between Y_sim, T_sim, and X nonlinear using something like this: Y_sim = (-2.5 \* np.tanh(2.0\*X))\*T_sim + W[:, support] @ coefs_Y + noise, where coefs_Y is random (we are using the same support in both simulations).

(c)

Using the code from the vignette, fit a Causal Forest (CausalForestDML) and a linear model (LinearDML) to the simulated data.

For both models, plot the predicted elasticity as a function of the INCOME, showing the confidence intervals and the real relationship.

Also plot the predicted elasticity and the true elasticity for each simulated observation. Report the true and estimated ATE with confidence intervals. Comment on the performance of both models on the simulated data.

Problem 2: Checking for Overlap

In order for Causal ML to be successful, there needs to be variation in the treatment variable for all combinations of the confounder variables. For a continuous treatment, it is important that there is residual variation left-over after the Causal Forest predicts the treatment using confounders. Keeping with the structure of the original vignette (same definition of W, X, Y, and T), use LassoCV to predict \(T\) using \(W\). Calculate and report the \(R^2\). Does this value of \(R^2\) support the suitability of this dataset for Causal Inference?

Problem 3: Fitting and Interpreting the Model

(a)

Perform a train-test split on the data. Repeat the fit in the vignette to the training set (you can copy their code with suitable modifications to make it work) using a CausalForestDML model to learn the effect of income on price elasticity. Plot the price elasticity versus income with confidence intervals. Calculate the average treatment effect and confidence intervals.

(b)

Compute the R-score on the testing/validation set to determine the strength of the heterogeneity. What is your interpretation of the R-score value? Fit a LinearDML model in the same manner and compare the R-score to the CausalForestDML. Is either model noticeably better?

(c)

Compute a sensitivity check using the sensitivity_interval method of your fit model. This determines how strong an _unobserved confounder_ would have to be to change the results of your analysis in a meaningful way. The method recalculates confidence intervals for the _ATE_ based on two parameters c_t and c_y, which are the fraction of residual variance explained by the hypothetical confounder for the treatment and the target respectively. One method for determining the range of c_t and c_y to explore by checking the values of c_t and c_y for existing confounders. If you were to do this check, you would find that the most important confounder is the feat variable

(whether the item was advertised that week), and the range should be up to c_t=0.1 and c_y=0.3. Compute the sensitivity check for the most extreme scenario, with c_t=0.1 and c_y=0.3. What are the resulting confidence intervals for the ATE? Do they contain 0?

Problem 4: CATE for Brands and Income

The three brands of orange juice have different price points and are targetted at different customer segments, with dominicks as the discount brand, minute.maid as the mid-range brand, and tropicana as the premium brand. Move the brand variables from confounder matrix W to the modifier matrix X and refit the model. Calculate the feature importances for all the modifiers and plot the elasticity as a function of income for each of the three brands. How do the elasticities differ by brand?