Shapley Values and Regressions
Multicollinearity Approach with Shapley Values to Regressions
Lloyd Shapley was an American Mathematician who won the Nobel-Prize in economics. One of his greatest contributions was the concept of a Shapley value in the field of cooperative game theory. Recently, econometricians have adopted Shapley values as a way to tackle multicollinearity in regressions. It is a highly peculiar approach since one can view a regression as a game between multiple independent variables X competing to see who has the greatest impact on the dependent variable Y. Consider a simple regression with 3 independent variables: $$ Y=\alpha + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon$$ The approach utilized to dealing with muticollinearity with Shapley values involves running multiple permutations of regressions to see which factors truly have the strongest influence on Y and averaging out those contributions. In our three case scenario the total number of permutations of regressions we could run is equal to the factorial of the number of our independent variables X (6): $$ Y=\alpha + \beta_1 X_1 + \epsilon$$ $$ Y=\alpha + \beta_2 X_2 + \epsilon$$ $$ Y=\alpha + \beta_3 X_3 + \epsilon$$ $$ Y=\alpha + \beta_1 X_1 + \beta_2 X_2 + \epsilon$$ $$ Y=\alpha + \beta_1 X_1 + \beta_2 X_2 + \epsilon$$ $$ Y=\alpha + \beta_2 X_2 + \beta_3 X_3 + \epsilon$$ $$ Y=\alpha + \beta_1 X_1 + \beta_2 X_2 + \beta_2 X_3 + \epsilon$$ First we would check how much each independent variable X impacts the dependent variable Y. We then check how different combinations of two variables impact Y. At the end we can check how much of an impact they all have together. This gives us our R-squared values for all of the possible ways Y can be impacted. Now we utilize the Shapley formula which can be basically expressed for econometrics as: $$ \varphi (\beta_i )=\frac {1}{!X} \sum_{coalitions \; excluding \; beta_i}\frac{marginal \; contribution \; of \; \beta_i \; to \; coalition} {number \; of \; coalitions \; excluding \; \beta_i \; i} $$ This will give us the correct beta for each independent variable. Suppose we are a chip manufacturer like Intel and we are looking to adjust our budget to allocate capital across three segments which we think drive our revenue for chip sales. We think that chip speed, chip density, and chip quality are our main factors that impact our sales. We want to figure out how much we could allocate to each department in order to get the best bang for the buck in the future. We at first construct our regression as: $$ Y_{sales} =\alpha + \beta_{s} X_{s} + \beta_{d} X_{d} + \beta_{q} X_{q} + \epsilon$$ Where s=speed, d=density, q=quality. I've generated a randomized data set for these variables and made them all drift upwards. The black graph is the dependent sales variable while the blue are the independent variables.

The first thing we notice when we run this regression is multicollinearity. Our coefficients are all over the place. This doesn't help us much when we have to allocate our budget for what will drive sales.
OLS Regression Results ============================================================================== Dep. Variable: sales R-squared: 0.976 Model: OLS Adj. R-squared: 0.976 Method: Least Squares F-statistic: 1.351e+04 Date: Sun, 23 Sep 2018 Prob (F-statistic): 0.00 Time: 12:16:38 Log-Likelihood: -3629.0 No. Observations: 999 AIC: 7266. Df Residuals: 995 BIC: 7286. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -868.3913 16.928 -51.300 0.000 -901.609 -835.173 speed 5.0703 0.098 51.703 0.000 4.878 5.263 quality 0.3679 0.063 5.883 0.000 0.245 0.491 density 4.3654 0.244 17.869 0.000 3.886 4.845 ============================================================================== Omnibus: 11.081 Durbin-Watson: 0.069 Prob(Omnibus): 0.004 Jarque-Bera (JB): 11.151 Skew: 0.256 Prob(JB): 0.00379 Kurtosis: 3.072 Cond. No. 9.92e+03 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 9.92e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Presence of multicollinearity
Instead of applying ridge or lasso regressions which usually involve certain parameters that one has to estimate/guess in order to adjust the weights of each beta coefficient to deal with multicollinearity, we rather apply the Shapley formula where no such parameters are necessary and wherein for the values we substitute the adjusted R-squared values for each of the possible regressions. We run all of our possible combinations of regressions which give us the following R-squared values: $$ R^2_{speed}= 0.965 $$ $$ R^2_{quality}= 0.743 $$ $$ R^2_{density}= 0.899 $$ $$ R^2_{speed, quality}= 0.968 $$ $$ R^2_{speed, density}= 0.975 $$ $$ R^2_{quality, density}= 0.912 $$ $$ R^2_{speed, quality, density}= 0.976 $$ We know that these regressions are spurious and colinear so we use the Shapley value formula on these R-squared outputs. Each dependent variable can be seen as competing for its relevance in a game where the game is who's the most relevant at guessing sales. We can check how relevant it is not just by how much it impacts the dependent variable by itself, but also as the sum of all of its marginal contributions when it is a part of a coalition with other dependent variables (part of a multiple regression). We apply the Shapley formula as follows:
Showing with an example, we wind up getting a value of 0.39 for our Speed coefficient. Solving for all of the coefficients with Shapley we get: $$ \beta_{speed}= 0.39 $$ $$ \beta_{quality}= 0.25 $$ $$ \beta_{density}= 0.33 $$ These are our adjusted betas after being treated for multicollinearity. We can now use this information to help with our Intel capital budgeting decision by using the beta coefficients as our multipliers. If we have $100 million to allocate to next year's R&D budget, we'd allocate them as:
$$ $39,000,000 = speed $$ $$ $25,000,000 = quality $$ $$ $33,000,000 = density $$
Python Code for Multicollinearity Treatment with Shapley Values
Since doing the above calculation is an incredible nightmare, here is the code for applying Shapley values for multiple regressions. An interesting note to make is that the Shapley formula is factorial in its time complexity. This means that if you have 3 independent variables, then your Shapley calculation involves 6 regressions as was shown above in our Intel example. If you have 10 independent variable, then your Shapley calculation involves 3,628,800 regressions!
import numpy as np import pandas as pd import statsmodels.formula.api as sm import shap yd=pd.read_csv('sales.csv') # dependent variable data xa=pd.read_csv('speed.csv') # independent variable 1 xb=pd.read_csv('quality.csv') # independent variable 2 xc=pd.read_csv('density.csv') # independent variable 3 yd=pd.DataFrame(yd) xa=pd.DataFrame(xa) xb=pd.DataFrame(xb) xc=pd.DataFrame(xc) series=pd.concat([yd,xa,xb,xc], axis=1) series_unfiltered=pd.concat([yd,xa,xb,xc], axis=1) series_unfiltered.columns=['sales','speed','quality','density'] series.columns=['sales','speed','quality','density'] result=sm.ols(formula='sales ~ speed + quality + density', data=series_unfiltered).fit() print result.summary() # OLS shows us multicollinearity def reg_m(y,x): ones=np.ones(len(x[0])) X=sm.add_constant(np.column_stack((x[0],ones))) for ele in x[1:]: X=sm.add_constant(np.column_stack((ele,X))) results=sm.OLS(y,X).fit() return results regression_combos=shap.power_set(np.array(series.columns[1:])) characteristic_fx={} for i in xrange(0,len(regression_combos)): deps='+'.join(regression_combos[i]) result=sm.ols(formula='sales ~ '+ deps, data=series_unfiltered).fit() print str(deps) + ' with R-squared of: '+ str(round(result.rsquared,3)) characteristic_fx[str(','.join(regression_combos[i]))]= round(result.rsquared,5) player_list = list(np.array(series.columns[1:])) coalition_dictionary = characteristic_fx g = shap.Coop_Game(player_list,coalition_dictionary) print player_list # our list of independent variables print coalition_dictionary # our list of all of the R-squared for all of the possible combinations print 'Shapley adjusted beta coefficients are: ' + str(g.shapley())