EBM Internals - Regression#
This is part 1 of a 3 part series describing EBM internals and how to make predictions. For part 2, click here. For part 3, click here.
In this part 1 we’ll cover the simplest useful EBM: a regression model that does not have interactions, missing values, or other complications.
At their core, EBMs are generalized additive models where the score contributions from individual features and interactions are added together to make a prediction. Each individual score contribution is determined using a lookup table. Before doing the lookup, we first need to discretize continuous features and assign bin indexes to categorical features.
Regression is the simplest form of EBM model because the final sum is the actual prediction without requiring an inverse link function.
# boilerplate
from interpret import show
from interpret.glassbox import ExplainableBoostingRegressor
import numpy as np
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
# make a dataset composed of a nominal categorical, and a continuous feature
X = [["Peru", 7.0], ["Fiji", 8.0], ["Peru", 9.0]]
y = [450.0, 550.0, 350.0]
# Fit a regression EBM without interactions
# Eliminate the validation set to handle the small dataset
ebm = ExplainableBoostingRegressor(
interactions=0,
validation_size=0, outer_bags=1, min_samples_leaf=1, min_hessian=1e-9)
ebm.fit(X, y)
show(ebm.explain_global())
Let’s have a look at some of the most important attributes of the ExplainableBoostingRegressor.
print(ebm.feature_types_in_)
['nominal', 'continuous']
Since we did not pass a feature_types parameter to the __init__ function of the ExplainableBoostingRegressor, some reasonable feature type guesses were assigned. Those guesses were recorded in ebm.feature_types_in_. We support the following base feature types: ‘continuous’, ‘nominal’, and ‘ordinal’. For evaluation purposes, ‘nominal’ and ‘ordinal’ can be treated identically as they are both categoricals and are represented the same way in the model. ‘nominal’ and ‘ordinal’ are treated differently during training, but we’re focusing just on prediction here.
print(ebm.bins_)
[[{'Fiji': 1, 'Peru': 2}], [array([7.5, 8.5])]]
ebm.bins_ defines how to bin both categorical (‘nominal’ and ‘ordinal’) and ‘continuous’ features.
For categorical features we use a dictionary that maps the category strings to bin indexes. In this example, “Fiji” has been assigned bin #1 and “Peru” has been assigned bin #2.
Continuous feature binning is defined with a list of cut points that partition the continuous range into regions. In this example, the dataset has 3 unique values in the continuous feature: 7.0, 8.0, and 9.0. To separate these 3 values into 3 bins, we require 2 cut points. The EBM has chosen the cut points of 7.5 and 8.5, but it could have chosen any cut point values between 7.0 to 8.0 and between 8.0 to 9.0.
When making predictions involving continuous features, the feature values we receive can be anywhere in the continuous range between -infinity and +infinity. In this example therefore, our two cut points define 3 binned regions: bin #1 is [-inf, 7.5), bin #2 is [7.5, 8.5), and bin #3 is [8.5, +inf].
If there are any feature values that are equal to a bin cut value, they are placed into the upper bin choice. To convert a continuous feature into bins, we can use the numpy.digitize function with a slight adjustment that we’ll see below.
EBMs also include 2 special bins: the missing bin, and the unknown bin. The missing bin is the bin that we use if there are any feature values that are missing, like NaN or ‘None’. The unknown bin is used whenever we see a categorical value during prediction that was not present in the training set. For example, if our testing dataset had the categorical value “Vietnam”, or “Brazil”, then we would use the unknown bin in this example since those countries did not appear in the training set.
The missing bin is always located at the 0th index, and the unknown bin is always at the last index.
print(ebm.term_scores_[0])
[ 0. 84.55036163 -42.27518082 0. ]
ebm.term_scores_ contains the lookup tables for each additive term. ebm.term_scores_[0] is the lookup table for the first feature in this example, which is the categorical feature containing country strings.
Since the first feature is a categorial, we use the dictionary from ebm.bins_[0], which is {‘Fiji’: 1, ‘Peru’: 2} to lookup which bin to use when either of those strings appear as a feature value. If we received a feature value of NaN, then we’d use the score value at index 0. If the feature value was “Fiji”, we’d use the score value at index 1. If the feature value was “Peru”, we’d use the score value at index 2. If the feature value was anything else, we’d use the score value at index 3.
Another thing to note is that the values stored in ebm.term_scores_ are identical to the values shown in the global explanation graphs (see above). This is true for both categorical and continuous features, and also for interactions. This full model transparency is what makes EBMs a glassbox model.
print(ebm.term_scores_[1])
[ 0. 42.27518082 15.44963837 -57.72481918 0. ]
ebm.term_scores_[1] is the lookup table for the continuous feature in our dataset. The 0th index is again reserved for missing values, and the last index is again reserved for unknown values. In the context of a continuous feature, the unknown bin is used for anything that cannot be converted to a float, so if we had been given the string “BAD_VALUE” instead of a number, then we’d use the last bin. The unknown bin’s score value can optionally be set to NaN if you would prefer an error condition.
The 3 remaining scores in the middle correspond to the 3 binned regions, which in our example are: bin #1 [-numpy.inf, 7.5), bin #2 [7.5, 8.5), and bin #3 [8.5, +numpy.inf].
Once again, the scores in ebm.term_scores_[1] match the values shown in the global explanation graphs for feature_0001 (see above).
print(ebm.intercept_)
450.00000000000034
ebm.intercept_ should usually be very close to the base score. In this example, the intercept is indeed very close to the average of the three ‘y’ values: numpy.average([450, 550, 350]) == 450.
When making predictions, we start from the intercept value and add the scores from each lookup table.
Sample code
Finally, here’s some code which puts the above considerations together into a function that can make predictions for simplified scenarios. This code does not handle things like interactions, missing values, unknown values, or classification.
If you need a drop-in complete function that can work in all EBM scenarios, see the multiclass example in part 3 which handles regression and binary classification in addition to multiclass and all the other nuances.
sample_scores = []
for sample in X:
# start from the intercept for each sample
score = ebm.intercept_
print("intercept: " + str(score))
# we have 2 features, so add their score contributions
for feature_idx, feature_val in enumerate(sample):
bins = ebm.bins_[feature_idx][0]
if isinstance(bins, dict):
# categorical feature
bin_idx = bins[feature_val]
else:
# continuous feature. bins is an array of cut points
# add 1 because the 0th bin is reserved for 'missing'
bin_idx = np.digitize(feature_val, bins) + 1
local_score = ebm.term_scores_[feature_idx][bin_idx]
# local_score is also the local feature importance (see plot below)
print(ebm.feature_names_in_[feature_idx] + ": " + str(local_score))
score += local_score
sample_scores.append(score)
print()
print("PREDICTIONS:")
print(ebm.predict(X))
print(np.array(sample_scores))
intercept: 450.00000000000034
feature_0000: -42.27518081674776
feature_0001: 42.27518081674705
intercept: 450.00000000000034
feature_0000: 84.55036163349553
feature_0001: 15.4496383665047
intercept: 450.00000000000034
feature_0000: -42.27518081674776
feature_0001: -57.72481918325175
PREDICTIONS:
[450. 550. 350.]
[450. 550. 350.]
The predictions match the predictions from the ExplainableBoostingRegressor’s predict function. In this example, the EBM almost exactly recovered the original ‘y’ values of [450, 550, 350]. This is not surprising since this model is extremely overfit for illustration purposes.
Another interesting thing to note is that the values we retrieved from the lookup tables, which were assigned to the variable ‘local_score’, are identical to the values shown in the EBM local explanations.
show(ebm.explain_local(X, y), 0)