Smoothing EBMs#
In this demonstration notebook, we are going to create an Explainable Boosting Machine (EBM) using a specially designed synthetic dataset. Our control over the data generation process allows us to visually assess how well the EBM is able to recover the original functions that were used to create the data. To understand how the synthetic dataset was generated, you can examine the full code on GitHub. This will provide insights into the underlying functions we are trying to recover. The full dataset generation code can be found in: synthetic generation code
This notebook can be found in our examples folder on GitHub.
# install interpret if not already installed
try:
import interpret
except ModuleNotFoundError:
!pip install --quiet interpret scikit-learn
# boilerplate - generate the synthetic dataset and split into test/train
import numpy as np
from sklearn.model_selection import train_test_split
from interpret.utils import make_synthetic
from interpret import show
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
seed = 42
X, y, names, types = make_synthetic(classes=None, n_samples=50000, missing=False, seed=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)
Train the Explainable Boosting Machine (EBM)
The synthetic dataset has a significant number of smooth functions. To handle these smoothly varying relationships effectively, we incorporate a parameter called ‘smoothing_rounds’ in the EBM fitting process. ‘smoothing_rounds’ initiates the boosting process in a non-greedy manner by selecting random split points when constructing the internal decision trees. This strategy helps to avoid initial overfitting and sets up baseline smooth partial responses before changing to using a greedy approach that is better for fitting any remaining sharp transitions in the partial responses. We also use the reg_alpha regularization parameter to further smooth the results. EBMs additionally support reg_lambda and max_delta_step, which might be useful in some cases.
For some datasets with large outliers, increasing the validation_size and/or taking the median model from the outer bags might be helpful as described here: interpretml/interpret#548
from interpret.glassbox import ExplainableBoostingRegressor
ebm = ExplainableBoostingRegressor(names, types, interactions=3, smoothing_rounds=5000, reg_alpha=10.0)
ebm.fit(X_train, y_train)
ExplainableBoostingRegressor(feature_names=['feature_0', 'feature_1', 'feature_2', 'feature_3_integers', 'feature_4', 'feature_5', 'feature_6', 'feature_7_unused', 'feature_8_low_cardinality', 'feature_9_high_cardinality'], feature_types=['continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'nominal', 'nominal'], interactions=3, reg_alpha=10.0, smoothing_rounds=5000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ExplainableBoostingRegressor(feature_names=['feature_0', 'feature_1', 'feature_2', 'feature_3_integers', 'feature_4', 'feature_5', 'feature_6', 'feature_7_unused', 'feature_8_low_cardinality', 'feature_9_high_cardinality'], feature_types=['continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'continuous', 'nominal', 'nominal'], interactions=3, reg_alpha=10.0, smoothing_rounds=5000)
Global Explanations
The visual graphs below confirm that the EBM was able to successfully recover the original data generation functions for this particular problem.
# Feature 0 - Cosine partial response generated on uniformly distributed data.
show(ebm.explain_global(), 0)
# Feature 1 - Sine partial response generated on normally distributed data.
show(ebm.explain_global(), 1)
# Feature 2 - Squared partial response generated on exponentially distributed data.
show(ebm.explain_global(), 2)
# Feature 3 - Linear partial response generated on poisson distributed integers.
show(ebm.explain_global(), 3)
# Feature 4 - Square wave partial response generated on a feature with correlations
# to features 0 and 1 with added normally distributed noise.
show(ebm.explain_global(), 4)
# Feature 5 - Sawtooth wave partial response generated on a feature with a conditional
# correlation to feature 2 with added normally distributed noise.
show(ebm.explain_global(), 5)
# Feature 6 - exp(x) partial response generated on a feature with interaction effects
# between features 2 and 3 with added normally distributed noise.
show(ebm.explain_global(), 6)
# Feature 7 - Unused in the generation function. Should have minimal importance.
show(ebm.explain_global(), 7)
# Feature 8 - Linear partial response generated on a low cardinality categorical feature.
# The category strings end in integers that indicate the increasing order.
show(ebm.explain_global(), 8)
# Feature 9 - Linear partial response generated on a high cardinality categorical feature.
# The category strings end in integers that indicate the increasing order.
show(ebm.explain_global(), 9)
# Interaction 0 - Pairwise interaction generated by XORing the sign of feature 0
# with the least significant bit of the integers from feature 3.
show(ebm.explain_global(), 10)
# Interaction 1 - Pairwise interaction generated by multiplying feature 1 and 2.
show(ebm.explain_global(), 11)
# Interaction 2 - Pairwise interaction generated by multiplying feature 3 and 8.
show(ebm.explain_global(), 12)
For RMSE regression, the EBM's intercept should be identical to the mean
print(np.average(y_train))
print(ebm.intercept_)
0.8689547089058739
0.8689547089058739
Importances of the features and pairwise terms
show(ebm.explain_global())
Evaluate EBM performance
from interpret.perf import RegressionPerf
ebm_perf = RegressionPerf(ebm).explain_perf(X_test, y_test, name='EBM')
print("RMSE: " + str(ebm_perf._internal_obj["overall"]["rmse"]))
show(ebm_perf)
RMSE: 0.26813750938453734