Smoothing EBMs#

In this demonstration notebook, we are going to create an Explainable Boosting Machine (EBM) using a specially designed synthetic dataset. Our control over the data generation process allows us to visually assess how well the EBM is able to recover the original functions that were used to create the data. To understand how the synthetic dataset was generated, you can examine the full code on GitHub. This will provide insights into the underlying functions we are trying to recover. The full dataset generation code can be found in: synthetic generation code

This notebook can be found in our examples folder on GitHub.

# install interpret if not already installed
try:
    import interpret
except ModuleNotFoundError:
    !pip install --quiet interpret scikit-learn
# boilerplate - generate the synthetic dataset and split into test/train

import numpy as np
from sklearn.model_selection import train_test_split
from interpret.utils import make_synthetic
from interpret import show

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

seed = 42

X, y, names, types = make_synthetic(classes=None, n_samples=50000, missing=False, seed=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

Train the Explainable Boosting Machine (EBM)

The synthetic dataset has a significant number of smooth functions. To handle these smoothly varying relationships effectively, we incorporate a parameter called ‘smoothing_rounds’ in the EBM fitting process. ‘smoothing_rounds’ initiates the boosting process in a non-greedy manner by selecting random split points when constructing the internal decision trees. This strategy helps to avoid initial overfitting and sets up baseline smooth partial responses before changing to using a greedy approach that is better for fitting any remaining sharp transitions in the partial responses. We also use the reg_alpha regularization parameter to further smooth the results. EBMs additionally support reg_lambda and max_delta_step, which might be useful in some cases.

For some datasets with large outliers, increasing the validation_size and/or taking the median model from the outer bags might be helpful as described here: interpretml/interpret#548

from interpret.glassbox import ExplainableBoostingRegressor

ebm = ExplainableBoostingRegressor(names, types, interactions=3, smoothing_rounds=5000, reg_alpha=10.0)
ebm.fit(X_train, y_train)
ExplainableBoostingRegressor(feature_names=['feature_0', 'feature_1',
                                            'feature_2', 'feature_3_integers',
                                            'feature_4', 'feature_5',
                                            'feature_6', 'feature_7_unused',
                                            'feature_8_low_cardinality',
                                            'feature_9_high_cardinality'],
                             feature_types=['continuous', 'continuous',
                                            'continuous', 'continuous',
                                            'continuous', 'continuous',
                                            'continuous', 'continuous',
                                            'nominal', 'nominal'],
                             interactions=3, reg_alpha=10.0,
                             smoothing_rounds=5000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Global Explanations

The visual graphs below confirm that the EBM was able to successfully recover the original data generation functions for this particular problem.

# Feature 0 - Cosine partial response generated on uniformly distributed data.

show(ebm.explain_global(), 0)