{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# EBM Internals - Multiclass\n",
"\n",
"This is part 3 of a 3 part series describing EBM internals and how to make predictions. For part 1, click [here](./ebm-internals-regression.ipynb). For part 2, click [here](./ebm-internals-classification.ipynb).\n",
"\n",
"In this part 3 we'll cover multiclass, specified bin cuts, term exclusion, and unknown values. Before reading this part you should be familiar with the information in [part 1](./ebm-internals-regression.ipynb) and [part 2](./ebm-internals-classification.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# boilerplate\n",
"from interpret import show\n",
"from interpret.glassbox import ExplainableBoostingClassifier\n",
"import numpy as np\n",
"\n",
"from interpret import set_visualize_provider\n",
"from interpret.provider import InlineProvider\n",
"set_visualize_provider(InlineProvider())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# make a dataset composed of a nominal, an unused feature, and a continuous \n",
"X = [[\"Peru\", \"\", 7], [\"Fiji\", \"\", 8], [\"Peru\", \"\", 9], [None, \"\", None]]\n",
"y = [6000, 5000, 4000, 6000] # integer classes\n",
"\n",
"# Fit a classification EBM without interactions\n",
"# Specify exact bin cuts for the continuous feature\n",
"# Exclude the middle feature during fitting\n",
"# Eliminate the validation set to handle the small dataset\n",
"ebm = ExplainableBoostingClassifier(\n",
" interactions=0, \n",
" feature_types=['nominal', 'nominal', [7.25, 9.0]], \n",
" exclude=[(1,)],\n",
" validation_size=0, outer_bags=1, min_samples_leaf=1, min_hessian=1e-9)\n",
"ebm.fit(X, y)\n",
"show(ebm.explain_global())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
\n",
"
\n",
"
\n",
"
\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.classes_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Like all scikit-learn classifiers, we store the list of classes in the ebm.classes_ attribute as a sorted array. In this example our classes are integers, but we also accept strings as seen in [part 2](./ebm-internals-classification.ipynb)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.feature_types)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example we passed feature_types into the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier. Per scikit-learn convention, this was recorded unmodified in the ebm object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.feature_types_in_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The feature_types passed into \\_\\_init\\_\\_ were actualized into the base feature types of ['nominal', 'nominal', 'continuous']."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.feature_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"feature_names were not specified in the call to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier, so it was set to None following the scikit-learn convention of recording \\_\\_init\\_\\_ parameters unmodified."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.feature_names_in_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we did not specify feature names, some default names were created for the model. If we had passed feature_names to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier, or if we had used a Pandas dataframe with column names, then ebm.feature_names_in_ would have contained those names. Following scikit-learn's [SLEP007 convention](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html), we recorded this in ebm.feature_names_in_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.term_features_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the call to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier, we specified exclude=[(1,)], which means we excluded the middle feature in the list of terms for the model. The middle feature is thus missing from the list of terms in ebm.term_features_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.term_names_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ebm.term_names_ is also missing the middle feature since ebm.term_features_ is missing that feature "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.bins_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ebm.bins_ is a per-feature attribute, so the middle feature is listed here. We see however that the middle feature does not have a binning definition since it is not considered when making predictions with the model.\n",
"\n",
"These bins are structured as described in [part 1](./ebm-internals-regression.ipynb) and [part 2](./ebm-internals-classification.ipynb). One change to note though is that the continuous feature bin cuts are the same as the bin cuts [7.25, 9.0] specified in the feature_types parameter to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier.\n",
"\n",
"It is also noteworthy that the last bin cut specified is exactly equal to the largest feature value of 9.0. In this situation where a feature value is identical to the cut value, the feature gets placed into the upper bin."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For multiclass, ebm.intercept_ is an array containing a logit for each of the predicted classes in ebm.classes_. This behavior is identical to how other scikit-learn multiclass classifiers generate one logit per class."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.term_scores_[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ebm.term_scores_[0] is the lookup table for the nominal categorical feature containing country names. For multiclass, each bin consists of an array of logits with 1 logit per class being predicted. In this example, each row corresponds to a bin. There are 4 bins in the outer index and 3 class logits in the inner index.\n",
"\n",
"Missing values are once again placed in the 0th bin index, shown above as the first row of 3 logits. The unknown bin is the last row of zeros.\n",
"\n",
"Since this feature is a nominal categorial, we use the dictionary {'Fiji': 1, 'Peru': 2} to lookup which row of logits to use for each categorical string."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(ebm.term_scores_[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ebm.term_scores_[1] is the lookup table for the continuous feature. Once again, the 0th and last index are for missing values, and unknown values respectively. This particular example has 5 bins consisting of the 0th missing bin index, the three partitions from the 2 cuts, and the unknown bin index. Each row is a single bin that contains 3 class logits. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"