{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# EBM Internals - Multiclass\n",
    "\n",
    "This is part 3 of a 3 part series describing EBM internals and how to make predictions. For part 1, click [here](./ebm-internals-regression.ipynb). For part 2, click [here](./ebm-internals-classification.ipynb).\n",
    "\n",
    "In this part 3 we'll cover multiclass, specified bin cuts, term exclusion, and unseen values. Before reading this part you should be familiar with the information in [part 1](./ebm-internals-regression.ipynb) and [part 2](./ebm-internals-classification.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# boilerplate\n",
    "from interpret import show\n",
    "from interpret.glassbox import ExplainableBoostingClassifier\n",
    "import numpy as np\n",
    "\n",
    "from interpret import set_visualize_provider\n",
    "from interpret.provider import InlineProvider\n",
    "set_visualize_provider(InlineProvider())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make a dataset composed of a nominal, an unused feature, and a continuous \n",
    "X = [[\"Peru\", \"\", 7], [\"Fiji\", \"\", 8], [\"Peru\", \"\", 9], [None, \"\", None]]\n",
    "y = [6000, 5000, 4000, 6000] # integer classes\n",
    "\n",
    "# Fit a classification EBM without interactions\n",
    "# Specify exact bin cuts for the continuous feature\n",
    "# Exclude the middle feature during fitting\n",
    "# Eliminate the validation set to handle the small dataset\n",
    "ebm = ExplainableBoostingClassifier(\n",
    "    interactions=0, \n",
    "    feature_types=['nominal', 'nominal', [7.25, 9.0]], \n",
    "    exclude=[(1,)],\n",
    "    validation_size=0, outer_bags=1, min_samples_leaf=1, min_hessian=1e-9)\n",
    "ebm.fit(X, y)\n",
    "show(ebm.explain_global())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>\n",
    "<br/>\n",
    "<br/>\n",
    "<br/>\n",
    "<br/>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.classes_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like all scikit-learn classifiers, we store the list of classes in the ebm.classes_ attribute as a sorted array. In this example our classes are integers, but we also accept strings as seen in [part 2](./ebm-internals-classification.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.feature_types)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we passed feature_types into the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier. Per scikit-learn convention, this was recorded unmodified in the ebm object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.feature_types_in_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The feature_types passed into \\_\\_init\\_\\_ were actualized into the base feature types of ['nominal', 'nominal', 'continuous']."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "feature_names were not specified in the call to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier, so it was set to None following the scikit-learn convention of recording \\_\\_init\\_\\_ parameters unmodified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.feature_names_in_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we did not specify feature names, some default names were created for the model. If we had passed feature_names to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier, or if we had used a Pandas dataframe with column names, then ebm.feature_names_in_ would have contained those names.  Following scikit-learn's [SLEP007 convention](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html), we recorded this in ebm.feature_names_in_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_features_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the call to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier, we specified exclude=[(1,)], which means we excluded the middle feature in the list of terms for the model. The middle feature is thus missing from the list of terms in ebm.term_features_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_names_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.term_names_ is also missing the middle feature since ebm.term_features_ is missing that feature "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.bins_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.bins_ is a per-feature attribute, so the middle feature is listed here. We see however that the middle feature does not have a binning definition since it is not considered when making predictions with the model.\n",
    "\n",
    "These bins are structured as described in [part 1](./ebm-internals-regression.ipynb) and [part 2](./ebm-internals-classification.ipynb). One change to note though is that the continuous feature bin cuts are the same as the bin cuts [7.25, 9.0] specified in the feature_types parameter to the \\_\\_init\\_\\_ function of the ExplainableBoostingClassifier.\n",
    "\n",
    "It is also noteworthy that the last bin cut specified is exactly equal to the largest feature value of 9.0. In this situation where a feature value is identical to the cut value, the feature gets placed into the upper bin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For multiclass, ebm.intercept_ is an array containing a logit for each of the predicted classes in ebm.classes_. This behavior is identical to how other scikit-learn multiclass classifiers generate one logit per class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_scores_[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.term_scores_[0] is the lookup table for the nominal categorical feature containing country names. For multiclass, each bin consists of an array of logits with 1 logit per class being predicted. In this example, each row corresponds to a bin. There are 4 bins in the outer index and 3 class logits in the inner index.\n",
    "\n",
    "Missing values are once again placed in the 0th bin index, shown above as the first row of 3 logits. The unseen bin is the last row of zeros.\n",
    "\n",
    "Since this feature is a nominal categorial, we use the dictionary {'Fiji': 1, 'Peru': 2} to lookup which row of logits to use for each categorical string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(ebm.term_scores_[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ebm.term_scores_[1] is the lookup table for the continuous feature. Once again, the 0th and last index are for missing values, and unseen values respectively. This particular example has 5 bins consisting of the 0th missing bin index, the three partitions from the 2 cuts, and the unseen bin index. Each row is a single bin that contains 3 class logits. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Sample code</h2>\n",
    "\n",
    "This sample code incorporates everything discussed in all 3 sections. It could be used as a drop in replacement for the existing EBM predict function of the ExplainableBoostingRegressor or as the predict_proba function of the ExplainableBoostingClassifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils.extmath import softmax\n",
    "\n",
    "sample_scores = []\n",
    "for sample in X:\n",
    "    # start from the intercept for each sample\n",
    "    score = ebm.intercept_.copy()\n",
    "    if isinstance(score, float) or len(score) == 1:\n",
    "        # regression or binary classification\n",
    "        score = float(score)\n",
    "\n",
    "    # we have 2 terms, so add their score contributions\n",
    "    for term_idx, features in enumerate(ebm.term_features_):\n",
    "        # indexing into a tensor requires a multi-dimensional index\n",
    "        tensor_index = []\n",
    "\n",
    "        # main effects will have 1 feature, and pairs will have 2 features\n",
    "        for feature_idx in features:\n",
    "            feature_val = sample[feature_idx]\n",
    "            bin_idx = 0  # if missing value, use bin index 0\n",
    "\n",
    "            if feature_val is not None and feature_val is not np.nan:\n",
    "                # we bin differently for main effects and pairs, so first \n",
    "                # get the list containing the bins for different resolutions\n",
    "                bin_levels = ebm.bins_[feature_idx]\n",
    "\n",
    "                # what resolution do we need for this term (main resolution, pair\n",
    "                # resolution, etc.), but limit to the last resolution available\n",
    "                bins = bin_levels[min(len(bin_levels), len(features)) - 1]\n",
    "\n",
    "                if isinstance(bins, dict):\n",
    "                    # categorical feature\n",
    "                    # 'unseen' category strings are in the last bin (-1)\n",
    "                    bin_idx = bins.get(feature_val, -1)\n",
    "                else:\n",
    "                    # continuous feature\n",
    "                    try:\n",
    "                        # try converting to a float, if that fails it's 'unseen'\n",
    "                        feature_val = float(feature_val)\n",
    "                        # add 1 because the 0th bin is reserved for 'missing'\n",
    "                        bin_idx = np.digitize(feature_val, bins) + 1\n",
    "                    except ValueError:\n",
    "                        # non-floats are 'unseen', which is in the last bin (-1)\n",
    "                        bin_idx = -1\n",
    "        \n",
    "            tensor_index.append(bin_idx)\n",
    "        # local_score is also the local feature importance\n",
    "        local_score = ebm.term_scores_[term_idx][tuple(tensor_index)]\n",
    "        score += local_score\n",
    "    sample_scores.append(score)\n",
    "\n",
    "predictions = np.array(sample_scores)\n",
    "\n",
    "if hasattr(ebm, 'classes_'):\n",
    "    # classification\n",
    "    if len(ebm.classes_) == 2:\n",
    "        # binary classification\n",
    "\n",
    "        # softmax expects two logits for binary classification\n",
    "        # the first logit is always equivalent to 0 for binary classification\n",
    "        predictions = [[0, x] for x in predictions]\n",
    "    predictions = softmax(predictions)\n",
    "\n",
    "if hasattr(ebm, 'classes_'):\n",
    "    print(\"probabilities for classes \" + str(ebm.classes_))\n",
    "    print(\"\")\n",
    "    print(ebm.predict_proba(X))\n",
    "else:\n",
    "    print(ebm.predict(X))\n",
    "print(\"\")\n",
    "print(predictions)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}