Generating counterfactuals for multi-class classification and regression models

This notebook will demonstrate how the DiCE library can be used for multiclass classification and regression for scikit-learn models. You can use any method (“random”, “kdtree”, “genetic”), just specific it in the method argument in the initialization step. The rest of the code is completely identical. For demonstration, we will be using the genetic algorithm for CFs.

[1]:

import dice_ml
from dice_ml import Dice

from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

import pandas as pd

[2]:

%load_ext autoreload
%autoreload 2

We will use sklearn’s internal datasets to demonstrate DiCE’s features in this notebook

Multiclass Classification

For multiclass classification, we will use sklearn’s Iris dataset. This data set consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length. More information at https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset

[3]:

df_iris = load_iris(as_frame=True).frame
df_iris.head()

[3]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

[4]:

df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

[5]:

outcome_name = "target"
continuous_features_iris = df_iris.drop(outcome_name, axis=1).columns.tolist()
target = df_iris[outcome_name]

[6]:

# Split data into train and test
datasetX = df_iris.drop(outcome_name, axis=1)
x_train, x_test, y_train, y_test = train_test_split(datasetX,
                                                    target,
                                                    test_size=0.2,
                                                    random_state=0,
                                                    stratify=target)

categorical_features = x_train.columns.difference(continuous_features_iris)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, continuous_features_iris),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf_iris = Pipeline(steps=[('preprocessor', transformations),
                           ('classifier', RandomForestClassifier())])
model_iris = clf_iris.fit(x_train, y_train)

[7]:

d_iris = dice_ml.Data(dataframe=df_iris,
                      continuous_features=continuous_features_iris,
                      outcome_name=outcome_name)

# We provide the type of model as a parameter (model_type)
m_iris = dice_ml.Model(model=model_iris, backend="sklearn", model_type='classifier')

[8]:

exp_genetic_iris = Dice(d_iris, m_iris, method="genetic")

As we can see below, all the target values will lie in the desired class

[9]:

# Single input
query_instances_iris = x_test[2:3]
genetic_iris = exp_genetic_iris.generate_counterfactuals(query_instances_iris, total_CFs=7, desired_class=2)
genetic_iris.visualize_as_dataframe()

100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.16s/it]

Query instance (original outcome : 0)

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.8	1.9	0.4	0


Diverse Counterfactual set (new outcome: 2)

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
7.9	3.8	6.4	2.0	2
7.7	3.8	6.7	2.2	2
6.3	3.8	4.4	2.1	2
6.4	3.3	6.1	0.9	2
7.2	3.6	5.2	1.3	2
4.3	3.3	4.5	1.7	2

[10]:

# Multiple queries can be given as input at once
query_instances_iris = x_test[17:19]
genetic_iris = exp_genetic_iris.generate_counterfactuals(query_instances_iris, total_CFs=7, desired_class=2)
genetic_iris.visualize_as_dataframe(show_only_changes=True)

100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.56s/it]

Query instance (original outcome : 1)

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	6.7	3.1	4.4	1.4	1


Diverse Counterfactual set (new outcome: 2)

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
-	-	5.6	2.4	2.0
6.4	-	5.5	1.8	2.0
6.9	-	5.4	2.1	2.0
6.9	-	5.1	2.3	2.0
-	3.0	5.2	2.3	2.0
5.7	-	5.0	2.0	2.0
-	3.3	5.7	2.1	2.0

Query instance (original outcome : 1)

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	7.0	3.2	4.7	1.4	1


Diverse Counterfactual set (new outcome: 2)

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
7.2	-	6.0	1.8	2.0
6.5	-	5.1	2.0	2.0
6.9	-	5.7	2.3	2.0
6.8	-	5.9	2.3	2.0
6.4	-	5.3	2.3	2.0
-	2.8	1.5	2.0	2.0
6.9	3.1	5.4	2.1	2.0

Regression

For regression, we will use sklearn’s California Housing dataset. This dataset contains California house prices. More information at https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

[11]:

housing_data = fetch_california_housing()
df_housing = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)
df_housing[outcome_name] = pd.Series(housing_data.target)
df_housing.head()

[11]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	target
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

[12]:

df_housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB

[13]:

continuous_features_housing = df_housing.drop(outcome_name, axis=1).columns.tolist()
target = df_housing[outcome_name]

[14]:

# Split data into train and test
datasetX = df_housing.drop(outcome_name, axis=1)
x_train, x_test, y_train, y_test = train_test_split(datasetX,
                                                    target,
                                                    test_size=0.2,
                                                    random_state=0)

categorical_features = x_train.columns.difference(continuous_features_housing)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, continuous_features_housing),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
regr_housing = Pipeline(steps=[('preprocessor', transformations),
                               ('regressor', RandomForestRegressor())])
model_housing = regr_housing.fit(x_train, y_train)

[15]:

d_housing = dice_ml.Data(dataframe=df_housing, continuous_features=continuous_features_housing, outcome_name=outcome_name)
# We provide the type of model as a parameter (model_type)
m_housing = dice_ml.Model(model=model_housing, backend="sklearn", model_type='regressor')

[16]:

exp_genetic_housing = Dice(d_housing, m_housing, method="genetic")

As we can see below, all the target values will lie in the desired range

[17]:

# Multiple queries can be given as input at once
query_instances_housing = x_test[2:4]
genetic_housing = exp_genetic_housing.generate_counterfactuals(query_instances_housing,
                                                               total_CFs=2,
                                                               desired_range=[3.0, 5.0])
genetic_housing.visualize_as_dataframe(show_only_changes=True)

100%|█████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.09s/it]

Query instance (original outcome : 1.5158300399780273)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	target
0	4.3487	29.0	5.930712	1.026217	1554.0	2.910112	38.650002	-121.839996	1.51583


Diverse Counterfactual set (new outcome: [3.0, 5.0])

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	target
0	3.6976	-	5.9	1.0	-	2.0	34.24	-124.35	3.6281412000000004
0	6.5173	24.0	6.5	1.0	-	2.7	34.45	-119.81	3.517150499999996

Query instance (original outcome : 0.8763800263404846)

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	target
0	2.4511	37.0	4.992958	1.316901	390.0	2.746479	33.200001	-115.599998	0.87638


Diverse Counterfactual set (new outcome: [3.0, 5.0])

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	target
0	10.3682	-	8.1	1.1	-	2.6	33.61	-117.92	4.94750939999999
0	2.9167	43.0	4.6	1.2	-	1.6	34.01	-118.47	4.3242802999999945