Generating counterfactual explanations without access to training data

If only the trained model is available but not the training data, DiCE can still be used to generate counterfactual explanations. Below we show an example where DiCE uses only basic metadata about each feature used in the ML model.

[1]:

# import DiCE
import pandas as pd
import dice_ml
from dice_ml.utils import helpers  # helper functions

[2]:

%load_ext autoreload
%autoreload 2

Defining meta data

We simulate “adult” income dataset from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult) by providing only meta information about the data: range is provided for continuous features and categories are provided for categorical features. Please note for Python<=3.6, “features” parameter should be provided as an OrderedDict in the same order that was used to train the ML model.

[3]:

d = dice_ml.Data(features={'age': [17, 90],
                           'workclass': ['Government', 'Other/Unknown', 'Private', 'Self-Employed'],
                           'education': ['Assoc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters',
                                         'Prof-school', 'School', 'Some-college'],
                           'marital_status': ['Divorced', 'Married', 'Separated', 'Single', 'Widowed'],
                           'occupation': ['Blue-Collar', 'Other/Unknown', 'Professional', 'Sales', 'Service', 'White-Collar'],
                           'race': ['Other', 'White'],
                           'gender': ['Female', 'Male'],
                           'hours_per_week': [1, 99]},
                 outcome_name='income')

Explaining pre-trained sklearn models

We first explain a RandomForest model that has been pre-trained on the Adult dataset.

[4]:

backend = 'sklearn'
sk_modelpath = helpers.get_adult_income_modelpath(backend=backend)  # pretrained model
m = dice_ml.Model(model_path=sk_modelpath, backend=backend)

The next two steps are the same as when using DiCE with training data. We specify the random algorithm and provide an input query instance.

[5]:

# initiate DiCE
exp = dice_ml.Dice(d, m, method="genetic")

/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator OneHotEncoder from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator Pipeline from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator ColumnTransformer from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(

[6]:

# query instance in the form of a dictionary; keys: feature name, values: feature value
query_instance = pd.DataFrame({'age': 22,
                               'workclass': 'Private',
                               'education': 'HS-grad',
                               'marital_status': 'Single',
                               'occupation': 'Service',
                               'race': 'White',
                               'gender': 'Female',
                               'hours_per_week': 45}, index=[0])

Generate diverse counterfactuals

The initialization needs to be provided as random since the default kdtree is not supported for private data.

[7]:

# generate counterfactuals
dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=4, desired_class="opposite",
                                        initialization="random")
# visualize the results
dice_exp.visualize_as_dataframe(show_only_changes=True)

100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.10s/it]

Query instance (original outcome : 0)

	age	workclass	education	marital_status	occupation	race	gender	hours_per_week	income
0	22	Private	HS-grad	Single	Service	White	Female	45	0


Diverse Counterfactual set without sparsity correction since only metadata about each  feature is available (new outcome: 1

age	workclass	education	marital_status	occupation	race	gender	hours_per_week	income
-	-	Doctorate	Widowed	White-Collar	-	-	39	1
-	-	Masters	Widowed	White-Collar	-	-	31	1
-	-	Doctorate	Widowed	White-Collar	-	-	97	1
23	Self-Employed	Bachelors	Married	Other/Unknown	-	-	53	1

Explaining pre-trained deep learning models

We can also use a trained model based on tensorflow or pytorch. Below, we use a trained ML model which produces high accuracy on test datasets, comparable to other popular baselines. This sample trained model comes in-built with our package.

The variable backend below indicates the implementation type of DiCE we want to use. We use TensorFlow 2 in the notebooks with backend=‘TF2’. You can set backend to ‘TF1’ or ‘PYT’ to use DiCE with TensorFlow 1.x or with PyTorch respectively. We want to note that the time required to find counterfactuals with Tensorflow 2.x’s eager style of execution is significantly greater than that with TensorFlow 1.x’s graph execution.

[8]:

import tensorflow as tf  # noqa

backend = 'TF' + tf.__version__[0]  # TF2
ML_modelpath = helpers.get_adult_income_modelpath(backend=backend)
m = dice_ml.Model(model_path=ML_modelpath, backend=backend, func="ohe-min-max")

2023-10-26 15:46:28.320922: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-26 15:46:28.325785: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-10-26 15:46:28.325804: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

[9]:

# initiate DiCE
exp = dice_ml.Dice(d, m, method="gradient")

2023-10-26 15:46:30.666572: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-10-26 15:46:30.666630: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-10-26 15:46:30.666650: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (AMSHAR-X1): /proc/driver/nvidia/version does not exist
2023-10-26 15:46:30.666896: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

[10]:

# query instance in the form of a dictionary; keys: feature name, values: feature value
query_instance = pd.DataFrame({'age': 22,
                               'workclass': 'Private',
                               'education': 'HS-grad',
                               'marital_status': 'Single',
                               'occupation': 'Service',
                               'race': 'White',
                               'gender': 'Female',
                               'hours_per_week': 45}, index=[0])

Generate diverse counterfactuals

[11]:

# generate counterfactuals
dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=4, desired_class="opposite")
# visualize the results
dice_exp.visualize_as_dataframe(show_only_changes=True)

Diverse Counterfactuals found! total time taken: 00 min 46 sec
Query instance (original outcome : 0.01899999938905239)

	age	workclass	education	marital_status	occupation	race	gender	hours_per_week	income
0	22.0	Private	HS-grad	Single	Service	White	Female	45.0	0.019


Diverse Counterfactual set without sparsity correction since only metadata about each  feature is available (new outcome: 1.0

	age	workclass	education	marital_status	occupation	race	gender	hours_per_week	income
0	60.0	Self-Employed	Prof-school	Married	Professional	-	-	43.0	0.911
1	38.0	Other/Unknown	Assoc	Married	-	-	-	55.0	0.74
2	90.0	-	Doctorate	-	-	-	-	99.0	0.755
3	70.0	-	-	-	White-Collar	Other	Male	73.0	0.525

Note on weighing different features: When the training data is available, by default, the distance between two values of a continuous feature is scaled by the inverse median absolute deviation (MAD) of the feature using the training data. This is done to capture the relative prevalence of observing a continuous feature at a particular value, as discussed in our paper. However, when there is no access to the training data, as in the above case, no scaling is done and hence all features are weighted equally in the normalized form. As a result, the counterfactuals generated above are different from those in DiCE_getting_started notebook where the training data was available. Nonetheless, you can manually provide the scaling constants through a parameter feature_weights to the generate_counterfactuals() method as shown in this advanced notebook, or you can provide the MADs directly to the data interface dice_ml.Data if you know them.