Generating counterfactual explanations without access to training data

If only the trained model is available but not the training data, DiCE can still be used to generate counterfactual explanations. Below we show an example where DiCE uses only basic metadata about each feature used in the ML model.

[1]:
# import DiCE
import pandas as pd
import dice_ml
from dice_ml.utils import helpers  # helper functions
[2]:
%load_ext autoreload
%autoreload 2

Defining meta data

We simulate “adult” income dataset from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult) by providing only meta information about the data: range is provided for continuous features and categories are provided for categorical features. Please note for Python<=3.6, “features” parameter should be provided as an OrderedDict in the same order that was used to train the ML model.

[3]:
d = dice_ml.Data(features={'age': [17, 90],
                           'workclass': ['Government', 'Other/Unknown', 'Private', 'Self-Employed'],
                           'education': ['Assoc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters',
                                         'Prof-school', 'School', 'Some-college'],
                           'marital_status': ['Divorced', 'Married', 'Separated', 'Single', 'Widowed'],
                           'occupation': ['Blue-Collar', 'Other/Unknown', 'Professional', 'Sales', 'Service', 'White-Collar'],
                           'race': ['Other', 'White'],
                           'gender': ['Female', 'Male'],
                           'hours_per_week': [1, 99]},
                 outcome_name='income')

Explaining pre-trained sklearn models

We first explain a RandomForest model that has been pre-trained on the Adult dataset.

[4]:
backend = 'sklearn'
sk_modelpath = helpers.get_adult_income_modelpath(backend=backend)  # pretrained model
m = dice_ml.Model(model_path=sk_modelpath, backend=backend)

The next two steps are the same as when using DiCE with training data. We specify the random algorithm and provide an input query instance.

[5]:
# initiate DiCE
exp = dice_ml.Dice(d, m, method="genetic")
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator OneHotEncoder from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator Pipeline from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator ColumnTransformer from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
/home/amshar/python-envs/v3.8dowhy/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 1.1.1 when using version 1.0.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
  warnings.warn(
[6]:
# query instance in the form of a dictionary; keys: feature name, values: feature value
query_instance = pd.DataFrame({'age': 22,
                               'workclass': 'Private',
                               'education': 'HS-grad',
                               'marital_status': 'Single',
                               'occupation': 'Service',
                               'race': 'White',
                               'gender': 'Female',
                               'hours_per_week': 45}, index=[0])

Generate diverse counterfactuals

The initialization needs to be provided as random since the default kdtree is not supported for private data.

[7]:
# generate counterfactuals
dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=4, desired_class="opposite",
                                        initialization="random")
# visualize the results
dice_exp.visualize_as_dataframe(show_only_changes=True)
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.10s/it]
Query instance (original outcome : 0)

age workclass education marital_status occupation race gender hours_per_week income
0 22 Private HS-grad Single Service White Female 45 0

Diverse Counterfactual set without sparsity correction since only metadata about each  feature is available (new outcome: 1
age workclass education marital_status occupation race gender hours_per_week income
0 - - Doctorate Widowed White-Collar - - 39 1
0 - - Masters Widowed White-Collar - - 31 1
0 - - Doctorate Widowed White-Collar - - 97 1
0 23 Self-Employed Bachelors Married Other/Unknown - - 53 1

Explaining pre-trained deep learning models

We can also use a trained model based on tensorflow or pytorch. Below, we use a trained ML model which produces high accuracy on test datasets, comparable to other popular baselines. This sample trained model comes in-built with our package.

The variable backend below indicates the implementation type of DiCE we want to use. We use TensorFlow 2 in the notebooks with backend=‘TF2’. You can set backend to ‘TF1’ or ‘PYT’ to use DiCE with TensorFlow 1.x or with PyTorch respectively. We want to note that the time required to find counterfactuals with Tensorflow 2.x’s eager style of execution is significantly greater than that with TensorFlow 1.x’s graph execution.

[8]:
import tensorflow as tf  # noqa

backend = 'TF' + tf.__version__[0]  # TF2
ML_modelpath = helpers.get_adult_income_modelpath(backend=backend)
m = dice_ml.Model(model_path=ML_modelpath, backend=backend, func="ohe-min-max")
2023-10-26 15:46:28.320922: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-26 15:46:28.325785: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-10-26 15:46:28.325804: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[9]:
# initiate DiCE
exp = dice_ml.Dice(d, m, method="gradient")
2023-10-26 15:46:30.666572: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-10-26 15:46:30.666630: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-10-26 15:46:30.666650: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (AMSHAR-X1): /proc/driver/nvidia/version does not exist
2023-10-26 15:46:30.666896: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[10]:
# query instance in the form of a dictionary; keys: feature name, values: feature value
query_instance = pd.DataFrame({'age': 22,
                               'workclass': 'Private',
                               'education': 'HS-grad',
                               'marital_status': 'Single',
                               'occupation': 'Service',
                               'race': 'White',
                               'gender': 'Female',
                               'hours_per_week': 45}, index=[0])

Generate diverse counterfactuals

[11]:
# generate counterfactuals
dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=4, desired_class="opposite")
# visualize the results
dice_exp.visualize_as_dataframe(show_only_changes=True)
Diverse Counterfactuals found! total time taken: 00 min 46 sec
Query instance (original outcome : 0.01899999938905239)
age workclass education marital_status occupation race gender hours_per_week income
0 22.0 Private HS-grad Single Service White Female 45.0 0.019

Diverse Counterfactual set without sparsity correction since only metadata about each  feature is available (new outcome: 1.0
age workclass education marital_status occupation race gender hours_per_week income
0 60.0 Self-Employed Prof-school Married Professional - - 43.0 0.911
1 38.0 Other/Unknown Assoc Married - - - 55.0 0.74
2 90.0 - Doctorate - - - - 99.0 0.755
3 70.0 - - - White-Collar Other Male 73.0 0.525

Note on weighing different features: When the training data is available, by default, the distance between two values of a continuous feature is scaled by the inverse median absolute deviation (MAD) of the feature using the training data. This is done to capture the relative prevalence of observing a continuous feature at a particular value, as discussed in our paper. However, when there is no access to the training data, as in the above case, no scaling is done and hence all features are weighted equally in the normalized form. As a result, the counterfactuals generated above are different from those in DiCE_getting_started notebook where the training data was available. Nonetheless, you can manually provide the scaling constants through a parameter feature_weights to the generate_counterfactuals() method as shown in this advanced notebook, or you can provide the MADs directly to the data interface dice_ml.Data if you know them.