Credit Card Approval Prediction and Explainability with LIME and SHAP
Predicting credit card approval can be a crucial task for financial institutions aiming to assess the creditworthiness of customers. In this blog post, we will explore how to build a Multilayer Perceptron (MLP) model to predict credit card approval and, more importantly, explain the model’s predictions using LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations). We’ll break down the entire process step-by-step, from data preprocessing to model interpretability.
Problem Statement
The goal of this project is to predict whether a customer will be approved for a credit card based on certain demographic and financial features. We will train an MLP classifier and use LIME and SHAP to explain the model’s decisions, ensuring transparency and trust in the model.
The Dataset
The dataset used in this project contains labeled data with the following features:
- ID: Customer ID
- Age: Customer’s age
- Experience: Years of work experience
- Income: Annual income (in thousands)
- Zipcode: Residential zipcode
- Family: Number of family members
- CCAvg: Average monthly credit card spend
- Education: Education level (1: Bachelor, 2: Master, 3: Advanced Degree)
- Mortgage: Mortgage value (in thousands)
- Securities Account: Boolean flag for having a securities account
- CD Account: Boolean flag for having a Certificate of Deposit account
- Online: Boolean flag for using online banking
- CreditCard: Target column (1: Credit card approved, 0: Not approved)
Our target variable, CreditCard, indicates whether a customer was approved for a credit card or not. We will use this variable to train our classifier.
Step 1: Data Preprocessing and EDA
Loading the Data
First, we load the dataset and inspect its structure to understand the types of features we are working with. Here is a sample of the code used to load the data:
import pandas as pd
# Load the dataset
data = pd.read_csv("UniversalBank.csv")
# Inspect the dataset
print(data.shape)
print(data.head())
Exploratory Data Analysis (EDA)
Next, we explore the dataset to identify any patterns, missing values, or correlations that might impact model performance. We use visualizations such as histograms and heatmaps to understand feature distributions and correlations.
import seaborn as sns
import matplotlib.pyplot as plt
# Plot feature distributions
data.hist(figsize=(12, 8))
plt.show()
# Correlation heatmap
sns.heatmap(data.corr(), annot=True, cmap="coolwarm")
plt.show()
Feature Normalization
Certain features such as Income
, Mortgage
, and CCAvg
have vastly different scales, which can adversely affect the model. Hence, we normalize these features using the StandardScaler
to ensure they are on the same scale.
from sklearn.preprocessing import StandardScaler
# Select numerical features for scaling
features_to_scale = ['Income', 'CCAvg', 'Mortgage']
# Initialize scaler
scaler = StandardScaler()
# Scale the selected features
data[features_to_scale] = scaler.fit_transform(data[features_to_scale])
Step 2: MLP Model Implementation
The Multilayer Perceptron (MLP) is a feedforward neural network with multiple layers. We will use Keras to implement a simple MLP with two hidden layers.
MLP Architecture
Our MLP architecture is defined as follows:
import tensorflow as tf
from tensorflow.keras import layers, models
# Define the MLP model
model = models.Sequential([
layers.Dense(128, activation='relu', input_shape=(data.shape[1]-1,)),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
K-Fold Cross-Validation
To ensure robustness, we use 5-fold cross-validation, training the model on different subsets of the data and evaluating it on the remaining portions.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train, epochs=10)
Step 3: Explainability with LIME
Once the model is trained, we use LIME to explain the predictions of our model for individual instances.
LIME Overview
LIME works by perturbing the input data and observing how the model’s predictions change. It generates a local surrogate model to approximate the complex model within a small region around the input data point.
Here is a Mermaid diagram that explains how LIME works:
graph TD A[Input Data] --> B[Model Prediction] B --> C{Perturb Input} C --> D[Local Surrogate Model] D --> E[Explain Predictions]
LIME Code Example
from lime.lime_tabular import LimeTabularExplainer
# Initialize LIME explainer
explainer = LimeTabularExplainer(X_train, feature_names=data.columns, class_names=['Not Approved', 'Approved'], mode='classification')
# Explain a single prediction
exp = explainer.explain_instance(X_test[0], model.predict)
exp.show_in_notebook()
Step 4: SHAP Explainability
SHAP assigns an importance value to each feature by calculating how much each feature contributes to the model’s prediction for a specific instance.
SHAP Overview
SHAP values are based on game theory and help to allocate the contribution of each feature to the final decision. SHAP values can also be visualized as a summary plot.
SHAP Code Example
import shap
# Initialize SHAP explainer
explainer = shap.KernelExplainer(model.predict, X_train)
# Calculate SHAP values for test data
shap_values = explainer.shap_values(X_test)
# Plot SHAP summary
shap.summary_plot(shap_values, X_test)
Step 5: Observations and Comparison
By comparing the results of LIME and SHAP, we observe that both methods offer valuable insights into the model’s predictions, but they operate differently:
- LIME focuses on local explanations, making it ideal for explaining individual predictions.
- SHAP provides a more holistic view by calculating feature contributions for the entire dataset, making it useful for global interpretability.
Conclusion
In this project, we successfully predicted credit card approval using an MLP model and applied LIME and SHAP to explain our model’s predictions. Understanding the decision-making process of machine learning models is crucial in sensitive domains like finance, and LIME and SHAP are powerful tools to ensure transparency and trust in model predictions.