DNN
model for classifying MNIST digits. The DNN
model will consist of 784 input nodes, a hidden layer with 128 nodes and 10 output nodes (corresponding to the 10 digits).mnist.load_data()
to get the 70,000 images divided into a set of 60,000 training images and 10,000 test images. We hold back 5,000 of the 60,000 training images for validation.DNN
model we analyze its performance. In particular, we use confusion matrices to compare the predicted classes
with the class labels
to try to determine why some images were misclassified by the model.activation values
of one of the hidden nodes for the (original) set of training data. We want to use these activation values
as "proxies" for the predicted classes of the 60,000 images.predicted classes
with the class labels
using confusion matrices to determine the efficacy of the model, we use box plots
to visualize the relationship between the activation values
of one hidden node and the class labels
. We don't expect these activation values to have much "predictive power". In fact, the same activation values can be associated with multiple class labels resulting in a lot of overlap in the box plots
.scatter plots
to visualize the relationship between the pair of pixel values with the class labels (represented by different colored dots).scatter plot
to visualize the correlation between the two principal component values and the class labels.First we import all the packages that will be used in the assignment.
Since Keras is integrated in TensorFlow 2.x, we import keras
from tensorflow
and use tenserflow.keras.xxx
to import all other Keras packages. The seed argument produces a deterministic sequence of tensors across multiple calls.
import datetime
from packaging import version
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib as mpl # EA
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import accuracy_score
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.datasets import mnist
2022-10-16 21:23:11.471856: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
%matplotlib inline
np.set_printoptions(precision=3, suppress=True)
print("This notebook requires TensorFlow 2.0 or above")
print("TensorFlow version: ", tf.__version__)
assert version.parse(tf.__version__).release[0] >=2
This notebook requires TensorFlow 2.0 or above TensorFlow version: 2.10.0
print("Keras version: ", keras.__version__)
Keras version: 2.10.0
#from google.colab import drive
#drive.mount('/content/gdrive')
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Input In [6], in <cell line: 1>() ----> 1 from google.colab import drive 2 drive.mount('/content/gdrive') ModuleNotFoundError: No module named 'google.colab'
def print_validation_report(test_labels, predictions):
print("Classification Report")
print(classification_report(test_labels, predictions))
print('Accuracy Score: {}'.format(accuracy_score(test_labels, predictions)))
print('Root Mean Square Error: {}'.format(np.sqrt(MSE(test_labels, predictions))))
def plot_confusion_matrix(y_true, y_pred):
mtx = confusion_matrix(y_true, y_pred)
fig, ax = plt.subplots(figsize=(8,8))
sns.heatmap(mtx, annot=True, fmt='d', linewidths=.75, cbar=False, ax=ax,cmap='Blues',linecolor='white')
# square=True,
plt.ylabel('true label')
plt.xlabel('predicted label')
def plot_history(history):
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
plt.figure(figsize=(16, 4))
for i, metrics in enumerate(zip([losses, accs], [val_losses, val_accs], ['Loss', 'Accuracy'])):
plt.subplot(1, 2, i + 1)
plt.plot(range(epochs), metrics[0], label='Training {}'.format(metrics[2]))
plt.plot(range(epochs), metrics[1], label='Validation {}'.format(metrics[2]))
plt.legend()
plt.show()
def plot_digits(instances, pos, images_per_row=5, **options):
size = 28
images_per_row = min(len(instances), images_per_row)
images = [instance.reshape(size,size) for instance in instances]
n_rows = (len(instances) - 1) // images_per_row + 1
row_images = []
n_empty = n_rows * images_per_row - len(instances)
images.append(np.zeros((size, size * n_empty)))
for row in range(n_rows):
rimages = images[row * images_per_row : (row + 1) * images_per_row]
row_images.append(np.concatenate(rimages, axis=1))
image = np.concatenate(row_images, axis=0)
pos.imshow(image, cmap = 'binary', **options)
pos.axis("off")
def plot_digit(data):
image = data.reshape(28, 28)
plt.imshow(image, cmap = 'hot',
interpolation="nearest")
plt.axis("off")
tf.Keras
. Use the tf.keras.datasets.mnist.load_data
to the get these datasets (and the corresponding labels) as Numpy arrays.(x_train, y_train), (x_test, y_test)= tf.keras.datasets.mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 11490434/11490434 [==============================] - 4s 0us/step
(x_train, y_train)
, (x_test, y_test)
x_train
, x_test
: uint8 arrays of grayscale image data with shapes (num_samples, 28, 28).y_train
, y_test
: uint8 arrays of digit labels (integers in range 0-9)print('x_train:\t{}'.format(x_train.shape))
print('y_train:\t{}'.format(y_train.shape))
print('x_test:\t\t{}'.format(x_test.shape))
print('y_test:\t\t{}'.format(y_test.shape))
x_train: (60000, 28, 28) y_train: (60000,) x_test: (10000, 28, 28) y_test: (10000,)
print("First ten labels training dataset:\n {}\n".format(y_train[0:10]))
First ten labels training dataset: [5 0 4 1 9 2 1 3 1 4]
items = [{'Class': x, 'Count': y} for x, y in Counter(y_train).items()]
distribution = pd.DataFrame(items).sort_values(['Class'])
sns.barplot(x=distribution.Class, y=distribution.Count);
items = [{'Class': x, 'Count': y} for x, y in Counter(y_test).items()]
distribution = pd.DataFrame(items).sort_values(['Class'])
sns.barplot(x=distribution.Class, y=distribution.Count);
Counter(y_train).most_common()
[(1, 6742), (7, 6265), (3, 6131), (2, 5958), (9, 5949), (0, 5923), (6, 5918), (8, 5851), (4, 5842), (5, 5421)]
Counter(y_test).most_common()
[(1, 1135), (2, 1032), (7, 1028), (3, 1010), (9, 1009), (4, 982), (0, 980), (8, 974), (6, 958), (5, 892)]
fig = plt.figure(figsize = (15, 9))
for i in range(50):
plt.subplot(5, 10, 1+i)
plt.title(y_train[i])
plt.xticks([])
plt.yticks([])
plt.imshow(x_train[i].reshape(28,28), cmap='binary')
We will change the way the labels are represented from numbers (0 to 9) to vectors (1D arrays) of shape (10, ) with all the elements set to 0 except the one which the label belongs to - which will be set to 1. For example:
original label | one-hot encoded label |
---|---|
5 | [0 0 0 0 0 1 0 0 0 0] |
7 | [0 0 0 0 0 0 0 1 0 0] |
1 | [0 1 0 0 0 0 0 0 0 0] |
y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(y_test)
print("First ten entries of y_train:\n {}\n".format(y_train[0:10]))
print("First ten rows of one-hot y_train:\n {}".format(y_train_encoded[0:10,]))
First ten entries of y_train: [5 0 4 1 9 2 1 3 1 4] First ten rows of one-hot y_train: [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]
print('y_train_encoded shape: ', y_train_encoded.shape)
print('y_test_encoded shape: ', y_test_encoded.shape)
y_train_encoded shape: (60000, 10) y_test_encoded shape: (10000, 10)
Reshape the images from shape (28, 28) 2D arrays to shape (784, ) vectors (1D arrays).
# Before reshape:
print('x_train:\t{}'.format(x_train.shape))
print('x_test:\t\t{}'.format(x_test.shape))
x_train: (60000, 28, 28) x_test: (10000, 28, 28)
# Reshape the images:
x_train_reshaped = np.reshape(x_train, (60000, 784))
x_test_reshaped = np.reshape(x_test, (10000, 784))
# After reshape:
print('x_train_reshaped shape: ', x_train_reshaped.shape)
print('x_test_reshaped shape: ', x_test_reshaped.shape)
x_train_reshaped shape: (60000, 784) x_test_reshaped shape: (10000, 784)
print(set(x_train_reshaped[0]))
{0, 1, 2, 3, 9, 11, 14, 16, 18, 23, 24, 25, 26, 27, 30, 35, 36, 39, 43, 45, 46, 49, 55, 56, 64, 66, 70, 78, 80, 81, 82, 90, 93, 94, 107, 108, 114, 119, 126, 127, 130, 132, 133, 135, 136, 139, 148, 150, 154, 156, 160, 166, 170, 171, 172, 175, 182, 183, 186, 187, 190, 195, 198, 201, 205, 207, 212, 213, 219, 221, 225, 226, 229, 238, 240, 241, 242, 244, 247, 249, 250, 251, 252, 253, 255}
Rescale the elements between [0 and 1]
x_train_norm = x_train_reshaped.astype('float32') / 255
x_test_norm = x_test_reshaped.astype('float32') / 255
# Take a look at the first reshaped and normalized training image:
print(set(x_train_norm[0]))
{0.0, 0.011764706, 0.53333336, 0.07058824, 0.49411765, 0.6862745, 0.101960786, 0.6509804, 1.0, 0.96862745, 0.49803922, 0.11764706, 0.14117648, 0.36862746, 0.6039216, 0.6666667, 0.043137256, 0.05490196, 0.03529412, 0.85882354, 0.7764706, 0.7137255, 0.94509804, 0.3137255, 0.6117647, 0.41960785, 0.25882354, 0.32156864, 0.21960784, 0.8039216, 0.8666667, 0.8980392, 0.7882353, 0.52156866, 0.18039216, 0.30588236, 0.44705883, 0.3529412, 0.15294118, 0.6745098, 0.88235295, 0.99215686, 0.9490196, 0.7647059, 0.2509804, 0.19215687, 0.93333334, 0.9843137, 0.74509805, 0.7294118, 0.5882353, 0.50980395, 0.8862745, 0.105882354, 0.09019608, 0.16862746, 0.13725491, 0.21568628, 0.46666667, 0.3647059, 0.27450982, 0.8352941, 0.7176471, 0.5803922, 0.8117647, 0.9764706, 0.98039216, 0.73333335, 0.42352942, 0.003921569, 0.54509807, 0.67058825, 0.5294118, 0.007843138, 0.31764707, 0.0627451, 0.09411765, 0.627451, 0.9411765, 0.9882353, 0.95686275, 0.83137256, 0.5176471, 0.09803922, 0.1764706}
Below is the neural network architecture we will use today for classifying MNIST digits.
We use a Sequential
class defined in Keras
to create our model. All the layers are going to be Dense layers. This means, like the figure shown above, all the nodes of a layer would be connected to all the nodes of the preceding layer i.e. densely connected.
After the model is built, we view ....
model = Sequential([
Dense(input_shape=[784], units=128, activation = tf.nn.relu,kernel_regularizer=tf.keras.regularizers.L2(0.001)),
Dense(name = "output_layer", units = 10, activation = tf.nn.softmax)
])
model.summary()
Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_2 (Dense) (None, 128) 100480 output_layer (Dense) (None, 10) 1290 ================================================================= Total params: 101,770 Trainable params: 101,770 Non-trainable params: 0 _________________________________________________________________
keras.utils.plot_model(model, "mnist_model.png", show_shapes=True)
You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
In addition to setting up our model architecture, we also need to define which algorithm should the model use in order to optimize the weights and biases as per the given data. We will use stochastic gradient descent.
We also need to define a loss function. Think of this function as the difference between the predicted outputs and the actual outputs given in the dataset. This loss needs to be minimized in order to have a higher model accuracy. That's what the optimization algorithm essentially does - it minimizes the loss during model training. For our multi-class classification problem, categorical cross entropy is commonly used.
Finally, we will use the accuracy during training as a metric to keep track of as the model trains.
tf.keras.optimizers.RMSprop
https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop
tf.keras.losses.CategoricalCrossentropy
https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy
model.compile(optimizer='rmsprop',
loss = 'categorical_crossentropy',
metrics=['accuracy'])
tf.keras.model.fit
https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
tf.keras.callbacks.EarlyStopping
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping
history = model.fit(
x_train_norm
,y_train_encoded
,epochs = 200
,validation_split=0.20
,callbacks=[tf.keras.callbacks.ModelCheckpoint("DNN_model.h5",save_best_only=True,save_weights_only=False)
,tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)]
)
Epoch 1/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.3965 - accuracy: 0.9148 - val_loss: 0.2699 - val_accuracy: 0.9431 Epoch 2/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.2392 - accuracy: 0.9509 - val_loss: 0.2192 - val_accuracy: 0.9548 Epoch 3/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.2018 - accuracy: 0.9602 - val_loss: 0.1957 - val_accuracy: 0.9596 Epoch 4/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1818 - accuracy: 0.9639 - val_loss: 0.1832 - val_accuracy: 0.9617 Epoch 5/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1705 - accuracy: 0.9659 - val_loss: 0.1786 - val_accuracy: 0.9632 Epoch 6/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1636 - accuracy: 0.9672 - val_loss: 0.1624 - val_accuracy: 0.9682 Epoch 7/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1568 - accuracy: 0.9699 - val_loss: 0.1613 - val_accuracy: 0.9682 Epoch 8/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1526 - accuracy: 0.9708 - val_loss: 0.1514 - val_accuracy: 0.9697 Epoch 9/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1510 - accuracy: 0.9702 - val_loss: 0.1472 - val_accuracy: 0.9711 Epoch 10/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1443 - accuracy: 0.9715 - val_loss: 0.1834 - val_accuracy: 0.9588 Epoch 11/200 1500/1500 [==============================] - 2s 1ms/step - loss: 0.1436 - accuracy: 0.9723 - val_loss: 0.1433 - val_accuracy: 0.9707
In order to ensure that this is not a simple "memorization" by the machine, we should evaluate the performance on the test set. This is easy to do, we simply use the evaluate
method on our model.
model = tf.keras.models.load_model("DNN_model.h5")
print(f"Test acc: {model.evaluate(x_test_norm, y_test_encoded)[1]:.3f}")
313/313 [==============================] - 0s 889us/step - loss: 0.1485 - accuracy: 0.9696 Test acc: 0.970
loss, accuracy = model.evaluate(x_test_norm, y_test_encoded)
print('test set accuracy: ', accuracy * 100)
313/313 [==============================] - 0s 859us/step - loss: 0.1485 - accuracy: 0.9696 test set accuracy: 96.96000218391418
preds = model.predict(x_test_norm)
print('shape of preds: ', preds.shape)
313/313 [==============================] - 0s 1ms/step shape of preds: (10000, 10)
Look at the first 25 - Plot test set images along with their predicted and actual labels to understand how the trained model actually performed
plt.figure(figsize = (12, 8))
start_index = 0
for i in range(25):
plt.subplot(5, 5, i + 1)
plt.grid(False)
plt.xticks([])
plt.yticks([])
pred = np.argmax(preds[start_index + i])
actual = np.argmax(y_test_encoded[start_index + i])
col = 'g'
if pred != actual:
col = 'r'
plt.xlabel('i={} | pred={} | true={}'.format(start_index + i, pred, actual), color = col)
plt.imshow(x_test[start_index + i], cmap='binary')
plt.show()
history_dict = history.history
history_dict.keys()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
We use Matplotlib
to create 2 plots--displaying the training and validation loss (resp. accuracy) for each (training) epoch side by side.
history_dict = history.history
history_dict.keys()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
losses = history.history['loss']
accs = history.history['accuracy']
val_losses = history.history['val_loss']
val_accs = history.history['val_accuracy']
epochs = len(losses)
history_df=pd.DataFrame(history_dict)
history_df.tail().round(3)
loss | accuracy | val_loss | val_accuracy | |
---|---|---|---|---|
6 | 0.157 | 0.970 | 0.161 | 0.968 |
7 | 0.153 | 0.971 | 0.151 | 0.970 |
8 | 0.151 | 0.970 | 0.147 | 0.971 |
9 | 0.144 | 0.972 | 0.183 | 0.959 |
10 | 0.144 | 0.972 | 0.143 | 0.971 |
plot_history(history)
pred1= model.predict(x_test_norm)
pred1=np.argmax(pred1, axis=1)
313/313 [==============================] - 0s 691us/step
print_validation_report(y_test, pred1)
Classification Report precision recall f1-score support 0 0.98 0.99 0.98 980 1 0.99 0.98 0.99 1135 2 0.97 0.96 0.97 1032 3 0.93 0.98 0.96 1010 4 0.96 0.97 0.97 982 5 0.98 0.96 0.97 892 6 0.98 0.97 0.98 958 7 0.98 0.96 0.97 1028 8 0.95 0.96 0.96 974 9 0.97 0.96 0.96 1009 accuracy 0.97 10000 macro avg 0.97 0.97 0.97 10000 weighted avg 0.97 0.97 0.97 10000 Accuracy Score: 0.9696 Root Mean Square Error: 0.7425631286294789
Let us see what the confusion matrix looks like. Using both sklearn.metrics
. Then we visualize the confusion matrix and see what that tells us.
# Get the predicted classes:
# pred_classes = model.predict_classes(x_train_norm)# give deprecation warning
pred_classes = np.argmax(model.predict(x_test_norm), axis=-1)
pred_classes;
313/313 [==============================] - 0s 721us/step
conf_mx = tf.math.confusion_matrix(y_test, pred_classes)
conf_mx;
cm = sns.light_palette((260, 75, 60), input="husl", as_cmap=True)
df = pd.DataFrame(preds[0:20], columns = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
df.style.format("{:.2%}").background_gradient(cmap=cm)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 100.00% | 0.00% | 0.00% |
1 | 0.00% | 0.02% | 99.95% | 0.04% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
2 | 0.00% | 98.71% | 0.14% | 0.09% | 0.14% | 0.03% | 0.20% | 0.16% | 0.53% | 0.00% |
3 | 99.96% | 0.00% | 0.01% | 0.00% | 0.00% | 0.00% | 0.01% | 0.01% | 0.00% | 0.00% |
4 | 0.00% | 0.00% | 0.01% | 0.00% | 99.89% | 0.00% | 0.00% | 0.02% | 0.00% | 0.07% |
5 | 0.00% | 99.58% | 0.01% | 0.02% | 0.11% | 0.00% | 0.00% | 0.15% | 0.13% | 0.00% |
6 | 0.00% | 0.00% | 0.00% | 0.00% | 99.97% | 0.00% | 0.00% | 0.00% | 0.02% | 0.01% |
7 | 0.00% | 0.00% | 0.30% | 2.26% | 0.90% | 0.02% | 0.00% | 0.06% | 0.02% | 96.43% |
8 | 0.01% | 0.00% | 0.61% | 0.00% | 0.19% | 12.99% | 79.64% | 0.00% | 6.28% | 0.29% |
9 | 0.00% | 0.00% | 0.00% | 0.00% | 1.50% | 0.00% | 0.00% | 0.67% | 0.67% | 97.16% |
10 | 100.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
11 | 0.00% | 0.00% | 0.00% | 0.00% | 0.02% | 0.00% | 99.88% | 0.00% | 0.10% | 0.00% |
12 | 0.00% | 0.00% | 0.00% | 0.19% | 0.80% | 0.02% | 0.00% | 0.38% | 0.01% | 98.60% |
13 | 100.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
14 | 0.00% | 99.92% | 0.00% | 0.02% | 0.01% | 0.00% | 0.00% | 0.00% | 0.05% | 0.00% |
15 | 0.01% | 0.00% | 0.01% | 11.86% | 0.00% | 88.02% | 0.00% | 0.00% | 0.11% | 0.00% |
16 | 0.00% | 0.00% | 0.02% | 0.03% | 0.13% | 0.00% | 0.00% | 0.48% | 0.05% | 99.28% |
17 | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 100.00% | 0.00% | 0.00% |
18 | 0.00% | 0.00% | 0.00% | 99.98% | 0.00% | 0.00% | 0.00% | 0.00% | 0.02% | 0.00% |
19 | 0.00% | 0.00% | 0.00% | 0.00% | 99.94% | 0.00% | 0.00% | 0.00% | 0.00% | 0.06% |
We use code from chapter 3 of Hands on Machine Learning (A. Geron) (cf. https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb) to display a "heat map" of the confusion matrix. Then we normalize the confusion matrix so we can compare error rates.
plot_confusion_matrix(y_test,pred_classes)
Looks like 28 fours were misclassified as nines (and 10 nines were classifed fours). We display some of these misclassfications along with exam of fours and nines that were correctly identified.
cl_a, cl_b = 4, 9
X_aa = x_test_norm[(y_test == cl_a) & (pred_classes == cl_a)]
X_ab = x_test_norm[(y_test == cl_a) & (pred_classes == cl_b)]
X_ba = x_test_norm[(y_test == cl_b) & (pred_classes == cl_a)]
X_bb = x_test_norm[(y_test == cl_b) & (pred_classes == cl_b)]
plt.figure(figsize=(16,8))
p1 = plt.subplot(221)
p2 = plt.subplot(222)
p3 = plt.subplot(223)
p4 = plt.subplot(224)
plot_digits(X_aa[:25], p1, images_per_row=5);
plot_digits(X_ab[:25], p2, images_per_row=5);
plot_digits(X_ba[:25], p3, images_per_row=5);
plot_digits(X_bb[:25], p4, images_per_row=5);
p1.set_title(f"{cl_a}'s classified as {cl_a}'s")
p2.set_title(f"{cl_a}'s classified as {cl_b}'s")
p3.set_title(f"{cl_b}'s classified as {cl_a}'s")
p4.set_title(f"{cl_b}'s classified as {cl_b}'s")
# plt.savefig("error_analysis_digits_plot_EXP1_valid")
plt.show()
We want to examine the contribution of the individual hidden nodes to the classifications made by the model. We first get the activation values of all the hidden nodes for each of the 60,000 training images and treat these 128 activations as the features that determine the classification class. For the sake of comparison, we also consider the 784 pixels of each training image and determine the contribution of the individual pixels to the predicted classification class.
Our goal is to use box and scatter plots to visualize how these features (pixel and activation values) correlate with the class labels. Because of the high dimension of the feature spaces, we apply PCA decomposition and t-Distributed stochastic neighbor embedding (t-SNE
) to reduce the number of features in each case.
We use the following two articles as reference
Raw data is 60,000 X 784. Just do a scatter plot of col 1 vs col 2. Overlay the color coded classes. We should not see any patterns since there is not much info in 2 cols to discriminate.
PCA of raw data – as we discussed earlier. Plot PC1 vs PC2 with overlay. This should be ‘better’ since these 2 capture the info from all 784 cols.
PCA of activation values – as we discussed earlier. This should be ‘better’ than the previous 2 since it has captured specific features of discrimination.
To get the activation values of the hidden nodes, we need to create a new model, activation_model
, that takes the same input as our current model but outputs the activation value of the hidden layer, i.e. of the hidden node. Then use the predict
function to get the activation values.
# Extracts the outputs of the 2 layers:
layer_outputs = [layer.output for layer in model.layers]
# Creates a model that will return these outputs, given the model input:
activation_model = models.Model(inputs=model.input, outputs=layer_outputs)
print(f"There are {len(layer_outputs)} layers")
layer_outputs; # description of the layers
There are 2 layers
# Get the outputs of all the hidden nodes for each of the 60000 training images
activations = activation_model.predict(x_train_norm)
hidden_layer_activation = activations[0]
output_layer_activations = activations[1]
hidden_layer_activation.shape # each of the 128 hidden nodes has one activation value per training image
1875/1875 [==============================] - 2s 784us/step
(60000, 128)
output_layer_activations.shape
(60000, 10)
print(f"The maximum activation value of the hidden nodes in the hidden layer is \
{hidden_layer_activation.max()}")
The maximum activation value of the hidden nodes in the hidden layer is 4.983288288116455
# Some stats about the output layer as an aside...
np.set_printoptions(suppress = True) # display probabilities as decimals and NOT in scientific notation
ouput_layer_activation = activations[1]
print(f"The output node has shape {ouput_layer_activation.shape}")
print(f"The output for the first image are {ouput_layer_activation[0].round(4)}")
print(f"The sum of the probabilities is (approximately) {ouput_layer_activation[0].sum()}")
The output node has shape (60000, 10) The output for the first image are [0. 0. 0. 0.012 0. 0.988 0. 0. 0. 0. ] The sum of the probabilities is (approximately) 0.9999999403953552
#Get the dataframe of all the node values
activation_data = {'actual_class':y_train}
for k in range(0,128):
activation_data[f"act_val_{k}"] = hidden_layer_activation[:,k]
activation_df = pd.DataFrame(activation_data)
activation_df.head(15).round(3).T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
actual_class | 5.000 | 0.000 | 4.000 | 1.000 | 9.000 | 2.000 | 1.000 | 3.000 | 1.000 | 4.000 | 3.000 | 5.000 | 3.000 | 6.000 | 1.000 |
act_val_0 | 0.000 | 0.000 | 0.000 | 0.473 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.022 | 0.073 | 0.961 | 0.000 | 0.473 | 0.000 |
act_val_1 | 0.000 | 0.171 | 0.000 | 0.000 | 0.121 | 0.000 | 0.184 | 0.191 | 0.210 | 0.000 | 0.000 | 0.000 | 0.887 | 0.000 | 0.312 |
act_val_2 | 1.562 | 1.363 | 0.213 | 0.362 | 0.397 | 0.525 | 0.000 | 2.138 | 0.000 | 0.000 | 0.604 | 0.000 | 1.161 | 0.000 | 0.000 |
act_val_3 | 0.121 | 0.000 | 0.000 | 1.439 | 0.000 | 0.000 | 0.925 | 0.000 | 0.578 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.069 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
act_val_123 | 0.919 | 0.000 | 0.000 | 0.088 | 0.173 | 0.597 | 0.000 | 0.317 | 0.000 | 1.015 | 0.000 | 0.632 | 0.000 | 1.103 | 0.252 |
act_val_124 | 0.543 | 0.662 | 0.076 | 0.000 | 0.741 | 1.571 | 0.000 | 0.000 | 0.237 | 0.517 | 0.000 | 0.090 | 0.000 | 0.000 | 0.022 |
act_val_125 | 0.819 | 0.497 | 0.000 | 0.000 | 0.000 | 0.000 | 0.919 | 1.198 | 0.780 | 0.000 | 1.096 | 0.000 | 1.451 | 0.000 | 0.487 |
act_val_126 | 0.199 | 0.000 | 0.218 | 0.440 | 0.371 | 0.000 | 0.257 | 0.850 | 0.087 | 0.616 | 0.753 | 0.000 | 0.746 | 0.178 | 0.125 |
act_val_127 | 0.000 | 0.000 | 0.062 | 0.000 | 0.234 | 0.000 | 0.017 | 0.895 | 0.000 | 0.218 | 0.706 | 0.100 | 0.261 | 0.000 | 0.006 |
129 rows × 15 columns