Advanced Data Analytic Algorithms

Week 1

Model selection with an AI assistance

Based on the AI’s suggestion, SVM is initially selected as the model for the research implementation.

Dataset: SMS Spam Collection Task: Text Classification

Load the dataset

Show/Hide the code

1
2
3
4
5
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("SMSSpamCollection", sep="\t", names=["label", "message"])
df

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
...	...	...
5567	spam	This is the 2nd time we have tried 2 contact u...
5568	ham	Will ü b going to esplanade fr home?
5569	ham	Pity, * was in mood for that. So...any other s...
5570	ham	The guy did some bitching but I acted like i'd...
5571	ham	Rofl. Its true to its name

5572 rows × 2 columns

Data Preparation

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import re


def clean_text(text):
    text = text.lower()
    text = re.sub(r"\W+", " ", text)
    return text.strip()


df["clean_message"] = df["message"].apply(clean_text)

Dummy Model

Show/Hide the code

1
2
3
4
5
def spam_detect(text: str) -> str:
    if "free" in text or "!" in text:
        return "spam"
    else:
        return "ham"

The result shows that this dummy model an extremely low recall, reflecting its inability to detect spam messages.

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
)

y_pred = df["clean_message"].apply(spam_detect)
y_true = df["label"]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred, pos_label="spam"))
print("Recall:", recall_score(y_true, y_pred, pos_label="spam"))
print("F1 Score:", f1_score(y_true, y_pred, pos_label="spam"))

# Detailed report
print("\nClassification Report:\n")
print(classification_report(y_true, y_pred, target_names=["ham", "spam"]))

Accuracy: 0.8898061737257718
Precision: 0.7509433962264151
Recall: 0.26639892904953144
F1 Score: 0.3932806324110672

Classification Report:

              precision    recall  f1-score   support

         ham       0.90      0.99      0.94      4825
        spam       0.75      0.27      0.39       747

    accuracy                           0.89      5572
   macro avg       0.82      0.63      0.67      5572
weighted avg       0.88      0.89      0.87      5572

The confusion matrix

Show/Hide the code

1
2
3
4
5
6
7
8
9
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(
    y_true,
    y_pred,
    normalize=None,
    cmap="Blues",
)
plt.show()

Week 2

SVM is a type of linear classifier. Its simplest form is the maximum margin classifier. This kind of classifier requires the data to be linearly separable and seeks an optimal hyperplane to divide the data into two classes. The criterion for finding the optimal hyperplane is that it should be as far away from the data points as possible. In order to understand the concepts of data points, hyperplanes, and their relationships with the basic machine learning framework — Hypothesis, Loss, and Optimizer — I decided to first implement a perceptron classifier and study its learning process

Perceptron Learning Algorithm

The implementation of a Perceptron Classifier.

A perceptron classifies data using a hyperplane defined by a linear function $f(x) = w \cdot x + b$. It has been proven that if the data is linearly separable, the perceptron algorithm is guaranteed to converge. Let the class label be denoted by $y$, where $y = 1$ for the positive class and $y = -1$ for the negative class. The predicted label is given by $\hat{y} = \text{sign}(f(x))$, and the product $\hat{y}y$ indicates the correctness of the prediction: a positive value implies a correct prediction, while a negative value indicates a misclassification.

When a data point is correctly classified, the algorithm proceeds to the next data point without updating the model. However, in the case of a misclassification, the model parameters are updated according to the following rules:

$$ w \leftarrow w + y \cdot x \cdot \text{lr} $$$$ b \leftarrow b + y \cdot \text{lr} $$

where $\text{lr}$ is the learning rate that pre-defined by human.

This update mechanism implies that if the true label is positive, the normal vector $w$ of the hyperplane is adjusted in the direction of $x$, thereby increasing the likelihood of correctly classifying $x$ as a positive instance. Conversely, if the true label is negative, the vector $w$ is updated in the opposite direction of $x$, making it more likely that $x$ will be classified as negative.

If the dataset is linearly separable, the perceptron algorithm will converge to a solution in a finite number of steps. Otherwise, it will continue to update indefinitely without convergence.

The code below implements a simple perceptron classifier according to above descriptions, with an ability to record training processing.

Show/Hide the code

1
2
import numpy as np
import matplotlib.pyplot as plt

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class MyPerceptron:

    def __init__(self, lr=0.1, epochs=100):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0
        self.w_b_arr = []
        self.losses = []
        self.steps = []

    def f(self, x):
        return x @ self.w + self.b

    def sign(self, x):
        return np.where(x > 0, 1, -1)

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)

        for _ in range(self.epochs):
            for i, (xi, yi) in enumerate(zip(X, y)):
                y_hat = self.sign(self.f(xi))

                old_w = self.w.copy()
                old_b = self.b
                current_loss = np.sum(self.sign(self.f(X)) != y)

                if yi * y_hat < 0:
                    print(current_loss)
                    self.w += self.lr * yi * xi
                    self.b += self.lr * yi

                    self.w_b_arr.append((old_w, old_b, self.w.copy(), self.b))
                    self.losses.append(current_loss)
                    self.steps.append(i)

                if current_loss == 0:
                    print(current_loss)
                    self.w_b_arr.append((old_w, old_b, self.w.copy(), self.b))
                    self.losses.append(current_loss)
                    self.steps.append(i)
                    return

Generate some mock data

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def f(x, w, b):
    return np.dot(x, w) + b + np.random.normal(loc=0, scale=0.9)


s = np.arange(0, 10, .1)
d_positive = np.array([[x, f(x, 1.9, 2), 1] for x in s])
d_negative = np.array([[x, f(x, 1.3, -2.2), -1] for x in s])

df = np.vstack((d_positive, d_negative))
np.random.shuffle(df)
X, y = df[:, :2], df[:, -1].astype(int)

plt.scatter(d_positive[0:, 0], d_positive[0:, 1], color='blue', marker='+')
plt.scatter(d_negative[0:, 0], d_negative[0:, 1], color='red', marker='_')
plt.show()

Show/Hide the code

1
2
perceptron = MyPerceptron(lr=0.1)
perceptron.fit(X, y)

loss

Show/Hide the code

1
2
plt.plot(range(len(perceptron.losses)), perceptron.losses)
plt.show()

Visualize the hyperplane

The following equation rewrites the hyperplane from its normal vector form into a slope-intercept form, which facilitates visualization.

$$ w_1 x_1 + w_2 x_2 + b = 0 \ $$$$ x_2 = -\frac{w_1 x_1 + b}{w_2} $$

As illustrated in the figure below, this transformation makes it easier to observe how the hyperplane’s normal vector is influenced by the data points. The green arrow represents the normal vector before the update, while the black arrow shows the normal vector after the update.

For example, in Update 2, when a negative sample lies on positive side of the hyperplane, the green arrow shifts slightly away from this data point after the update, making it more likely for the point to appear on the opposite side in the direction of the updated normal vector. A more pronounced case can be seen in Update 4, where a negative sample (marked in yellow) initially lies on positive side of the hyperplane but moves to the opposite side following the update.

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import matplotlib.pyplot as plt
import numpy as np

w_b_arr = perceptron.w_b_arr
steps = perceptron.steps

n_updates = len(w_b_arr)

rows = int(np.ceil(np.sqrt(n_updates)))
cols = int(np.ceil(n_updates / rows))
fig, axs = plt.subplots(rows, cols, figsize=(cols * 5, rows * 5))
axs = axs.flatten() if n_updates > 1 else [axs]

x_margin = 20.0
y_margin = 10.0
x_min = np.min(X[:, 0]) - x_margin
x_max = np.max(X[:, 0]) + x_margin
y_min = np.min(X[:, 1]) - y_margin
y_max = np.max(X[:, 1]) + y_margin

mean_x = np.mean(X[:, 0])

for i in range(n_updates):
    ax = axs[i]
    old_w, old_b, new_w, new_b = w_b_arr[i]

    pos_idx = y == 1
    neg_idx = y == -1
    ax.scatter(X[pos_idx, 0],
               X[pos_idx, 1],
               color='blue',
               label='Positive (y=1)',
               alpha=.3)
    ax.scatter(X[neg_idx, 0],
               X[neg_idx, 1],
               color='red',
               label='Negative (y=-1)',
               alpha=.3)

    ax.scatter(X[steps[i], 0],
               X[steps[i], 1],
               facecolor="yellow",
               edgecolor='black',
               label='Affecting Point')

    def plot_hp(w, b, color, linestyle, label):
        x_vals = np.linspace(x_min, x_max, 100)
        if abs(w[1]) > 1e-6:
            y_vals = (-w[0] * x_vals - b) / w[1]
            ax.plot(x_vals, y_vals, color=color, linestyle=linestyle, label=label)
        else:
            x_val = -b / w[0] if abs(w[0]) > 1e-6 else 0
            ax.axvline(x_val, color=color, linestyle=linestyle, label=label)
        return x_vals, y_vals if abs(w[1]) > 1e-6 else None

    x_vals, y_vals = plot_hp(old_w, old_b, 'green', '--', 'Previous Hyperplane')
    plot_hp(new_w, new_b, 'black', '-', 'Current Hyperplane')

    x0 = 0
    y0 = (-old_w[0] * x0 - old_b) / old_w[1] if old_w[1] != 0 else 0
    norm = np.linalg.norm(old_w)
    if norm != 0:
        ax.arrow(x0,
                 y0,
                 old_w[0] / norm * 10,
                 old_w[1] / norm * 10,
                 head_width=1,
                 width=0.5,
                 color="green")
    x0 = 0
    y0 = (-new_w[0] * x0 - new_b) / new_w[1] if new_w[1] != 0 else 0
    norm = np.linalg.norm(new_w)
    if norm != 0:
        ax.arrow(x0,
                 y0,
                 new_w[0] / norm * 10,
                 new_w[1] / norm * 10,
                 head_width=1,
                 width=0.5,
                 color="black")

    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_title(f'Update {i+1}')
    ax.set_aspect('equal')
    ax.legend()

for j in range(n_updates, len(axs)):
    axs[j].axis('off')

plt.tight_layout()
plt.show()

ML Components in Perceptron

In the case of perceptron, the hypothesis is the family of affine functions in $R^n$ the loss is the number of points misclassified by the perceptron. The optimizer is

$$ w \leftarrow w + y \cdot x \cdot \text{lr} $$$$ b \leftarrow b + y \cdot \text{lr} $$

Week3

In the case of perceptron, the loss of one point is either 0 (correct prediction) or 1 (incorrect prediction) and the overall loss is the number of points that are incorrectly classified. This is a valid way to measure the approximation of the true unknown function $f$ by the hypothesis functions. Using different criteria to measure the deviation of hypothesis functions from the true function $f$ can lead to different final choice of $g$. For example, in a system designed to detect harmful content targeted at children, the probability of false negatives - that is, harmful content that goes undetected - should be minimized as much as possible. In this case, the evaluation criterion should prioritize penalizing false negatives to the greatest extent.

Logistic Regression

implementation of a logistic regression

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
class LogisticRegression:

    def __init__(self, lr=0.01, epochs=1000, threshold=0.5):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = 0
        self.threshold = threshold
        self.losses = []
        self.history = []

    def _linear(self, X):
        return X @ self.w + self.b

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def predict_prob(self, X):
        return self._sigmoid(self._linear(X))

    def predict(self, X):
        return (self.predict_prob(X) >= self.threshold).astype(int)

    def loss(self, X, y):
        y_prob = self.predict_prob(X)
        eps = 1e-15
        y_prob = np.clip(y_prob, eps, 1 - eps)
        return -(1 / X.shape[0]) * (np.sum(y * np.log(y_prob)) + np.sum(
            (1 - y) * np.log(1 - y_prob)))

    def fit(self, X, y):
        self.w = np.zeros(X.shape[1])

        for i in range(self.epochs):
            loss = self.loss(X, y)
            self.losses.append(loss)
            y_prob = self.predict_prob(X)

            # Gradient Descent
            dw = (1 / X.shape[0]) * X.T @ (y_prob - y)
            db = (1 / X.shape[0]) * np.sum(y_prob - y)

            self.w -= self.lr * dw
            self.b -= self.lr * db

            # calculate how much did the hyperplane has "rotate" by a inner product of their normal vectors
            cos_theta = 0
            if self.history:
                cos_theta = np.dot(self.w, self.history[-1][0]) / (
                    np.linalg.norm(self.w) * np.linalg.norm(self.history[-1][0]))
                cos_theta = np.clip(cos_theta, -1.0, 1.0)

            angle_deg = np.degrees(np.arccos(cos_theta))

            # record history only when there is a big update
            if not self.history or angle_deg > 30 or i == self.epochs - 1:
                self.history.append([self.w.copy(), self.b, i])

Try Logistic Regression on a Synthetic Dataset

generate data

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import numpy as np
import matplotlib.pyplot as plt


def f(x, w, b):
    return np.dot(x, w) + b + np.random.normal(loc=0, scale=0.8)


s = np.arange(2, 10, 0.1)
d_positive = np.array([[x, f(x, 1, 1), 1] for x in s])
d_negative = np.array([[x, f(x, 0.9, -1), 0] for x in s])
df = np.vstack((d_positive, d_negative))
np.random.shuffle(df)
X, y = df[:, :2], df[:, -1].astype(int)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.5)
plt.show()

learn

Show/Hide the code

1
2
logistic = LogisticRegression(0.2, 1000, 0.5)
logistic.fit(X, y)

loss

Show/Hide the code

1
2
plt.plot(np.arange(len(logistic.losses)), logistic.losses)
plt.show()

visualize learning process

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
fig, axes = plt.subplots(nrows=int(np.ceil(len(logistic.history) / 4)),
                         ncols=4,
                         figsize=(20, 20))
axes = axes.flatten()
# print(p.losses)

x0_vals = np.linspace(X[:, 1].min() - 1, X[:, 1].max() + 1, 100)

for i, (cw, cb, epoch) in enumerate(logistic.history):

    axes[i].plot(
        x0_vals,
        -(cw[0] * x0_vals + cb) / cw[1],
    )

    axes[i].scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.5)
    axes[i].set_xlim(-1, X[:, 0].max() + 1)
    axes[i].set_ylim(-1, X[:, 1].max() + 1)
    axes[i].set_title(
        f"epoch={epoch}, w0={cw[0]:.2f}, w1={cw[1]:.2f}, \nb={cb:.2f}, loss={logistic.losses[i]:.2f}"
    )
plt.tight_layout()
plt.show()

Try Logistic Regression on a Real Dataset

load data and learn

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import re
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("SMSSpamCollection", sep="\t", names=["label", "message"])


def clean_text(text):
    text = text.lower()
    text = re.sub(r"\W+", " ", text)
    return text.strip()


df["message"] = df["message"].apply(clean_text)

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df["message"])
y = np.where(df["label"] == "spam", 1, 0)

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=42)

classifier = LogisticRegression(lr=0.1, epochs=20000)

classifier.fit(X_train, y_train)

evaluate performance

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
y_pred = classifier.predict(X_test)

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, pos_label=1))
print("Recall:", recall_score(y_test, y_pred, pos_label=1))
print("F1 Score:", f1_score(y_test, y_pred, pos_label=1))

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=["ham", "spam"]))

Accuracy: 0.9488789237668162
Precision: 1.0
Recall: 0.6174496644295302
F1 Score: 0.7634854771784232

Classification Report:

              precision    recall  f1-score   support

         ham       0.94      1.00      0.97       966
        spam       1.00      0.62      0.76       149

    accuracy                           0.95      1115
   macro avg       0.97      0.81      0.87      1115
weighted avg       0.95      0.95      0.94      1115

Show/Hide the code

1
2
3
4
5
6
7
8
9
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(
    y_test,
    y_pred,
    normalize=None,
    cmap="Blues",
)
plt.show()

loss

Show/Hide the code

1
2
plt.plot(np.arange(len(classifier.losses)), classifier.losses)
plt.show()

Week 4

The loss for multi-output regression should be the sum of the losses for all outputs. Whether using MSE or other types of loss functions, it’s simply a numerical summation. However, since the outputs of classification tasks are interpreted as probabilities, their sum must be 1. This means that an increase in the probability of one class must lead to a decrease in the probabilities of other classes in order to maintain the total sum of 1.

As for the cross entropy loss, I prefer to view it from a probabilistic and statistical perspective, namely as minimizing the negative log-likelihood function.

For a binary classification task, define the random variable:

$$ X = \begin{cases} 1, & \text{if positive} \\ 0, & \text{if negative} \end{cases} $$

Then $X \sim \text{Bern}(p)$, where $p$ is the probability of the sample being positive and $p$ is in effect a function of the model parameters $\mathbf{w}$

The probability mass function is:

$$ P(X=y) = p^y (1-p)^{1-y}, \quad y \in \{0,1\}. $$

For a classification task with $K$ classes, let:

$\mathbf{p} = (p_1, p_2, \dots, p_K)$ be the probability vector over all classes
$\mathbf{y} = (y_1, y_2, \dots, y_K)$ be the one-hot encoded label

Then the probability of observing $\mathbf{y}$ is:

$$ P(\mathbf{X} = \mathbf{y}) = \prod_{k=1}^K{p_k^{y_k}}. $$

The Maximum Likelihood Function for a multi-class classification problem with $n$ samples and $K$ possible classes is:

$$ L(\mathbf{y}|\mathbf{x}; \mathbf{p}) = \prod_{i=1}^n \prod_{k=1}^K p_{k}^{\,y_{k}}, $$

where:

$p_{i,k}$ is the predicted probability that sample $i$ belongs to class $k$, which is a function of the model parameters
$y_{i,k}$ is the one-hot label

Taking the negative logarithm of the likelihood gives the negative log-likelihood:

$$ -\log L(\mathbf{y}|\mathbf{x}; \mathbf{p}) = -\sum_{i=1}^n \sum_{k=1}^K y_{k}\,\log p_{k}. $$

Since only the true class has $y_{k}=1$, this simplifies to:

$$ -\log L(\mathbf{y}|\mathbf{x}; \mathbf{p}) = -\sum_{i=1}^n \log p_{i}, $$

Week 5

Input and output formats of the functions

The simplest form of Support Vector Machine (SVM), the Maximum Margin Classifier, shares the same family of hypotheses as the perceptron, namely a set of N-dimensional hyperplanes. The key difference lies in how they select the optimal hypothesis. The perceptron learning algorithm processes data points sequentially, one at a time. If a data point is misclassified, the algorithm updates the parameters, effectively adjusting the hypothesis.

By contrast, the Maximum Margin Classifier seeks an optimal hyperplane that maximizes the margin, positioning it as far as possible from all data points. SVMs take a N-dimensional vector as input which represents a data point. The output is predictions $\hat{y} \in {-1, 1}$

Are they parametrized?

Yes, they are. the parameter space is $R^{(N+1)}$ where $N$ is the number of dimensions of input vectors.

Randomly sample 100 functions from this family

Suppose the dataset is 2-dimensional

Show/Hide the code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
parameters = np.random.randn(100, 3)

for p in parameters:
    w0, w1, b = p[0], p[1], p[2] * 10
    x_vals = np.linspace(-100, 100, 1000)
    y_vals = -(b + w0 * x_vals) / w1
    plt.plot(x_vals, y_vals)
plt.tight_layout()
plt.show()