带你入门数据处理Chapter 3

Sarzn2025-12-312025-12-31

Chapter 3: Dive Into Machine Learning - From Regression to Clustering

书接上回

Welcome to the third chapter of my data analytics blog series! Today, we’ll explore the fundamentals of machine learning (机器学习), from core definitions to practical algorithms like linear regression (线性回归), classification (分类), and clustering (聚类).Let’s jump in!

Introduction to Machine Learning (机器学习入门)

What is Machine Learning? (机器学习的定义)

Machine learning (机器学习) is defined by three core components:

Task (任务, T): The goal the model aims to achieve (e.g., classifying emails as spam).
Experience (经验, E): Labeled or unlabeled data used to train the model (e.g., historical email data with “spam” labels).
Performance (性能, P): A metric to measure how well the model performs the task (e.g., percentage of correctly classified emails).

In short: Machines learn from experience to improve performance on a specific task—no explicit programming required!

Supervised vs. Unsupervised Learning (有监督学习 vs. 无监督学习)

Aspect (维度)	Supervised Learning (有监督学习)	Unsupervised Learning (无监督学习)
Core Feature (核心特征)	Uses labeled data (X = input features + y = target labels)	Uses unlabeled data (only X = input features)
“Teacher” Role (是否有“老师”)	Yes (labels guide learning)	No (model discovers hidden patterns)
Key Goal (核心目标)	Predict known targets	Uncover unknown structures (e.g., clusters)
Examples (示例)	- Classify tumors as malignant/benign - Predict house prices - Filter spam emails	- Group customers by shopping behavior - Identify anomalies in sensor data - Segment social media users by interests

Regression vs. Classification (回归 vs. 分类)

Both are subfields of supervised learning—they differ in the type of target variable:

Type (类型)	Regression (回归)	Classification (分类)
Target Variable (目标变量)	Continuous (continuous numeric values 连续数值)	Discrete (categorical labels 离散类别)
Key Goal (核心目标)	Predict a quantity (e.g., price, temperature)	Assign a category (e.g., “yes/no”, “A/B/C”)
Examples (示例)	- Predict house price based on size - Forecast monthly sales - Estimate student exam scores	- Classify loan applicants as “safe/risky” - Detect COVID-19 (positive/negative) - Identify handwritten digits (0-9)

Sampling for Model Evaluation (模型评估的抽样方法)

To assess a model’s performance, we split data into training and testing sets—here are two common methods:

Hold-Out Approach (留出法):
- Split data randomly into a training set (训练集) (e.g., 70-80%) and a test set (测试集) (e.g., 20-30%).
- Train the model on the training set; evaluate accuracy on the test set.
- Simple and fast, but performance depends on the random split.
Cross-Validation (交叉验证):
- More robust than hold-out. Split data into k equal “folds” (e.g., k=5).
- Train the model on k-1 folds; test on the remaining 1 fold.
- Repeat k times (each fold acts as test set once); average the results.
- Reduces bias from random splitting (e.g., k-fold cross-validation k折交叉验证).

Linear Regression (线性回归)

Linear regression is a supervised learning algorithm for predicting continuous targets. Let’s break down its key components.

Key Terms in Linear Regression (线性回归中的关键术语)

f_θ(x) = θ₀ + θ₁x

f_θ(x): Predicted target value (预测目标值).
θ₀: Intercept (截距) — the value of f_θ(x) when x = 0 (where the line crosses the y-axis).
θ₁: Coefficient (系数) — the slope of the line (change in y per unit change in x).
x: Input feature (输入特征) — the variable used to predict y.

Simple vs. Multiple Linear Regression (简单 vs. 多元线性回归)

Type (类型)	Simple Linear Regression (简单线性回归)	Multiple Linear Regression (多元线性回归)
Number of Features (特征数)	1 input feature (x)	≥2 input features (x₁, x₂, ..., x_n)
Model Formula (模型公式)	f_θ(x) = θ₀ + θ₁x	f_θ(x) = θ₀ + θ₁x₁ + θ₂x₂ + ... + θ_nx_n
Example (示例)	Predict house price from square footage	Predict house price from square footage + number of bedrooms + age

How to Plot a Simple Linear Regression Model (绘制简单线性回归模型)

The model f_θ(x) = θ₀ + θ₁x is a straight line—its shape depends on θ₀ and θ₁:

Changing θ₀ (截距): Shifts the line up (higher θ₀) or down (lower θ₀) along the y-axis.
Example: θ₀ = 10, θ₁ = 2 → line starts at (0,10); θ₀ = 5, θ₁ = 2 → line starts at (0,5).
Changing θ₁ (系数): Adjusts the slope.
Example: θ₀ = 10, θ₁ = 3 → steeper line; θ₀ = 10, θ₁ = 1 → flatter line.

To plot:

Choose a range of x values (e.g., 0 to 10).
Calculate f_θ(x) for each x using the formula.
Plot the (x, f_θ(x)) points and draw a straight line through them.

Regression Model Evaluation (回归模型评估)

We use three key metrics to measure how well the model predicts actual values ():

MAE (Mean Absolute Error, 平均绝对误差):
- Formula:
- Definition: Average of the absolute differences between actual (y_i) and predicted (f_θ(x_i)) values.
- Interpretation: Lower MAE = better model (less error).
MSE (Mean Squared Error, 均方误差):
- Formula:
- Definition: Average of the squared differences between actual and predicted values.
- Interpretation: Penalizes large errors (squares them); lower MSE = better model.
R-Squared (决定系数):
- Formula:
- Terms: ȳ = mean of actual values; numerator = sum of squared errors; denominator = total variance of y.
- Interpretation: Proportion of variance in explained by the model (0 ≤ R² ≤ 1). Closer to 1 = model explains more variance.

Model Evaluation & Selection (模型评估与选择)

Training Error (训练误差): Error on the training set (e.g., MAE/MSE calculated using training data).
Measures how well the model fits the training data.
Generalization (Test) Error (泛化误差/测试误差): Error on the test set.
Measures how well the model performs on new, unseen data (the true goal!).
Hold-Out Approach for Data Partitioning (留出法数据划分):
- Randomly split data into training (70-80%) and test (20-30%) sets.
- Train on training data; use test error to estimate generalization error.
- Avoid overfitting (模型过拟合) — a model with low training error but high test error is “memorizing” training data, not learning.

Classification (分类)

Classification predicts discrete categorical labels (e.g., “spam” vs. “not spam”). Let’s explore its key metrics and decision tree models.

Classification Accuracy Measures (分类准确率指标)

First, define the confusion matrix (混淆矩阵) terms for binary classification (二分类):

TP (True Positive, 真阳): Actual positive → predicted positive (e.g., COVID positive → test positive).
TN (True Negative, 真阴): Actual negative → predicted negative (e.g., COVID negative → test negative).
FP (False Positive, 假阳): Actual negative → predicted positive (e.g., COVID negative → test positive).
FN (False Negative, 假阴): Actual positive → predicted negative (e.g., COVID positive → test negative).

Using these, calculate core metrics:

Overall Accuracy (总体准确率):
- Formula:
- Definition: Percentage of total predictions that are correct.
- Limitation: Misleading for imbalanced data (e.g., 99% non-spam emails → predicting all as non-spam gives 99% accuracy but fails at detecting spam).
Precision (精确率):
- Formula:
- Definition: Percentage of predicted positives that are actually positive.
- Use case: Critical when FP is costly (e.g., diagnosing a rare disease — avoid false positives that cause unnecessary treatment).
Recall (召回率/灵敏度):
- Formula:
- Definition: Percentage of actual positives that are correctly predicted.
- Use case: Critical when FN is costly (e.g., fraud detection — avoid missing actual fraud cases).
F1-score:
- Formula:
- Definition: An f-score is a way to measure a model’s accuracy based on recall and precision. There’s a general case F-score,called the F1-score
- Use case: The higher an F-score, the more accurate a model is. The lower an F-score, the less accurate a model is.

Classification accuracy sometimes can be misleading

Let us focus on a two-class problem (e.g., “non cancer”/ “cancer” patients) where

number of class C₁ tuples: 9,990
number of class C₂ tuples: 10
- If the classifier predicts everything to be class C₁ , then accuracy is 99.9%
- However, this is misleading because the classifier does not correctly predict any tuple from C₂

Decision Tree Representation (决策树表示)

A decision tree is a hierarchical model for classification, consisting of three components:

Internal Nodes (内部节点): Represent tests on a feature (e.g., “Is taxable income > 80K?”).
Branches (Edges, 分支/边): Represent outcomes of the test (e.g., “Yes” or “No” for the income test).
Leaf Nodes (叶节点): Represent final class labels (e.g., “Cheat = Yes” or “Cheat = No”).

Node Splitting Criteria (节点分裂标准)

Nodes are split to create branches that are:

As similar as possible within (内部尽可能相似): Most data points in a branch belong to the same class.
As different as possible among (外部尽可能不同): Data points in different branches belong to different classes.

Simple Example: Binary Split (二分裂示例)

Suppose we have a feature “Taxable Income” (numeric) and class “Cheat” (Yes/No):

Split the feature into two branches: “Income ≤ 80K” and “Income > 80K”.
Goal: All “Cheat = Yes” cases fall into one branch, and “Cheat = No” into the other (ideal split).
Result: Branch 1 (“≤80K”) has 90% “No” cases; Branch 2 (“>80K”) has 70% “Yes” cases — split is effective!

Clustering Method (聚类方法)

Clustering is an unsupervised learning technique that groups unlabeled data into clusters (聚类) based on similarity.

What is Clustering? (聚类的含义)

Clustering aims to:

Group objects with high similarity (高相似度) into the same cluster.
Ensure objects in different clusters have low similarity (低相似度).
Discover hidden structures in data (e.g., customer groups with similar buying habits).

3 Examples of Clustering Applications (聚类的3个应用示例)

Customer Segmentation (客户细分): Group customers by purchase frequency, spending amount, or product preferences to design targeted marketing campaigns.
Anomaly Detection (异常检测): Identify unusual patterns (e.g., fraudulent transactions, faulty sensor readings) that don’t fit any cluster.
Text Document Clustering (文本文档聚类): Group news articles, research papers, or social media posts by topic (e.g., “politics”, “technology”, “sports”).

Similarity Measures:

Manhattan & Euclidean Distance (相似度度量：曼哈顿距离与欧氏距离) Similarity is often measured by distance (距离) — smaller distance = higher similarity. For two instances with two features (x = (x₁, x₂) and y = (y₁, y₂)):

Manhattan Distance (曼哈顿距离):
- Formula: d_M(x, y) = |x₁ − y₁|+|x₂ − y₂|
- Definition: Sum of the absolute differences of their coordinates (like walking city blocks).
- Example: Instance A (2, 3) and Instance B (5, 7) → .
Euclidean Distance (欧氏距离):
- Formula:
- Definition: Straight-line distance between two points in 2D space.
- Example: Instance A (2, 3) and Instance B (5, 7) →

How Distance is Used for Clustering (距离如何用于聚类)

Clustering algorithms use distance to group instances. Let’s take a simple example:

Data: 4 cars with two features: “Color (数值化: Red=1, Blue=2)” and “Speed (km/h: 100, 180)”.
- Car 1: (1, 100) → Red, 100km/h
- Car 2: (1, 110) → Red, 110km/h
- Car 3: (2, 180) → Blue, 180km/h
- Car 4: (2, 170) → Blue, 170km/h
Distance Calculation:
- Car 1 & 2 (same color): (small distance → same cluster).
- Car 3 & 4 (same color): (small distance → same cluster).
- Car 1 & 3 (different color/speed): (large distance → different clusters).
Result: Two clusters: {Car1, Car2} and {Car3, Car4} — grouped by color and speed similarity!

Machine learning is all about turning data into actionable insights—whether predicting values with regression, classifying labels with decision trees, or finding hidden groups with clustering. These fundamentals will help you tackle real-world problems with confidence!!

THE END