Train/Test Split

Why you must evaluate a model on data it has never seen

Knowledge0%

Learn & Drill

Fluency0%

Drill & Speed

Retention0%

Mastery & Review

Confidence0%

All modes

Practice

Knowledge

Fluency

Retention

Knowledge Debt detected

You can study this freely — but your score may plateau if these foundations have gaps. The Mastery badge requires them to be solid.

What is a Model?0%

Explanation

You must evaluate your model on data it was NOT trained on — otherwise you're testing whether it memorized the data, not whether it can generalize.

The split:

python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # 80% train, 20% test # random_state ensures reproducibility

Workflow:

1. Split data FIRST (before any processing)
2. Train the model on X_train / y_train ONLY
3. Evaluate on X_test / y_test (data model has NEVER seen)

Why 80/20? Common default. With more data you can afford a smaller test set. With very little data, use cross-validation instead.

Train accuracy vs Test accuracy:

Train ≈ Test → good fit
Train >> Test → overfitting (memorized training data)
Both low → underfitting (model too simple)

Validation set: Often data is split 3 ways: train / validation / test. Validation is used to tune hyperparameters; test is only used at the very end.

Examples

Full train/test workflow

R² close to 1.0 means good fit

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + np.random.randn(100) * 2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score  = model.score(X_test, y_test)
print(f"Train R²: {train_score:.3f}")
print(f"Test  R²: {test_score:.3f}")

How well did you understand this?

Next in Machine Learning Basics

Overfitting vs Underfitting

Continue

Unlocks

Overfitting vs Underfitting