Train/Test Split
Why you must evaluate a model on data it has never seen
Knowledge Debt detected
You can study this freely — but your score may plateau if these foundations have gaps. The Mastery badge requires them to be solid.
Explanation
You must evaluate your model on data it was NOT trained on — otherwise you're testing whether it memorized the data, not whether it can generalize.
The split:
```python from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # 80% train, 20% test # random_state ensures reproducibility ```
Workflow:
- 1. Split data FIRST (before any processing)
- 2. Train the model on X_train / y_train ONLY
- 3. Evaluate on X_test / y_test (data model has NEVER seen)
Why 80/20? Common default. With more data you can afford a smaller test set. With very little data, use cross-validation instead.
Train accuracy vs Test accuracy:
- Train ≈ Test → good fit
- Train >> Test → overfitting (memorized training data)
- Both low → underfitting (model too simple)
Validation set: Often data is split 3 ways: train / validation / test. Validation is used to tune hyperparameters; test is only used at the very end.
Examples
Full train/test workflow
R² close to 1.0 means good fit
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + np.random.randn(100) * 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Train R²: {train_score:.3f}")
print(f"Test R²: {test_score:.3f}")Next in Machine Learning Basics
Overfitting vs Underfitting