AtomLearn
DashboardGoalsGraphAchievementsReviewSign In
Machine Learning BasicsNot Started

Train/Test Split

Why you must evaluate a model on data it has never seen

0%

Knowledge Debt detected

You can study this freely — but your score may plateau if these foundations have gaps. The Mastery badge requires them to be solid.

Explanation

You must evaluate your model on data it was NOT trained on — otherwise you're testing whether it memorized the data, not whether it can generalize.

The split:

```python from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # 80% train, 20% test # random_state ensures reproducibility ```

Workflow:

  • 1. Split data FIRST (before any processing)
  • 2. Train the model on X_train / y_train ONLY
  • 3. Evaluate on X_test / y_test (data model has NEVER seen)

Why 80/20? Common default. With more data you can afford a smaller test set. With very little data, use cross-validation instead.

Train accuracy vs Test accuracy:

  • Train ≈ Test → good fit
  • Train >> Test → overfitting (memorized training data)
  • Both low → underfitting (model too simple)

Validation set: Often data is split 3 ways: train / validation / test. Validation is used to tune hyperparameters; test is only used at the very end.

Examples

Full train/test workflow

R² close to 1.0 means good fit

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + np.random.randn(100) * 2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score  = model.score(X_test, y_test)
print(f"Train R²: {train_score:.3f}")
print(f"Test  R²: {test_score:.3f}")

Next in Machine Learning Basics

Overfitting vs Underfitting

Continue