Perfect β here is Lab 2: Linear Regression in Python, written exactly in the same one-action-at-a-time format your intern can follow in Jupyter Notebook.
This lab assumes they completed Lab 1 and already have:
X_train_scaledX_test_scaledy_trainy_testGoal: Learn how to train, evaluate, and interpret a Linear Regression model using cleaned, encoded, and scaled data from Lab 1.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as npmodel = LinearRegression()
modelYou should see:
LinearRegression()This means your model object was created successfully.
model.fit(X_train_scaled, y_train)Intern should understand: The model learns the best possible line that fits the training data.
y_train_pred = model.predict(X_train_scaled)
y_train_pred[:5]y_test_pred = model.predict(X_test_scaled)
y_test_pred[:5]Expected: Random-looking numbers close to actual scores (but not perfect β this is normal).
We will use 3 evaluation metrics:
train_mae = mean_absolute_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
train_mae, train_mse, train_r2You should see numbers like:
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)
test_mae, test_mse, test_r2Ideally:
If test performance is much worse, the model is overfitting.
Linear Regression gives us:
model.intercept_This is the predicted score when all features are zero (scaled values).
model.coef_This gives a list of coefficient values.
But this is useless without knowing which coefficient belongs to which feature.
coef_table = pd.DataFrame({
"Feature": X_train.columns,
"Coefficient": model.coef_
}).sort_values(by="Coefficient", ascending=False)
coef_tableThis shows:
Intern should observe:
plt.scatter(y_test, y_test_pred)
plt.xlabel("Actual Final Scores")
plt.ylabel("Predicted Final Scores")
plt.title("Actual vs Predicted β Linear Regression")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color="red") # perfect prediction line
plt.show()Intern should understand:
Let's simulate a new student:
We must format data with exact same columns as X_train.
new_student = pd.DataFrame({
"studyhours": [4],
"prevscore": [70],
"gradelevel_encoded": [11],
"gender_M": [1],
"gender_Unknown": [0],
"city_Mumbai": [1],
"city_Delhi": [0]
})
new_studentnew_student_scaled = scaler.transform(new_student)model.predict(new_student_scaled)You will get a predicted FinalScore (e.g., 75.4).
The intern now knows how to: