Lesson 10
Regression in Python
Big question
How do we estimate a simple regression with pandas and statsmodels?
Lesson progress
Complete checkpoints as you learn
Learning objectives
- Explain regression in python in plain language.
- Use pandas correctly in an interpretation.
- Connect the lesson idea to a formula, graph, Python result, or real example.
Simple explanation
Python lets students load data, inspect variables, estimate a regression, and keep the workflow reproducible. The core pattern is load the CSV, define y and x, add an intercept, fit the model, and print the summary.
Key terms
- pandas
- A Python library for loading and working with tabular data.
- statsmodels
- A Python library for estimating statistical models and printing regression output.
- add_constant
- A statsmodels helper that adds an intercept column to the explanatory variables.
- summary output
- A formatted table reporting regression estimates and diagnostics.
Estimated sample regression function
Example
The local CSV wage_sample.csv is small enough for students to read but still shows the full regression workflow.
Interactive visual
Python workflow
Load, inspect, estimate, explain, and save the result.
Load
Read wage_sample.csv into a pandas DataFrame.
Inspect
Check columns, ranges, and a scatter plot before modeling.
Estimate
Use statsmodels OLS with a constant column.
Explain
Translate coefficients, R-squared, and limitations.
Estimate wage on education
1import pandas as pd2import statsmodels.api as sm3 4df = pd.read_csv("wage_sample.csv")5 6y = df["wage"]7X = sm.add_constant(df["education"])8 9model = sm.OLS(y, X).fit()10print(model.summary())11 12print("Intercept:", round(model.params["const"], 2))13print("Education slope:", round(model.params["education"], 2))14print("R-squared:", round(model.rsquared, 3))Python walkthrough
- 1pandas reads the CSV into a DataFrame, where columns behave like named variables.
- 2y stores the dependent variable and X stores the explanatory variable plus a constant for the intercept.
- 3sm.OLS(y, X).fit() estimates the line that minimizes the sum of squared residuals.
- 4The printed values are rounded so students can immediately practice interpretation.
Live notebook
Run this lesson as a notebook
Open an editable notebook cell-by-cell, run Python in the browser, and download the `.ipynb` file for later.
Interactive activity
Code prediction
X = sm.add_constant(X)What does this line add?
Try it yourself
Change the independent variable from education to experience and rerun the regression.
Common mistakes
Check these before you move on.
A regression coefficient describes a pattern unless the assumptions or research design support a causal interpretation.
Quick quiz
Why does the Python code use sm.add_constant?
Quick quiz
What does model.params['education'] return?
Key takeaway
A good Python regression workflow is short, reproducible, and easy to translate into plain language.