Dataset library

Practice datasets

Download classroom-sized data, results templates, and interpretation templates for hands-on practice.

wage_sample.csv

Wage sample

A small teaching dataset for practicing descriptive statistics, scatter plots, and early regression intuition.

CSVModule 2RegressionPython-ready

Use in Module 2 Download data Read notes Results Interpretation

Type

Cross-sectional

Related lesson

Regression in Python

Example regression

wage on education

Source

Local CSV

Practice exercise

Estimate wage on education, plot the fitted line, then interpret the slope and R-squared.

Data note

Loaded directly by Module 2 charts and Python examples.

Python loading example

import pandas as pd

data = pd.read_csv("/data/wage_sample.csv")
print(data[["wage", "education"]].head())

Variables

wagenumericdollars per hour

Hourly wage in dollars.

educationnumericyears

Years of completed education.

experiencenumericyears

Years of labor market experience.

femalebinary

Indicator equal to 1 for female workers.

marriedbinary

Indicator equal to 1 for married workers.

Preview rows

wage	education	experience	female	married
18.5	12	3	0	0
24.2	16	6	1	1
31.8	18	10	0	1
21.1	14	4	1	0
28.4	16	9	0	1

WAGE1.DTA

WAGE1

Worker wage data for practicing the classic education-wage simple regression and basic coefficient interpretation.

DTAModule 2WagesEducation

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

What does beta1 mean?

Example regression

wage on educ

Source

Google Drive Stata file

Practice exercise

Estimate how hourly wage changes with one more year of education, then compare the result with the local teaching sample.

Data note

Use this Drive file for the larger wage dataset; the local CSV is the classroom-sized starter version.

Python loading example

import pandas as pd

data = pd.read_stata("WAGE1.DTA")
print(data[["wage", "educ"]].head())

Variables

wagenumericdollars per hour

Hourly wage.

educnumericyears

Years of education.

expernumericyears

Labor market experience.

tenurenumericyears

Years with current employer.

femalebinary

Indicator for female workers.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("WAGE1.DTA")
print(data[["wage", "educ"]].head())

ceosal1.xls

CEOSAL1

CEO compensation data for practicing how executive salary is associated with company return on equity.

XLSModule 2FirmsSalary

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Dependent variable and explanatory variable

Example regression

salary on roe

Source

Google Drive Excel file

Practice exercise

Identify salary as y and return on equity as x, then estimate and interpret the slope.

Data note

This Excel file was found in the course Google Drive dataset folder.

Python loading example

import pandas as pd

data = pd.read_excel("ceosal1.xls")
print(data[["salary", "roe"]].head())

Variables

salarynumericthousands of dollars

CEO salary.

roenumericpercent

Return on equity.

salesnumericmillions of dollars

Firm sales.

profitsnumericmillions of dollars

Firm profits.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_excel("ceosal1.xls")
print(data[["salary", "roe"]].head())

CEOSAL2.DTA

CEO salary

Executive compensation data for studying how salary is associated with firm performance or firm size.

DTAModule 2FirmsSalary

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Interpreting regression output

Example regression

salary on roe or salary on sales

Source

Google Drive Stata file

Practice exercise

Estimate salary on return on equity, then write one cautious sentence about what the slope does and does not prove.

Data note

Use as the CEOSAL practice dataset for Module 2; CEOSAL1 follows the same simple-regression idea.

Python loading example

import pandas as pd

data = pd.read_stata("CEOSAL2.DTA")
print(data[["salary", "roe"]].head())

Variables

salarynumericthousands of dollars

CEO salary.

roenumericpercent

Return on equity.

salesnumericmillions of dollars

Firm sales.

profitsnumericmillions of dollars

Firm profits.

ceotennumericyears

Years as CEO.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("CEOSAL2.DTA")
print(data[["salary", "roe"]].head())

VOTE1.DTA

Campaign spending and vote share

Election data for connecting a fitted line to campaign spending and candidate vote share.

DTAModule 2PoliticsCampaigns

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Dependent variable and explanatory variable

Example regression

voteA on shareA

Source

Google Drive Stata file

Practice exercise

Graph vote share against candidate A's spending share, then interpret the fitted slope in percentage-point terms.

Python loading example

import pandas as pd

data = pd.read_stata("VOTE1.DTA")
print(data[["voteA", "shareA"]].head())

Variables

voteAnumericpercent

Candidate A's vote share.

shareAnumericpercent

Candidate A's share of campaign spending.

expendAnumericdollars

Campaign spending by candidate A.

expendBnumericdollars

Campaign spending by candidate B.

prtystrAnumeric

Party strength for candidate A.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("VOTE1.DTA")
print(data[["voteA", "shareA"]].head())

SLEEP75.DTA

Sleep and work time

Time-use data for examining whether more work time is associated with less sleep.

DTAModule 2Time useLabor

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Fitted values and residuals

Example regression

sleep on totwrk

Source

Google Drive Stata file

Practice exercise

Estimate sleep on total work minutes and identify one observation with a large residual.

Python loading example

import pandas as pd

data = pd.read_stata("SLEEP75.DTA")
print(data[["sleep", "totwrk"]].head())

Variables

sleepnumericminutes per week

Minutes slept per week.

totwrknumericminutes per week

Total minutes worked per week.

educnumericyears

Years of education.

agenumericyears

Age.

malebinary

Indicator for male respondents.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("SLEEP75.DTA")
print(data[["sleep", "totwrk"]].head())

BWGHT.DTA

Birth weight

Family and birth outcome data for discussing health examples and careful non-causal interpretation.

DTAModule 2HealthFamilies

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

What is the error term?

Example regression

bwght on cigs

Source

Google Drive Stata file

Practice exercise

Estimate birth weight on cigarettes smoked and list at least two omitted factors that may affect interpretation.

Python loading example

import pandas as pd

data = pd.read_stata("BWGHT.DTA")
print(data[["bwght", "cigs"]].head())

Variables

bwghtnumericounces

Infant birth weight.

cigsnumericcigarettes per day

Cigarettes smoked by the mother.

famincnumeric

Family income.

motheducnumericyears

Mother's education.

fatheducnumericyears

Father's education.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("BWGHT.DTA")
print(data[["bwght", "cigs"]].head())

401K.DTA

401(k) participation

Retirement plan data for studying participation rates and employer match rates.

DTAModule 2RetirementPolicy

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Ordinary Least Squares intuition

Example regression

prate on mrate

Source

Google Drive Stata file

Practice exercise

Regress participation rate on match rate and explain the fitted line in percentage-point language.

Python loading example

import pandas as pd

data = pd.read_stata("401K.DTA")
print(data[["prate", "mrate"]].head())

Variables

pratenumericpercent

Plan participation rate.

mratenumericmatch rate

Employer match rate.

totpartnumeric

Total participants.

totelgnumeric

Total eligible employees.

agenumericyears

Plan age.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("401K.DTA")
print(data[["prate", "mrate"]].head())

MEAP93.DTA

School spending and math performance

School-level data for practicing log transformations, fitted values, and cautious school-performance interpretation.

DTAModule 2EducationSchools

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

R-squared

Example regression

math10 on log(expend)

Source

Google Drive Stata file

Practice exercise

Estimate math pass rate on log spending and compare the R-squared with a scatter plot.

Python loading example

import pandas as pd
import numpy as np

data = pd.read_stata("MEAP93.DTA")
data["log_expend"] = np.log(data["expend"])
print(data[["math10", "log_expend"]].head())

Variables

math10numericpercent

Percentage passing the math test.

expendnumericdollars per student

Expenditure per pupil.

lnchprgnumericpercent

Lunch program percentage.

enrollnumericstudents

School enrollment.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd
import numpy as np

data = pd.read_stata("MEAP93.DTA")
data["log_expend"] = np.log(data["expend"])
print(data[["math10", "log_expend"]].head())

WAGE2.DTA

WAGE2

A larger wage dataset for extending the wage example after students understand the simple education slope.

DTAModule 2WagesExtension

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Practice regression project

Example regression

wage on educ

Source

Google Drive Stata file

Practice exercise

Repeat the education-wage regression with WAGE2 and compare the slope to the WAGE1 or local sample result.

Python loading example

import pandas as pd

data = pd.read_stata("WAGE2.DTA")
print(data[["wage", "educ"]].head())

Variables

wagenumeric

Wage measure in the dataset.

educnumericyears

Years of education.

expernumericyears

Experience.

tenurenumericyears

Job tenure.

IQnumeric

IQ score measure.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("WAGE2.DTA")
print(data[["wage", "educ"]].head())

RDCHEM.DTA

Research and development

Firm data for practicing regression with business investment and sales measures.

DTAModule 2FirmsR&D

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Practice regression project

Example regression

rd on sales

Source

Google Drive Stata file

Practice exercise

Estimate R&D spending on firm sales and explain whether the intercept is meaningful.

Python loading example

import pandas as pd

data = pd.read_stata("RDCHEM.DTA")
print(data[["rd", "sales"]].head())

Variables

rdnumericdollars or millions, depending on file

Research and development spending.

salesnumericdollars or millions, depending on file

Firm sales.

profitsnumericdollars or millions, depending on file

Firm profits.

rdintensnumeric

R&D intensity measure where available.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("RDCHEM.DTA")
print(data[["rd", "sales"]].head())

charity.dta

Charitable giving

Donation data for a real-world practice project on mailings, response, and gift amounts.

DTAModule 2NonprofitApplied practice

Use in Module 2 Open Drive data Read notes

Type

Cross-sectional

Related lesson

Practice regression project

Example regression

gift on mailsyear

Source

Google Drive Stata file

Practice exercise

Estimate gift amount on mailings per year and write one limitation related to donor selection.

Python loading example

import pandas as pd

data = pd.read_stata("charity.dta")
print(data[["gift", "mailsyear"]].head())

Variables

giftnumericdollars

Gift amount.

mailsyearnumericmailings per year

Number of mailings per year.

avggiftnumericdollars

Average previous gift.

proprespnumeric

Response proportion.

Preview

Open the Drive file to inspect the full dataset. Students can convert the Stata file to CSV in Python with pandas before running the regression.

import pandas as pd

data = pd.read_stata("charity.dta")
print(data[["gift", "mailsyear"]].head())