{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Simple Regression in Python\n",
        "\n",
        "Estimate a simple regression of wage on education. This notebook runs in the browser and can also be downloaded as a regular `.ipynb` file. If `statsmodels` is available, the notebook uses it. If not, it uses the same OLS formulas directly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "df = pd.read_csv(\"wage_sample.csv\")\n",
        "print(df.head())\n",
        "print(\"Rows:\", len(df))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "plt.scatter(df[\"education\"], df[\"wage\"])\n",
        "plt.xlabel(\"Years of education\")\n",
        "plt.ylabel(\"Hourly wage\")\n",
        "plt.title(\"Wage and education\")\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "x = df[\"education\"]\n",
        "y = df[\"wage\"]\n",
        "\n",
        "try:\n",
        "    import statsmodels.api as sm\n",
        "    X = sm.add_constant(x)\n",
        "    model = sm.OLS(y, X).fit()\n",
        "    intercept = model.params[\"const\"]\n",
        "    slope = model.params[\"education\"]\n",
        "    r_squared = model.rsquared\n",
        "    print(model.summary())\n",
        "except Exception as error:\n",
        "    print(\"statsmodels is not available in this browser runtime.\")\n",
        "    print(\"Using the same OLS formulas directly instead.\")\n",
        "    x_bar = x.mean()\n",
        "    y_bar = y.mean()\n",
        "    slope = ((x - x_bar) * (y - y_bar)).sum() / ((x - x_bar) ** 2).sum()\n",
        "    intercept = y_bar - slope * x_bar\n",
        "    fitted = intercept + slope * x\n",
        "    residuals = y - fitted\n",
        "    ssr = (residuals ** 2).sum()\n",
        "    sst = ((y - y_bar) ** 2).sum()\n",
        "    r_squared = 1 - ssr / sst\n",
        "\n",
        "print(\"Intercept:\", round(intercept, 2))\n",
        "print(\"Education slope:\", round(slope, 2))\n",
        "print(\"R-squared:\", round(r_squared, 3))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "education_grid = np.linspace(df[\"education\"].min(), df[\"education\"].max(), 50)\n",
        "predicted_wage = intercept + slope * education_grid\n",
        "\n",
        "plt.scatter(df[\"education\"], df[\"wage\"], label=\"Actual workers\")\n",
        "plt.plot(education_grid, predicted_wage, label=\"Fitted regression line\")\n",
        "plt.xlabel(\"Years of education\")\n",
        "plt.ylabel(\"Hourly wage\")\n",
        "plt.title(\"Simple regression fitted line\")\n",
        "plt.legend()\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Interpretation practice\n",
        "\n",
        "In this teaching sample, one more year of education is associated with about the estimated slope dollars higher predicted hourly wage. This is an association from a simple model, not automatic proof of causation."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.11"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}
