From 8b6442a07081352922ee9331becca118274efe83 Mon Sep 17 00:00:00 2001 From: Saran Ahluwalia Date: Fri, 10 Dec 2021 08:06:03 -0500 Subject: [PATCH] draft --- ...da_se_12_09_2021-revised-denominator.ipynb | 686 ++++++++++++++++++ .../ipython/hud_eda_se_12_09_2021.ipynb | 45 +- 2 files changed, 717 insertions(+), 14 deletions(-) create mode 100644 data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021-revised-denominator.ipynb diff --git a/data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021-revised-denominator.ipynb b/data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021-revised-denominator.ipynb new file mode 100644 index 00000000..3327ed80 --- /dev/null +++ b/data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021-revised-denominator.ipynb @@ -0,0 +1,686 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Methodology per Blair Russell\n", + "\n", + "We may want to rethink the denominator of our equation for housing cost burden.\n", + "\n", + "\"Right now it’s all housing units with a cost burden computed. \n", + "\n", + "Alternatively, you could use low-income households (with cost burden computed) as the denominator, which would be a measure of relative cost burden just for low-income households. \n", + "\n", + "Both approaches are appropriate, but they tell a different story. You can imagine an area with few low-income households but a vast majority of them being cost burdened. In your calculation, you’d get a small percentage. \n", + "\n", + "In the alternative approach, it’s a large percentage. Just something to think about. It depends on the story you want to tell.\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Packages" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import math\n", + "import numpy as np\n", + "import os\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Table of Contents" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Comparisons\n", + "* [Implementation by Lucas](#lucas)\n", + "* [Implementation by Saran](#saran)\n", + "* [Side-by-side Comparison](#comparison)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The relevant denominator variables - all with line type of subtotal - in table 8 of the CHAS dataset are the following (CHAS data dictionary available [here](https://www.huduser.gov/portal/datasets/cp/CHAS-data-dictionary-14-18.xlsx)):\n", + "\n", + "| Name | Label |\n", + "|---------|-----------------------------------------------------|\n", + "|T1_est3 | Owner occupied less or equal to 30% of HAMFI | | \n", + "|T8_est16 | Owner occupied greater than 30% of HAMFI cost burden less than 50% |\n", + "|T8_est29 |Owner occupied greater than 50% but less than or equal to 80% of HAMFI\tcost burden |\n", + "|T8_est69 |Renter occupied less than or equal to 30% of HAMFI|\n", + "|T8_est82 | Renter occupied less than or equal to 30% of HAMFI cost burden greater than 50% |\n", + "|T8_est95 |Renter occupied\tgreater than 50% but less than or equal to 80% of HAMFI|\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Implementation by Lucas " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Read in the data from https://www.huduser.gov/portal/datasets/cp.html\n", + "housing = pd.read_csv(\"Table8.csv\", \n", + " encoding = \"ISO-8859-1\", \n", + " dtype = {'Tract_ID': object, 'st': object, 'geoid': object})\n", + "\n", + "\n", + "# Remove data for states that aren't included in the census (e.g. American Samoa, Guam, etc.):\n", + "housing.drop(housing.loc[housing['st'] == '72'].index, inplace = True)\n", + "\n", + "# Rename columns\n", + "housing = housing.rename(columns = {'geoid' :'FIPS_tract_id',\n", + " 'st' : 'state'\n", + " })\n", + "\n", + "# Owner occupied numerator fields\n", + "OWNER_OCCUPIED_NUMERATOR_FIELDS = [\n", + " # Column Name\n", + " # Line_Type\n", + " # Tenure\n", + " # Household income\n", + " # Cost burden\n", + " # Facilities\n", + " \"T8_est7\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # less than or equal to 30% of HAMFI\n", + " # greater than 30% but less than or equal to 50%\n", + " # All\n", + " \"T8_est10\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # less than or equal to 30% of HAMFI\n", + " # greater than 50%\n", + " # All\n", + " \"T8_est20\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 30% but less than or equal to 50% of HAMFI\n", + " # greater than 30% but less than or equal to 50%\n", + " # All\n", + " \"T8_est23\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 30% but less than or equal to 50% of HAMFI\n", + " # greater than 50%\n", + " # All\n", + " \"T8_est33\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 50% but less than or equal to 80% of HAMFI\n", + " # greater than 30% but less than or equal to 50%\n", + " # All\n", + " \"T8_est36\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 50% but less than or equal to 80% of HAMFI\n", + " # greater than 50%\n", + " # All\n", + "]\n", + "\n", + "# These rows have the values where HAMFI was not computed, b/c of no or negative income.\n", + "OWNER_OCCUPIED_NOT_COMPUTED_FIELDS = [\n", + " # Column Name\n", + " # Line_Type\n", + " # Tenure\n", + " # Household income\n", + " # Cost burden\n", + " # Facilities\n", + " \"T8_est13\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # less than or equal to 30% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est26\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 30% but less than or equal to 50% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est39\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 50% but less than or equal to 80% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est52\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 80% but less than or equal to 100% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est65\",\n", + " # Subtotal\n", + " # Owner occupied\n", + " # greater than 100% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + "]\n", + "\n", + "OWNER_OCCUPIED_POPULATION_FIELD = \"T8_est2\"\n", + "# Subtotal\n", + "# Owner occupied\n", + "# All\n", + "# All\n", + "# All\n", + "\n", + "OWNER_OCCUPIED_POPULATION_HAMFI_FIELD = \"T8_est3\"\n", + "# Subtotal\n", + "# Owner occupied \n", + "# All\n", + "# All\n", + "# All\n", + "\n", + "# Renter occupied numerator fields\n", + "RENTER_OCCUPIED_NUMERATOR_FIELDS = [\n", + " # Column Name\n", + " # Line_Type\n", + " # Tenure\n", + " # Household income\n", + " # Cost burden\n", + " # Facilities\n", + " \"T8_est73\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # less than or equal to 30% of HAMFI\n", + " # greater than 30% but less than or equal to 50%\n", + " # All\n", + " \"T8_est76\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # less than or equal to 30% of HAMFI\n", + " # greater than 50%\n", + " # All\n", + " \"T8_est86\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # greater than 30% but less than or equal to 50% of HAMFI\n", + " # greater than 30% but less than or equal to 50%\n", + " # All\n", + " \"T8_est89\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # greater than 30% but less than or equal to 50% of HAMFI\n", + " # greater than 50%\n", + " # All\n", + " \"T8_est99\",\n", + " # Subtotal\n", + " # Renter occupied\tgreater than 50% but less than or equal to 80% of HAMFI\n", + " # greater than 30% but less than or equal to 50%\n", + " # All\n", + " \"T8_est102\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # greater than 50% but less than or equal to 80% of HAMFI\n", + " # greater than 50%\n", + " # All\n", + "]\n", + "\n", + "# These rows have the values where HAMFI was not computed, b/c of no or negative income.\n", + "RENTER_OCCUPIED_NOT_COMPUTED_FIELDS = [\n", + " # Column Name\n", + " # Line_Type\n", + " # Tenure\n", + " # Household income\n", + " # Cost burden\n", + " # Facilities\n", + " \"T8_est79\",\n", + " # Subtotal\n", + " # Renter occupied\tless than or equal to 30% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est92\",\n", + " # Subtotal\n", + " # Renter occupied\tgreater than 30% but less than or equal to 50% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est105\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # greater than 50% but less than or equal to 80% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est118\",\n", + " # Subtotal\n", + " # Renter occupied\tgreater than 80% but less than or equal to 100% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + " \"T8_est131\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # greater than 100% of HAMFI\n", + " # not computed (no/negative income)\n", + " # All\n", + "]\n", + "\n", + "# T8_est68\tSubtotalRenter occupied\tAll\tAll\tAll\n", + "RENTER_OCCUPIED_POPULATION_FIELD = \"T8_est68\"" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Original computation:\n", + "# (\n", + "# # of Owner Occupied Units Meeting Criteria\n", + "# + # of Renter Occupied Units Meeting Criteria\n", + "# )\n", + "# divided by\n", + "# (\n", + "# Total # of Owner Occupied Units\n", + "# + Total # of Renter Occupied Units\n", + "# - # of Owner Occupied Units with HAMFI Not Computed\n", + "# - # of Renter Occupied Units with HAMFI Not Computed\n", + "# )\n", + "\n", + "housing[\"numerator_pre\"] = housing[\n", + " OWNER_OCCUPIED_NUMERATOR_FIELDS\n", + "].sum(axis=1) + housing[RENTER_OCCUPIED_NUMERATOR_FIELDS].sum(axis=1)\n", + "\n", + "\n", + "\n", + "housing[\"denominator_pre\"] = (\n", + " housing[OWNER_OCCUPIED_POPULATION_FIELD]\n", + " + housing[RENTER_OCCUPIED_POPULATION_FIELD]\n", + " - housing[OWNER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)\n", + " - housing[RENTER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Implementation by Saran " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "OWNER_REVISED_DENOMINATOR_FIELDS = [\n", + " # Column Name\n", + " # Line_Type\n", + " # Tenure\n", + " # Household income\n", + " # Cost burden\n", + " # Facilities\n", + " \"T8_est3\",\n", + " # Subtotal\n", + " # Renter occupied\tless than or equal to 30% of HAMFI\n", + " # All\n", + " \"T8_est16\",\n", + " # Subtotal\n", + " # Renter occupied\tgreater than 30% but less than or equal to 50% of HAMFI\n", + " # All\n", + " \"T8_est29\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # greater than 50% but less than or equal to 80% of HAMFI\n", + " # All\n", + " ]\n", + "\n", + "RENTAL_REVISED_DENOMINATOR_FIELDS = [\n", + " # Column Name\n", + " # Line_Type\n", + " # Tenure\n", + " # Household income\n", + " # Cost burden\n", + " # Facilities\n", + " \"T8_est69\",\n", + " # Subtotal\n", + " # Renter occupied\tless than or equal to 30% of HAMFI\n", + " # All\n", + " \"T8_est82\",\n", + " # Subtotal\n", + " # Renter occupied\tgreater than 30% but less than or equal to 50% of HAMFI\n", + " # All\n", + " \"T8_est95\",\n", + " # Subtotal\n", + " # Renter occupied\n", + " # greater than 50% but less than or equal to 80% of HAMFI\n", + " # All\n", + " ]\n", + "\n", + "# New computation:\n", + "# (\n", + "# # of Owner Occupied Units Meeting Criteria\n", + "# + # of Renter Occupied Units Meeting Criteria\n", + "# )\n", + "# divided by\n", + "# (\n", + "# Total # of Owner Occupied Units that meet criterion for poverty\n", + "# + Total # of Renter Occupied Units that meet criterion for poverty\n", + "# - # of Owner Occupied Units with HAMFI Not Computed\n", + "# - # of Renter Occupied Units with HAMFI Not Computed\n", + "# )\n", + "\n", + "housing['denominator_post'] = housing[\n", + " RENTAL_REVISED_DENOMINATOR_FIELDS\n", + "].sum(axis = 1) + housing[OWNER_REVISED_DENOMINATOR_FIELDS].sum(axis = 1) - (\n", + " - housing[OWNER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)\n", + " - housing[RENTER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numerator_predenominator_predenominator_post
0174765295
1177720350
22791291559
32741635525
488541351090
\n", + "
" + ], + "text/plain": [ + " numerator_pre denominator_pre denominator_post\n", + "0 174 765 295\n", + "1 177 720 350\n", + "2 279 1291 559\n", + "3 274 1635 525\n", + "4 885 4135 1090" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing.iloc[:, -3:].head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Concatenate GeoIDs with Derived Columns" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "housing_df = pd.concat([housing.iloc[:, 2: 7], housing.iloc[:, -3:]], axis = 1)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "housing_df[\"ratio_pre\"] = np.round(\n", + " housing_df['numerator_pre'] / housing_df['denominator_pre'], 2)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "housing_df[\"ratio_post\"] = np.round(\n", + " housing_df['numerator_pre'] / housing_df['denominator_post'], 2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Comparison " + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FIPS_tract_idnamestatecntytractnumerator_predenominator_predenominator_postratio_preratio_post
014000US01001020100Census Tract 201, Autauga County, Alabama011201001747652950.230.59
114000US01001020200Census Tract 202, Autauga County, Alabama011202001777203500.250.51
214000US01001020300Census Tract 203, Autauga County, Alabama0112030027912915590.220.50
314000US01001020400Census Tract 204, Autauga County, Alabama0112040027416355250.170.52
414000US01001020500Census Tract 205, Autauga County, Alabama01120500885413510900.210.81
\n", + "
" + ], + "text/plain": [ + " FIPS_tract_id name state cnty \\\n", + "0 14000US01001020100 Census Tract 201, Autauga County, Alabama 01 1 \n", + "1 14000US01001020200 Census Tract 202, Autauga County, Alabama 01 1 \n", + "2 14000US01001020300 Census Tract 203, Autauga County, Alabama 01 1 \n", + "3 14000US01001020400 Census Tract 204, Autauga County, Alabama 01 1 \n", + "4 14000US01001020500 Census Tract 205, Autauga County, Alabama 01 1 \n", + "\n", + " tract numerator_pre denominator_pre denominator_post ratio_pre \\\n", + "0 20100 174 765 295 0.23 \n", + "1 20200 177 720 350 0.25 \n", + "2 20300 279 1291 559 0.22 \n", + "3 20400 274 1635 525 0.17 \n", + "4 20500 885 4135 1090 0.21 \n", + "\n", + " ratio_post \n", + "0 0.59 \n", + "1 0.51 \n", + "2 0.50 \n", + "3 0.52 \n", + "4 0.81 " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_df.head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021.ipynb b/data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021.ipynb index e30228ee..648f8bc5 100644 --- a/data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021.ipynb +++ b/data/data-pipeline/data_pipeline/ipython/hud_eda_se_12_09_2021.ipynb @@ -4,17 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Methodology per Blair Russell\n", - "\n", - "We may want to rethink the denominator of our equation for housing cost burden.\n", - "\n", - "\"Right now it’s all housing units with a cost burden computed. \n", - "\n", - "Alternatively, you could use low-income households (with cost burden computed) as the denominator, which would be a measure of relative cost burden just for low-income households. \n", - "\n", - "Both approaches are appropriate, but they tell a different story. You can imagine an area with few low-income households but a vast majority of them being cost burdened. In your calculation, you’d get a small percentage. \n", - "\n", - "In the alternative approach, it’s a large percentage. Just something to think about. It depends on the story you want to tell.\"\n" + "## Methodology to address fundamental problem 1 tiemized in Issue 1024" ] }, { @@ -36,7 +26,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -61,11 +51,11 @@ "\n", "This data is sourced from the 2014-2018 Comprehensive Housing Affordability Strategy dataset from the Department of Housing and Urban Development (HUD) using the census tract geographic summary level, and contains cost burdens for households by percent HUD-adjusted median family income (HAMFI) category. This data can be found [here](https://www.huduser.gov/portal/datasets/cp.html). \n", "\n", - "Because CHAS data is based on American Communities Survey (ACS) estimates, which come from a sample of the population, they may be unreliable if based on a small sample or population size. T\n", + "Because CHAS data is based on American Communities Survey (ACS) estimates, which come from a sample of the population, they may be unreliable if based on a small sample or population size.\n", "\n", "The standard error and relative standard error were used to evaluate the reliability of each estimate using CalEnviroScreen’s methodology. \n", "\n", - "Census tract estimates that met either of the following criteria were considered reliable and included in the analysis(CalEnviroScreen, 2017):\n", + "Census tract estimates that met either of the following criteria were considered reliable and included in the analysis [(CalEnviroScreen, 2017, page 129)](https://oehha.ca.gov/media/downloads/calenviroscreen/report/ces3report.pdf ):\n", "\n", "- Relative standard error less than 50 (meaning the standard error was less than half of the estimate), OR \n", "- Standard error less than the mean standard error of all census tract estimates \n", @@ -302,6 +292,33 @@ "source": [ "housingburden.head()" ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(73056, 4)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housingburden.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {