Notebook investigating NHPD as a source for providing contemporary foreclosure data (#1012)

Co-authored-by: Saran Ahluwalia <sarahluw@cisco.com>
This commit is contained in:
Saran Ahluwalia 2022-01-18 13:08:27 -05:00 committed by GitHub
parent b7f1483bd2
commit a07bf752b0
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -0,0 +1,648 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Problem statement and Goal\n",
"NHPD: For all counties, aggregate the National Housing Preservation Database at the census tract level using whatever variables seem most interesting or relevant, and combine it with the “processed” datasets. What does the overall low-income housing stock look like in areas with high eviction rates? Are any of these features statistically related to the incidence of evictions in these counties? Furthermore, are there any insights common to two or more counties, or is the “state” of low-income housing unique to each county?\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data_set = pd.read_excel('nhpd_data.xlsx', engine='openpyxl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### General inference from the data dictionary\n",
"- The data set 'Active and Inconclusive Properties.xlsx' appears to be the `Data Extract`, as opposed to the `Data Grid`, as each row here represents a property. Here, subsidy information is present alongside property information instead of it being expanded from the property record.\n",
"- One property - One address mapping.\n",
"- Most of the phased properties are separated by address locations. \n",
"- This is a rare scenario: In cases where they're combined into one record, we might need to separate them by address locations. This is rare, and might not be needed since funding is tracked at property level, and unless we need specifics that sub-categorize the property, we find little use of maintaining these records as separate entities. \n",
"- Key words 'Development', 'Project' and 'Property' have been used interchangeably to mean 'cluster of buildings' tracked by the same identification number. \n",
" - If we wish to run some NLP algorithms on the descriptions, we may need to replace these words as 'Property'. If not, the vectorized versions of these words may not be close to each other. \n",
" - Even if we consider state-of-the-art algorithms like Word2Vec, we may not get vectors close to one another because 'Development', 'Project' and 'Property' have different semantic meanings."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>NHPDPropertyID</th>\n",
" <th>PropertyName</th>\n",
" <th>PropertyAddress</th>\n",
" <th>City</th>\n",
" <th>State</th>\n",
" <th>Zip</th>\n",
" <th>CBSACode</th>\n",
" <th>CBSAType</th>\n",
" <th>County</th>\n",
" <th>CountyCode</th>\n",
" <th>...</th>\n",
" <th>NumberActiveMR</th>\n",
" <th>NumberInconclusiveMR</th>\n",
" <th>NumberInactiveMR</th>\n",
" <th>Mr_1_Status</th>\n",
" <th>Mr_1_ProgramName</th>\n",
" <th>Mr_1_AssistedUnits</th>\n",
" <th>Mr_2_Status</th>\n",
" <th>Mr_2_ProgramName</th>\n",
" <th>Mr_2_AssistedUnits</th>\n",
" <th>OldNHPDPropertyID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1000000</td>\n",
" <td>IVY ESTATES</td>\n",
" <td>6729 Zeigler Blvd</td>\n",
" <td>Mobile</td>\n",
" <td>AL</td>\n",
" <td>36608-4253</td>\n",
" <td>33660.0</td>\n",
" <td>Metropolitan Statistical Area</td>\n",
" <td>Mobile</td>\n",
" <td>1097.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1000001</td>\n",
" <td>RENDU TERRACE WEST</td>\n",
" <td>7400 Old Shell Rd</td>\n",
" <td>Mobile</td>\n",
" <td>AL</td>\n",
" <td>36608-4549</td>\n",
" <td>33660.0</td>\n",
" <td>Metropolitan Statistical Area</td>\n",
" <td>Mobile</td>\n",
" <td>1097.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1000002</td>\n",
" <td>TWB RESIDENTIAL OPPORTUNITIES II</td>\n",
" <td>93 Canal Rd</td>\n",
" <td>Port Jefferson Station</td>\n",
" <td>NY</td>\n",
" <td>11776-3024</td>\n",
" <td>35620.0</td>\n",
" <td>Metropolitan</td>\n",
" <td>Suffolk</td>\n",
" <td>36103.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1000003</td>\n",
" <td>THE DAISY HOUSE</td>\n",
" <td>615 Clarissa St</td>\n",
" <td>Rochester</td>\n",
" <td>NY</td>\n",
" <td>14608-2485</td>\n",
" <td>40380.0</td>\n",
" <td>Metropolitan</td>\n",
" <td>Monroe</td>\n",
" <td>36055.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1000004</td>\n",
" <td>MAIN AVENUE APARTMENTS</td>\n",
" <td>105 E Walnut St</td>\n",
" <td>Sylacauga</td>\n",
" <td>AL</td>\n",
" <td>35150-3012</td>\n",
" <td>45180.0</td>\n",
" <td>Micropolitan Statistical Area</td>\n",
" <td>Talladega</td>\n",
" <td>1121.0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 252 columns</p>\n",
"</div>"
],
"text/plain": [
" NHPDPropertyID PropertyName PropertyAddress \\\n",
"0 1000000 IVY ESTATES 6729 Zeigler Blvd \n",
"1 1000001 RENDU TERRACE WEST 7400 Old Shell Rd \n",
"2 1000002 TWB RESIDENTIAL OPPORTUNITIES II 93 Canal Rd \n",
"3 1000003 THE DAISY HOUSE 615 Clarissa St \n",
"4 1000004 MAIN AVENUE APARTMENTS 105 E Walnut St \n",
"\n",
" City State Zip CBSACode \\\n",
"0 Mobile AL 36608-4253 33660.0 \n",
"1 Mobile AL 36608-4549 33660.0 \n",
"2 Port Jefferson Station NY 11776-3024 35620.0 \n",
"3 Rochester NY 14608-2485 40380.0 \n",
"4 Sylacauga AL 35150-3012 45180.0 \n",
"\n",
" CBSAType County CountyCode ... NumberActiveMR \\\n",
"0 Metropolitan Statistical Area Mobile 1097.0 ... 0 \n",
"1 Metropolitan Statistical Area Mobile 1097.0 ... 0 \n",
"2 Metropolitan Suffolk 36103.0 ... 0 \n",
"3 Metropolitan Monroe 36055.0 ... 0 \n",
"4 Micropolitan Statistical Area Talladega 1121.0 ... 0 \n",
"\n",
" NumberInconclusiveMR NumberInactiveMR Mr_1_Status Mr_1_ProgramName \\\n",
"0 0 0 NaN NaN \n",
"1 0 0 NaN NaN \n",
"2 0 0 NaN NaN \n",
"3 0 0 NaN NaN \n",
"4 0 0 NaN NaN \n",
"\n",
" Mr_1_AssistedUnits Mr_2_Status Mr_2_ProgramName Mr_2_AssistedUnits \\\n",
"0 NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN \n",
"4 NaN NaN NaN NaN \n",
"\n",
" OldNHPDPropertyID \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"\n",
"[5 rows x 252 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"shape = data_set.shape\n",
"data_set.head()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"No. of data points: 82287\n",
"No. of features: 252\n",
"Different classes, : ['Active' 'Inconclusive']\n"
]
}
],
"source": [
"print(\"No. of data points:\", shape[0])\n",
"print(\"No. of features: \", shape[1])\n",
"print(\"Different property statuses, :\", data_set.loc[:,\"PropertyStatus\"].unique())"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>NHPDPropertyID</th>\n",
" <th>CBSACode</th>\n",
" <th>CountyCode</th>\n",
" <th>CensusTract</th>\n",
" <th>Latitude</th>\n",
" <th>Longitude</th>\n",
" <th>ActiveSubsidies</th>\n",
" <th>TotalInconclusiveSubsidies</th>\n",
" <th>TotalInactiveSubsidies</th>\n",
" <th>TotalUnits</th>\n",
" <th>...</th>\n",
" <th>NumberInconclusivePBV</th>\n",
" <th>NumberInactivePBV</th>\n",
" <th>Pbv_1_AssistedUnits</th>\n",
" <th>Pbv_2_AssistedUnits</th>\n",
" <th>NumberActiveMR</th>\n",
" <th>NumberInconclusiveMR</th>\n",
" <th>NumberInactiveMR</th>\n",
" <th>Mr_1_AssistedUnits</th>\n",
" <th>Mr_2_AssistedUnits</th>\n",
" <th>OldNHPDPropertyID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>8.228700e+04</td>\n",
" <td>72919.000000</td>\n",
" <td>82229.000000</td>\n",
" <td>8.222400e+04</td>\n",
" <td>82287.000000</td>\n",
" <td>82287.000000</td>\n",
" <td>82287.000000</td>\n",
" <td>82287.000000</td>\n",
" <td>82287.000000</td>\n",
" <td>82287.000000</td>\n",
" <td>...</td>\n",
" <td>82287.0</td>\n",
" <td>82287.0</td>\n",
" <td>2784.000000</td>\n",
" <td>173.000000</td>\n",
" <td>82287.000000</td>\n",
" <td>82287.0</td>\n",
" <td>82287.0</td>\n",
" <td>500.000000</td>\n",
" <td>9.000000</td>\n",
" <td>58370.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1.074656e+06</td>\n",
" <td>30447.401281</td>\n",
" <td>28953.566759</td>\n",
" <td>2.895217e+10</td>\n",
" <td>38.483402</td>\n",
" <td>-90.228069</td>\n",
" <td>1.386355</td>\n",
" <td>0.077485</td>\n",
" <td>0.369098</td>\n",
" <td>66.711145</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>43.463003</td>\n",
" <td>35.011561</td>\n",
" <td>0.006234</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>34.208000</td>\n",
" <td>26.333333</td>\n",
" <td>52459.141083</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>4.021620e+04</td>\n",
" <td>11096.935896</td>\n",
" <td>15256.967835</td>\n",
" <td>1.525466e+10</td>\n",
" <td>4.975471</td>\n",
" <td>15.636717</td>\n",
" <td>0.895868</td>\n",
" <td>0.287468</td>\n",
" <td>0.734841</td>\n",
" <td>96.200003</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>41.703263</td>\n",
" <td>29.062280</td>\n",
" <td>0.081741</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>25.840049</td>\n",
" <td>19.685020</td>\n",
" <td>33880.224028</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000000e+06</td>\n",
" <td>10100.000000</td>\n",
" <td>1001.000000</td>\n",
" <td>1.001020e+09</td>\n",
" <td>13.495030</td>\n",
" <td>-166.722478</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>11.000000</td>\n",
" <td>11.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>11.000000</td>\n",
" <td>12.000000</td>\n",
" <td>4.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1.039279e+06</td>\n",
" <td>19740.000000</td>\n",
" <td>17053.000000</td>\n",
" <td>1.705396e+10</td>\n",
" <td>34.983064</td>\n",
" <td>-96.380764</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>18.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>17.000000</td>\n",
" <td>14.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>16.000000</td>\n",
" <td>13.000000</td>\n",
" <td>24226.250000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1.073499e+06</td>\n",
" <td>32580.000000</td>\n",
" <td>29095.000000</td>\n",
" <td>2.909501e+10</td>\n",
" <td>39.312214</td>\n",
" <td>-86.490946</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>40.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>29.000000</td>\n",
" <td>24.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>25.000000</td>\n",
" <td>19.000000</td>\n",
" <td>49850.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1.108144e+06</td>\n",
" <td>39300.000000</td>\n",
" <td>41015.000000</td>\n",
" <td>4.101395e+10</td>\n",
" <td>41.799999</td>\n",
" <td>-79.052250</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>82.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>53.000000</td>\n",
" <td>46.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>43.000000</td>\n",
" <td>22.000000</td>\n",
" <td>78774.750000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.163400e+06</td>\n",
" <td>99999.000000</td>\n",
" <td>69120.000000</td>\n",
" <td>5.604595e+10</td>\n",
" <td>65.160556</td>\n",
" <td>145.751129</td>\n",
" <td>106.000000</td>\n",
" <td>13.000000</td>\n",
" <td>24.000000</td>\n",
" <td>5881.000000</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>449.000000</td>\n",
" <td>191.000000</td>\n",
" <td>5.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>187.000000</td>\n",
" <td>62.000000</td>\n",
" <td>127185.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 114 columns</p>\n",
"</div>"
],
"text/plain": [
" NHPDPropertyID CBSACode CountyCode CensusTract Latitude \\\n",
"count 8.228700e+04 72919.000000 82229.000000 8.222400e+04 82287.000000 \n",
"mean 1.074656e+06 30447.401281 28953.566759 2.895217e+10 38.483402 \n",
"std 4.021620e+04 11096.935896 15256.967835 1.525466e+10 4.975471 \n",
"min 1.000000e+06 10100.000000 1001.000000 1.001020e+09 13.495030 \n",
"25% 1.039279e+06 19740.000000 17053.000000 1.705396e+10 34.983064 \n",
"50% 1.073499e+06 32580.000000 29095.000000 2.909501e+10 39.312214 \n",
"75% 1.108144e+06 39300.000000 41015.000000 4.101395e+10 41.799999 \n",
"max 1.163400e+06 99999.000000 69120.000000 5.604595e+10 65.160556 \n",
"\n",
" Longitude ActiveSubsidies TotalInconclusiveSubsidies \\\n",
"count 82287.000000 82287.000000 82287.000000 \n",
"mean -90.228069 1.386355 0.077485 \n",
"std 15.636717 0.895868 0.287468 \n",
"min -166.722478 0.000000 0.000000 \n",
"25% -96.380764 1.000000 0.000000 \n",
"50% -86.490946 1.000000 0.000000 \n",
"75% -79.052250 2.000000 0.000000 \n",
"max 145.751129 106.000000 13.000000 \n",
"\n",
" TotalInactiveSubsidies TotalUnits ... NumberInconclusivePBV \\\n",
"count 82287.000000 82287.000000 ... 82287.0 \n",
"mean 0.369098 66.711145 ... 0.0 \n",
"std 0.734841 96.200003 ... 0.0 \n",
"min 0.000000 1.000000 ... 0.0 \n",
"25% 0.000000 18.000000 ... 0.0 \n",
"50% 0.000000 40.000000 ... 0.0 \n",
"75% 1.000000 82.000000 ... 0.0 \n",
"max 24.000000 5881.000000 ... 0.0 \n",
"\n",
" NumberInactivePBV Pbv_1_AssistedUnits Pbv_2_AssistedUnits \\\n",
"count 82287.0 2784.000000 173.000000 \n",
"mean 0.0 43.463003 35.011561 \n",
"std 0.0 41.703263 29.062280 \n",
"min 0.0 11.000000 11.000000 \n",
"25% 0.0 17.000000 14.000000 \n",
"50% 0.0 29.000000 24.000000 \n",
"75% 0.0 53.000000 46.000000 \n",
"max 0.0 449.000000 191.000000 \n",
"\n",
" NumberActiveMR NumberInconclusiveMR NumberInactiveMR \\\n",
"count 82287.000000 82287.0 82287.0 \n",
"mean 0.006234 0.0 0.0 \n",
"std 0.081741 0.0 0.0 \n",
"min 0.000000 0.0 0.0 \n",
"25% 0.000000 0.0 0.0 \n",
"50% 0.000000 0.0 0.0 \n",
"75% 0.000000 0.0 0.0 \n",
"max 5.000000 0.0 0.0 \n",
"\n",
" Mr_1_AssistedUnits Mr_2_AssistedUnits OldNHPDPropertyID \n",
"count 500.000000 9.000000 58370.000000 \n",
"mean 34.208000 26.333333 52459.141083 \n",
"std 25.840049 19.685020 33880.224028 \n",
"min 11.000000 12.000000 4.000000 \n",
"25% 16.000000 13.000000 24226.250000 \n",
"50% 25.000000 19.000000 49850.500000 \n",
"75% 43.000000 22.000000 78774.750000 \n",
"max 187.000000 62.000000 127185.000000 \n",
"\n",
"[8 rows x 114 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_set.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Status [WIP]\n",
"- Findout how housing and subsidies work in the US.\n",
"- Map census track to its geographical boundaries.\n",
"- List out all of the different subsidies ().\n",
"- Find out metrics per subsidy and compare various statistical plots."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}