mirror of
https://github.com/DOI-DO/j40-cejst-2.git
synced 2025-10-03 01:43:18 -07:00
Data folder restructuring in preparation for 361 (#376)
* initial checkin * gitignore and docker-compose update * readme update and error on hud * encoding issue * one more small README change * data roadmap re-strcuture * pyproject sort * small update to score output folders * checkpoint * couple of last fixes
This commit is contained in:
parent
3032a8305d
commit
543d147e61
66 changed files with 130 additions and 108 deletions
33
data/data-pipeline/Dockerfile
Normal file
33
data/data-pipeline/Dockerfile
Normal file
|
@ -0,0 +1,33 @@
|
|||
FROM ubuntu:20.04
|
||||
|
||||
# Install packages
|
||||
RUN apt-get update && apt-get install -y \
|
||||
build-essential \
|
||||
make \
|
||||
gcc \
|
||||
git \
|
||||
unzip \
|
||||
wget \
|
||||
python3-dev \
|
||||
python3-pip
|
||||
|
||||
# tippeanoe
|
||||
ENV TZ=America/Los_Angeles
|
||||
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
|
||||
RUN apt-get install -y software-properties-common libsqlite3-dev zlib1g-dev
|
||||
RUN apt-add-repository -y ppa:git-core/ppa
|
||||
RUN mkdir -p /tmp/tippecanoe-src && git clone https://github.com/mapbox/tippecanoe.git /tmp/tippecanoe-src
|
||||
WORKDIR /tmp/tippecanoe-src
|
||||
RUN /bin/sh -c make && make install
|
||||
|
||||
## gdal
|
||||
RUN add-apt-repository ppa:ubuntugis/ppa
|
||||
RUN apt-get -y install gdal-bin
|
||||
|
||||
# Prepare python packages
|
||||
WORKDIR /data-pipeline
|
||||
RUN pip3 install --upgrade pip setuptools wheel
|
||||
COPY . .
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip3 install -r requirements.txt
|
163
data/data-pipeline/README.md
Normal file
163
data/data-pipeline/README.md
Normal file
|
@ -0,0 +1,163 @@
|
|||
# Justice 40 Score application
|
||||
|
||||
<details open="open">
|
||||
<summary>Table of Contents</summary>
|
||||
|
||||
<!-- TOC -->
|
||||
|
||||
- [About this application](#about-this-application)
|
||||
- [Score comparison workflow](#score-comparison-workflow)
|
||||
- [Workflow Diagram](#workflow-diagram)
|
||||
- [Step 0: Set up your environment](#step-0-set-up-your-environment)
|
||||
- [Step 1: Run the ETL script for each data source](#step-1-run-the-etl-script-for-each-data-source)
|
||||
- [Step 2: Calculate the Justice40 score experiments](#step-2-calculate-the-justice40-score-experiments)
|
||||
- [Step 3: Compare the Justice40 score experiments to other indices](#step-3-compare-the-justice40-score-experiments-to-other-indices)
|
||||
- [Data Sources](#data-sources)
|
||||
- [Running using Docker](#running-using-docker)
|
||||
- [Log visualization](#log-visualization)
|
||||
- [Local development](#local-development)
|
||||
- [Downloading Census Block Groups GeoJSON and Generating CBG CSVs](#downloading-census-block-groups-geojson-and-generating-cbg-csvs)
|
||||
- [Generating mbtiles](#generating-mbtiles)
|
||||
- [Serve the map locally](#serve-the-map-locally)
|
||||
- [Running Jupyter notebooks](#running-jupyter-notebooks)
|
||||
- [Activating variable-enabled Markdown for Jupyter notebooks](#activating-variable-enabled-markdown-for-jupyter-notebooks)
|
||||
- [Miscellaneous](#miscellaneous)
|
||||
|
||||
<!-- /TOC -->
|
||||
|
||||
</details>
|
||||
|
||||
## About this application
|
||||
|
||||
This application is used to compare experimental versions of the Justice40 score to established environmental justice indices, such as EJSCREEN, CalEnviroScreen, and so on.
|
||||
|
||||
_**NOTE:** These scores **do not** represent final versions of the Justice40 scores and are merely used for comparative purposes. As a result, the specific input columns and formulas used to calculate them are likely to change over time._
|
||||
|
||||
|
||||
### Score comparison workflow
|
||||
|
||||
The descriptions below provide a more detailed outline of what happens at each step of ETL and score calculation workflow.
|
||||
|
||||
#### Workflow Diagram
|
||||
|
||||
TODO add mermaid diagram
|
||||
|
||||
#### Step 0: Set up your environment
|
||||
|
||||
1. After cloning the project locally, change to this directory: `cd score`
|
||||
1. Choose whether you'd like to run this application using Docker or if you'd like to install the dependencies locally so you can contribute to the project.
|
||||
- **With Docker:** Follow these [installation instructions](https://docs.docker.com/get-docker/) and skip down to the [Running with Docker section](#running-with-docker) for more information
|
||||
- **For Local Development:** Skip down to the [Local Development section](#local-development) for more detailed installation instructions
|
||||
|
||||
|
||||
#### Step 1: Run the ETL script for each data source
|
||||
|
||||
1. Call the `etl-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
|
||||
- With Docker: `docker run --rm -it j40_score /bin/sh -c "python3 application.py etl-run"`
|
||||
- With Poetry: `poetry run python application.py etl-run`
|
||||
1. The `etl-run` command will execute the corresponding ETL script for each data source in `etl/sources/`. For example, `etl/sources/ejscreen/etl.py` is the ETL script for EJSCREEN data.
|
||||
1. Each ETL script will extract the data from its original source, then format the data into `.csv` files that get stored in the relevant folder in `data/dataset/`. For example, HUD Housing data is stored in `data/dataset/hud_housing/usa.csv`
|
||||
|
||||
_**NOTE:** You have the option to pass the name of a specific data source to the `etl-run` command, which will limit the execution of the ETL process to that specific data source._
|
||||
_For example: `poetry run python application.py etl-run ejscreen` would only run the ETL process for EJSCREEN data._
|
||||
|
||||
#### Step 2: Calculate the Justice40 score experiments
|
||||
|
||||
1. Call the `score-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
|
||||
- With Docker: `docker run --rm -it j40_score /bin/sh -c "python3 application.py score-run"`
|
||||
- With Poetry: `poetry run python application.py score-run`
|
||||
1. The `score-run` command will execute the `etl/score/etl.py` script which loads the data from each of the source files added to the `data/dataset/` directory by the ETL scripts in Step 1.
|
||||
1. These data sets are merged into a single dataframe using their Census Block Group GEOID as a common key, and the data in each of the columns is standardized in two ways:
|
||||
- Their [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) is calculated, which tells us what percentage of other Census Block Groups have a lower value for that particular column.
|
||||
- They are normalized using [min-max normalization](https://en.wikipedia.org/wiki/Feature_scaling), which adjusts the scale of the data so that the Census Block Group with the highest value for that column is set to 1, the Census Block Group with the lowest value is set to 0, and all of the other values are adjusted to fit within that range based on how close they were to the highest or lowest value.
|
||||
1. The standardized columns are then used to calculate each of the Justice40 score experiments described in greater detail below, and the results are exported to a `.csv` file in [`data/score/csv`](data/score/csv)
|
||||
|
||||
#### Step 3: Compare the Justice40 score experiments to other indices
|
||||
|
||||
1. TODO: Describe the steps for this
|
||||
|
||||
### Data Sources
|
||||
|
||||
- **[EJSCREEN](etl/sources/ejscreen):** TODO Add description of data source
|
||||
- **[Census](etl/sources/census):** TODO Add description of data source
|
||||
- **[American Communities Survey](etl/sources/census_acs):** TODO Add description of data source
|
||||
- **[Housing and Transportation](etl/sources/housing_and_transportation):** TODO Add description of data source
|
||||
- **[HUD Housing](etl/sources/hud_housing):** TODO Add description of data source
|
||||
- **[HUD Recap](etl/sources/hud_recap):** TODO Add description of data source
|
||||
- **[CalEnviroScreen](etl/scores/calenviroscreen):** TODO Add description of data source
|
||||
|
||||
|
||||
## Running using Docker
|
||||
|
||||
We use Docker to install the necessary libraries in a container that can be run in any operating system.
|
||||
|
||||
To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build`.
|
||||
|
||||
Once completed, run `docker-compose up` and then open a new tab or terminal window, and then run any command for the application using this format:
|
||||
`docker exec j40_data_pipeline_1 python3 application.py [command]`
|
||||
|
||||
Here's a list of commands:
|
||||
|
||||
- Get help: `docker exec j40_data_pipeline_1 python3 application.py --help"`
|
||||
- Clean up the census data directories: `docker exec j40_data_pipeline_1 python3 application.py census-cleanup"`
|
||||
- Clean up the data directories: `docker exec j40_data_pipeline_1 python3 application.py data-cleanup"`
|
||||
- Generate census data: `docker exec j40_data_pipeline_1 python3 application.py census-data-download"`
|
||||
- Run all ETL processes: `docker exec j40_data_pipeline_1 python3 application.py etl-run"`
|
||||
- Generate Score: `docker exec j40_data_pipeline_1 python3 application.py score-run"`
|
||||
|
||||
## Local development
|
||||
|
||||
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe). Please refer to the repos for specific instructions for your OS.
|
||||
|
||||
### Windows Users
|
||||
- If you want to download Census data or run tile generation, please install TippeCanoe [following these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
|
||||
- If you want to generate tiles, you need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally.
|
||||
|
||||
### Setting up Poetry
|
||||
|
||||
- Start a terminal
|
||||
- Change to this directory (`/data/data-pipeline`)
|
||||
- Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
|
||||
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
|
||||
- Install Poetry requirements with `poetry install`
|
||||
|
||||
### Downloading Census Block Groups GeoJSON and Generating CBG CSVs
|
||||
|
||||
- Make sure you have Docker running in your machine
|
||||
- Start a terminal
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
|
||||
- Then run `poetry run python application.py census-data-download`
|
||||
Note: Census files are not kept in the repository and the download directories are ignored by Git
|
||||
|
||||
### Generating mbtiles
|
||||
|
||||
- TBD
|
||||
|
||||
### Serve the map locally
|
||||
|
||||
- Start a terminal
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl`
|
||||
|
||||
### Running Jupyter notebooks
|
||||
|
||||
- Start a terminal
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
|
||||
|
||||
### Activating variable-enabled Markdown for Jupyter notebooks
|
||||
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- Activate a Poetry Shell (see above)
|
||||
- Run `jupyter contrib nbextension install --user`
|
||||
- Run `jupyter nbextension enable python-markdown/main`
|
||||
- Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near
|
||||
top right of Notebook screen.)
|
||||
|
||||
For more information, see [nbextensions docs](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html) and
|
||||
see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/python-markdown).
|
||||
|
||||
## Miscellaneous
|
||||
|
||||
- To export packages from Poetry to `requirements.txt` run `poetry export --without-hashes > requirements.txt`
|
0
data/data-pipeline/__init__.py
Normal file
0
data/data-pipeline/__init__.py
Normal file
92
data/data-pipeline/application.py
Normal file
92
data/data-pipeline/application.py
Normal file
|
@ -0,0 +1,92 @@
|
|||
import click
|
||||
|
||||
from config import settings
|
||||
from etl.sources.census.etl_utils import reset_data_directories as census_reset
|
||||
from utils import (
|
||||
get_module_logger,
|
||||
data_folder_cleanup,
|
||||
score_folder_cleanup,
|
||||
temp_folder_cleanup,
|
||||
)
|
||||
from etl.sources.census.etl import download_census_csvs
|
||||
from etl.runner import etl_runner, score_generate
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
@click.group()
|
||||
def cli():
|
||||
"""Defines a click group for the commands below"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
@cli.command(
|
||||
help="Clean up all census data folders",
|
||||
)
|
||||
def census_cleanup():
|
||||
"""CLI command to clean up the census data folder"""
|
||||
|
||||
data_path = settings.APP_ROOT / "data"
|
||||
|
||||
# census directories
|
||||
logger.info(f"Initializing all census data")
|
||||
census_reset(data_path)
|
||||
|
||||
logger.info("Cleaned up all census data files")
|
||||
|
||||
|
||||
@cli.command(
|
||||
help="Clean up all data folders",
|
||||
)
|
||||
def data_cleanup():
|
||||
"""CLI command to clean up the all the data folders"""
|
||||
|
||||
data_folder_cleanup()
|
||||
score_folder_cleanup()
|
||||
temp_folder_cleanup()
|
||||
|
||||
logger.info("Cleaned up all data folders")
|
||||
|
||||
|
||||
@cli.command(
|
||||
help="Census data download",
|
||||
)
|
||||
def census_data_download():
|
||||
"""CLI command to download all census shape files from the Census FTP and extract the geojson
|
||||
to generate national and by state Census Block Group CSVs"""
|
||||
|
||||
logger.info("Downloading census data")
|
||||
data_path = settings.APP_ROOT / "data"
|
||||
download_census_csvs(data_path)
|
||||
|
||||
logger.info("Completed downloading census data")
|
||||
|
||||
|
||||
@cli.command(
|
||||
help="Run all ETL processes or a specific one",
|
||||
)
|
||||
@click.option("-d", "--dataset", required=False, type=str)
|
||||
def etl_run(dataset: str):
|
||||
"""Run a specific or all ETL processes
|
||||
|
||||
Args:
|
||||
dataset (str): Name of the ETL module to be run (optional)
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
|
||||
etl_runner(dataset)
|
||||
|
||||
|
||||
@cli.command(
|
||||
help="Generate Score",
|
||||
)
|
||||
def score_run():
|
||||
"""CLI command to generate the score"""
|
||||
score_generate()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cli()
|
15
data/data-pipeline/config.py
Normal file
15
data/data-pipeline/config.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
from dynaconf import Dynaconf
|
||||
from pathlib import Path
|
||||
|
||||
settings = Dynaconf(
|
||||
envvar_prefix="DYNACONF",
|
||||
settings_files=["settings.toml", ".secrets.toml"],
|
||||
environments=True,
|
||||
)
|
||||
|
||||
# set root dir
|
||||
settings.APP_ROOT = Path.cwd()
|
||||
|
||||
# To set an environment use:
|
||||
# Linux/OSX: export ENV_FOR_DYNACONF=staging
|
||||
# Windows: set ENV_FOR_DYNACONF=staging
|
0
data/data-pipeline/data/census/__init__.py
Normal file
0
data/data-pipeline/data/census/__init__.py
Normal file
0
data/data-pipeline/data/census/csv/__init__.py
Normal file
0
data/data-pipeline/data/census/csv/__init__.py
Normal file
53
data/data-pipeline/data/census/csv/fips_states_2010.csv
Normal file
53
data/data-pipeline/data/census/csv/fips_states_2010.csv
Normal file
|
@ -0,0 +1,53 @@
|
|||
fips,state_name,state_abbreviation,region,division
|
||||
01,Alabama,AL,South,East South Central
|
||||
02,Alaska,AK,West,Pacific
|
||||
04,Arizona,AZ,West,Mountain
|
||||
05,Arkansas,AR,South,West South Central
|
||||
06,California,CA,West,Pacific
|
||||
08,Colorado,CO,West,Mountain
|
||||
09,Connecticut,CT,Northeast,New England
|
||||
10,Delaware,DE,South,South Atlantic
|
||||
11,District of Columbia,DC,South,South Atlantic
|
||||
12,Florida,FL,South,South Atlantic
|
||||
13,Georgia,GA,South,South Atlantic
|
||||
15,Hawaii,HI,West,Pacific
|
||||
16,Idaho,ID,West,Mountain
|
||||
17,Illinois,IL,Midwest,East North Central
|
||||
18,Indiana,IN,Midwest,East North Central
|
||||
19,Iowa,IA,Midwest,West North Central
|
||||
20,Kansas,KS,Midwest,West North Central
|
||||
21,Kentucky,KY,South,East South Central
|
||||
22,Louisiana,LA,South,West South Central
|
||||
23,Maine,ME,Northeast,New England
|
||||
24,Maryland,MD,South,South Atlantic
|
||||
25,Massachusetts,MA,Northeast,New England
|
||||
26,Michigan,MI,Midwest,East North Central
|
||||
27,Minnesota,MN,Midwest,West North Central
|
||||
28,Mississippi,MS,South,East South Central
|
||||
29,Missouri,MO,Midwest,West North Central
|
||||
30,Montana,MT,West,Mountain
|
||||
31,Nebraska,NE,Midwest,West North Central
|
||||
32,Nevada,NV,West,Mountain
|
||||
33,New Hampshire,NH,Northeast,New England
|
||||
34,New Jersey,NJ,Northeast,Middle Atlantic
|
||||
35,New Mexico,NM,West,Mountain
|
||||
36,New York,NY,Northeast,Middle Atlantic
|
||||
37,North Carolina,NC,South,South Atlantic
|
||||
38,North Dakota,ND,Midwest,West North Central
|
||||
39,Ohio,OH,Midwest,East North Central
|
||||
40,Oklahoma,OK,South,West South Central
|
||||
41,Oregon,OR,West,Pacific
|
||||
42,Pennsylvania,PA,Northeast,Middle Atlantic
|
||||
44,Rhode Island,RI,Northeast,New England
|
||||
45,South Carolina,SC,South,South Atlantic
|
||||
46,South Dakota,SD,Midwest,West North Central
|
||||
47,Tennessee,TN,South,East South Central
|
||||
48,Texas,TX,South,West South Central
|
||||
49,Utah,UT,West,Mountain
|
||||
50,Vermont,VT,Northeast,New England
|
||||
51,Virginia,VA,South,South Atlantic
|
||||
53,Washington,WA,West,Pacific
|
||||
54,West Virginia,WV,South,South Atlantic
|
||||
55,Wisconsin,WI,Midwest,East North Central
|
||||
56,Wyoming,WY,West,Mountain
|
||||
72,Puerto Rico,PR,Puerto Rico,Puerto Rico
|
|
0
data/data-pipeline/data/census/geojson/__init__.py
Normal file
0
data/data-pipeline/data/census/geojson/__init__.py
Normal file
0
data/data-pipeline/data/census/shp/__init__.py
Normal file
0
data/data-pipeline/data/census/shp/__init__.py
Normal file
0
data/data-pipeline/data/dataset/__init__.py
Normal file
0
data/data-pipeline/data/dataset/__init__.py
Normal file
0
data/data-pipeline/data/score/csv/__init__.py
Normal file
0
data/data-pipeline/data/score/csv/__init__.py
Normal file
0
data/data-pipeline/data/score/geojson/__init__.py
Normal file
0
data/data-pipeline/data/score/geojson/__init__.py
Normal file
0
data/data-pipeline/data/tiles/__init__.py
Normal file
0
data/data-pipeline/data/tiles/__init__.py
Normal file
0
data/data-pipeline/data/tmp/__init__.py
Normal file
0
data/data-pipeline/data/tmp/__init__.py
Normal file
0
data/data-pipeline/etl/__init__.py
Normal file
0
data/data-pipeline/etl/__init__.py
Normal file
63
data/data-pipeline/etl/base.py
Normal file
63
data/data-pipeline/etl/base.py
Normal file
|
@ -0,0 +1,63 @@
|
|||
from pathlib import Path
|
||||
import pathlib
|
||||
|
||||
from config import settings
|
||||
from utils import unzip_file_from_url, remove_all_from_dir
|
||||
|
||||
|
||||
class ExtractTransformLoad(object):
|
||||
"""
|
||||
A class used to instantiate an ETL object to retrieve and process data from
|
||||
datasets.
|
||||
|
||||
Attributes:
|
||||
DATA_PATH (pathlib.Path): Local path where all data will be stored
|
||||
TMP_PATH (pathlib.Path): Local path where temporary data will be stored
|
||||
GEOID_FIELD_NAME (str): The common column name for a Census Block Group identifier
|
||||
GEOID_TRACT_FIELD_NAME (str): The common column name for a Census Tract identifier
|
||||
"""
|
||||
|
||||
DATA_PATH: Path = settings.APP_ROOT / "data"
|
||||
TMP_PATH: Path = DATA_PATH / "tmp"
|
||||
GEOID_FIELD_NAME: str = "GEOID10"
|
||||
GEOID_TRACT_FIELD_NAME: str = "GEOID10_TRACT"
|
||||
|
||||
def get_yaml_config(self) -> None:
|
||||
"""Reads the YAML configuration file for the dataset and stores
|
||||
the properies in the instance (upcoming feature)"""
|
||||
|
||||
pass
|
||||
|
||||
def check_ttl(self) -> None:
|
||||
"""Checks if the ETL process can be run based on a the TLL value on the
|
||||
YAML config (upcoming feature)"""
|
||||
|
||||
pass
|
||||
|
||||
def extract(
|
||||
self, source_url: str = None, extract_path: Path = None
|
||||
) -> None:
|
||||
"""Extract the data from
|
||||
a remote source. By default it provides code to get the file from a source url,
|
||||
unzips it and stores it on an extract_path."""
|
||||
|
||||
# this can be accessed via super().extract()
|
||||
if source_url and extract_path:
|
||||
unzip_file_from_url(source_url, self.TMP_PATH, extract_path)
|
||||
|
||||
def transform(self) -> None:
|
||||
"""Transform the data extracted into a format that can be consumed by the
|
||||
score generator"""
|
||||
|
||||
raise NotImplementedError
|
||||
|
||||
def load(self) -> None:
|
||||
"""Saves the transformed data in the specified local data folder or remote AWS S3
|
||||
bucket"""
|
||||
|
||||
raise NotImplementedError
|
||||
|
||||
def cleanup(self) -> None:
|
||||
"""Clears out any files stored in the TMP folder"""
|
||||
|
||||
remove_all_from_dir(self.TMP_PATH)
|
114
data/data-pipeline/etl/runner.py
Normal file
114
data/data-pipeline/etl/runner.py
Normal file
|
@ -0,0 +1,114 @@
|
|||
import importlib
|
||||
|
||||
from etl.score.etl_score import ScoreETL
|
||||
from etl.score.etl_score_post import PostScoreETL
|
||||
|
||||
|
||||
def etl_runner(dataset_to_run: str = None) -> None:
|
||||
"""Runs all etl processes or a specific one
|
||||
|
||||
Args:
|
||||
dataset_to_run (str): Run a specific ETL process. If missing, runs all processes (optional)
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
|
||||
# this list comes from YAMLs
|
||||
dataset_list = [
|
||||
{
|
||||
"name": "census_acs",
|
||||
"module_dir": "census_acs",
|
||||
"class_name": "CensusACSETL",
|
||||
},
|
||||
{
|
||||
"name": "ejscreen",
|
||||
"module_dir": "ejscreen",
|
||||
"class_name": "EJScreenETL",
|
||||
},
|
||||
{
|
||||
"name": "housing_and_transportation",
|
||||
"module_dir": "housing_and_transportation",
|
||||
"class_name": "HousingTransportationETL",
|
||||
},
|
||||
{
|
||||
"name": "hud_housing",
|
||||
"module_dir": "hud_housing",
|
||||
"class_name": "HudHousingETL",
|
||||
},
|
||||
{
|
||||
"name": "calenviroscreen",
|
||||
"module_dir": "calenviroscreen",
|
||||
"class_name": "CalEnviroScreenETL",
|
||||
},
|
||||
{
|
||||
"name": "hud_recap",
|
||||
"module_dir": "hud_recap",
|
||||
"class_name": "HudRecapETL",
|
||||
},
|
||||
]
|
||||
|
||||
if dataset_to_run:
|
||||
dataset_element = next(
|
||||
(item for item in dataset_list if item["name"] == dataset_to_run),
|
||||
None,
|
||||
)
|
||||
if not dataset_list:
|
||||
raise ValueError("Invalid dataset name")
|
||||
else:
|
||||
# reset the list to just the dataset
|
||||
dataset_list = [dataset_element]
|
||||
|
||||
# Run the ETLs for the dataset_list
|
||||
for dataset in dataset_list:
|
||||
etl_module = importlib.import_module(
|
||||
f"etl.sources.{dataset['module_dir']}.etl"
|
||||
)
|
||||
etl_class = getattr(etl_module, dataset["class_name"])
|
||||
etl_instance = etl_class()
|
||||
|
||||
# run extract
|
||||
etl_instance.extract()
|
||||
|
||||
# run transform
|
||||
etl_instance.transform()
|
||||
|
||||
# run load
|
||||
etl_instance.load()
|
||||
|
||||
# cleanup
|
||||
etl_instance.cleanup()
|
||||
|
||||
# update the front end JSON/CSV of list of data sources
|
||||
pass
|
||||
|
||||
|
||||
def score_generate() -> None:
|
||||
"""Generates the score and saves it on the local data directory
|
||||
|
||||
Args:
|
||||
None
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
|
||||
# Score Gen
|
||||
score_gen = ScoreETL()
|
||||
score_gen.extract()
|
||||
score_gen.transform()
|
||||
score_gen.load()
|
||||
|
||||
# Post Score Processing
|
||||
score_post = PostScoreETL()
|
||||
score_post.extract()
|
||||
score_post.transform()
|
||||
score_post.load()
|
||||
score_post.cleanup()
|
||||
|
||||
|
||||
def _find_dataset_index(dataset_list, key, value):
|
||||
for i, element in enumerate(dataset_list):
|
||||
if element[key] == value:
|
||||
return i
|
||||
return -1
|
0
data/data-pipeline/etl/score/__init__.py
Normal file
0
data/data-pipeline/etl/score/__init__.py
Normal file
410
data/data-pipeline/etl/score/etl_score.py
Normal file
410
data/data-pipeline/etl/score/etl_score.py
Normal file
|
@ -0,0 +1,410 @@
|
|||
import collections
|
||||
import functools
|
||||
import pandas as pd
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from utils import get_module_logger
|
||||
from etl.sources.census.etl_utils import get_state_fips_codes
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class ScoreETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
# Define some global parameters
|
||||
self.BUCKET_SOCIOECONOMIC = "Socioeconomic Factors"
|
||||
self.BUCKET_SENSITIVE = "Sensitive populations"
|
||||
self.BUCKET_ENVIRONMENTAL = "Environmental effects"
|
||||
self.BUCKET_EXPOSURES = "Exposures"
|
||||
self.BUCKETS = [
|
||||
self.BUCKET_SOCIOECONOMIC,
|
||||
self.BUCKET_SENSITIVE,
|
||||
self.BUCKET_ENVIRONMENTAL,
|
||||
self.BUCKET_EXPOSURES,
|
||||
]
|
||||
|
||||
# A few specific field names
|
||||
# TODO: clean this up, I name some fields but not others.
|
||||
self.UNEMPLOYED_FIELD_NAME = "Unemployed civilians (percent)"
|
||||
self.LINGUISTIC_ISOLATION_FIELD_NAME = "Linguistic isolation (percent)"
|
||||
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
|
||||
self.POVERTY_FIELD_NAME = (
|
||||
"Poverty (Less than 200% of federal poverty line)"
|
||||
)
|
||||
self.HIGH_SCHOOL_FIELD_NAME = "Percent individuals age 25 or over with less than high school degree"
|
||||
|
||||
# There's another aggregation level (a second level of "buckets").
|
||||
self.AGGREGATION_POLLUTION = "Pollution Burden"
|
||||
self.AGGREGATION_POPULATION = "Population Characteristics"
|
||||
|
||||
self.PERCENTILE_FIELD_SUFFIX = " (percentile)"
|
||||
self.MIN_MAX_FIELD_SUFFIX = " (min-max normalized)"
|
||||
|
||||
self.SCORE_CSV_PATH = self.DATA_PATH / "score" / "csv" / "full"
|
||||
|
||||
# dataframes
|
||||
self.df: pd.DataFrame
|
||||
self.ejscreen_df: pd.DataFrame
|
||||
self.census_df: pd.DataFrame
|
||||
self.housing_and_transportation_df: pd.DataFrame
|
||||
self.hud_housing_df: pd.DataFrame
|
||||
|
||||
def extract(self) -> None:
|
||||
# EJSCreen csv Load
|
||||
ejscreen_csv = self.DATA_PATH / "dataset" / "ejscreen_2019" / "usa.csv"
|
||||
self.ejscreen_df = pd.read_csv(
|
||||
ejscreen_csv, dtype={"ID": "string"}, low_memory=False
|
||||
)
|
||||
self.ejscreen_df.rename(
|
||||
columns={"ID": self.GEOID_FIELD_NAME}, inplace=True
|
||||
)
|
||||
|
||||
# Load census data
|
||||
census_csv = self.DATA_PATH / "dataset" / "census_acs_2019" / "usa.csv"
|
||||
self.census_df = pd.read_csv(
|
||||
census_csv,
|
||||
dtype={self.GEOID_FIELD_NAME: "string"},
|
||||
low_memory=False,
|
||||
)
|
||||
|
||||
# Load housing and transportation data
|
||||
housing_and_transportation_index_csv = (
|
||||
self.DATA_PATH
|
||||
/ "dataset"
|
||||
/ "housing_and_transportation_index"
|
||||
/ "usa.csv"
|
||||
)
|
||||
self.housing_and_transportation_df = pd.read_csv(
|
||||
housing_and_transportation_index_csv,
|
||||
dtype={self.GEOID_FIELD_NAME: "string"},
|
||||
low_memory=False,
|
||||
)
|
||||
|
||||
# Load HUD housing data
|
||||
hud_housing_csv = self.DATA_PATH / "dataset" / "hud_housing" / "usa.csv"
|
||||
self.hud_housing_df = pd.read_csv(
|
||||
hud_housing_csv,
|
||||
dtype={self.GEOID_TRACT_FIELD_NAME: "string"},
|
||||
low_memory=False,
|
||||
)
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming Score Data")
|
||||
|
||||
# Join all the data sources that use census block groups
|
||||
census_block_group_dfs = [
|
||||
self.ejscreen_df,
|
||||
self.census_df,
|
||||
self.housing_and_transportation_df,
|
||||
]
|
||||
|
||||
census_block_group_df = functools.reduce(
|
||||
lambda left, right: pd.merge(
|
||||
left=left, right=right, on=self.GEOID_FIELD_NAME, how="outer"
|
||||
),
|
||||
census_block_group_dfs,
|
||||
)
|
||||
|
||||
# Sanity check the join.
|
||||
if (
|
||||
len(census_block_group_df[self.GEOID_FIELD_NAME].str.len().unique())
|
||||
!= 1
|
||||
):
|
||||
raise ValueError(
|
||||
f"One of the input CSVs uses {self.GEOID_FIELD_NAME} with a different length."
|
||||
)
|
||||
|
||||
# Join all the data sources that use census tracts
|
||||
# TODO: when there's more than one data source using census tract, reduce/merge them here.
|
||||
census_tract_df = self.hud_housing_df
|
||||
|
||||
# Calculate the tract for the CBG data.
|
||||
census_block_group_df[
|
||||
self.GEOID_TRACT_FIELD_NAME
|
||||
] = census_block_group_df[self.GEOID_FIELD_NAME].str[0:11]
|
||||
|
||||
self.df = census_block_group_df.merge(
|
||||
census_tract_df, on=self.GEOID_TRACT_FIELD_NAME
|
||||
)
|
||||
|
||||
if len(census_block_group_df) > 220333:
|
||||
raise ValueError("Too many rows in the join.")
|
||||
|
||||
# Define a named tuple that will be used for each data set input.
|
||||
DataSet = collections.namedtuple(
|
||||
typename="DataSet",
|
||||
field_names=["input_field", "renamed_field", "bucket"],
|
||||
)
|
||||
|
||||
data_sets = [
|
||||
# The following data sets have `bucket=None`, because it's not used in the bucket based score ("Score C").
|
||||
DataSet(
|
||||
input_field=self.GEOID_FIELD_NAME,
|
||||
# Use the name `GEOID10` to enable geoplatform.gov's workflow.
|
||||
renamed_field=self.GEOID_FIELD_NAME,
|
||||
bucket=None,
|
||||
),
|
||||
DataSet(
|
||||
input_field=self.HOUSING_BURDEN_FIELD_NAME,
|
||||
renamed_field=self.HOUSING_BURDEN_FIELD_NAME,
|
||||
bucket=None,
|
||||
),
|
||||
DataSet(
|
||||
input_field="ACSTOTPOP",
|
||||
renamed_field="Total population",
|
||||
bucket=None,
|
||||
),
|
||||
# The following data sets have buckets, because they're used in the score
|
||||
DataSet(
|
||||
input_field="CANCER",
|
||||
renamed_field="Air toxics cancer risk",
|
||||
bucket=self.BUCKET_EXPOSURES,
|
||||
),
|
||||
DataSet(
|
||||
input_field="RESP",
|
||||
renamed_field="Respiratory hazard index",
|
||||
bucket=self.BUCKET_EXPOSURES,
|
||||
),
|
||||
DataSet(
|
||||
input_field="DSLPM",
|
||||
renamed_field="Diesel particulate matter",
|
||||
bucket=self.BUCKET_EXPOSURES,
|
||||
),
|
||||
DataSet(
|
||||
input_field="PM25",
|
||||
renamed_field="Particulate matter (PM2.5)",
|
||||
bucket=self.BUCKET_EXPOSURES,
|
||||
),
|
||||
DataSet(
|
||||
input_field="OZONE",
|
||||
renamed_field="Ozone",
|
||||
bucket=self.BUCKET_EXPOSURES,
|
||||
),
|
||||
DataSet(
|
||||
input_field="PTRAF",
|
||||
renamed_field="Traffic proximity and volume",
|
||||
bucket=self.BUCKET_EXPOSURES,
|
||||
),
|
||||
DataSet(
|
||||
input_field="PRMP",
|
||||
renamed_field="Proximity to RMP sites",
|
||||
bucket=self.BUCKET_ENVIRONMENTAL,
|
||||
),
|
||||
DataSet(
|
||||
input_field="PTSDF",
|
||||
renamed_field="Proximity to TSDF sites",
|
||||
bucket=self.BUCKET_ENVIRONMENTAL,
|
||||
),
|
||||
DataSet(
|
||||
input_field="PNPL",
|
||||
renamed_field="Proximity to NPL sites",
|
||||
bucket=self.BUCKET_ENVIRONMENTAL,
|
||||
),
|
||||
DataSet(
|
||||
input_field="PWDIS",
|
||||
renamed_field="Wastewater discharge",
|
||||
bucket=self.BUCKET_ENVIRONMENTAL,
|
||||
),
|
||||
DataSet(
|
||||
input_field="PRE1960PCT",
|
||||
renamed_field="Percent pre-1960s housing (lead paint indicator)",
|
||||
bucket=self.BUCKET_ENVIRONMENTAL,
|
||||
),
|
||||
DataSet(
|
||||
input_field="UNDER5PCT",
|
||||
renamed_field="Individuals under 5 years old",
|
||||
bucket=self.BUCKET_SENSITIVE,
|
||||
),
|
||||
DataSet(
|
||||
input_field="OVER64PCT",
|
||||
renamed_field="Individuals over 64 years old",
|
||||
bucket=self.BUCKET_SENSITIVE,
|
||||
),
|
||||
DataSet(
|
||||
input_field=self.LINGUISTIC_ISOLATION_FIELD_NAME,
|
||||
renamed_field=self.LINGUISTIC_ISOLATION_FIELD_NAME,
|
||||
bucket=self.BUCKET_SENSITIVE,
|
||||
),
|
||||
DataSet(
|
||||
input_field="LINGISOPCT",
|
||||
renamed_field="Percent of households in linguistic isolation",
|
||||
bucket=self.BUCKET_SOCIOECONOMIC,
|
||||
),
|
||||
DataSet(
|
||||
input_field="LOWINCPCT",
|
||||
renamed_field=self.POVERTY_FIELD_NAME,
|
||||
bucket=self.BUCKET_SOCIOECONOMIC,
|
||||
),
|
||||
DataSet(
|
||||
input_field="LESSHSPCT",
|
||||
renamed_field=self.HIGH_SCHOOL_FIELD_NAME,
|
||||
bucket=self.BUCKET_SOCIOECONOMIC,
|
||||
),
|
||||
DataSet(
|
||||
input_field=self.UNEMPLOYED_FIELD_NAME,
|
||||
renamed_field=self.UNEMPLOYED_FIELD_NAME,
|
||||
bucket=self.BUCKET_SOCIOECONOMIC,
|
||||
),
|
||||
DataSet(
|
||||
input_field="ht_ami",
|
||||
renamed_field="Housing + Transportation Costs % Income for the Regional Typical Household",
|
||||
bucket=self.BUCKET_SOCIOECONOMIC,
|
||||
),
|
||||
]
|
||||
|
||||
# Rename columns:
|
||||
renaming_dict = {
|
||||
data_set.input_field: data_set.renamed_field
|
||||
for data_set in data_sets
|
||||
}
|
||||
|
||||
self.df.rename(
|
||||
columns=renaming_dict,
|
||||
inplace=True,
|
||||
errors="raise",
|
||||
)
|
||||
|
||||
columns_to_keep = [data_set.renamed_field for data_set in data_sets]
|
||||
self.df = self.df[columns_to_keep]
|
||||
|
||||
# Convert all columns to numeric.
|
||||
for data_set in data_sets:
|
||||
# Skip GEOID_FIELD_NAME, because it's a string.
|
||||
if data_set.renamed_field == self.GEOID_FIELD_NAME:
|
||||
continue
|
||||
self.df[f"{data_set.renamed_field}"] = pd.to_numeric(
|
||||
self.df[data_set.renamed_field]
|
||||
)
|
||||
|
||||
# calculate percentiles
|
||||
for data_set in data_sets:
|
||||
self.df[
|
||||
f"{data_set.renamed_field}{self.PERCENTILE_FIELD_SUFFIX}"
|
||||
] = self.df[data_set.renamed_field].rank(pct=True)
|
||||
|
||||
# Math:
|
||||
# (
|
||||
# Observed value
|
||||
# - minimum of all values
|
||||
# )
|
||||
# divided by
|
||||
# (
|
||||
# Maximum of all values
|
||||
# - minimum of all values
|
||||
# )
|
||||
for data_set in data_sets:
|
||||
# Skip GEOID_FIELD_NAME, because it's a string.
|
||||
if data_set.renamed_field == self.GEOID_FIELD_NAME:
|
||||
continue
|
||||
|
||||
min_value = self.df[data_set.renamed_field].min(skipna=True)
|
||||
|
||||
max_value = self.df[data_set.renamed_field].max(skipna=True)
|
||||
|
||||
logger.info(
|
||||
f"For data set {data_set.renamed_field}, the min value is {min_value} and the max value is {max_value}."
|
||||
)
|
||||
|
||||
self.df[f"{data_set.renamed_field}{self.MIN_MAX_FIELD_SUFFIX}"] = (
|
||||
self.df[data_set.renamed_field] - min_value
|
||||
) / (max_value - min_value)
|
||||
|
||||
# Graph distributions and correlations.
|
||||
min_max_fields = [
|
||||
f"{data_set.renamed_field}{self.MIN_MAX_FIELD_SUFFIX}"
|
||||
for data_set in data_sets
|
||||
if data_set.renamed_field != self.GEOID_FIELD_NAME
|
||||
]
|
||||
|
||||
# Calculate score "A" and score "B"
|
||||
self.df["Score A"] = self.df[
|
||||
[
|
||||
"Poverty (Less than 200% of federal poverty line) (percentile)",
|
||||
"Percent individuals age 25 or over with less than high school degree (percentile)",
|
||||
]
|
||||
].mean(axis=1)
|
||||
self.df["Score B"] = (
|
||||
self.df[
|
||||
"Poverty (Less than 200% of federal poverty line) (percentile)"
|
||||
]
|
||||
* self.df[
|
||||
"Percent individuals age 25 or over with less than high school degree (percentile)"
|
||||
]
|
||||
)
|
||||
|
||||
# Calculate "CalEnviroScreen for the US" score
|
||||
# Average all the percentile values in each bucket into a single score for each of the four buckets.
|
||||
for bucket in self.BUCKETS:
|
||||
fields_in_bucket = [
|
||||
f"{data_set.renamed_field}{self.PERCENTILE_FIELD_SUFFIX}"
|
||||
for data_set in data_sets
|
||||
if data_set.bucket == bucket
|
||||
]
|
||||
self.df[f"{bucket}"] = self.df[fields_in_bucket].mean(axis=1)
|
||||
|
||||
# Combine the score from the two Exposures and Environmental Effects buckets into a single score called "Pollution Burden". The math for this score is: (1.0 * Exposures Score + 0.5 * Environment Effects score) / 1.5.
|
||||
self.df[self.AGGREGATION_POLLUTION] = (
|
||||
1.0 * self.df[f"{self.BUCKET_EXPOSURES}"]
|
||||
+ 0.5 * self.df[f"{self.BUCKET_ENVIRONMENTAL}"]
|
||||
) / 1.5
|
||||
|
||||
# Average the score from the two Sensitive populations and Socioeconomic factors buckets into a single score called "Population Characteristics".
|
||||
self.df[self.AGGREGATION_POPULATION] = self.df[
|
||||
[f"{self.BUCKET_SENSITIVE}", f"{self.BUCKET_SOCIOECONOMIC}"]
|
||||
].mean(axis=1)
|
||||
|
||||
# Multiply the "Pollution Burden" score and the "Population Characteristics" together to produce the cumulative impact score.
|
||||
self.df["Score C"] = (
|
||||
self.df[self.AGGREGATION_POLLUTION]
|
||||
* self.df[self.AGGREGATION_POPULATION]
|
||||
)
|
||||
|
||||
if len(census_block_group_df) > 220333:
|
||||
raise ValueError("Too many rows in the join.")
|
||||
|
||||
fields_to_use_in_score = [
|
||||
self.UNEMPLOYED_FIELD_NAME,
|
||||
self.LINGUISTIC_ISOLATION_FIELD_NAME,
|
||||
self.HOUSING_BURDEN_FIELD_NAME,
|
||||
self.POVERTY_FIELD_NAME,
|
||||
self.HIGH_SCHOOL_FIELD_NAME,
|
||||
]
|
||||
|
||||
fields_min_max = [
|
||||
f"{field}{self.MIN_MAX_FIELD_SUFFIX}"
|
||||
for field in fields_to_use_in_score
|
||||
]
|
||||
fields_percentile = [
|
||||
f"{field}{self.PERCENTILE_FIELD_SUFFIX}"
|
||||
for field in fields_to_use_in_score
|
||||
]
|
||||
|
||||
# Calculate "Score D", which uses min-max normalization
|
||||
# and calculate "Score E", which uses percentile normalization for the same fields
|
||||
self.df["Score D"] = self.df[fields_min_max].mean(axis=1)
|
||||
self.df["Score E"] = self.df[fields_percentile].mean(axis=1)
|
||||
|
||||
# Calculate correlations
|
||||
self.df[fields_min_max].corr()
|
||||
|
||||
# Create percentiles for the scores
|
||||
for score_field in [
|
||||
"Score A",
|
||||
"Score B",
|
||||
"Score C",
|
||||
"Score D",
|
||||
"Score E",
|
||||
]:
|
||||
self.df[f"{score_field}{self.PERCENTILE_FIELD_SUFFIX}"] = self.df[
|
||||
score_field
|
||||
].rank(pct=True)
|
||||
self.df[f"{score_field} (top 25th percentile)"] = (
|
||||
self.df[f"{score_field}{self.PERCENTILE_FIELD_SUFFIX}"] >= 0.75
|
||||
)
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving Score CSV")
|
||||
|
||||
# write nationwide csv
|
||||
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||
self.df.to_csv(self.SCORE_CSV_PATH / f"usa.csv", index=False)
|
112
data/data-pipeline/etl/score/etl_score_post.py
Normal file
112
data/data-pipeline/etl/score/etl_score_post.py
Normal file
|
@ -0,0 +1,112 @@
|
|||
import pandas as pd
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from utils import get_module_logger
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class PostScoreETL(ExtractTransformLoad):
|
||||
"""
|
||||
A class used to instantiate an ETL object to retrieve and process data from
|
||||
datasets.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.CENSUS_COUNTIES_ZIP_URL = "https://www2.census.gov/geo/docs/maps-data/data/gazetteer/Gaz_counties_national.zip"
|
||||
self.CENSUS_COUNTIES_TXT = self.TMP_PATH / "Gaz_counties_national.txt"
|
||||
self.CENSUS_COUNTIES_COLS = ["USPS", "GEOID", "NAME"]
|
||||
self.SCORE_CSV_PATH = self.DATA_PATH / "score" / "csv"
|
||||
self.STATE_CSV = (
|
||||
self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
|
||||
)
|
||||
self.FULL_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
|
||||
self.TILR_SCORE_CSV = self.SCORE_CSV_PATH / "tile" / "usa.csv"
|
||||
|
||||
self.TILES_SCORE_COLUMNS = [
|
||||
"GEOID10",
|
||||
"Score E (percentile)",
|
||||
"Score E (top 25th percentile)",
|
||||
"GEOID",
|
||||
"State Abbreviation",
|
||||
"County Name",
|
||||
]
|
||||
self.TILES_SCORE_CSV_PATH = self.SCORE_CSV_PATH / "tiles"
|
||||
self.TILES_SCORE_CSV = self.TILES_SCORE_CSV_PATH / "usa.csv"
|
||||
|
||||
self.counties_df: pd.DataFrame
|
||||
self.states_df: pd.DataFrame
|
||||
self.score_df: pd.DataFrame
|
||||
self.score_county_state_merged: pd.DataFrame
|
||||
self.score_for_tiles: pd.DataFrame
|
||||
|
||||
def extract(self) -> None:
|
||||
super().extract(
|
||||
self.CENSUS_COUNTIES_ZIP_URL,
|
||||
self.TMP_PATH,
|
||||
)
|
||||
|
||||
logger.info(f"Reading Counties CSV")
|
||||
self.counties_df = pd.read_csv(
|
||||
self.CENSUS_COUNTIES_TXT,
|
||||
sep="\t",
|
||||
dtype={"GEOID": "string", "USPS": "string"},
|
||||
low_memory=False,
|
||||
encoding="latin-1",
|
||||
)
|
||||
|
||||
logger.info(f"Reading States CSV")
|
||||
self.states_df = pd.read_csv(
|
||||
self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
|
||||
)
|
||||
self.score_df = pd.read_csv(
|
||||
self.FULL_SCORE_CSV, dtype={"GEOID10": "string"}
|
||||
)
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming data sources for Score + County CSV")
|
||||
|
||||
# rename some of the columns to prepare for merge
|
||||
self.counties_df = self.counties_df[["USPS", "GEOID", "NAME"]]
|
||||
self.counties_df.rename(
|
||||
columns={"USPS": "State Abbreviation", "NAME": "County Name"},
|
||||
inplace=True,
|
||||
)
|
||||
|
||||
# remove unnecessary columns
|
||||
self.states_df.rename(
|
||||
columns={
|
||||
"fips": "State Code",
|
||||
"state_name": "State Name",
|
||||
"state_abbreviation": "State Abbreviation",
|
||||
},
|
||||
inplace=True,
|
||||
)
|
||||
self.states_df.drop(["region", "division"], axis=1, inplace=True)
|
||||
|
||||
# add the tract level column
|
||||
self.score_df["GEOID"] = self.score_df.GEOID10.str[:5]
|
||||
|
||||
# merge state and counties
|
||||
county_state_merged = self.counties_df.join(
|
||||
self.states_df, rsuffix=" Other"
|
||||
)
|
||||
del county_state_merged["State Abbreviation Other"]
|
||||
|
||||
# merge county and score
|
||||
self.score_county_state_merged = self.score_df.join(
|
||||
county_state_merged, rsuffix="_OTHER"
|
||||
)
|
||||
del self.score_county_state_merged["GEOID_OTHER"]
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving Full Score CSV with County Information")
|
||||
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||
self.score_county_state_merged.to_csv(self.FULL_SCORE_CSV, index=False)
|
||||
|
||||
logger.info(f"Saving Tile Score CSV")
|
||||
# TODO: check which are the columns we'll use
|
||||
# Related to: https://github.com/usds/justice40-tool/issues/302
|
||||
score_tiles = self.score_county_state_merged[self.TILES_SCORE_COLUMNS]
|
||||
self.TILES_SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||
score_tiles.to_csv(self.TILES_SCORE_CSV, index=False)
|
0
data/data-pipeline/etl/sources/__init__.py
Normal file
0
data/data-pipeline/etl/sources/__init__.py
Normal file
0
data/data-pipeline/etl/sources/calenviroscreen/README.md
Normal file
0
data/data-pipeline/etl/sources/calenviroscreen/README.md
Normal file
69
data/data-pipeline/etl/sources/calenviroscreen/etl.py
Normal file
69
data/data-pipeline/etl/sources/calenviroscreen/etl.py
Normal file
|
@ -0,0 +1,69 @@
|
|||
import pandas as pd
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from utils import get_module_logger
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class CalEnviroScreenETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
self.CALENVIROSCREEN_FTP_URL = "https://justice40-data.s3.amazonaws.com/CalEnviroScreen/CalEnviroScreen_4.0_2021.zip"
|
||||
self.CALENVIROSCREEN_CSV = self.TMP_PATH / "CalEnviroScreen_4.0_2021.csv"
|
||||
self.CSV_PATH = self.DATA_PATH / "dataset" / "calenviroscreen4"
|
||||
|
||||
# Definining some variable names
|
||||
self.CALENVIROSCREEN_SCORE_FIELD_NAME = "calenviroscreen_score"
|
||||
self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME = "calenviroscreen_percentile"
|
||||
self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME = (
|
||||
"calenviroscreen_priority_community"
|
||||
)
|
||||
|
||||
# Choosing constants.
|
||||
# None of these numbers are final, but just for the purposes of comparison.
|
||||
self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD = 75
|
||||
|
||||
self.df: pd.DataFrame
|
||||
|
||||
def extract(self) -> None:
|
||||
logger.info(f"Downloading CalEnviroScreen Data")
|
||||
super().extract(
|
||||
self.CALENVIROSCREEN_FTP_URL,
|
||||
self.TMP_PATH,
|
||||
)
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming CalEnviroScreen Data")
|
||||
|
||||
# Data from https://calenviroscreen-oehha.hub.arcgis.com/#Data, specifically:
|
||||
# https://oehha.ca.gov/media/downloads/calenviroscreen/document/calenviroscreen40resultsdatadictionaryd12021.zip
|
||||
# Load comparison index (CalEnviroScreen 4)
|
||||
self.df = pd.read_csv(
|
||||
self.CALENVIROSCREEN_CSV, dtype={"Census Tract": "string"}
|
||||
)
|
||||
|
||||
self.df.rename(
|
||||
columns={
|
||||
"Census Tract": self.GEOID_TRACT_FIELD_NAME,
|
||||
"DRAFT CES 4.0 Score": self.CALENVIROSCREEN_SCORE_FIELD_NAME,
|
||||
"DRAFT CES 4.0 Percentile": self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME,
|
||||
},
|
||||
inplace=True,
|
||||
)
|
||||
|
||||
# Add a leading "0" to the Census Tract to match our format in other data frames.
|
||||
self.df[self.GEOID_TRACT_FIELD_NAME] = (
|
||||
"0" + self.df[self.GEOID_TRACT_FIELD_NAME]
|
||||
)
|
||||
|
||||
# Calculate the top K% of prioritized communities
|
||||
self.df[self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME] = (
|
||||
self.df[self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME]
|
||||
>= self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD
|
||||
)
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving CalEnviroScreen CSV")
|
||||
# write nationwide csv
|
||||
self.CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||
self.df.to_csv(self.CSV_PATH / f"data06.csv", index=False)
|
0
data/data-pipeline/etl/sources/census/README.md
Normal file
0
data/data-pipeline/etl/sources/census/README.md
Normal file
0
data/data-pipeline/etl/sources/census/__init__.py
Normal file
0
data/data-pipeline/etl/sources/census/__init__.py
Normal file
111
data/data-pipeline/etl/sources/census/etl.py
Normal file
111
data/data-pipeline/etl/sources/census/etl.py
Normal file
|
@ -0,0 +1,111 @@
|
|||
import csv
|
||||
import os
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from .etl_utils import get_state_fips_codes
|
||||
from utils import unzip_file_from_url, get_module_logger
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
def download_census_csvs(data_path: Path) -> None:
|
||||
"""Download all census shape files from the Census FTP and extract the geojson
|
||||
to generate national and by state Census Block Group CSVs
|
||||
|
||||
Args:
|
||||
data_path (pathlib.Path): Name of the directory where the files and directories will
|
||||
be created
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
|
||||
# the fips_states_2010.csv is generated from data here
|
||||
# https://www.census.gov/geographies/reference-files/time-series/geo/tallies.html
|
||||
state_fips_codes = get_state_fips_codes(data_path)
|
||||
geojson_dir_path = data_path / "census" / "geojson"
|
||||
|
||||
for fips in state_fips_codes:
|
||||
# check if file exists
|
||||
shp_file_path = (
|
||||
data_path / "census" / "shp" / fips / f"tl_2010_{fips}_bg10.shp"
|
||||
)
|
||||
|
||||
logger.info(f"Checking if {fips} file exists")
|
||||
if not os.path.isfile(shp_file_path):
|
||||
logger.info(f"Downloading and extracting {fips} shape file")
|
||||
# 2020 tiger data is here: https://www2.census.gov/geo/tiger/TIGER2020/BG/
|
||||
# But using 2010 for now
|
||||
cbg_state_url = f"https://www2.census.gov/geo/tiger/TIGER2010/BG/2010/tl_2010_{fips}_bg10.zip"
|
||||
unzip_file_from_url(
|
||||
cbg_state_url,
|
||||
data_path / "tmp",
|
||||
data_path / "census" / "shp" / fips,
|
||||
)
|
||||
|
||||
cmd = (
|
||||
"ogr2ogr -f GeoJSON data/census/geojson/"
|
||||
+ fips
|
||||
+ ".json data/census/shp/"
|
||||
+ fips
|
||||
+ "/tl_2010_"
|
||||
+ fips
|
||||
+ "_bg10.shp"
|
||||
)
|
||||
os.system(cmd)
|
||||
|
||||
# generate CBG CSV table for pandas
|
||||
## load in memory
|
||||
cbg_national = [] # in-memory global list
|
||||
cbg_per_state: dict = {} # in-memory dict per state
|
||||
for file in os.listdir(geojson_dir_path):
|
||||
if file.endswith(".json"):
|
||||
logger.info(f"Ingesting geoid10 for file {file}")
|
||||
with open(geojson_dir_path / file) as f:
|
||||
geojson = json.load(f)
|
||||
for feature in geojson["features"]:
|
||||
geoid10 = feature["properties"]["GEOID10"]
|
||||
cbg_national.append(str(geoid10))
|
||||
geoid10_state_id = geoid10[:2]
|
||||
if not cbg_per_state.get(geoid10_state_id):
|
||||
cbg_per_state[geoid10_state_id] = []
|
||||
cbg_per_state[geoid10_state_id].append(geoid10)
|
||||
|
||||
csv_dir_path = data_path / "census" / "csv"
|
||||
## write to individual state csv
|
||||
for state_id in cbg_per_state:
|
||||
geoid10_list = cbg_per_state[state_id]
|
||||
with open(
|
||||
csv_dir_path / f"{state_id}.csv", mode="w", newline=""
|
||||
) as cbg_csv_file:
|
||||
cbg_csv_file_writer = csv.writer(
|
||||
cbg_csv_file,
|
||||
delimiter=",",
|
||||
quotechar='"',
|
||||
quoting=csv.QUOTE_MINIMAL,
|
||||
)
|
||||
|
||||
for geoid10 in geoid10_list:
|
||||
cbg_csv_file_writer.writerow(
|
||||
[
|
||||
geoid10,
|
||||
]
|
||||
)
|
||||
|
||||
## write US csv
|
||||
with open(csv_dir_path / "us.csv", mode="w", newline="") as cbg_csv_file:
|
||||
cbg_csv_file_writer = csv.writer(
|
||||
cbg_csv_file,
|
||||
delimiter=",",
|
||||
quotechar='"',
|
||||
quoting=csv.QUOTE_MINIMAL,
|
||||
)
|
||||
for geoid10 in cbg_national:
|
||||
cbg_csv_file_writer.writerow(
|
||||
[
|
||||
geoid10,
|
||||
]
|
||||
)
|
||||
|
||||
logger.info("Census block groups downloading complete")
|
55
data/data-pipeline/etl/sources/census/etl_utils.py
Normal file
55
data/data-pipeline/etl/sources/census/etl_utils.py
Normal file
|
@ -0,0 +1,55 @@
|
|||
from pathlib import Path
|
||||
import csv
|
||||
import os
|
||||
from config import settings
|
||||
|
||||
from utils import (
|
||||
remove_files_from_dir,
|
||||
remove_all_dirs_from_dir,
|
||||
unzip_file_from_url,
|
||||
get_module_logger,
|
||||
)
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
def reset_data_directories(data_path: Path) -> None:
|
||||
census_data_path = data_path / "census"
|
||||
|
||||
# csv
|
||||
csv_path = census_data_path / "csv"
|
||||
remove_files_from_dir(csv_path, ".csv")
|
||||
|
||||
# geojson
|
||||
geojson_path = census_data_path / "geojson"
|
||||
remove_files_from_dir(geojson_path, ".json")
|
||||
|
||||
# shp
|
||||
shp_path = census_data_path / "shp"
|
||||
remove_all_dirs_from_dir(shp_path)
|
||||
|
||||
|
||||
def get_state_fips_codes(data_path: Path) -> list:
|
||||
fips_csv_path = data_path / "census" / "csv" / "fips_states_2010.csv"
|
||||
|
||||
# check if file exists
|
||||
if not os.path.isfile(fips_csv_path):
|
||||
logger.info(f"Downloading fips from S3 repository")
|
||||
unzip_file_from_url(
|
||||
settings.AWS_JUSTICE40_DATA_URL + "/Census/fips_states_2010.zip",
|
||||
data_path / "tmp",
|
||||
data_path / "census" / "csv",
|
||||
)
|
||||
|
||||
fips_state_list = []
|
||||
with open(fips_csv_path) as csv_file:
|
||||
csv_reader = csv.reader(csv_file, delimiter=",")
|
||||
line_count = 0
|
||||
|
||||
for row in csv_reader:
|
||||
if line_count == 0:
|
||||
line_count += 1
|
||||
else:
|
||||
fips = row[0].strip()
|
||||
fips_state_list.append(fips)
|
||||
return fips_state_list
|
0
data/data-pipeline/etl/sources/census_acs/README.md
Normal file
0
data/data-pipeline/etl/sources/census_acs/README.md
Normal file
0
data/data-pipeline/etl/sources/census_acs/__init__.py
Normal file
0
data/data-pipeline/etl/sources/census_acs/__init__.py
Normal file
108
data/data-pipeline/etl/sources/census_acs/etl.py
Normal file
108
data/data-pipeline/etl/sources/census_acs/etl.py
Normal file
|
@ -0,0 +1,108 @@
|
|||
import pandas as pd
|
||||
import censusdata
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from etl.sources.census.etl_utils import get_state_fips_codes
|
||||
from utils import get_module_logger
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class CensusACSETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
self.ACS_YEAR = 2019
|
||||
self.OUTPUT_PATH = (
|
||||
self.DATA_PATH / "dataset" / f"census_acs_{self.ACS_YEAR}"
|
||||
)
|
||||
self.UNEMPLOYED_FIELD_NAME = "Unemployed civilians (percent)"
|
||||
self.LINGUISTIC_ISOLATION_FIELD_NAME = "Linguistic isolation (percent)"
|
||||
self.LINGUISTIC_ISOLATION_TOTAL_FIELD_NAME = (
|
||||
"Linguistic isolation (total)"
|
||||
)
|
||||
self.LINGUISTIC_ISOLATION_FIELDS = [
|
||||
"C16002_001E",
|
||||
"C16002_004E",
|
||||
"C16002_007E",
|
||||
"C16002_010E",
|
||||
"C16002_013E",
|
||||
]
|
||||
self.df: pd.DataFrame
|
||||
|
||||
def _fips_from_censusdata_censusgeo(
|
||||
self, censusgeo: censusdata.censusgeo
|
||||
) -> str:
|
||||
"""Create a FIPS code from the proprietary censusgeo index."""
|
||||
fips = "".join([value for (key, value) in censusgeo.params()])
|
||||
return fips
|
||||
|
||||
def extract(self) -> None:
|
||||
dfs = []
|
||||
for fips in get_state_fips_codes(self.DATA_PATH):
|
||||
logger.info(
|
||||
f"Downloading data for state/territory with FIPS code {fips}"
|
||||
)
|
||||
|
||||
dfs.append(
|
||||
censusdata.download(
|
||||
src="acs5",
|
||||
year=self.ACS_YEAR,
|
||||
geo=censusdata.censusgeo(
|
||||
[("state", fips), ("county", "*"), ("block group", "*")]
|
||||
),
|
||||
var=[
|
||||
# Emploment fields
|
||||
"B23025_005E",
|
||||
"B23025_003E",
|
||||
]
|
||||
+ self.LINGUISTIC_ISOLATION_FIELDS,
|
||||
)
|
||||
)
|
||||
|
||||
self.df = pd.concat(dfs)
|
||||
|
||||
self.df[self.GEOID_FIELD_NAME] = self.df.index.to_series().apply(
|
||||
func=self._fips_from_censusdata_censusgeo
|
||||
)
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Starting Census ACS Transform")
|
||||
|
||||
# Calculate percent unemployment.
|
||||
# TODO: remove small-sample data that should be `None` instead of a high-variance fraction.
|
||||
self.df[self.UNEMPLOYED_FIELD_NAME] = (
|
||||
self.df.B23025_005E / self.df.B23025_003E
|
||||
)
|
||||
|
||||
# Calculate linguistic isolation.
|
||||
individual_limited_english_fields = [
|
||||
"C16002_004E",
|
||||
"C16002_007E",
|
||||
"C16002_010E",
|
||||
"C16002_013E",
|
||||
]
|
||||
|
||||
self.df[self.LINGUISTIC_ISOLATION_TOTAL_FIELD_NAME] = self.df[
|
||||
individual_limited_english_fields
|
||||
].sum(axis=1, skipna=True)
|
||||
self.df[self.LINGUISTIC_ISOLATION_FIELD_NAME] = (
|
||||
self.df[self.LINGUISTIC_ISOLATION_TOTAL_FIELD_NAME].astype(float)
|
||||
/ self.df["C16002_001E"]
|
||||
)
|
||||
|
||||
self.df[self.LINGUISTIC_ISOLATION_FIELD_NAME].describe()
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving Census ACS Data")
|
||||
|
||||
# mkdir census
|
||||
self.OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
columns_to_include = [
|
||||
self.GEOID_FIELD_NAME,
|
||||
self.UNEMPLOYED_FIELD_NAME,
|
||||
self.LINGUISTIC_ISOLATION_FIELD_NAME,
|
||||
]
|
||||
|
||||
self.df[columns_to_include].to_csv(
|
||||
path_or_buf=self.OUTPUT_PATH / "usa.csv", index=False
|
||||
)
|
0
data/data-pipeline/etl/sources/ejscreen/README.md
Normal file
0
data/data-pipeline/etl/sources/ejscreen/README.md
Normal file
0
data/data-pipeline/etl/sources/ejscreen/__init__.py
Normal file
0
data/data-pipeline/etl/sources/ejscreen/__init__.py
Normal file
37
data/data-pipeline/etl/sources/ejscreen/etl.py
Normal file
37
data/data-pipeline/etl/sources/ejscreen/etl.py
Normal file
|
@ -0,0 +1,37 @@
|
|||
import pandas as pd
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from utils import get_module_logger
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class EJScreenETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
self.EJSCREEN_FTP_URL = "https://gaftp.epa.gov/EJSCREEN/2019/EJSCREEN_2019_StatePctile.csv.zip"
|
||||
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctiles.csv"
|
||||
self.CSV_PATH = self.DATA_PATH / "dataset" / "ejscreen_2019"
|
||||
self.df: pd.DataFrame
|
||||
|
||||
def extract(self) -> None:
|
||||
logger.info(f"Downloading EJScreen Data")
|
||||
super().extract(
|
||||
self.EJSCREEN_FTP_URL,
|
||||
self.TMP_PATH,
|
||||
)
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming EJScreen Data")
|
||||
self.df = pd.read_csv(
|
||||
self.EJSCREEN_CSV,
|
||||
dtype={"ID": "string"},
|
||||
# EJSCREEN writes the word "None" for NA data.
|
||||
na_values=["None"],
|
||||
low_memory=False,
|
||||
)
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving EJScreen CSV")
|
||||
# write nationwide csv
|
||||
self.CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||
self.df.to_csv(self.CSV_PATH / f"usa.csv", index=False)
|
|
@ -0,0 +1,62 @@
|
|||
import pandas as pd
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from etl.sources.census.etl_utils import get_state_fips_codes
|
||||
from utils import get_module_logger, unzip_file_from_url
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class HousingTransportationETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
self.HOUSING_FTP_URL = (
|
||||
"https://htaindex.cnt.org/download/download.php?focus=blkgrp&geoid="
|
||||
)
|
||||
self.OUTPUT_PATH = (
|
||||
self.DATA_PATH / "dataset" / "housing_and_transportation_index"
|
||||
)
|
||||
self.df: pd.DataFrame
|
||||
|
||||
def extract(self) -> None:
|
||||
# Download each state / territory individually
|
||||
dfs = []
|
||||
zip_file_dir = self.TMP_PATH / "housing_and_transportation_index"
|
||||
for fips in get_state_fips_codes(self.DATA_PATH):
|
||||
logger.info(
|
||||
f"Downloading housing data for state/territory with FIPS code {fips}"
|
||||
)
|
||||
|
||||
# Puerto Rico has no data, so skip
|
||||
if fips == "72":
|
||||
continue
|
||||
|
||||
unzip_file_from_url(
|
||||
f"{self.HOUSING_FTP_URL}{fips}", self.TMP_PATH, zip_file_dir
|
||||
)
|
||||
|
||||
# New file name:
|
||||
tmp_csv_file_path = (
|
||||
zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
|
||||
)
|
||||
tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
|
||||
|
||||
dfs.append(tmp_df)
|
||||
|
||||
self.df = pd.concat(dfs)
|
||||
|
||||
self.df.head()
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming Housing and Transportation Data")
|
||||
|
||||
# Rename and reformat block group ID
|
||||
self.df.rename(columns={"blkgrp": self.GEOID_FIELD_NAME}, inplace=True)
|
||||
self.df[self.GEOID_FIELD_NAME] = self.df[
|
||||
self.GEOID_FIELD_NAME
|
||||
].str.replace('"', "")
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving Housing and Transportation Data")
|
||||
|
||||
self.OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
|
||||
self.df.to_csv(path_or_buf=self.OUTPUT_PATH / "usa.csv", index=False)
|
0
data/data-pipeline/etl/sources/hud_housing/README.md
Normal file
0
data/data-pipeline/etl/sources/hud_housing/README.md
Normal file
0
data/data-pipeline/etl/sources/hud_housing/__init__.py
Normal file
0
data/data-pipeline/etl/sources/hud_housing/__init__.py
Normal file
187
data/data-pipeline/etl/sources/hud_housing/etl.py
Normal file
187
data/data-pipeline/etl/sources/hud_housing/etl.py
Normal file
|
@ -0,0 +1,187 @@
|
|||
import pandas as pd
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from etl.sources.census.etl_utils import get_state_fips_codes
|
||||
from utils import get_module_logger, unzip_file_from_url, remove_all_from_dir
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class HudHousingETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
self.OUTPUT_PATH = self.DATA_PATH / "dataset" / "hud_housing"
|
||||
self.GEOID_TRACT_FIELD_NAME = "GEOID10_TRACT"
|
||||
self.HOUSING_FTP_URL = "https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
|
||||
self.HOUSING_ZIP_FILE_DIR = self.TMP_PATH / "hud_housing"
|
||||
|
||||
# We measure households earning less than 80% of HUD Area Median Family Income by county
|
||||
# and paying greater than 30% of their income to housing costs.
|
||||
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
|
||||
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME = "HOUSING_BURDEN_NUMERATOR"
|
||||
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = (
|
||||
"HOUSING_BURDEN_DENOMINATOR"
|
||||
)
|
||||
|
||||
# Note: some variable definitions.
|
||||
# HUD-adjusted median family income (HAMFI).
|
||||
# The four housing problems are: incomplete kitchen facilities, incomplete plumbing facilities, more than 1 person per room, and cost burden greater than 30%.
|
||||
# Table 8 is the desired table.
|
||||
|
||||
self.df: pd.DataFrame
|
||||
|
||||
def extract(self) -> None:
|
||||
logger.info(f"Extracting HUD Housing Data")
|
||||
super().extract(
|
||||
self.HOUSING_FTP_URL,
|
||||
self.HOUSING_ZIP_FILE_DIR,
|
||||
)
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming HUD Housing Data")
|
||||
|
||||
# New file name:
|
||||
tmp_csv_file_path = (
|
||||
self.HOUSING_ZIP_FILE_DIR
|
||||
/ "2012thru2016-140-csv"
|
||||
/ "2012thru2016-140-csv"
|
||||
/ "140"
|
||||
/ "Table8.csv"
|
||||
)
|
||||
self.df = pd.read_csv(
|
||||
filepath_or_buffer=tmp_csv_file_path,
|
||||
encoding="latin-1",
|
||||
)
|
||||
|
||||
# Rename and reformat block group ID
|
||||
self.df.rename(
|
||||
columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True
|
||||
)
|
||||
|
||||
# The CHAS data has census tract ids such as `14000US01001020100`
|
||||
# Whereas the rest of our data uses, for the same tract, `01001020100`.
|
||||
# the characters before `US`:
|
||||
self.df[self.GEOID_TRACT_FIELD_NAME] = self.df[
|
||||
self.GEOID_TRACT_FIELD_NAME
|
||||
].str.replace(r"^.*?US", "", regex=True)
|
||||
|
||||
# Calculate housing burden
|
||||
# This is quite a number of steps. It does not appear to be accessible nationally in a simpler format, though.
|
||||
# See "CHAS data dictionary 12-16.xlsx"
|
||||
|
||||
# Owner occupied numerator fields
|
||||
OWNER_OCCUPIED_NUMERATOR_FIELDS = [
|
||||
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
|
||||
# T8_est7 Subtotal Owner occupied less than or equal to 30% of HAMFI greater than 30% but less than or equal to 50% All
|
||||
"T8_est7",
|
||||
# T8_est10 Subtotal Owner occupied less than or equal to 30% of HAMFI greater than 50% All
|
||||
"T8_est10",
|
||||
# T8_est20 Subtotal Owner occupied greater than 30% but less than or equal to 50% of HAMFI greater than 30% but less than or equal to 50% All
|
||||
"T8_est20",
|
||||
# T8_est23 Subtotal Owner occupied greater than 30% but less than or equal to 50% of HAMFI greater than 50% All
|
||||
"T8_est23",
|
||||
# T8_est33 Subtotal Owner occupied greater than 50% but less than or equal to 80% of HAMFI greater than 30% but less than or equal to 50% All
|
||||
"T8_est33",
|
||||
# T8_est36 Subtotal Owner occupied greater than 50% but less than or equal to 80% of HAMFI greater than 50% All
|
||||
"T8_est36",
|
||||
]
|
||||
|
||||
# These rows have the values where HAMFI was not computed, b/c of no or negative income.
|
||||
OWNER_OCCUPIED_NOT_COMPUTED_FIELDS = [
|
||||
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
|
||||
# T8_est13 Subtotal Owner occupied less than or equal to 30% of HAMFI not computed (no/negative income) All
|
||||
"T8_est13",
|
||||
# T8_est26 Subtotal Owner occupied greater than 30% but less than or equal to 50% of HAMFI not computed (no/negative income) All
|
||||
"T8_est26",
|
||||
# T8_est39 Subtotal Owner occupied greater than 50% but less than or equal to 80% of HAMFI not computed (no/negative income) All
|
||||
"T8_est39",
|
||||
# T8_est52 Subtotal Owner occupied greater than 80% but less than or equal to 100% of HAMFI not computed (no/negative income) All
|
||||
"T8_est52",
|
||||
# T8_est65 Subtotal Owner occupied greater than 100% of HAMFI not computed (no/negative income) All
|
||||
"T8_est65",
|
||||
]
|
||||
|
||||
# T8_est2 Subtotal Owner occupied All All All
|
||||
OWNER_OCCUPIED_POPULATION_FIELD = "T8_est2"
|
||||
|
||||
# Renter occupied numerator fields
|
||||
RENTER_OCCUPIED_NUMERATOR_FIELDS = [
|
||||
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
|
||||
# T8_est73 Subtotal Renter occupied less than or equal to 30% of HAMFI greater than 30% but less than or equal to 50% All
|
||||
"T8_est73",
|
||||
# T8_est76 Subtotal Renter occupied less than or equal to 30% of HAMFI greater than 50% All
|
||||
"T8_est76",
|
||||
# T8_est86 Subtotal Renter occupied greater than 30% but less than or equal to 50% of HAMFI greater than 30% but less than or equal to 50% All
|
||||
"T8_est86",
|
||||
# T8_est89 Subtotal Renter occupied greater than 30% but less than or equal to 50% of HAMFI greater than 50% All
|
||||
"T8_est89",
|
||||
# T8_est99 Subtotal Renter occupied greater than 50% but less than or equal to 80% of HAMFI greater than 30% but less than or equal to 50% All
|
||||
"T8_est99",
|
||||
# T8_est102 Subtotal Renter occupied greater than 50% but less than or equal to 80% of HAMFI greater than 50% All
|
||||
"T8_est102",
|
||||
]
|
||||
|
||||
# These rows have the values where HAMFI was not computed, b/c of no or negative income.
|
||||
RENTER_OCCUPIED_NOT_COMPUTED_FIELDS = [
|
||||
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
|
||||
# T8_est79 Subtotal Renter occupied less than or equal to 30% of HAMFI not computed (no/negative income) All
|
||||
"T8_est79",
|
||||
# T8_est92 Subtotal Renter occupied greater than 30% but less than or equal to 50% of HAMFI not computed (no/negative income) All
|
||||
"T8_est92",
|
||||
# T8_est105 Subtotal Renter occupied greater than 50% but less than or equal to 80% of HAMFI not computed (no/negative income) All
|
||||
"T8_est105",
|
||||
# T8_est118 Subtotal Renter occupied greater than 80% but less than or equal to 100% of HAMFI not computed (no/negative income) All
|
||||
"T8_est118",
|
||||
# T8_est131 Subtotal Renter occupied greater than 100% of HAMFI not computed (no/negative income) All
|
||||
"T8_est131",
|
||||
]
|
||||
|
||||
# T8_est68 Subtotal Renter occupied All All All
|
||||
RENTER_OCCUPIED_POPULATION_FIELD = "T8_est68"
|
||||
|
||||
# Math:
|
||||
# (
|
||||
# # of Owner Occupied Units Meeting Criteria
|
||||
# + # of Renter Occupied Units Meeting Criteria
|
||||
# )
|
||||
# divided by
|
||||
# (
|
||||
# Total # of Owner Occupied Units
|
||||
# + Total # of Renter Occupied Units
|
||||
# - # of Owner Occupied Units with HAMFI Not Computed
|
||||
# - # of Renter Occupied Units with HAMFI Not Computed
|
||||
# )
|
||||
|
||||
self.df[self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME] = self.df[
|
||||
OWNER_OCCUPIED_NUMERATOR_FIELDS
|
||||
].sum(axis=1) + self.df[RENTER_OCCUPIED_NUMERATOR_FIELDS].sum(axis=1)
|
||||
|
||||
self.df[self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME] = (
|
||||
self.df[OWNER_OCCUPIED_POPULATION_FIELD]
|
||||
+ self.df[RENTER_OCCUPIED_POPULATION_FIELD]
|
||||
- self.df[OWNER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)
|
||||
- self.df[RENTER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)
|
||||
)
|
||||
|
||||
# TODO: add small sample size checks
|
||||
self.df[self.HOUSING_BURDEN_FIELD_NAME] = self.df[
|
||||
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME
|
||||
].astype(float) / self.df[
|
||||
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME
|
||||
].astype(
|
||||
float
|
||||
)
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving HUD Housing Data")
|
||||
|
||||
self.OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Drop unnecessary fields
|
||||
self.df[
|
||||
[
|
||||
self.GEOID_TRACT_FIELD_NAME,
|
||||
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME,
|
||||
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME,
|
||||
self.HOUSING_BURDEN_FIELD_NAME,
|
||||
]
|
||||
].to_csv(path_or_buf=self.OUTPUT_PATH / "usa.csv", index=False)
|
0
data/data-pipeline/etl/sources/hud_recap/README.md
Normal file
0
data/data-pipeline/etl/sources/hud_recap/README.md
Normal file
0
data/data-pipeline/etl/sources/hud_recap/__init__.py
Normal file
0
data/data-pipeline/etl/sources/hud_recap/__init__.py
Normal file
63
data/data-pipeline/etl/sources/hud_recap/etl.py
Normal file
63
data/data-pipeline/etl/sources/hud_recap/etl.py
Normal file
|
@ -0,0 +1,63 @@
|
|||
import pandas as pd
|
||||
import requests
|
||||
|
||||
from etl.base import ExtractTransformLoad
|
||||
from utils import get_module_logger
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
class HudRecapETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
self.HUD_RECAP_CSV_URL = "https://opendata.arcgis.com/api/v3/datasets/56de4edea8264fe5a344da9811ef5d6e_0/downloads/data?format=csv&spatialRefId=4326"
|
||||
self.HUD_RECAP_CSV = (
|
||||
self.TMP_PATH
|
||||
/ "Racially_or_Ethnically_Concentrated_Areas_of_Poverty__R_ECAPs_.csv"
|
||||
)
|
||||
self.CSV_PATH = self.DATA_PATH / "dataset" / "hud_recap"
|
||||
|
||||
# Definining some variable names
|
||||
self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME = "hud_recap_priority_community"
|
||||
|
||||
self.df: pd.DataFrame
|
||||
|
||||
def extract(self) -> None:
|
||||
logger.info(f"Downloading HUD Recap Data")
|
||||
download = requests.get(self.HUD_RECAP_CSV_URL, verify=None)
|
||||
file_contents = download.content
|
||||
csv_file = open(self.HUD_RECAP_CSV, "wb")
|
||||
csv_file.write(file_contents)
|
||||
csv_file.close()
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming HUD Recap Data")
|
||||
|
||||
# Load comparison index (CalEnviroScreen 4)
|
||||
self.df = pd.read_csv(self.HUD_RECAP_CSV, dtype={"Census Tract": "string"})
|
||||
|
||||
self.df.rename(
|
||||
columns={
|
||||
"GEOID": self.GEOID_TRACT_FIELD_NAME,
|
||||
# Interestingly, there's no data dictionary for the RECAP data that I could find.
|
||||
# However, this site (http://www.schousing.com/library/Tax%20Credit/2020/QAP%20Instructions%20(2).pdf)
|
||||
# suggests:
|
||||
# "If RCAP_Current for the tract in which the site is located is 1, the tract is an R/ECAP. If RCAP_Current is 0, it is not."
|
||||
"RCAP_Current": self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME,
|
||||
},
|
||||
inplace=True,
|
||||
)
|
||||
|
||||
# Convert to boolean
|
||||
self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME] = self.df[
|
||||
self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME
|
||||
].astype("bool")
|
||||
|
||||
self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME].value_counts()
|
||||
|
||||
self.df.sort_values(by=self.GEOID_TRACT_FIELD_NAME, inplace=True)
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving HUD Recap CSV")
|
||||
# write nationwide csv
|
||||
self.CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||
self.df.to_csv(self.CSV_PATH / f"usa.csv", index=False)
|
161
data/data-pipeline/ipython/county_lookup.ipynb
Normal file
161
data/data-pipeline/ipython/county_lookup.ipynb
Normal file
|
@ -0,0 +1,161 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7185e18d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import csv\n",
|
||||
"from pathlib import Path\n",
|
||||
"import os\n",
|
||||
"import sys"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "174bbd09",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"module_path = os.path.abspath(os.path.join(\"..\"))\n",
|
||||
"if module_path not in sys.path:\n",
|
||||
" sys.path.append(module_path)\n",
|
||||
" \n",
|
||||
"from utils import unzip_file_from_url\n",
|
||||
"from etl.sources.census.etl_utils import get_state_fips_codes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "dd090fcc",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"DATA_PATH = Path.cwd().parent / \"data\"\n",
|
||||
"TMP_PATH: Path = DATA_PATH / \"tmp\"\n",
|
||||
"STATE_CSV = DATA_PATH / \"census\" / \"csv\" / \"fips_states_2010.csv\"\n",
|
||||
"SCORE_CSV = DATA_PATH / \"score\" / \"csv\" / \"usa.csv\"\n",
|
||||
"COUNTY_SCORE_CSV = DATA_PATH / \"score\" / \"csv\" / \"usa-county.csv\"\n",
|
||||
"CENSUS_COUNTIES_ZIP_URL = \"https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2020_Gazetteer/2020_Gaz_counties_national.zip\"\n",
|
||||
"CENSUS_COUNTIES_TXT = TMP_PATH / \"2020_Gaz_counties_national.txt\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "cf2e266b",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"unzip_file_from_url(CENSUS_COUNTIES_ZIP_URL, TMP_PATH, TMP_PATH)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9ff96da8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"counties_df = pd.read_csv(CENSUS_COUNTIES_TXT, sep=\"\\t\", dtype={\"GEOID\": \"string\", \"USPS\": \"string\"}, low_memory=False)\n",
|
||||
"counties_df = counties_df[['USPS', 'GEOID', 'NAME']]\n",
|
||||
"counties_df.rename(columns={\"USPS\": \"State Abbreviation\", \"NAME\": \"County Name\"}, inplace=True)\n",
|
||||
"counties_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5af103da",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"states_df = pd.read_csv(STATE_CSV, dtype={\"fips\": \"string\", \"state_abbreviation\": \"string\"})\n",
|
||||
"states_df.rename(columns={\"fips\": \"State Code\", \"state_name\": \"State Name\", \"state_abbreviation\": \"State Abbreviation\"}, inplace=True)\n",
|
||||
"states_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c8680258",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"county_state_merged = counties_df.join(states_df, rsuffix=' Other')\n",
|
||||
"del county_state_merged[\"State Abbreviation Other\"]\n",
|
||||
"county_state_merged.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "58dca55a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"score_df = pd.read_csv(SCORE_CSV, dtype={\"GEOID10\": \"string\"})\n",
|
||||
"score_df[\"GEOID\"] = score_df.GEOID10.str[:5]\n",
|
||||
"score_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "45e04d42",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"score_county_state_merged = score_df.join(county_state_merged, rsuffix='_OTHER')\n",
|
||||
"del score_county_state_merged[\"GEOID_OTHER\"]\n",
|
||||
"score_county_state_merged.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a5a0b32b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"score_county_state_merged.to_csv(COUNTY_SCORE_CSV, index=False)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b690937e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
901
data/data-pipeline/ipython/scoring_comparison.ipynb
Normal file
901
data/data-pipeline/ipython/scoring_comparison.ipynb
Normal file
|
@ -0,0 +1,901 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "54615cef",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Before running this script as it currently stands, you'll need to run these notebooks (in any order):\n",
|
||||
"# * score_calc.ipynb\n",
|
||||
"# * calenviroscreen_etl.ipynb\n",
|
||||
"# * hud_recap_etl.ipynb\n",
|
||||
"\n",
|
||||
"import collections\n",
|
||||
"import functools\n",
|
||||
"import IPython\n",
|
||||
"import numpy as np\n",
|
||||
"import os\n",
|
||||
"import pandas as pd\n",
|
||||
"import pathlib\n",
|
||||
"import pypandoc\n",
|
||||
"import requests\n",
|
||||
"import string\n",
|
||||
"import sys\n",
|
||||
"import typing\n",
|
||||
"import us\n",
|
||||
"import zipfile\n",
|
||||
"\n",
|
||||
"from datetime import datetime\n",
|
||||
"from tqdm.notebook import tqdm_notebook\n",
|
||||
"\n",
|
||||
"module_path = os.path.abspath(os.path.join(\"..\"))\n",
|
||||
"if module_path not in sys.path:\n",
|
||||
" sys.path.append(module_path)\n",
|
||||
"\n",
|
||||
"from utils import remove_all_from_dir, get_excel_column_name\n",
|
||||
"\n",
|
||||
"# Turn on TQDM for pandas so that we can have progress bars when running `apply`.\n",
|
||||
"tqdm_notebook.pandas()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "49a63129",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Suppress scientific notation in pandas (this shows up for census tract IDs)\n",
|
||||
"pd.options.display.float_format = \"{:.2f}\".format\n",
|
||||
"\n",
|
||||
"# Set some global parameters\n",
|
||||
"DATA_DIR = pathlib.Path.cwd().parent / \"data\"\n",
|
||||
"TEMP_DATA_DIR = pathlib.Path.cwd().parent / \"data\" / \"tmp\"\n",
|
||||
"COMPARISON_OUTPUTS_DIR = TEMP_DATA_DIR / \"comparison_outputs\"\n",
|
||||
"\n",
|
||||
"# Make the dirs if they don't exist\n",
|
||||
"TEMP_DATA_DIR.mkdir(parents=True, exist_ok=True)\n",
|
||||
"COMPARISON_OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)\n",
|
||||
"\n",
|
||||
"CEJST_PRIORITY_COMMUNITY_THRESHOLD = 0.75\n",
|
||||
"\n",
|
||||
"# Name fields using variables. (This makes it easy to reference the same fields frequently without using strings\n",
|
||||
"# and introducing the risk of misspelling the field name.)\n",
|
||||
"\n",
|
||||
"GEOID_FIELD_NAME = \"GEOID10\"\n",
|
||||
"GEOID_TRACT_FIELD_NAME = \"GEOID10_TRACT\"\n",
|
||||
"GEOID_STATE_FIELD_NAME = \"GEOID10_STATE\"\n",
|
||||
"CENSUS_BLOCK_GROUP_POPULATION_FIELD = \"Total population\"\n",
|
||||
"\n",
|
||||
"CEJST_SCORE_FIELD = \"cejst_score\"\n",
|
||||
"CEJST_PERCENTILE_FIELD = \"cejst_percentile\"\n",
|
||||
"CEJST_PRIORITY_COMMUNITY_FIELD = \"cejst_priority_community\"\n",
|
||||
"\n",
|
||||
"# Define some suffixes\n",
|
||||
"POPULATION_SUFFIX = \" (priority population)\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2b26dccf",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load CEJST score data\n",
|
||||
"cejst_data_path = DATA_DIR / \"score\" / \"csv\" / \"usa.csv\"\n",
|
||||
"cejst_df = pd.read_csv(cejst_data_path, dtype={GEOID_FIELD_NAME: \"string\"})\n",
|
||||
"\n",
|
||||
"# score_used = \"Score A\"\n",
|
||||
"\n",
|
||||
"# # Rename unclear name \"id\" to \"census_block_group_id\", as well as other renamings.\n",
|
||||
"# cejst_df.rename(\n",
|
||||
"# columns={\n",
|
||||
"# \"Total population\": CENSUS_BLOCK_GROUP_POPULATION_FIELD,\n",
|
||||
"# score_used: CEJST_SCORE_FIELD,\n",
|
||||
"# f\"{score_used} (percentile)\": CEJST_PERCENTILE_FIELD,\n",
|
||||
"# },\n",
|
||||
"# inplace=True,\n",
|
||||
"# errors=\"raise\",\n",
|
||||
"# )\n",
|
||||
"\n",
|
||||
"# Create the CBG's Census Tract ID by dropping the last number from the FIPS CODE of the CBG.\n",
|
||||
"# The CBG ID is the last one character.\n",
|
||||
"# For more information, see https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html.\n",
|
||||
"cejst_df.loc[:, GEOID_TRACT_FIELD_NAME] = (\n",
|
||||
" cejst_df.loc[:, GEOID_FIELD_NAME].astype(str).str[:-1]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"cejst_df.loc[:, GEOID_STATE_FIELD_NAME] = (\n",
|
||||
" cejst_df.loc[:, GEOID_FIELD_NAME].astype(str).str[0:2]\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"cejst_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "08962382",
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load CalEnviroScreen 4.0\n",
|
||||
"CALENVIROSCREEN_SCORE_FIELD = \"calenviroscreen_score\"\n",
|
||||
"CALENVIROSCREEN_PERCENTILE_FIELD = \"calenviroscreen_percentile\"\n",
|
||||
"CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD = \"calenviroscreen_priority_community\"\n",
|
||||
"\n",
|
||||
"calenviroscreen_data_path = DATA_DIR / \"dataset\" / \"calenviroscreen4\" / \"data06.csv\"\n",
|
||||
"calenviroscreen_df = pd.read_csv(\n",
|
||||
" calenviroscreen_data_path, dtype={GEOID_TRACT_FIELD_NAME: \"string\"}\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Convert priority community field to a bool.\n",
|
||||
"calenviroscreen_df[CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD] = calenviroscreen_df[\n",
|
||||
" CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD\n",
|
||||
"].astype(bool)\n",
|
||||
"\n",
|
||||
"calenviroscreen_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "42bd28d4",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load HUD data\n",
|
||||
"hud_recap_data_path = DATA_DIR / \"dataset\" / \"hud_recap\" / \"usa.csv\"\n",
|
||||
"hud_recap_df = pd.read_csv(\n",
|
||||
" hud_recap_data_path, dtype={GEOID_TRACT_FIELD_NAME: \"string\"}\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"hud_recap_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d77cd872",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Join all dataframes that use tracts\n",
|
||||
"census_tract_dfs = [calenviroscreen_df, hud_recap_df]\n",
|
||||
"\n",
|
||||
"census_tract_df = functools.reduce(\n",
|
||||
" lambda left, right: pd.merge(\n",
|
||||
" left=left, right=right, on=GEOID_TRACT_FIELD_NAME, how=\"outer\"\n",
|
||||
" ),\n",
|
||||
" census_tract_dfs,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"if census_tract_df[GEOID_TRACT_FIELD_NAME].str.len().unique() != [11]:\n",
|
||||
" raise ValueError(\"Some of the census tract data has the wrong length.\")\n",
|
||||
"\n",
|
||||
"if len(census_tract_df) > 74134:\n",
|
||||
" raise ValueError(\"Too many rows in the join.\")\n",
|
||||
"\n",
|
||||
"census_tract_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "813e5656",
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Join tract indices and CEJST data.\n",
|
||||
"# Note: we're joining on the census *tract*, so there will be multiple CBG entries joined to the same census tract row from CES,\n",
|
||||
"# creating multiple rows of the same CES data.\n",
|
||||
"merged_df = cejst_df.merge(\n",
|
||||
" census_tract_df,\n",
|
||||
" how=\"left\",\n",
|
||||
" on=GEOID_TRACT_FIELD_NAME,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"if len(merged_df) > 220333:\n",
|
||||
" raise ValueError(\"Too many rows in the join.\")\n",
|
||||
"\n",
|
||||
"merged_df.head()\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# merged_df.to_csv(\n",
|
||||
"# path_or_buf=COMPARISON_OUTPUTS_DIR / \"merged.csv\", na_rep=\"\", index=False\n",
|
||||
"# )"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8a801121",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"cejst_priority_communities_fields = [\n",
|
||||
" \"Score A (top 25th percentile)\",\n",
|
||||
" \"Score B (top 25th percentile)\",\n",
|
||||
" \"Score C (top 25th percentile)\",\n",
|
||||
" \"Score D (top 25th percentile)\",\n",
|
||||
" \"Score E (top 25th percentile)\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"comparison_priority_communities_fields = [\n",
|
||||
" \"calenviroscreen_priority_community\",\n",
|
||||
" \"hud_recap_priority_community\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9fef0da9",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def get_state_distributions(\n",
|
||||
" df: pd.DataFrame, priority_communities_fields: typing.List[str]\n",
|
||||
") -> pd.DataFrame:\n",
|
||||
" \"\"\"For each boolean field of priority communities, calculate distribution across states and territories.\"\"\"\n",
|
||||
"\n",
|
||||
" # Ensure each field is boolean.\n",
|
||||
" for priority_communities_field in priority_communities_fields:\n",
|
||||
" if df[priority_communities_field].dtype != bool:\n",
|
||||
" print(f\"Converting {priority_communities_field} to boolean.\")\n",
|
||||
"\n",
|
||||
" # Calculate the population included as priority communities per CBG. Will either be 0 or the population.\n",
|
||||
" df[f\"{priority_communities_field}{POPULATION_SUFFIX}\"] = (\n",
|
||||
" df[priority_communities_field] * df[CENSUS_BLOCK_GROUP_POPULATION_FIELD]\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" def calculate_state_comparison(frame: pd.DataFrame) -> pd.DataFrame:\n",
|
||||
" \"\"\"\n",
|
||||
" This method will be applied to a `group_by` object. Inherits some parameters from outer scope.\n",
|
||||
" \"\"\"\n",
|
||||
" state_id = frame[GEOID_STATE_FIELD_NAME].unique()[0]\n",
|
||||
"\n",
|
||||
" summary_dict = {}\n",
|
||||
" summary_dict[GEOID_STATE_FIELD_NAME] = state_id\n",
|
||||
" summary_dict[\"State name\"] = us.states.lookup(state_id).name\n",
|
||||
" summary_dict[\"Total CBGs in state\"] = len(frame)\n",
|
||||
" summary_dict[\"Total population in state\"] = frame[\n",
|
||||
" CENSUS_BLOCK_GROUP_POPULATION_FIELD\n",
|
||||
" ].sum()\n",
|
||||
"\n",
|
||||
" for priority_communities_field in priority_communities_fields:\n",
|
||||
" summary_dict[f\"{priority_communities_field}{POPULATION_SUFFIX}\"] = frame[\n",
|
||||
" f\"{priority_communities_field}{POPULATION_SUFFIX}\"\n",
|
||||
" ].sum()\n",
|
||||
"\n",
|
||||
" summary_dict[f\"{priority_communities_field} (total CBGs)\"] = frame[\n",
|
||||
" f\"{priority_communities_field}\"\n",
|
||||
" ].sum()\n",
|
||||
"\n",
|
||||
" # Calculate some combinations of other variables.\n",
|
||||
" summary_dict[f\"{priority_communities_field} (percent CBGs)\"] = (\n",
|
||||
" summary_dict[f\"{priority_communities_field} (total CBGs)\"]\n",
|
||||
" / summary_dict[\"Total CBGs in state\"]\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" summary_dict[f\"{priority_communities_field} (percent population)\"] = (\n",
|
||||
" summary_dict[f\"{priority_communities_field}{POPULATION_SUFFIX}\"]\n",
|
||||
" / summary_dict[\"Total population in state\"]\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" df = pd.DataFrame(summary_dict, index=[0])\n",
|
||||
"\n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
" grouped_df = df.groupby(GEOID_STATE_FIELD_NAME)\n",
|
||||
"\n",
|
||||
" # Run the comparison function on the groups.\n",
|
||||
" state_distribution_df = grouped_df.progress_apply(calculate_state_comparison)\n",
|
||||
"\n",
|
||||
" return state_distribution_df\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def write_state_distribution_excel(\n",
|
||||
" state_distribution_df: pd.DataFrame, file_path: pathlib.PosixPath\n",
|
||||
") -> None:\n",
|
||||
" \"\"\"Write the dataframe to excel with special formatting.\"\"\"\n",
|
||||
" # Create a Pandas Excel writer using XlsxWriter as the engine.\n",
|
||||
" writer = pd.ExcelWriter(file_path, engine=\"xlsxwriter\")\n",
|
||||
"\n",
|
||||
" # Convert the dataframe to an XlsxWriter Excel object. We also turn off the\n",
|
||||
" # index column at the left of the output dataframe.\n",
|
||||
" state_distribution_df.to_excel(writer, sheet_name=\"Sheet1\", index=False)\n",
|
||||
"\n",
|
||||
" # Get the xlsxwriter workbook and worksheet objects.\n",
|
||||
" workbook = writer.book\n",
|
||||
" worksheet = writer.sheets[\"Sheet1\"]\n",
|
||||
" worksheet.autofilter(\n",
|
||||
" 0, 0, state_distribution_df.shape[0], state_distribution_df.shape[1]\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" for column in state_distribution_df.columns:\n",
|
||||
" # Special formatting for columns that capture the percent of population considered priority.\n",
|
||||
" if \"(percent population)\" in column:\n",
|
||||
" # Turn the column index into excel ranges (e.g., column #95 is \"CR\" and the range may be \"CR2:CR53\").\n",
|
||||
" column_index = state_distribution_df.columns.get_loc(column)\n",
|
||||
" column_character = get_excel_column_name(column_index)\n",
|
||||
" column_ranges = (\n",
|
||||
" f\"{column_character}2:{column_character}{len(state_distribution_df)+1}\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Add green to red conditional formatting.\n",
|
||||
" worksheet.conditional_format(\n",
|
||||
" column_ranges,\n",
|
||||
" # Min: green, max: red.\n",
|
||||
" {\n",
|
||||
" \"type\": \"2_color_scale\",\n",
|
||||
" \"min_color\": \"#00FF7F\",\n",
|
||||
" \"max_color\": \"#C82538\",\n",
|
||||
" },\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # TODO: text wrapping not working, fix.\n",
|
||||
" text_wrap = workbook.add_format({\"text_wrap\": True})\n",
|
||||
"\n",
|
||||
" # Make these columns wide enough that you can read them.\n",
|
||||
" worksheet.set_column(\n",
|
||||
" f\"{column_character}:{column_character}\", 40, text_wrap\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" writer.save()\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"state_distribution_df = get_state_distributions(\n",
|
||||
" df=merged_df,\n",
|
||||
" priority_communities_fields=cejst_priority_communities_fields\n",
|
||||
" + comparison_priority_communities_fields,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"state_distribution_df.to_csv(\n",
|
||||
" path_or_buf=COMPARISON_OUTPUTS_DIR / \"Priority CBGs by state.csv\",\n",
|
||||
" na_rep=\"\",\n",
|
||||
" index=False,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"write_state_distribution_excel(\n",
|
||||
" state_distribution_df=state_distribution_df,\n",
|
||||
" file_path=COMPARISON_OUTPUTS_DIR / \"Priority CBGs by state.xlsx\",\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"state_distribution_df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d46667cf",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# This cell defines a couple of comparison functions. It does not run them.\n",
|
||||
"\n",
|
||||
"# Define a namedtuple for column names, which need to be shared between multiple parts of this comparison pipeline.\n",
|
||||
"# Named tuples are useful here because they provide guarantees that for each instance, all properties are defined and\n",
|
||||
"# can be accessed as properties (rather than as strings).\n",
|
||||
"\n",
|
||||
"# Note: if you'd like to add a field used throughout the comparison process, add it in three places.\n",
|
||||
"# For an example `new_field`,\n",
|
||||
"# 1. in this namedtuple, add the field as a string in `field_names` (e.g., `field_names=[..., \"new_field\"])`)\n",
|
||||
"# 2. in the function `get_comparison_field_names`, define how the field name should be created from input data\n",
|
||||
"# (e.g., `...new_field=f\"New field compares {method_a_name} to {method_b_name}\")\n",
|
||||
"# 3. In the function `get_comparison_markdown_content`, add some reporting on the new field to the markdown content.\n",
|
||||
"# (e.g., `The statistics indicate that {calculation_based_on_new_field} percent of census tracts are different between scores.`)\n",
|
||||
"ComparisonFieldNames = collections.namedtuple(\n",
|
||||
" typename=\"ComparisonFieldNames\",\n",
|
||||
" field_names=[\n",
|
||||
" \"any_tract_has_at_least_one_method_a_cbg\",\n",
|
||||
" \"method_b_tract_has_at_least_one_method_a_cbg\",\n",
|
||||
" \"method_b_tract_has_100_percent_method_a_cbg\",\n",
|
||||
" \"method_b_non_priority_tract_has_at_least_one_method_a_cbg\",\n",
|
||||
" \"method_b_non_priority_tract_has_100_percent_method_a_cbg\",\n",
|
||||
" ],\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# Define a namedtuple for indices.\n",
|
||||
"Index = collections.namedtuple(\n",
|
||||
" typename=\"Index\",\n",
|
||||
" field_names=[\n",
|
||||
" \"method_name\",\n",
|
||||
" \"priority_communities_field\",\n",
|
||||
" # Note: this field only used by indices defined at the census tract level.\n",
|
||||
" \"other_census_tract_fields_to_keep\",\n",
|
||||
" ],\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_comparison_field_names(\n",
|
||||
" method_a_name: str,\n",
|
||||
" method_b_name: str,\n",
|
||||
") -> ComparisonFieldNames:\n",
|
||||
" comparison_field_names = ComparisonFieldNames(\n",
|
||||
" any_tract_has_at_least_one_method_a_cbg=(\n",
|
||||
" f\"Any tract has at least one {method_a_name} Priority CBG?\"\n",
|
||||
" ),\n",
|
||||
" method_b_tract_has_at_least_one_method_a_cbg=(\n",
|
||||
" f\"{method_b_name} priority tract has at least one {method_a_name} CBG?\"\n",
|
||||
" ),\n",
|
||||
" method_b_tract_has_100_percent_method_a_cbg=(\n",
|
||||
" f\"{method_b_name} tract has 100% {method_a_name} priority CBGs?\"\n",
|
||||
" ),\n",
|
||||
" method_b_non_priority_tract_has_at_least_one_method_a_cbg=(\n",
|
||||
" f\"Non-priority {method_b_name} tract has at least one {method_a_name} priority CBG?\"\n",
|
||||
" ),\n",
|
||||
" method_b_non_priority_tract_has_100_percent_method_a_cbg=(\n",
|
||||
" f\"Non-priority {method_b_name} tract has 100% {method_a_name} priority CBGs?\"\n",
|
||||
" ),\n",
|
||||
" )\n",
|
||||
" return comparison_field_names\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_df_with_only_shared_states(\n",
|
||||
" df: pd.DataFrame,\n",
|
||||
" field_a: str,\n",
|
||||
" field_b: str,\n",
|
||||
" state_field=GEOID_STATE_FIELD_NAME,\n",
|
||||
") -> pd.DataFrame:\n",
|
||||
" \"\"\"\n",
|
||||
" Useful for looking at shared geographies across two fields.\n",
|
||||
"\n",
|
||||
" For a data frame and two fields, return a data frame only for states where there are non-null\n",
|
||||
" values for both fields in that state (or territory).\n",
|
||||
"\n",
|
||||
" This is useful, for example, when running a comparison of CalEnviroScreen (only in California) against\n",
|
||||
" a draft score that's national, and returning only the data for California for the entire data frame.\n",
|
||||
" \"\"\"\n",
|
||||
" field_a_states = df.loc[df[field_a].notnull(), state_field].unique()\n",
|
||||
" field_b_states = df.loc[df[field_b].notnull(), state_field].unique()\n",
|
||||
"\n",
|
||||
" shared_states = list(set(field_a_states) & set(field_b_states))\n",
|
||||
"\n",
|
||||
" df = df.loc[df[state_field].isin(shared_states), :]\n",
|
||||
"\n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_comparison_df(\n",
|
||||
" df: pd.DataFrame,\n",
|
||||
" method_a_priority_census_block_groups_field: str,\n",
|
||||
" method_b_priority_census_tracts_field: str,\n",
|
||||
" other_census_tract_fields_to_keep: typing.Optional[typing.List[str]],\n",
|
||||
" comparison_field_names: ComparisonFieldNames,\n",
|
||||
" output_dir: pathlib.PosixPath,\n",
|
||||
") -> None:\n",
|
||||
" \"\"\"Produces a comparison report for any two given boolean columns representing priority fields.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" df: a pandas dataframe including the data for this comparison.\n",
|
||||
" method_a_priority_census_block_groups_field: the name of a boolean column in `df`, such as the CEJST priority\n",
|
||||
" community field that defines communities at the level of census block groups (CBGs).\n",
|
||||
" method_b_priority_census_tracts_field: the name of a boolean column in `df`, such as the CalEnviroScreen priority\n",
|
||||
" community field that defines communities at the level of census tracts.\n",
|
||||
" other_census_tract_fields_to_keep (optional): a list of field names to preserve at the census tract level\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" df: a pandas dataframe with one row with the results of this comparison\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" def calculate_comparison(frame: pd.DataFrame) -> pd.DataFrame:\n",
|
||||
" \"\"\"\n",
|
||||
" This method will be applied to a `group_by` object.\n",
|
||||
"\n",
|
||||
" Note: It inherits from outer scope `method_a_priority_census_block_groups_field`, `method_b_priority_census_tracts_field`,\n",
|
||||
" and `other_census_tract_fields_to_keep`.\n",
|
||||
" \"\"\"\n",
|
||||
" # Keep all the tract values at the Census Tract Level\n",
|
||||
" for field in other_census_tract_fields_to_keep:\n",
|
||||
" if len(frame[field].unique()) != 1:\n",
|
||||
" raise ValueError(\n",
|
||||
" f\"There are different values per CBG for field {field}.\"\n",
|
||||
" \"`other_census_tract_fields_to_keep` can only be used for fields at the census tract level.\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" df = frame.loc[\n",
|
||||
" frame.index[0],\n",
|
||||
" [\n",
|
||||
" GEOID_TRACT_FIELD_NAME,\n",
|
||||
" method_b_priority_census_tracts_field,\n",
|
||||
" ]\n",
|
||||
" + other_census_tract_fields_to_keep,\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
" # Convenience constant for whether the tract is or is not a method B priority community.\n",
|
||||
" is_a_method_b_priority_tract = frame.loc[\n",
|
||||
" frame.index[0], [method_b_priority_census_tracts_field]\n",
|
||||
" ][0]\n",
|
||||
"\n",
|
||||
" # Recall that NaN values are not falsy, so we need to check if `is_a_method_b_priority_tract` is True.\n",
|
||||
" is_a_method_b_priority_tract = is_a_method_b_priority_tract is True\n",
|
||||
"\n",
|
||||
" # Calculate whether the tract (whether or not it is a comparison priority tract) includes CBGs that are priority\n",
|
||||
" # according to the current CBG score.\n",
|
||||
" df[comparison_field_names.any_tract_has_at_least_one_method_a_cbg] = (\n",
|
||||
" frame.loc[:, method_a_priority_census_block_groups_field].sum() > 0\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Calculate comparison\n",
|
||||
" # A comparison priority tract has at least one CBG that is a priority CBG.\n",
|
||||
" df[comparison_field_names.method_b_tract_has_at_least_one_method_a_cbg] = (\n",
|
||||
" frame.loc[:, method_a_priority_census_block_groups_field].sum() > 0\n",
|
||||
" if is_a_method_b_priority_tract\n",
|
||||
" else None\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # A comparison priority tract has all of its contained CBGs as CBG priority CBGs.\n",
|
||||
" df[comparison_field_names.method_b_tract_has_100_percent_method_a_cbg] = (\n",
|
||||
" frame.loc[:, method_a_priority_census_block_groups_field].mean() == 1\n",
|
||||
" if is_a_method_b_priority_tract\n",
|
||||
" else None\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Calculate the inverse\n",
|
||||
" # A tract that is _not_ a comparison priority has at least one CBG priority CBG.\n",
|
||||
" df[\n",
|
||||
" comparison_field_names.method_b_non_priority_tract_has_at_least_one_method_a_cbg\n",
|
||||
" ] = (\n",
|
||||
" frame.loc[:, method_a_priority_census_block_groups_field].sum() > 0\n",
|
||||
" if not is_a_method_b_priority_tract\n",
|
||||
" else None\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # A tract that is _not_ a comparison priority has all of its contained CBGs as CBG priority CBGs.\n",
|
||||
" df[\n",
|
||||
" comparison_field_names.method_b_non_priority_tract_has_100_percent_method_a_cbg\n",
|
||||
" ] = (\n",
|
||||
" frame.loc[:, method_a_priority_census_block_groups_field].mean() == 1\n",
|
||||
" if not is_a_method_b_priority_tract\n",
|
||||
" else None\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
" # Group all data by the census tract.\n",
|
||||
" grouped_df = df.groupby(GEOID_TRACT_FIELD_NAME)\n",
|
||||
"\n",
|
||||
" # Run the comparison function on the groups.\n",
|
||||
" comparison_df = grouped_df.progress_apply(calculate_comparison)\n",
|
||||
"\n",
|
||||
" return comparison_df\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_comparison_markdown_content(\n",
|
||||
" original_df: pd.DataFrame,\n",
|
||||
" comparison_df: pd.DataFrame,\n",
|
||||
" comparison_field_names: ComparisonFieldNames,\n",
|
||||
" method_a_name: str,\n",
|
||||
" method_b_name: str,\n",
|
||||
" method_a_priority_census_block_groups_field: str,\n",
|
||||
" method_b_priority_census_tracts_field: str,\n",
|
||||
" state_field: str = GEOID_STATE_FIELD_NAME,\n",
|
||||
") -> str:\n",
|
||||
" # Prepare some constants for use in the following Markdown content.\n",
|
||||
" total_cbgs = len(original_df)\n",
|
||||
"\n",
|
||||
" # List of all states/territories in their FIPS codes:\n",
|
||||
" state_ids = sorted(original_df[state_field].unique())\n",
|
||||
" state_names = \", \".join([us.states.lookup(state_id).name for state_id in state_ids])\n",
|
||||
"\n",
|
||||
" # Note: using squeeze throughout do reduce result of `sum()` to a scalar.\n",
|
||||
" # TODO: investigate why sums are sometimes series and sometimes scalar.\n",
|
||||
" method_a_priority_cbgs = (\n",
|
||||
" original_df.loc[:, method_a_priority_census_block_groups_field].sum().squeeze()\n",
|
||||
" )\n",
|
||||
" method_a_priority_cbgs_percent = f\"{method_a_priority_cbgs / total_cbgs:.0%}\"\n",
|
||||
"\n",
|
||||
" total_tracts_count = len(comparison_df)\n",
|
||||
"\n",
|
||||
" method_b_priority_tracts_count = comparison_df.loc[\n",
|
||||
" :, method_b_priority_census_tracts_field\n",
|
||||
" ].sum()\n",
|
||||
"\n",
|
||||
" method_b_priority_tracts_count_percent = (\n",
|
||||
" f\"{method_b_priority_tracts_count / total_tracts_count:.0%}\"\n",
|
||||
" )\n",
|
||||
" method_b_non_priority_tracts_count = (\n",
|
||||
" total_tracts_count - method_b_priority_tracts_count\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" method_a_tracts_count = (\n",
|
||||
" comparison_df.loc[\n",
|
||||
" :, comparison_field_names.any_tract_has_at_least_one_method_a_cbg\n",
|
||||
" ]\n",
|
||||
" .sum()\n",
|
||||
" .squeeze()\n",
|
||||
" )\n",
|
||||
" method_a_tracts_count_percent = f\"{method_a_tracts_count / total_tracts_count:.0%}\"\n",
|
||||
"\n",
|
||||
" # Method A priority community stats\n",
|
||||
" method_b_tracts_with_at_least_one_method_a_cbg = comparison_df.loc[\n",
|
||||
" :, comparison_field_names.method_b_tract_has_at_least_one_method_a_cbg\n",
|
||||
" ].sum()\n",
|
||||
" method_b_tracts_with_at_least_one_method_a_cbg_percent = f\"{method_b_tracts_with_at_least_one_method_a_cbg / method_b_priority_tracts_count:.0%}\"\n",
|
||||
"\n",
|
||||
" method_b_tracts_with_at_100_percent_method_a_cbg = comparison_df.loc[\n",
|
||||
" :, comparison_field_names.method_b_tract_has_100_percent_method_a_cbg\n",
|
||||
" ].sum()\n",
|
||||
" method_b_tracts_with_at_100_percent_method_a_cbg_percent = f\"{method_b_tracts_with_at_100_percent_method_a_cbg / method_b_priority_tracts_count:.0%}\"\n",
|
||||
"\n",
|
||||
" # Method A non-priority community stats\n",
|
||||
" method_b_non_priority_tracts_with_at_least_one_method_a_cbg = comparison_df.loc[\n",
|
||||
" :,\n",
|
||||
" comparison_field_names.method_b_non_priority_tract_has_at_least_one_method_a_cbg,\n",
|
||||
" ].sum()\n",
|
||||
"\n",
|
||||
" method_b_non_priority_tracts_with_at_least_one_method_a_cbg_percent = f\"{method_b_non_priority_tracts_with_at_least_one_method_a_cbg / method_b_non_priority_tracts_count:.0%}\"\n",
|
||||
"\n",
|
||||
" method_b_non_priority_tracts_with_100_percent_method_a_cbg = comparison_df.loc[\n",
|
||||
" :,\n",
|
||||
" comparison_field_names.method_b_non_priority_tract_has_100_percent_method_a_cbg,\n",
|
||||
" ].sum()\n",
|
||||
" method_b_non_priority_tracts_with_100_percent_method_a_cbg_percent = f\"{method_b_non_priority_tracts_with_100_percent_method_a_cbg / method_b_non_priority_tracts_count:.0%}\"\n",
|
||||
"\n",
|
||||
" # Create markdown content for comparisons.\n",
|
||||
" markdown_content = f\"\"\"\n",
|
||||
"# {method_a_name} compared to {method_b_name}\n",
|
||||
"\n",
|
||||
"(This report was calculated on {datetime.today().strftime('%Y-%m-%d')}.)\n",
|
||||
"\n",
|
||||
"This report analyzes the following US states and territories: {state_names}.\n",
|
||||
"\n",
|
||||
"Recall that census tracts contain one or more census block groups, with up to nine census block groups per tract.\n",
|
||||
"\n",
|
||||
"Within the geographic area analyzed, there are {method_b_priority_tracts_count} census tracts designated as priority communities by {method_b_name}, out of {total_tracts_count} total tracts ({method_b_priority_tracts_count_percent}). \n",
|
||||
"\n",
|
||||
"Within the geographic region analyzed, there are {method_a_priority_cbgs} census block groups considered as priority communities by {method_a_name}, out of {total_cbgs} CBGs ({method_a_priority_cbgs_percent}). They occupy {method_a_tracts_count} census tracts ({method_a_tracts_count_percent}) of the geographic area analyzed.\n",
|
||||
"\n",
|
||||
"Out of every {method_b_name} priority census tract, {method_b_tracts_with_at_least_one_method_a_cbg} ({method_b_tracts_with_at_least_one_method_a_cbg_percent}) of these census tracts have at least one census block group within them that is considered a priority community by {method_a_name}.\n",
|
||||
"\n",
|
||||
"Out of every {method_b_name} priority census tract, {method_b_tracts_with_at_100_percent_method_a_cbg} ({method_b_tracts_with_at_100_percent_method_a_cbg_percent}) of these census tracts have 100% of the included census block groups within them considered priority communities by {method_a_name}.\n",
|
||||
"\n",
|
||||
"Out of every census tract that is __not__ marked as a priority community by {method_b_name}, {method_b_non_priority_tracts_with_at_least_one_method_a_cbg} ({method_b_non_priority_tracts_with_at_least_one_method_a_cbg_percent}) of these census tracts have at least one census block group within them that is considered a priority community by the current version of the CEJST score.\n",
|
||||
"\n",
|
||||
"Out of every census tract that is __not__ marked as a priority community by {method_b_name}, {method_b_non_priority_tracts_with_100_percent_method_a_cbg} ({method_b_non_priority_tracts_with_100_percent_method_a_cbg_percent}) of these census tracts have 100% of the included census block groups within them considered priority communities by the current version of the CEJST score.\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
" return markdown_content\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def write_markdown_and_docx_content(\n",
|
||||
" markdown_content: str, file_dir: pathlib.PosixPath, file_name_without_extension: str\n",
|
||||
") -> pathlib.PosixPath:\n",
|
||||
" \"\"\"Write Markdown content to both .md and .docx files.\"\"\"\n",
|
||||
" # Set the file paths for both files.\n",
|
||||
" markdown_file_path = file_dir / f\"{file_name_without_extension}.md\"\n",
|
||||
" docx_file_path = file_dir / f\"{file_name_without_extension}.docx\"\n",
|
||||
"\n",
|
||||
" # Write the markdown content to file.\n",
|
||||
" with open(markdown_file_path, \"w\") as text_file:\n",
|
||||
" text_file.write(markdown_content)\n",
|
||||
"\n",
|
||||
" # Convert markdown file to Word doc.\n",
|
||||
" pypandoc.convert_file(\n",
|
||||
" source_file=str(markdown_file_path),\n",
|
||||
" to=\"docx\",\n",
|
||||
" outputfile=str(docx_file_path),\n",
|
||||
" extra_args=[],\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" return docx_file_path\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def execute_comparison(\n",
|
||||
" df: pd.DataFrame,\n",
|
||||
" method_a_name: str,\n",
|
||||
" method_b_name: str,\n",
|
||||
" method_a_priority_census_block_groups_field: str,\n",
|
||||
" method_b_priority_census_tracts_field: str,\n",
|
||||
" other_census_tract_fields_to_keep: typing.Optional[typing.List[str]],\n",
|
||||
") -> pathlib.PosixPath:\n",
|
||||
" \"\"\"Execute an individual comparison by creating the data frame and writing the report.\n",
|
||||
"\n",
|
||||
" Args:\n",
|
||||
" df: a pandas dataframe including the data for this comparison.\n",
|
||||
" method_a_priority_census_block_groups_field: the name of a boolean column in `df`, such as the CEJST priority\n",
|
||||
" community field that defines communities at the level of census block groups (CBGs).\n",
|
||||
" method_b_priority_census_tracts_field: the name of a boolean column in `df`, such as the CalEnviroScreen priority\n",
|
||||
" community field that defines communities at the level of census tracts.\n",
|
||||
" other_census_tract_fields_to_keep (optional): a list of field names to preserve at the census tract level\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" df: a pandas dataframe with one row with the results of this comparison\n",
|
||||
"\n",
|
||||
" \"\"\"\n",
|
||||
" comparison_field_names = get_comparison_field_names(\n",
|
||||
" method_a_name=method_a_name, method_b_name=method_b_name\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Create or use a directory for outputs grouped by Method A.\n",
|
||||
" output_dir = COMPARISON_OUTPUTS_DIR / method_a_name\n",
|
||||
" output_dir.mkdir(parents=True, exist_ok=True)\n",
|
||||
"\n",
|
||||
" df_with_only_shared_states = get_df_with_only_shared_states(\n",
|
||||
" df=df,\n",
|
||||
" field_a=method_a_priority_census_block_groups_field,\n",
|
||||
" field_b=method_b_priority_census_tracts_field,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" comparison_df = get_comparison_df(\n",
|
||||
" df=df_with_only_shared_states,\n",
|
||||
" method_a_priority_census_block_groups_field=method_a_priority_census_block_groups_field,\n",
|
||||
" method_b_priority_census_tracts_field=method_b_priority_census_tracts_field,\n",
|
||||
" comparison_field_names=comparison_field_names,\n",
|
||||
" other_census_tract_fields_to_keep=other_census_tract_fields_to_keep,\n",
|
||||
" output_dir=output_dir,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Choose output path\n",
|
||||
" file_path = (\n",
|
||||
" output_dir / f\"Comparison Output - {method_a_name} and {method_b_name}.csv\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Write comparison to CSV.\n",
|
||||
" comparison_df.to_csv(\n",
|
||||
" path_or_buf=file_path,\n",
|
||||
" na_rep=\"\",\n",
|
||||
" index=False,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" markdown_content = get_comparison_markdown_content(\n",
|
||||
" original_df=df_with_only_shared_states,\n",
|
||||
" comparison_df=comparison_df,\n",
|
||||
" comparison_field_names=comparison_field_names,\n",
|
||||
" method_a_name=method_a_name,\n",
|
||||
" method_b_name=method_b_name,\n",
|
||||
" method_a_priority_census_block_groups_field=method_a_priority_census_block_groups_field,\n",
|
||||
" method_b_priority_census_tracts_field=method_b_priority_census_tracts_field,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" comparison_docx_file_path = write_markdown_and_docx_content(\n",
|
||||
" markdown_content=markdown_content,\n",
|
||||
" file_dir=output_dir,\n",
|
||||
" file_name_without_extension=f\"Comparison report - {method_a_name} and {method_b_name}\",\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" return comparison_docx_file_path\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def execute_comparisons(\n",
|
||||
" df: pd.DataFrame,\n",
|
||||
" census_block_group_indices: typing.List[Index],\n",
|
||||
" census_tract_indices: typing.List[Index],\n",
|
||||
"):\n",
|
||||
" \"\"\"Create multiple comparison reports.\"\"\"\n",
|
||||
" comparison_docx_file_paths = []\n",
|
||||
" for cbg_index in census_block_group_indices:\n",
|
||||
" for census_tract_index in census_tract_indices:\n",
|
||||
" print(\n",
|
||||
" f\"Running comparisons for {cbg_index.method_name} against {census_tract_index.method_name}...\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" comparison_docx_file_path = execute_comparison(\n",
|
||||
" df=df,\n",
|
||||
" method_a_name=cbg_index.method_name,\n",
|
||||
" method_b_name=census_tract_index.method_name,\n",
|
||||
" method_a_priority_census_block_groups_field=cbg_index.priority_communities_field,\n",
|
||||
" method_b_priority_census_tracts_field=census_tract_index.priority_communities_field,\n",
|
||||
" other_census_tract_fields_to_keep=census_tract_index.other_census_tract_fields_to_keep,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" comparison_docx_file_paths.append(comparison_docx_file_path)\n",
|
||||
"\n",
|
||||
" return comparison_docx_file_paths"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "48d9bf6b",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Actually execute the functions\n",
|
||||
"\n",
|
||||
"# # California only\n",
|
||||
"# cal_df = merged_df[merged_df[GEOID_TRACT_FIELD_NAME].astype(str).str[0:2] == \"06\"]\n",
|
||||
"# # cal_df = cal_df[0:1000]\n",
|
||||
"# print(len(cal_df))\n",
|
||||
"\n",
|
||||
"census_block_group_indices = [\n",
|
||||
" Index(\n",
|
||||
" method_name=\"Score A\",\n",
|
||||
" priority_communities_field=\"Score A (top 25th percentile)\",\n",
|
||||
" other_census_tract_fields_to_keep=[],\n",
|
||||
" ),\n",
|
||||
" # Index(\n",
|
||||
" # method_name=\"Score B\",\n",
|
||||
" # priority_communities_field=\"Score B (top 25th percentile)\",\n",
|
||||
" # other_census_tract_fields_to_keep=[],\n",
|
||||
" # ),\n",
|
||||
" Index(\n",
|
||||
" method_name=\"Score C\",\n",
|
||||
" priority_communities_field=\"Score C (top 25th percentile)\",\n",
|
||||
" other_census_tract_fields_to_keep=[],\n",
|
||||
" ),\n",
|
||||
" Index(\n",
|
||||
" method_name=\"Score D\",\n",
|
||||
" priority_communities_field=\"Score D (top 25th percentile)\",\n",
|
||||
" other_census_tract_fields_to_keep=[],\n",
|
||||
" ),\n",
|
||||
" # Index(\n",
|
||||
" # method_name=\"Score E\",\n",
|
||||
" # priority_communities_field=\"Score E (top 25th percentile)\",\n",
|
||||
" # other_census_tract_fields_to_keep=[],\n",
|
||||
" # ),\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"census_tract_indices = [\n",
|
||||
" Index(\n",
|
||||
" method_name=\"CalEnviroScreen 4.0\",\n",
|
||||
" priority_communities_field=\"calenviroscreen_priority_community\",\n",
|
||||
" other_census_tract_fields_to_keep=[\n",
|
||||
" CALENVIROSCREEN_SCORE_FIELD,\n",
|
||||
" CALENVIROSCREEN_PERCENTILE_FIELD,\n",
|
||||
" ],\n",
|
||||
" ),\n",
|
||||
" Index(\n",
|
||||
" method_name=\"HUD RECAP\",\n",
|
||||
" priority_communities_field=\"hud_recap_priority_community\",\n",
|
||||
" other_census_tract_fields_to_keep=[],\n",
|
||||
" ),\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"file_paths = execute_comparisons(\n",
|
||||
" df=merged_df,\n",
|
||||
" census_block_group_indices=census_block_group_indices,\n",
|
||||
" census_tract_indices=census_tract_indices,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(file_paths)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
1927
data/data-pipeline/poetry.lock
generated
Normal file
1927
data/data-pipeline/poetry.lock
generated
Normal file
File diff suppressed because it is too large
Load diff
26
data/data-pipeline/pyproject.toml
Normal file
26
data/data-pipeline/pyproject.toml
Normal file
|
@ -0,0 +1,26 @@
|
|||
[tool.poetry]
|
||||
name = "score"
|
||||
version = "0.1.0"
|
||||
description = "ETL and Generation of Justice 40 Score"
|
||||
authors = ["Your Name <you@example.com>"]
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
python = "^3.7.1"
|
||||
CensusData = "^1.13"
|
||||
click = "^8.0.1"
|
||||
dynaconf = "^3.1.4"
|
||||
ipython = "^7.24.1"
|
||||
jupyter = "^1.0.0"
|
||||
jupyter-contrib-nbextensions = "^0.5.1"
|
||||
numpy = "^1.21.0"
|
||||
pandas = "^1.2.5"
|
||||
requests = "^2.25.1"
|
||||
types-requests = "^2.25.0"
|
||||
|
||||
[tool.poetry.dev-dependencies]
|
||||
mypy = "^0.910"
|
||||
black = {version = "^21.6b0", allow-prereleases = true}
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core>=1.0.0"]
|
||||
build-backend = "poetry.core.masonry.api"
|
83
data/data-pipeline/requirements.txt
Normal file
83
data/data-pipeline/requirements.txt
Normal file
|
@ -0,0 +1,83 @@
|
|||
appnope==0.1.2; sys_platform == "darwin" and python_version >= "3.7"
|
||||
argon2-cffi==20.1.0; python_version >= "3.6"
|
||||
async-generator==1.10; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
attrs==21.2.0; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.5"
|
||||
backcall==0.2.0; python_version >= "3.7"
|
||||
bleach==3.3.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
|
||||
censusdata==1.13; python_version >= "2.7"
|
||||
certifi==2021.5.30; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "2.7"
|
||||
cffi==1.14.6; implementation_name == "pypy" and python_version >= "3.6"
|
||||
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "2.7"
|
||||
click==8.0.1; python_version >= "3.6"
|
||||
colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and sys_platform == "win32" and platform_system == "Windows" or sys_platform == "win32" and python_version >= "3.7" and python_full_version >= "3.5.0" and platform_system == "Windows"
|
||||
debugpy==1.3.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
|
||||
decorator==5.0.9; python_version >= "3.7"
|
||||
defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
|
||||
dynaconf==3.1.4
|
||||
entrypoints==0.3; python_version >= "3.7"
|
||||
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "2.7"
|
||||
importlib-metadata==3.10.1; python_version < "3.8" and python_version >= "3.7"
|
||||
ipykernel==6.0.1; python_version >= "3.7"
|
||||
ipython-genutils==0.2.0; python_version >= "3.7"
|
||||
ipython==7.25.0; python_version >= "3.7"
|
||||
ipywidgets==7.6.3
|
||||
jedi==0.18.0; python_version >= "3.7"
|
||||
jinja2==3.0.1; python_version >= "3.7"
|
||||
jsonschema==3.2.0; python_version >= "3.5"
|
||||
jupyter-client==6.2.0; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
jupyter-console==6.4.0; python_version >= "3.6"
|
||||
jupyter-contrib-core==0.3.3
|
||||
jupyter-contrib-nbextensions==0.5.1
|
||||
jupyter-core==4.7.1; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
jupyter-highlight-selected-word==0.2.0
|
||||
jupyter-latex-envs==1.4.6
|
||||
jupyter-nbextensions-configurator==0.4.1
|
||||
jupyter==1.0.0
|
||||
jupyterlab-pygments==0.1.2; python_version >= "3.7"
|
||||
jupyterlab-widgets==1.0.0; python_version >= "3.6"
|
||||
lxml==4.6.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
|
||||
markupsafe==2.0.1; python_version >= "3.7"
|
||||
matplotlib-inline==0.1.2; platform_system == "Darwin" and python_version >= "3.7"
|
||||
mistune==0.8.4; python_version >= "3.7"
|
||||
nbclient==0.5.3; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
nbconvert==6.1.0; python_version >= "3.7"
|
||||
nbformat==5.1.3; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
nest-asyncio==1.5.1; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
notebook==6.4.0; python_version >= "3.6"
|
||||
numpy==1.21.0; python_version >= "3.7"
|
||||
packaging==21.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
|
||||
pandas==1.3.0; python_full_version >= "3.7.1"
|
||||
pandocfilters==1.4.3; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
|
||||
parso==0.8.2; python_version >= "3.7"
|
||||
pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7"
|
||||
pickleshare==0.7.5; python_version >= "3.7"
|
||||
prometheus-client==0.11.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
prompt-toolkit==3.0.19; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
ptyprocess==0.7.0; sys_platform != "win32" and python_version >= "3.7" and os_name != "nt"
|
||||
py==1.10.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
|
||||
pycparser==2.20; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
|
||||
pygments==2.9.0; python_version >= "3.7"
|
||||
pyparsing==2.4.7; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
|
||||
pyrsistent==0.18.0; python_version >= "3.6"
|
||||
python-dateutil==2.8.1; python_full_version >= "3.7.1" and python_version >= "3.7"
|
||||
pytz==2021.1; python_full_version >= "3.7.1" and python_version >= "2.7"
|
||||
pywin32==301; sys_platform == "win32" and python_version >= "3.6"
|
||||
pywinpty==1.1.3; os_name == "nt" and python_version >= "3.6"
|
||||
pyyaml==5.4.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0"
|
||||
pyzmq==22.1.0; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
qtconsole==5.1.1; python_version >= "3.6"
|
||||
qtpy==1.9.0; python_version >= "3.6"
|
||||
requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
|
||||
send2trash==1.7.1; python_version >= "3.6"
|
||||
six==1.16.0; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.6") and (python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.5")
|
||||
terminado==0.10.1; python_version >= "3.6"
|
||||
testpath==0.5.0; python_version >= "3.7"
|
||||
tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
traitlets==5.0.5; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
types-requests==2.25.0
|
||||
typing-extensions==3.10.0.0; python_version < "3.8" and python_version >= "3.6"
|
||||
urllib3==1.26.6; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4" and python_version >= "2.7"
|
||||
wcwidth==0.2.5; python_full_version >= "3.6.1" and python_version >= "3.7"
|
||||
webencodings==0.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
|
||||
widgetsnbextension==3.5.1
|
||||
zipp==3.5.0; python_version < "3.8" and python_version >= "3.6"
|
8
data/data-pipeline/settings.toml
Normal file
8
data/data-pipeline/settings.toml
Normal file
|
@ -0,0 +1,8 @@
|
|||
[default]
|
||||
AWS_JUSTICE40_DATA_URL = "https://justice40-data.s3.amazonaws.com"
|
||||
|
||||
[development]
|
||||
|
||||
[staging]
|
||||
|
||||
[production]
|
0
data/data-pipeline/tile/__init__.py
Normal file
0
data/data-pipeline/tile/__init__.py
Normal file
63
data/data-pipeline/tile/generate.py
Normal file
63
data/data-pipeline/tile/generate.py
Normal file
|
@ -0,0 +1,63 @@
|
|||
import os
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
|
||||
from etl.sources.census.etl_utils import get_state_fips_codes
|
||||
|
||||
|
||||
def generate_tiles(data_path: Path) -> None:
|
||||
|
||||
# remove existing mbtiles file
|
||||
mb_tiles_path = data_path / "tiles" / "block2010.mbtiles"
|
||||
if os.path.exists(mb_tiles_path):
|
||||
os.remove(mb_tiles_path)
|
||||
|
||||
# remove existing mvt directory
|
||||
mvt_tiles_path = data_path / "tiles" / "mvt"
|
||||
if os.path.exists(mvt_tiles_path):
|
||||
shutil.rmtree(mvt_tiles_path)
|
||||
|
||||
# remove existing score json files
|
||||
score_geojson_dir = data_path / "score" / "geojson"
|
||||
files_in_directory = os.listdir(score_geojson_dir)
|
||||
filtered_files = [file for file in files_in_directory if file.endswith(".json")]
|
||||
for file in filtered_files:
|
||||
path_to_file = os.path.join(score_geojson_dir, file)
|
||||
os.remove(path_to_file)
|
||||
|
||||
# join the state shape sqllite with the score csv
|
||||
state_fips_codes = get_state_fips_codes()
|
||||
for fips in state_fips_codes:
|
||||
cmd = (
|
||||
"ogr2ogr -f GeoJSON "
|
||||
+ f"-sql \"SELECT * FROM tl_2010_{fips}_bg10 LEFT JOIN 'data/score/csv/data{fips}.csv'.data{fips} ON tl_2010_{fips}_bg10.GEOID10 = data{fips}.ID\" "
|
||||
+ f"data/score/geojson/{fips}.json data/census/shp/{fips}/tl_2010_{fips}_bg10.dbf"
|
||||
)
|
||||
os.system(cmd)
|
||||
|
||||
# get a list of all json files to plug in the docker commands below
|
||||
# (workaround since *.json doesn't seem to work)
|
||||
geojson_list = ""
|
||||
geojson_path = data_path / "score" / "geojson"
|
||||
for file in os.listdir(geojson_path):
|
||||
if file.endswith(".json"):
|
||||
geojson_list += f"data/score/geojson/{file} "
|
||||
|
||||
if geojson_list == "":
|
||||
logging.error(
|
||||
"No GeoJson files found. Please run scripts/download_cbg.py first"
|
||||
)
|
||||
|
||||
# generate mbtiles file
|
||||
cmd = (
|
||||
"tippecanoe --drop-densest-as-needed -zg -o /home/data/tiles/block2010.mbtiles --extend-zooms-if-still-dropping -l cbg2010 -s_srs EPSG:4269 -t_srs EPSG:4326 "
|
||||
+ geojson_list
|
||||
)
|
||||
os.system(cmd)
|
||||
|
||||
# generate mvts
|
||||
cmd = (
|
||||
"tippecanoe --drop-densest-as-needed --no-tile-compression -zg -e /home/data/tiles/mvt "
|
||||
+ geojson_list
|
||||
)
|
||||
os.system(cmd)
|
1177
data/data-pipeline/utils.py
Normal file
1177
data/data-pipeline/utils.py
Normal file
File diff suppressed because it is too large
Load diff
153
data/data-roadmap/README.md
Normal file
153
data/data-roadmap/README.md
Normal file
|
@ -0,0 +1,153 @@
|
|||
# Overview
|
||||
|
||||
This document describes our "data roadmap", which serves several purposes.
|
||||
|
||||
# Data roadmap goals
|
||||
|
||||
The goals of the data roadmap are as follows:
|
||||
|
||||
- Tracking data sets being considered for inclusion in the Climate and Economic Justice Screening Tool (CEJST), either as a data set that is included in the cumulative impacts score or a reference data set that is not included in the score
|
||||
|
||||
- Prioritizing data sets, so that it's obvious to developers working on the CEJST which data sets to incorporate next into the tool
|
||||
|
||||
- Gathering important details about each data set, such as its geographic resolution and the year it was last updated, so that the CEJST team can make informed decisions about what data to prioritize
|
||||
|
||||
- Tracking the problem areas that each data set relates to (e.g., a certain data set may relate to the problem of pesticide exposure amongst migrant farm workers)
|
||||
|
||||
- Enabling members of the public to submit ideas for problem areas or data sets to be considered for inclusion in the CEJST, with easy-to-use and accessible tools
|
||||
|
||||
- Enabling members of the public to submit revisions to the information about each problem area or data set, with easy-to-use and accessible tools
|
||||
|
||||
- Enabling the CEJST development team to review suggestions before incorporating them officially into the data roadmap, to filter out potential noise and spam, or consider how requests may lead to changes in software features and documentation
|
||||
|
||||
# User stories
|
||||
|
||||
These goals can map onto several user stories for the data roadmap, such as:
|
||||
|
||||
- As a community member, I want to suggest a new idea for a dataset.
|
||||
- As a community member, I want to understand what happened with my suggestion for a new dataset.
|
||||
- As a community member, I want to edit the details of a dataset proposal to add more information.
|
||||
- As a WHEJAC board member, I want to vote on what data sources should be prioritized next.
|
||||
- As a product manager, I want to filter based on characteristics of the data.
|
||||
- As a developer, I want to know what to work on next.
|
||||
|
||||
# Data set descriptions
|
||||
|
||||
There are lots of details that are important to track for each data set. This
|
||||
information helps us prepare to integrate a data set into the tool and prioritize
|
||||
between different options for data in the data roadmap.
|
||||
|
||||
In order to support a process of peer review on edits and updates, these details are
|
||||
tracked in one `YAML` file per data set description in the directory
|
||||
[data_roadmap/data_set_descriptions](data_roadmap/data_set_descriptions).
|
||||
|
||||
Each data set description includes a number of fields, some of which are required.
|
||||
The schema defining these fields is written in [Yamale](https://github.com/23andMe/Yamale)
|
||||
and lives at [data_roadmap/data_set_description_schema.yaml](data_roadmap/data_set_description_schema.yaml).
|
||||
|
||||
Because `Yamale` does not provide a method for describing fields, we've created an
|
||||
additional file that includes written descriptions of the meaning of each field in
|
||||
the schema. These live in [data_roadmap/data_set_description_field_descriptions.yaml](data_roadmap/data_set_description_field_descriptions.yaml).
|
||||
|
||||
In order to provide a helpful starting point for people who are ready to contribute
|
||||
ideas for a new data set for consideration, there is an auto-generated data set
|
||||
description template that lives at [data_roadmap/data_set_description_template.yaml](data_roadmap/data_set_description_template.yaml).
|
||||
|
||||
# Steps to add a new data set description: the "easy" way
|
||||
|
||||
Soon we will create a Google Form that contributors can use to submit ideas for new
|
||||
data sets. The Google Form will match the schema of the data set descriptions. Please
|
||||
see [this ticket](https://app.zenhub.com/workspaces/justice40-60993f6e05473d0010ec44e3/issues/usds/justice40-tool/39)
|
||||
for tracking this work.
|
||||
|
||||
# Steps to add a new data set description: the git-savvy way
|
||||
|
||||
For those who are comfortable using `git` and `Markdown`, these are the steps to
|
||||
contribute a new data set description to the data roadmap:
|
||||
|
||||
1. Research and learn about the data set you're proposing for consideration.
|
||||
|
||||
2. Clone the repository and learn about the [contribution guidelines for this
|
||||
project](../docs/CONTRIBUTING.md).
|
||||
|
||||
3. In your local version of the repository, copy the template from
|
||||
`data_roadmap/data_set_description_template.yaml` into a new file that lives in
|
||||
`data_roadmap/data_set_descriptions` and has the name of the data set as the name of the file.
|
||||
|
||||
4. Edit this file to ensure it has all of the appropriate details about the data set.
|
||||
|
||||
5. If you'd like, you can run the validations in `run_validations_and_write_template`
|
||||
to ensure your contribution is valid according to the schema. These checks will also
|
||||
run automatically on each commit.
|
||||
|
||||
6. Create a pull request with your new data set description and submit it for peer
|
||||
review.
|
||||
|
||||
Thank you for contributing!
|
||||
|
||||
# Tooling proposal and milestones
|
||||
|
||||
There is no single tool that supports all the goals and user stories described above.
|
||||
Therefore we've proposed combining a number of tools in a way that can support them all.
|
||||
|
||||
We've also proposed various "milestones" that will allow us to iteratively and
|
||||
sequentially build the data roadmap in a way that supports the entire vision but
|
||||
starts with small and achievable steps. These milestones are proposed in order.
|
||||
|
||||
This work is most accurately tracked in [this epic](https://app.zenhub.com/workspaces/justice40-60993f6e05473d0010ec44e3/issues/usds/justice40-tool/38).
|
||||
We've also verbally described them below.
|
||||
|
||||
## Milestone: YAML files for data sets and linter (Done)
|
||||
|
||||
To start, we'll create a folder in this repository that can
|
||||
house YAML files, one per data set. Each file will describe the characteristics of the data.
|
||||
|
||||
The benefit of using a YAML file for this is that it's easy to subject changes to these files to peer review through the pull request process. This allows external collaborators from the open source community to submit suggested changes, which can be reviewed by the core CEJST team.
|
||||
|
||||
We'll use a Python-based script to load all the files in the directory, and then run a schema validator to ensure all the files have valid entries.
|
||||
|
||||
For schema validation, we propose using [Yamale](https://github.com/23andMe/Yamale). This provides a lightweight schema and validator, and [integrates nicely with GitHub actions](https://github.com/nrkno/yaml-schema-validator-github-action).
|
||||
|
||||
If there's an improper format in any of the files, the schema validator will throw an error.
|
||||
|
||||
As part of this milestone, we will also set this up to run automatically with each commit to any branch as part of CI/CD.
|
||||
|
||||
## Milestone: Google forms integration
|
||||
|
||||
To make it easy for non-engineer members of the public and advisory bodies such as the WHEJAC to submit suggestions for data sets, we will configure a Google Form that maps to the schema of the data set files.
|
||||
|
||||
This will enable members of the public to fill out a simple form suggesting data without needing to understand Github or other engineering concepts.
|
||||
|
||||
At first, these responses can just go into a resulting Google Sheet and be manually copied and converted into data set description files. Later, we can write a script that converts new entries in the Google Sheet automatically into data set files. This can be setup to run as a trigger on the addition of new rows to the Google Sheet.
|
||||
|
||||
## Milestone: Post data in tabular format
|
||||
|
||||
Add a script that runs the schema validator on all files and, if successful, posts the results in a tabular format. There are straightforward packages to post a Python dictionary / `pandas` dataframe to Google Sheets and/or Airtable. As part of this milestone, we will also set this up to run automatically with each commit to `main` as part of CI/CD.
|
||||
|
||||
This will make it easier to filter the data to answer questions like, "which data sources are available at the census block group level".
|
||||
|
||||
## Milestone: Tickets created for incorporating data sets
|
||||
|
||||
For each data set that is being considered for inclusion soon in the tool, the project management team will create a ticket for "Incorporating \_\_\_ data set into the database", with a link to the data set detail document. This ticket will be created in the ticket tracking system used by the open source repository, which is ZenHub. This project management system will be public.
|
||||
|
||||
At the initial launch, we are not planning for members of the open source community to be able to create tickets, but we would like to consider a process for members of the open source community creating tickets that can go through review by the CEJST team.
|
||||
|
||||
This will help developers know what to work on next, and open source community members can also pick up tickets and work to integrate the data sets.
|
||||
|
||||
## Milestone: Add problem areas
|
||||
|
||||
We'll need to somehow track "problem areas" that describe problems in climate, environmental, and economic justice, even without specific proposals of data sets. For instance, a problem area may be "food insecurity", and a number of data sets can have this as their problem area.
|
||||
|
||||
We can change the linter to validate that every data set description maps to one or more known problem areas.
|
||||
|
||||
The benefit of this is that some non-data-focused members of the public or the WHEJAC advisory body may want to suggest we prioritize certain problem areas, with or without ideas for specific data sets that may best address that problem area.
|
||||
|
||||
It is not clear at this time the best path forward for implementing these problem area descriptions. One option is to create a folder for descriptions of problem areas, which contains YAML files that get validated according to a schema. Another option would be simply to add these as an array in the description of data sets, or add labels to the tickets once data sets are tracked in GitHub tickets.
|
||||
|
||||
## Milestone: Add prioritzation voting for WHEJAC and members of the public
|
||||
|
||||
This milestone is currently the least well-defined. It's important that members of advisory bodies like the WHEJAC and members of the public be able to "upvote" certain data sets for inclusion in the tool.
|
||||
|
||||
One potential for this is to use the [Stanford Participatory Budgeting Platform](https://pbstanford.org/). Here's an [example of voting on proposals within a limited budget](https://pbstanford.org/nyc8/knapsack).
|
||||
|
||||
For instance, going into a quarterly planning cycle, the CEJST development team could estimate the amount of time (in developer-weeks) that it would take to clean, analyze, and incorporate each potential data set. For instance, incorporating some already-cleaned census data may take 1 week of a developer's time, while incorporating new asthma data from CMS that's never been publicly released could take 5 weeks. Given a "budget" of the number of developer weeks available (e.g., 2 developers for 13 weeks, or 26 developer-weeks), advisors can vote on their top priorities for inclusion in the tool within the available "budget".
|
0
data/data-roadmap/__init__.py
Normal file
0
data/data-roadmap/__init__.py
Normal file
|
@ -0,0 +1,39 @@
|
|||
# There is no method for adding field descriptions to `yamale` schemas.
|
||||
# Therefore, we've created a dictionary here of fields and their descriptions.
|
||||
name: A short name of the data set.
|
||||
source: The URL pointing towards the data set itself or more information about the
|
||||
data set.
|
||||
relevance_to_environmental_justice: It's useful to spell out why this data is
|
||||
relevant to EJ issues and/or can be used to identify EJ communities.
|
||||
spatial_resolution: Dev team needs to know if the resolution is granular enough to be useful
|
||||
public_status: Whether a dataset has already gone through public release process
|
||||
(like Census data) or may need a lengthy review process (like Medicaid data).
|
||||
sponsor: Whether there's a federal agency or non-governmental agency that is working
|
||||
to provide and maintain this data.
|
||||
subjective_rating_of_data_quality: Sometimes we don't have statistics on data
|
||||
quality, but we know it is likely to be accurate or not. How much has it been
|
||||
vetted by an agency; is this the de facto data set for the topic?
|
||||
estimated_margin_of_error: Estimated margin of error on measurement, if known. Often
|
||||
more narrow geographic measures have a higher margin of error due to a smaller sample
|
||||
for each measurement.
|
||||
known_data_quality_issues: It can be helpful to write out known problems.
|
||||
geographic_coverage_percent: We want to think about data that is comprehensive across
|
||||
America.
|
||||
geographic_coverage_description: A verbal description of geographic coverage.
|
||||
data_formats: Developers need to know what formats the data is available in
|
||||
last_updated_date: When was the data last updated / refreshed? (In format YYYY-MM-DD.
|
||||
If exact date is not known, use YYYY-01-01.)
|
||||
frequency_of_updates: How often is this data updated? Is it updated on a reliable
|
||||
cadence?
|
||||
documentation: Link to docs. Also, is the documentation good enough? Can we get the
|
||||
info we need?
|
||||
data_can_go_in_cloud: Some datasets can not legally go in the cloud
|
||||
|
||||
discussion: Review of other topics, such as
|
||||
peer review (Overview or links out to peer review done on this dataset),
|
||||
where and how data is available (e.g., Geoplatform.gov? Is it available from multiple
|
||||
sources?),
|
||||
risk assessment of the data (e.g. a vendor-processed version of the dataset might not
|
||||
be open or good enough),
|
||||
legal considerations (Legal disclaimers, assumption of risk, proprietary?),
|
||||
accreditation (Is this source accredited?)
|
24
data/data-roadmap/data_set_description_schema.yaml
Normal file
24
data/data-roadmap/data_set_description_schema.yaml
Normal file
|
@ -0,0 +1,24 @@
|
|||
# `yamale` schema for descriptions of data sets.
|
||||
name: str(required=True)
|
||||
source: str(required=True)
|
||||
relevance_to_environmental_justice: str(required=False)
|
||||
data_formats: enum('GeoJSON', 'Esri Shapefile (SHP, DBF, SHX)', 'GML', 'KML/KMZ',
|
||||
'GPX', 'CSV/XLSX', 'GDB', 'MBTILES', 'LAS', required=True)
|
||||
spatial_resolution: enum('State/territory', 'County', 'Zip code', 'Census tract',
|
||||
'Census block group', 'Exact address or lat/long', 'Other', required=True)
|
||||
public_status: enum('Not Released', 'Public', 'Public for certain audiences', 'Other',
|
||||
required=True)
|
||||
sponsor: str(required=True)
|
||||
subjective_rating_of_data_quality: enum('Low Quality', 'Medium Quality', 'High
|
||||
Quality', required=False)
|
||||
estimated_margin_of_error: num(required=False)
|
||||
known_data_quality_issues: str(required=False)
|
||||
geographic_coverage_percent: num(required=False)
|
||||
geographic_coverage_description: str(required=False)
|
||||
last_updated_date: day(min='2001-01-01', max='2100-01-01', required=True)
|
||||
frequency_of_updates: enum('Less than annually', 'Approximately annually',
|
||||
'Once very 1-6 months',
|
||||
'Daily or more frequently than daily', 'Unknown', required=True)
|
||||
documentation: str(required=False)
|
||||
data_can_go_in_cloud: bool(required=False)
|
||||
discussion: str(required=False)
|
94
data/data-roadmap/data_set_description_template.yaml
Normal file
94
data/data-roadmap/data_set_description_template.yaml
Normal file
|
@ -0,0 +1,94 @@
|
|||
# Note: This template is automatically generated by the function
|
||||
# `write_data_set_description_template_file` from the schema
|
||||
# and field descriptions files. Do not manually edit this file.
|
||||
|
||||
name:
|
||||
# Description: A short name of the data set.
|
||||
# Required field: True
|
||||
# Field type: str
|
||||
|
||||
source:
|
||||
# Description: The URL pointing towards the data set itself or more information about the data set.
|
||||
# Required field: True
|
||||
# Field type: str
|
||||
|
||||
relevance_to_environmental_justice:
|
||||
# Description: It's useful to spell out why this data is relevant to EJ issues and/or can be used to identify EJ communities.
|
||||
# Required field: False
|
||||
# Field type: str
|
||||
|
||||
data_formats:
|
||||
# Description: Developers need to know what formats the data is available in
|
||||
# Required field: True
|
||||
# Field type: enum
|
||||
# Valid choices are one of the following: ('GeoJSON', 'Esri Shapefile (SHP, DBF, SHX)', 'GML', 'KML/KMZ', 'GPX', 'CSV/XLSX', 'GDB', 'MBTILES', 'LAS')
|
||||
|
||||
spatial_resolution:
|
||||
# Description: Dev team needs to know if the resolution is granular enough to be useful
|
||||
# Required field: True
|
||||
# Field type: enum
|
||||
# Valid choices are one of the following: ('State/territory', 'County', 'Zip code', 'Census tract', 'Census block group', 'Exact address or lat/long', 'Other')
|
||||
|
||||
public_status:
|
||||
# Description: Whether a dataset has already gone through public release process (like Census data) or may need a lengthy review process (like Medicaid data).
|
||||
# Required field: True
|
||||
# Field type: enum
|
||||
# Valid choices are one of the following: ('Not Released', 'Public', 'Public for certain audiences', 'Other')
|
||||
|
||||
sponsor:
|
||||
# Description: Whether there's a federal agency or non-governmental agency that is working to provide and maintain this data.
|
||||
# Required field: True
|
||||
# Field type: str
|
||||
|
||||
subjective_rating_of_data_quality:
|
||||
# Description: Sometimes we don't have statistics on data quality, but we know it is likely to be accurate or not. How much has it been vetted by an agency; is this the de facto data set for the topic?
|
||||
# Required field: False
|
||||
# Field type: enum
|
||||
# Valid choices are one of the following: ('Low Quality', 'Medium Quality', 'High Quality')
|
||||
|
||||
estimated_margin_of_error:
|
||||
# Description: Estimated margin of error on measurement, if known. Often more narrow geographic measures have a higher margin of error due to a smaller sample for each measurement.
|
||||
# Required field: False
|
||||
# Field type: num
|
||||
|
||||
known_data_quality_issues:
|
||||
# Description: It can be helpful to write out known problems.
|
||||
# Required field: False
|
||||
# Field type: str
|
||||
|
||||
geographic_coverage_percent:
|
||||
# Description: We want to think about data that is comprehensive across America.
|
||||
# Required field: False
|
||||
# Field type: num
|
||||
|
||||
geographic_coverage_description:
|
||||
# Description: A verbal description of geographic coverage.
|
||||
# Required field: False
|
||||
# Field type: str
|
||||
|
||||
last_updated_date:
|
||||
# Description: When was the data last updated / refreshed? (In format YYYY-MM-DD. If exact date is not known, use YYYY-01-01.)
|
||||
# Required field: True
|
||||
# Field type: day
|
||||
|
||||
frequency_of_updates:
|
||||
# Description: How often is this data updated? Is it updated on a reliable cadence?
|
||||
# Required field: True
|
||||
# Field type: enum
|
||||
# Valid choices are one of the following: ('Less than annually', 'Approximately annually', 'Once very 1-6 months', 'Daily or more frequently than daily', 'Unknown')
|
||||
|
||||
documentation:
|
||||
# Description: Link to docs. Also, is the documentation good enough? Can we get the info we need?
|
||||
# Required field: False
|
||||
# Field type: str
|
||||
|
||||
data_can_go_in_cloud:
|
||||
# Description: Some datasets can not legally go in the cloud
|
||||
# Required field: False
|
||||
# Field type: bool
|
||||
|
||||
discussion:
|
||||
# Description: Review of other topics, such as peer review (Overview or links out to peer review done on this dataset), where and how data is available (e.g., Geoplatform.gov? Is it available from multiple sources?), risk assessment of the data (e.g. a vendor-processed version of the dataset might not be open or good enough), legal considerations (Legal disclaimers, assumption of risk, proprietary?), accreditation (Is this source accredited?)
|
||||
# Required field: False
|
||||
# Field type: str
|
||||
|
35
data/data-roadmap/data_set_descriptions/PM25.yaml
Normal file
35
data/data-roadmap/data_set_descriptions/PM25.yaml
Normal file
|
@ -0,0 +1,35 @@
|
|||
name: Particulate Matter 2.5
|
||||
|
||||
source: https://gaftp.epa.gov/EJSCREEN/
|
||||
|
||||
relevance_to_environmental_justice: Particulate matter has a lot of adverse impacts
|
||||
on health.
|
||||
|
||||
data_formats: CSV/XLSX
|
||||
|
||||
spatial_resolution: Census block group
|
||||
|
||||
public_status: Public
|
||||
|
||||
sponsor: EPA
|
||||
|
||||
subjective_rating_of_data_quality: Medium Quality
|
||||
|
||||
estimated_margin_of_error:
|
||||
|
||||
known_data_quality_issues: Many PM 2.5 stations are known to be pretty far apart, so
|
||||
averaging them can lead to data quality loss.
|
||||
|
||||
geographic_coverage_percent:
|
||||
|
||||
geographic_coverage_description:
|
||||
|
||||
last_updated_date: 2017-01-01
|
||||
|
||||
frequency_of_updates: Less than annually
|
||||
|
||||
documentation: https://www.epa.gov/sites/production/files/2015-05/documents/ejscreen_technical_document_20150505.pdf#page=13
|
||||
|
||||
data_can_go_in_cloud: True
|
||||
|
||||
discussion:
|
0
data/data-roadmap/data_set_descriptions/__init__.py
Normal file
0
data/data-roadmap/data_set_descriptions/__init__.py
Normal file
1
data/data-roadmap/requirements.txt
Normal file
1
data/data-roadmap/requirements.txt
Normal file
|
@ -0,0 +1 @@
|
|||
yamale==3.0.6
|
21
data/data-roadmap/setup.py
Normal file
21
data/data-roadmap/setup.py
Normal file
|
@ -0,0 +1,21 @@
|
|||
"""Setup script for `data_roadmap` package."""
|
||||
import os
|
||||
|
||||
from setuptools import find_packages
|
||||
from setuptools import setup
|
||||
|
||||
# TODO: replace this with `poetry`. https://github.com/usds/justice40-tool/issues/57
|
||||
_PACKAGE_DIRECTORY = os.path.abspath(os.path.dirname(__file__))
|
||||
|
||||
with open(os.path.join(_PACKAGE_DIRECTORY, "requirements.txt")) as f:
|
||||
requirements = f.readlines()
|
||||
|
||||
setup(
|
||||
name="data_roadmap",
|
||||
description="Data roadmap package",
|
||||
author="CEJST Development Team",
|
||||
author_email="justice40open@usds.gov",
|
||||
install_requires=requirements,
|
||||
include_package_data=True,
|
||||
packages=find_packages(),
|
||||
)
|
151
data/data-roadmap/utils/utils_data_set_description_schema.py
Normal file
151
data/data-roadmap/utils/utils_data_set_description_schema.py
Normal file
|
@ -0,0 +1,151 @@
|
|||
import importlib_resources
|
||||
import pathlib
|
||||
import yamale
|
||||
import yaml
|
||||
|
||||
# Set directories.
|
||||
DATA_ROADMAP_DIRECTORY = importlib_resources.files("data_roadmap")
|
||||
UTILS_DIRECTORY = DATA_ROADMAP_DIRECTORY / "utils"
|
||||
DATA_SET_DESCRIPTIONS_DIRECTORY = DATA_ROADMAP_DIRECTORY / "data_set_descriptions"
|
||||
|
||||
# Set file paths.
|
||||
DATA_SET_DESCRIPTION_SCHEMA_FILE_PATH = (
|
||||
DATA_ROADMAP_DIRECTORY / "data_set_description_schema.yaml"
|
||||
)
|
||||
DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH = (
|
||||
DATA_ROADMAP_DIRECTORY / "data_set_description_field_descriptions.yaml"
|
||||
)
|
||||
DATA_SET_DESCRIPTION_TEMPLATE_FILE_PATH = (
|
||||
DATA_ROADMAP_DIRECTORY / "data_set_description_template.yaml"
|
||||
)
|
||||
|
||||
|
||||
def load_data_set_description_schema(
|
||||
file_path: pathlib.PosixPath = DATA_SET_DESCRIPTION_SCHEMA_FILE_PATH,
|
||||
) -> yamale.schema.schema.Schema:
|
||||
"""Load from file the data set description schema."""
|
||||
schema = yamale.make_schema(path=file_path)
|
||||
|
||||
return schema
|
||||
|
||||
|
||||
def load_data_set_description_field_descriptions(
|
||||
file_path: pathlib.PosixPath = DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH,
|
||||
) -> dict:
|
||||
"""Load from file the descriptions of fields in the data set description."""
|
||||
# Load field descriptions.
|
||||
with open(file_path, "r") as stream:
|
||||
data_set_description_field_descriptions = yaml.safe_load(stream=stream)
|
||||
|
||||
return data_set_description_field_descriptions
|
||||
|
||||
|
||||
def validate_descriptions_for_schema(
|
||||
schema: yamale.schema.schema.Schema,
|
||||
field_descriptions: dict,
|
||||
) -> None:
|
||||
"""Validate descriptions for schema.
|
||||
|
||||
Checks that every field in the `yamale` schema also has a field
|
||||
description in the `field_descriptions` dict.
|
||||
"""
|
||||
for field_name in schema.dict.keys():
|
||||
if field_name not in field_descriptions:
|
||||
raise ValueError(
|
||||
f"Field `{field_name}` does not have a "
|
||||
f"description. Please add one to file `{DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH}`"
|
||||
)
|
||||
|
||||
for field_name in field_descriptions.keys():
|
||||
if field_name not in schema.dict.keys():
|
||||
raise ValueError(
|
||||
f"Field `{field_name}` has a description but is not in the " f"schema."
|
||||
)
|
||||
|
||||
|
||||
def validate_all_data_set_descriptions(
|
||||
data_set_description_schema: yamale.schema.schema.Schema,
|
||||
) -> None:
|
||||
"""Validate data set descriptions.
|
||||
|
||||
Validate each file in the `data_set_descriptions` directory the schema
|
||||
against the provided schema.
|
||||
|
||||
"""
|
||||
data_set_description_file_paths_generator = DATA_SET_DESCRIPTIONS_DIRECTORY.glob(
|
||||
"*.yaml"
|
||||
)
|
||||
|
||||
# Validate each file
|
||||
for file_path in data_set_description_file_paths_generator:
|
||||
print(f"Validating {file_path}...")
|
||||
|
||||
# Create a yamale Data object
|
||||
data_set_description = yamale.make_data(file_path)
|
||||
|
||||
# TODO: explore collecting all errors and raising them at once. - Lucas
|
||||
yamale.validate(schema=data_set_description_schema, data=data_set_description)
|
||||
|
||||
|
||||
def write_data_set_description_template_file(
|
||||
data_set_description_schema: yamale.schema.schema.Schema,
|
||||
data_set_description_field_descriptions: dict,
|
||||
template_file_path: str = DATA_SET_DESCRIPTION_TEMPLATE_FILE_PATH,
|
||||
) -> None:
|
||||
"""Write an example data set description with helpful comments."""
|
||||
template_file_lines = []
|
||||
|
||||
# Write comments at the top of the template
|
||||
template_file_lines.append(
|
||||
"# Note: This template is automatically generated by the function\n"
|
||||
"# `write_data_set_description_template_file` from the schema\n"
|
||||
"# and field descriptions files. Do not manually edit this file.\n\n"
|
||||
)
|
||||
|
||||
schema_dict = data_set_description_schema.dict
|
||||
for field_name, field_schema in schema_dict.items():
|
||||
template_file_lines.append(f"{field_name}: \n")
|
||||
template_file_lines.append(
|
||||
f"# Description: {data_set_description_field_descriptions[field_name]}\n"
|
||||
)
|
||||
template_file_lines.append(f"# Required field: {field_schema.is_required}\n")
|
||||
template_file_lines.append(f"# Field type: {field_schema.get_name()}\n")
|
||||
if type(field_schema) is yamale.validators.validators.Enum:
|
||||
template_file_lines.append(
|
||||
f"# Valid choices are one of the following: {field_schema.enums}\n"
|
||||
)
|
||||
|
||||
# Add an empty linebreak to separate fields.
|
||||
template_file_lines.append("\n")
|
||||
|
||||
with open(template_file_path, "w") as file:
|
||||
file.writelines(template_file_lines)
|
||||
|
||||
|
||||
def run_validations_and_write_template() -> None:
|
||||
"""Run validations of schema and descriptions, and write a template file."""
|
||||
# Load the schema and a separate dictionary
|
||||
data_set_description_schema = load_data_set_description_schema()
|
||||
data_set_description_field_descriptions = (
|
||||
load_data_set_description_field_descriptions()
|
||||
)
|
||||
|
||||
validate_descriptions_for_schema(
|
||||
schema=data_set_description_schema,
|
||||
field_descriptions=data_set_description_field_descriptions,
|
||||
)
|
||||
|
||||
# Validate all data set descriptions in the directory against schema.
|
||||
validate_all_data_set_descriptions(
|
||||
data_set_description_schema=data_set_description_schema
|
||||
)
|
||||
|
||||
# Write an example template for data set descriptions.
|
||||
write_data_set_description_template_file(
|
||||
data_set_description_schema=data_set_description_schema,
|
||||
data_set_description_field_descriptions=data_set_description_field_descriptions,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run_validations_and_write_template()
|
|
@ -0,0 +1,248 @@
|
|||
import unittest
|
||||
from unittest import mock
|
||||
|
||||
import yamale
|
||||
from data_roadmap.utils.utils_data_set_description_schema import (
|
||||
load_data_set_description_schema,
|
||||
load_data_set_description_field_descriptions,
|
||||
validate_descriptions_for_schema,
|
||||
validate_all_data_set_descriptions,
|
||||
write_data_set_description_template_file,
|
||||
)
|
||||
|
||||
|
||||
class UtilsDataSetDescriptionSchema(unittest.TestCase):
|
||||
@mock.patch("yamale.make_schema")
|
||||
def test_load_data_set_description_schema(self, make_schema_mock):
|
||||
load_data_set_description_schema(file_path="mock.yaml")
|
||||
|
||||
make_schema_mock.assert_called_once_with(path="mock.yaml")
|
||||
|
||||
@mock.patch("yaml.safe_load")
|
||||
def test_load_data_set_description_field_descriptions(self, yaml_safe_load_mock):
|
||||
# Note: this isn't a great test, we could mock the actual YAML to
|
||||
# make it better. - Lucas
|
||||
mock_dict = {
|
||||
"name": "The name of the thing.",
|
||||
"age": "The age of the thing.",
|
||||
"height": "The height of the thing.",
|
||||
"awesome": "The awesome of the thing.",
|
||||
"field": "The field of the thing.",
|
||||
}
|
||||
|
||||
yaml_safe_load_mock.return_value = mock_dict
|
||||
|
||||
field_descriptions = load_data_set_description_field_descriptions()
|
||||
|
||||
yaml_safe_load_mock.assert_called_once()
|
||||
|
||||
self.assertDictEqual(field_descriptions, mock_dict)
|
||||
|
||||
def test_validate_descriptions_for_schema(self):
|
||||
# Test when all descriptions are present.
|
||||
field_descriptions = {
|
||||
"name": "The name of the thing.",
|
||||
"age": "The age of the thing.",
|
||||
"height": "The height of the thing.",
|
||||
"awesome": "The awesome of the thing.",
|
||||
"field": "The field of the thing.",
|
||||
}
|
||||
|
||||
schema = yamale.make_schema(
|
||||
content="""
|
||||
name: str()
|
||||
age: int(max=200)
|
||||
height: num()
|
||||
awesome: bool()
|
||||
field: enum('option 1', 'option 2')
|
||||
"""
|
||||
)
|
||||
|
||||
# Should pass.
|
||||
validate_descriptions_for_schema(
|
||||
schema=schema, field_descriptions=field_descriptions
|
||||
)
|
||||
|
||||
field_descriptions_missing_one = {
|
||||
"name": "The name of the thing.",
|
||||
"age": "The age of the thing.",
|
||||
"height": "The height of the thing.",
|
||||
"awesome": "The awesome of the thing.",
|
||||
}
|
||||
|
||||
# Should fail because of the missing field description.
|
||||
with self.assertRaises(ValueError) as context_manager:
|
||||
validate_descriptions_for_schema(
|
||||
schema=schema, field_descriptions=field_descriptions_missing_one
|
||||
)
|
||||
|
||||
# Using `assertIn` because the file path is returned in the error
|
||||
# message, and it varies based on environment.
|
||||
self.assertIn(
|
||||
"Field `field` does not have a description. Please add one to file",
|
||||
str(context_manager.exception),
|
||||
)
|
||||
|
||||
field_descriptions_extra_one = {
|
||||
"name": "The name of the thing.",
|
||||
"age": "The age of the thing.",
|
||||
"height": "The height of the thing.",
|
||||
"awesome": "The awesome of the thing.",
|
||||
"field": "The field of the thing.",
|
||||
"extra": "Extra description.",
|
||||
}
|
||||
|
||||
# Should fail because of the extra field description.
|
||||
with self.assertRaises(ValueError) as context_manager:
|
||||
validate_descriptions_for_schema(
|
||||
schema=schema, field_descriptions=field_descriptions_extra_one
|
||||
)
|
||||
|
||||
# Using `assertIn` because the file path is returned in the error
|
||||
# message, and it varies based on environment.
|
||||
self.assertEquals(
|
||||
"Field `extra` has a description but is not in the schema.",
|
||||
str(context_manager.exception),
|
||||
)
|
||||
|
||||
def test_validate_all_data_set_descriptions(self):
|
||||
# Setup a few examples of `yamale` data *before* we mock the `make_data`
|
||||
# function.
|
||||
valid_data = yamale.make_data(
|
||||
content="""
|
||||
name: Bill
|
||||
age: 26
|
||||
height: 6.2
|
||||
awesome: True
|
||||
field: option 1
|
||||
"""
|
||||
)
|
||||
|
||||
invalid_data_1 = yamale.make_data(
|
||||
content="""
|
||||
name: Bill
|
||||
age: asdf
|
||||
height: 6.2
|
||||
awesome: asdf
|
||||
field: option 1
|
||||
"""
|
||||
)
|
||||
|
||||
invalid_data_2 = yamale.make_data(
|
||||
content="""
|
||||
age: 26
|
||||
height: 6.2
|
||||
awesome: True
|
||||
field: option 1
|
||||
"""
|
||||
)
|
||||
|
||||
# Mock `make_data`.
|
||||
with mock.patch.object(
|
||||
yamale, "make_data", return_value=None
|
||||
) as yamale_make_data_mock:
|
||||
schema = yamale.make_schema(
|
||||
content="""
|
||||
name: str()
|
||||
age: int(max=200)
|
||||
height: num()
|
||||
awesome: bool()
|
||||
field: enum('option 1', 'option 2')
|
||||
"""
|
||||
)
|
||||
|
||||
# Make the `make_data` method return valid data.
|
||||
yamale_make_data_mock.return_value = valid_data
|
||||
|
||||
# Should pass.
|
||||
validate_all_data_set_descriptions(data_set_description_schema=schema)
|
||||
|
||||
# Make some of the data invalid.
|
||||
yamale_make_data_mock.return_value = invalid_data_1
|
||||
|
||||
# Should fail because of the invalid field values.
|
||||
with self.assertRaises(yamale.YamaleError) as context_manager:
|
||||
validate_all_data_set_descriptions(data_set_description_schema=schema)
|
||||
|
||||
self.assertEqual(
|
||||
str(context_manager.exception),
|
||||
"""Error validating data
|
||||
age: 'asdf' is not a int.
|
||||
awesome: 'asdf' is not a bool.""",
|
||||
)
|
||||
|
||||
# Make some of the data missing.
|
||||
yamale_make_data_mock.return_value = invalid_data_2
|
||||
|
||||
# Should fail because of the missing fields.
|
||||
with self.assertRaises(yamale.YamaleError) as context_manager:
|
||||
validate_all_data_set_descriptions(data_set_description_schema=schema)
|
||||
|
||||
self.assertEqual(
|
||||
str(context_manager.exception),
|
||||
"""Error validating data
|
||||
name: Required field missing""",
|
||||
)
|
||||
|
||||
@mock.patch("builtins.open", new_callable=mock.mock_open)
|
||||
def test_write_data_set_description_template_file(self, builtins_writelines_mock):
|
||||
schema = yamale.make_schema(
|
||||
content="""
|
||||
name: str()
|
||||
age: int(max=200)
|
||||
height: num()
|
||||
awesome: bool()
|
||||
field: enum('option 1', 'option 2')
|
||||
"""
|
||||
)
|
||||
|
||||
data_set_description_field_descriptions = {
|
||||
"name": "The name of the thing.",
|
||||
"age": "The age of the thing.",
|
||||
"height": "The height of the thing.",
|
||||
"awesome": "The awesome of the thing.",
|
||||
"field": "The field of the thing.",
|
||||
}
|
||||
|
||||
write_data_set_description_template_file(
|
||||
data_set_description_schema=schema,
|
||||
data_set_description_field_descriptions=data_set_description_field_descriptions,
|
||||
template_file_path="mock_template.yaml",
|
||||
)
|
||||
|
||||
call_to_writelines = builtins_writelines_mock.mock_calls[2][1][0]
|
||||
|
||||
self.assertListEqual(
|
||||
call_to_writelines,
|
||||
[
|
||||
"# Note: This template is automatically generated by the function\n"
|
||||
"# `write_data_set_description_template_file` from the schema\n"
|
||||
"# and field descriptions files. Do not manually edit this file.\n\n",
|
||||
"name: \n",
|
||||
"# Description: The name of the thing.\n",
|
||||
"# Required field: True\n",
|
||||
"# Field type: str\n",
|
||||
"\n",
|
||||
"age: \n",
|
||||
"# Description: The age of the thing.\n",
|
||||
"# Required field: True\n",
|
||||
"# Field type: int\n",
|
||||
"\n",
|
||||
"height: \n",
|
||||
"# Description: The height of the thing.\n",
|
||||
"# Required field: True\n",
|
||||
"# Field type: num\n",
|
||||
"\n",
|
||||
"awesome: \n",
|
||||
"# Description: The awesome of the thing.\n",
|
||||
"# Required field: True\n",
|
||||
"# Field type: bool\n",
|
||||
"\n",
|
||||
"field: \n",
|
||||
"# Description: The field of the thing.\n",
|
||||
"# Required field: True\n",
|
||||
"# Field type: enum\n",
|
||||
"# Valid choices are one of the following: ('option 1', 'option 2')\n",
|
||||
"\n",
|
||||
],
|
||||
)
|
Loading…
Add table
Add a link
Reference in a new issue