Data folder restructuring in preparation for 361 (#376)

* initial checkin

* gitignore and docker-compose update

* readme update and error on hud

* encoding issue

* one more small README change

* data roadmap re-strcuture

* pyproject sort

* small update to score output folders

* checkpoint

* couple of last fixes
This commit is contained in:
Jorge Escobar 2021-07-20 14:55:39 -04:00 committed by GitHub
commit 543d147e61
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
66 changed files with 130 additions and 108 deletions

View file

@ -0,0 +1,33 @@
FROM ubuntu:20.04
# Install packages
RUN apt-get update && apt-get install -y \
build-essential \
make \
gcc \
git \
unzip \
wget \
python3-dev \
python3-pip
# tippeanoe
ENV TZ=America/Los_Angeles
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get install -y software-properties-common libsqlite3-dev zlib1g-dev
RUN apt-add-repository -y ppa:git-core/ppa
RUN mkdir -p /tmp/tippecanoe-src && git clone https://github.com/mapbox/tippecanoe.git /tmp/tippecanoe-src
WORKDIR /tmp/tippecanoe-src
RUN /bin/sh -c make && make install
## gdal
RUN add-apt-repository ppa:ubuntugis/ppa
RUN apt-get -y install gdal-bin
# Prepare python packages
WORKDIR /data-pipeline
RUN pip3 install --upgrade pip setuptools wheel
COPY . .
COPY requirements.txt .
RUN pip3 install -r requirements.txt

View file

@ -0,0 +1,163 @@
# Justice 40 Score application
<details open="open">
<summary>Table of Contents</summary>
<!-- TOC -->
- [About this application](#about-this-application)
- [Score comparison workflow](#score-comparison-workflow)
- [Workflow Diagram](#workflow-diagram)
- [Step 0: Set up your environment](#step-0-set-up-your-environment)
- [Step 1: Run the ETL script for each data source](#step-1-run-the-etl-script-for-each-data-source)
- [Step 2: Calculate the Justice40 score experiments](#step-2-calculate-the-justice40-score-experiments)
- [Step 3: Compare the Justice40 score experiments to other indices](#step-3-compare-the-justice40-score-experiments-to-other-indices)
- [Data Sources](#data-sources)
- [Running using Docker](#running-using-docker)
- [Log visualization](#log-visualization)
- [Local development](#local-development)
- [Downloading Census Block Groups GeoJSON and Generating CBG CSVs](#downloading-census-block-groups-geojson-and-generating-cbg-csvs)
- [Generating mbtiles](#generating-mbtiles)
- [Serve the map locally](#serve-the-map-locally)
- [Running Jupyter notebooks](#running-jupyter-notebooks)
- [Activating variable-enabled Markdown for Jupyter notebooks](#activating-variable-enabled-markdown-for-jupyter-notebooks)
- [Miscellaneous](#miscellaneous)
<!-- /TOC -->
</details>
## About this application
This application is used to compare experimental versions of the Justice40 score to established environmental justice indices, such as EJSCREEN, CalEnviroScreen, and so on.
_**NOTE:** These scores **do not** represent final versions of the Justice40 scores and are merely used for comparative purposes. As a result, the specific input columns and formulas used to calculate them are likely to change over time._
### Score comparison workflow
The descriptions below provide a more detailed outline of what happens at each step of ETL and score calculation workflow.
#### Workflow Diagram
TODO add mermaid diagram
#### Step 0: Set up your environment
1. After cloning the project locally, change to this directory: `cd score`
1. Choose whether you'd like to run this application using Docker or if you'd like to install the dependencies locally so you can contribute to the project.
- **With Docker:** Follow these [installation instructions](https://docs.docker.com/get-docker/) and skip down to the [Running with Docker section](#running-with-docker) for more information
- **For Local Development:** Skip down to the [Local Development section](#local-development) for more detailed installation instructions
#### Step 1: Run the ETL script for each data source
1. Call the `etl-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
- With Docker: `docker run --rm -it j40_score /bin/sh -c "python3 application.py etl-run"`
- With Poetry: `poetry run python application.py etl-run`
1. The `etl-run` command will execute the corresponding ETL script for each data source in `etl/sources/`. For example, `etl/sources/ejscreen/etl.py` is the ETL script for EJSCREEN data.
1. Each ETL script will extract the data from its original source, then format the data into `.csv` files that get stored in the relevant folder in `data/dataset/`. For example, HUD Housing data is stored in `data/dataset/hud_housing/usa.csv`
_**NOTE:** You have the option to pass the name of a specific data source to the `etl-run` command, which will limit the execution of the ETL process to that specific data source._
_For example: `poetry run python application.py etl-run ejscreen` would only run the ETL process for EJSCREEN data._
#### Step 2: Calculate the Justice40 score experiments
1. Call the `score-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
- With Docker: `docker run --rm -it j40_score /bin/sh -c "python3 application.py score-run"`
- With Poetry: `poetry run python application.py score-run`
1. The `score-run` command will execute the `etl/score/etl.py` script which loads the data from each of the source files added to the `data/dataset/` directory by the ETL scripts in Step 1.
1. These data sets are merged into a single dataframe using their Census Block Group GEOID as a common key, and the data in each of the columns is standardized in two ways:
- Their [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) is calculated, which tells us what percentage of other Census Block Groups have a lower value for that particular column.
- They are normalized using [min-max normalization](https://en.wikipedia.org/wiki/Feature_scaling), which adjusts the scale of the data so that the Census Block Group with the highest value for that column is set to 1, the Census Block Group with the lowest value is set to 0, and all of the other values are adjusted to fit within that range based on how close they were to the highest or lowest value.
1. The standardized columns are then used to calculate each of the Justice40 score experiments described in greater detail below, and the results are exported to a `.csv` file in [`data/score/csv`](data/score/csv)
#### Step 3: Compare the Justice40 score experiments to other indices
1. TODO: Describe the steps for this
### Data Sources
- **[EJSCREEN](etl/sources/ejscreen):** TODO Add description of data source
- **[Census](etl/sources/census):** TODO Add description of data source
- **[American Communities Survey](etl/sources/census_acs):** TODO Add description of data source
- **[Housing and Transportation](etl/sources/housing_and_transportation):** TODO Add description of data source
- **[HUD Housing](etl/sources/hud_housing):** TODO Add description of data source
- **[HUD Recap](etl/sources/hud_recap):** TODO Add description of data source
- **[CalEnviroScreen](etl/scores/calenviroscreen):** TODO Add description of data source
## Running using Docker
We use Docker to install the necessary libraries in a container that can be run in any operating system.
To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build`.
Once completed, run `docker-compose up` and then open a new tab or terminal window, and then run any command for the application using this format:
`docker exec j40_data_pipeline_1 python3 application.py [command]`
Here's a list of commands:
- Get help: `docker exec j40_data_pipeline_1 python3 application.py --help"`
- Clean up the census data directories: `docker exec j40_data_pipeline_1 python3 application.py census-cleanup"`
- Clean up the data directories: `docker exec j40_data_pipeline_1 python3 application.py data-cleanup"`
- Generate census data: `docker exec j40_data_pipeline_1 python3 application.py census-data-download"`
- Run all ETL processes: `docker exec j40_data_pipeline_1 python3 application.py etl-run"`
- Generate Score: `docker exec j40_data_pipeline_1 python3 application.py score-run"`
## Local development
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe). Please refer to the repos for specific instructions for your OS.
### Windows Users
- If you want to download Census data or run tile generation, please install TippeCanoe [following these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
- If you want to generate tiles, you need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally.
### Setting up Poetry
- Start a terminal
- Change to this directory (`/data/data-pipeline`)
- Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
- Install Poetry requirements with `poetry install`
### Downloading Census Block Groups GeoJSON and Generating CBG CSVs
- Make sure you have Docker running in your machine
- Start a terminal
- Change to this directory (i.e. `cd data/data-pipeline`)
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
- Then run `poetry run python application.py census-data-download`
Note: Census files are not kept in the repository and the download directories are ignored by Git
### Generating mbtiles
- TBD
### Serve the map locally
- Start a terminal
- Change to this directory (i.e. `cd data/data-pipeline`)
- Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl`
### Running Jupyter notebooks
- Start a terminal
- Change to this directory (i.e. `cd data/data-pipeline`)
- Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
### Activating variable-enabled Markdown for Jupyter notebooks
- Change to this directory (i.e. `cd data/data-pipeline`)
- Activate a Poetry Shell (see above)
- Run `jupyter contrib nbextension install --user`
- Run `jupyter nbextension enable python-markdown/main`
- Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near
top right of Notebook screen.)
For more information, see [nbextensions docs](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html) and
see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/python-markdown).
## Miscellaneous
- To export packages from Poetry to `requirements.txt` run `poetry export --without-hashes > requirements.txt`

View file

View file

@ -0,0 +1,92 @@
import click
from config import settings
from etl.sources.census.etl_utils import reset_data_directories as census_reset
from utils import (
get_module_logger,
data_folder_cleanup,
score_folder_cleanup,
temp_folder_cleanup,
)
from etl.sources.census.etl import download_census_csvs
from etl.runner import etl_runner, score_generate
logger = get_module_logger(__name__)
@click.group()
def cli():
"""Defines a click group for the commands below"""
pass
@cli.command(
help="Clean up all census data folders",
)
def census_cleanup():
"""CLI command to clean up the census data folder"""
data_path = settings.APP_ROOT / "data"
# census directories
logger.info(f"Initializing all census data")
census_reset(data_path)
logger.info("Cleaned up all census data files")
@cli.command(
help="Clean up all data folders",
)
def data_cleanup():
"""CLI command to clean up the all the data folders"""
data_folder_cleanup()
score_folder_cleanup()
temp_folder_cleanup()
logger.info("Cleaned up all data folders")
@cli.command(
help="Census data download",
)
def census_data_download():
"""CLI command to download all census shape files from the Census FTP and extract the geojson
to generate national and by state Census Block Group CSVs"""
logger.info("Downloading census data")
data_path = settings.APP_ROOT / "data"
download_census_csvs(data_path)
logger.info("Completed downloading census data")
@cli.command(
help="Run all ETL processes or a specific one",
)
@click.option("-d", "--dataset", required=False, type=str)
def etl_run(dataset: str):
"""Run a specific or all ETL processes
Args:
dataset (str): Name of the ETL module to be run (optional)
Returns:
None
"""
etl_runner(dataset)
@cli.command(
help="Generate Score",
)
def score_run():
"""CLI command to generate the score"""
score_generate()
if __name__ == "__main__":
cli()

View file

@ -0,0 +1,15 @@
from dynaconf import Dynaconf
from pathlib import Path
settings = Dynaconf(
envvar_prefix="DYNACONF",
settings_files=["settings.toml", ".secrets.toml"],
environments=True,
)
# set root dir
settings.APP_ROOT = Path.cwd()
# To set an environment use:
# Linux/OSX: export ENV_FOR_DYNACONF=staging
# Windows: set ENV_FOR_DYNACONF=staging

View file

@ -0,0 +1,53 @@
fips,state_name,state_abbreviation,region,division
01,Alabama,AL,South,East South Central
02,Alaska,AK,West,Pacific
04,Arizona,AZ,West,Mountain
05,Arkansas,AR,South,West South Central
06,California,CA,West,Pacific
08,Colorado,CO,West,Mountain
09,Connecticut,CT,Northeast,New England
10,Delaware,DE,South,South Atlantic
11,District of Columbia,DC,South,South Atlantic
12,Florida,FL,South,South Atlantic
13,Georgia,GA,South,South Atlantic
15,Hawaii,HI,West,Pacific
16,Idaho,ID,West,Mountain
17,Illinois,IL,Midwest,East North Central
18,Indiana,IN,Midwest,East North Central
19,Iowa,IA,Midwest,West North Central
20,Kansas,KS,Midwest,West North Central
21,Kentucky,KY,South,East South Central
22,Louisiana,LA,South,West South Central
23,Maine,ME,Northeast,New England
24,Maryland,MD,South,South Atlantic
25,Massachusetts,MA,Northeast,New England
26,Michigan,MI,Midwest,East North Central
27,Minnesota,MN,Midwest,West North Central
28,Mississippi,MS,South,East South Central
29,Missouri,MO,Midwest,West North Central
30,Montana,MT,West,Mountain
31,Nebraska,NE,Midwest,West North Central
32,Nevada,NV,West,Mountain
33,New Hampshire,NH,Northeast,New England
34,New Jersey,NJ,Northeast,Middle Atlantic
35,New Mexico,NM,West,Mountain
36,New York,NY,Northeast,Middle Atlantic
37,North Carolina,NC,South,South Atlantic
38,North Dakota,ND,Midwest,West North Central
39,Ohio,OH,Midwest,East North Central
40,Oklahoma,OK,South,West South Central
41,Oregon,OR,West,Pacific
42,Pennsylvania,PA,Northeast,Middle Atlantic
44,Rhode Island,RI,Northeast,New England
45,South Carolina,SC,South,South Atlantic
46,South Dakota,SD,Midwest,West North Central
47,Tennessee,TN,South,East South Central
48,Texas,TX,South,West South Central
49,Utah,UT,West,Mountain
50,Vermont,VT,Northeast,New England
51,Virginia,VA,South,South Atlantic
53,Washington,WA,West,Pacific
54,West Virginia,WV,South,South Atlantic
55,Wisconsin,WI,Midwest,East North Central
56,Wyoming,WY,West,Mountain
72,Puerto Rico,PR,Puerto Rico,Puerto Rico
1 fips state_name state_abbreviation region division
2 01 Alabama AL South East South Central
3 02 Alaska AK West Pacific
4 04 Arizona AZ West Mountain
5 05 Arkansas AR South West South Central
6 06 California CA West Pacific
7 08 Colorado CO West Mountain
8 09 Connecticut CT Northeast New England
9 10 Delaware DE South South Atlantic
10 11 District of Columbia DC South South Atlantic
11 12 Florida FL South South Atlantic
12 13 Georgia GA South South Atlantic
13 15 Hawaii HI West Pacific
14 16 Idaho ID West Mountain
15 17 Illinois IL Midwest East North Central
16 18 Indiana IN Midwest East North Central
17 19 Iowa IA Midwest West North Central
18 20 Kansas KS Midwest West North Central
19 21 Kentucky KY South East South Central
20 22 Louisiana LA South West South Central
21 23 Maine ME Northeast New England
22 24 Maryland MD South South Atlantic
23 25 Massachusetts MA Northeast New England
24 26 Michigan MI Midwest East North Central
25 27 Minnesota MN Midwest West North Central
26 28 Mississippi MS South East South Central
27 29 Missouri MO Midwest West North Central
28 30 Montana MT West Mountain
29 31 Nebraska NE Midwest West North Central
30 32 Nevada NV West Mountain
31 33 New Hampshire NH Northeast New England
32 34 New Jersey NJ Northeast Middle Atlantic
33 35 New Mexico NM West Mountain
34 36 New York NY Northeast Middle Atlantic
35 37 North Carolina NC South South Atlantic
36 38 North Dakota ND Midwest West North Central
37 39 Ohio OH Midwest East North Central
38 40 Oklahoma OK South West South Central
39 41 Oregon OR West Pacific
40 42 Pennsylvania PA Northeast Middle Atlantic
41 44 Rhode Island RI Northeast New England
42 45 South Carolina SC South South Atlantic
43 46 South Dakota SD Midwest West North Central
44 47 Tennessee TN South East South Central
45 48 Texas TX South West South Central
46 49 Utah UT West Mountain
47 50 Vermont VT Northeast New England
48 51 Virginia VA South South Atlantic
49 53 Washington WA West Pacific
50 54 West Virginia WV South South Atlantic
51 55 Wisconsin WI Midwest East North Central
52 56 Wyoming WY West Mountain
53 72 Puerto Rico PR Puerto Rico Puerto Rico

View file

View file

View file

@ -0,0 +1,63 @@
from pathlib import Path
import pathlib
from config import settings
from utils import unzip_file_from_url, remove_all_from_dir
class ExtractTransformLoad(object):
"""
A class used to instantiate an ETL object to retrieve and process data from
datasets.
Attributes:
DATA_PATH (pathlib.Path): Local path where all data will be stored
TMP_PATH (pathlib.Path): Local path where temporary data will be stored
GEOID_FIELD_NAME (str): The common column name for a Census Block Group identifier
GEOID_TRACT_FIELD_NAME (str): The common column name for a Census Tract identifier
"""
DATA_PATH: Path = settings.APP_ROOT / "data"
TMP_PATH: Path = DATA_PATH / "tmp"
GEOID_FIELD_NAME: str = "GEOID10"
GEOID_TRACT_FIELD_NAME: str = "GEOID10_TRACT"
def get_yaml_config(self) -> None:
"""Reads the YAML configuration file for the dataset and stores
the properies in the instance (upcoming feature)"""
pass
def check_ttl(self) -> None:
"""Checks if the ETL process can be run based on a the TLL value on the
YAML config (upcoming feature)"""
pass
def extract(
self, source_url: str = None, extract_path: Path = None
) -> None:
"""Extract the data from
a remote source. By default it provides code to get the file from a source url,
unzips it and stores it on an extract_path."""
# this can be accessed via super().extract()
if source_url and extract_path:
unzip_file_from_url(source_url, self.TMP_PATH, extract_path)
def transform(self) -> None:
"""Transform the data extracted into a format that can be consumed by the
score generator"""
raise NotImplementedError
def load(self) -> None:
"""Saves the transformed data in the specified local data folder or remote AWS S3
bucket"""
raise NotImplementedError
def cleanup(self) -> None:
"""Clears out any files stored in the TMP folder"""
remove_all_from_dir(self.TMP_PATH)

View file

@ -0,0 +1,114 @@
import importlib
from etl.score.etl_score import ScoreETL
from etl.score.etl_score_post import PostScoreETL
def etl_runner(dataset_to_run: str = None) -> None:
"""Runs all etl processes or a specific one
Args:
dataset_to_run (str): Run a specific ETL process. If missing, runs all processes (optional)
Returns:
None
"""
# this list comes from YAMLs
dataset_list = [
{
"name": "census_acs",
"module_dir": "census_acs",
"class_name": "CensusACSETL",
},
{
"name": "ejscreen",
"module_dir": "ejscreen",
"class_name": "EJScreenETL",
},
{
"name": "housing_and_transportation",
"module_dir": "housing_and_transportation",
"class_name": "HousingTransportationETL",
},
{
"name": "hud_housing",
"module_dir": "hud_housing",
"class_name": "HudHousingETL",
},
{
"name": "calenviroscreen",
"module_dir": "calenviroscreen",
"class_name": "CalEnviroScreenETL",
},
{
"name": "hud_recap",
"module_dir": "hud_recap",
"class_name": "HudRecapETL",
},
]
if dataset_to_run:
dataset_element = next(
(item for item in dataset_list if item["name"] == dataset_to_run),
None,
)
if not dataset_list:
raise ValueError("Invalid dataset name")
else:
# reset the list to just the dataset
dataset_list = [dataset_element]
# Run the ETLs for the dataset_list
for dataset in dataset_list:
etl_module = importlib.import_module(
f"etl.sources.{dataset['module_dir']}.etl"
)
etl_class = getattr(etl_module, dataset["class_name"])
etl_instance = etl_class()
# run extract
etl_instance.extract()
# run transform
etl_instance.transform()
# run load
etl_instance.load()
# cleanup
etl_instance.cleanup()
# update the front end JSON/CSV of list of data sources
pass
def score_generate() -> None:
"""Generates the score and saves it on the local data directory
Args:
None
Returns:
None
"""
# Score Gen
score_gen = ScoreETL()
score_gen.extract()
score_gen.transform()
score_gen.load()
# Post Score Processing
score_post = PostScoreETL()
score_post.extract()
score_post.transform()
score_post.load()
score_post.cleanup()
def _find_dataset_index(dataset_list, key, value):
for i, element in enumerate(dataset_list):
if element[key] == value:
return i
return -1

View file

View file

@ -0,0 +1,410 @@
import collections
import functools
import pandas as pd
from etl.base import ExtractTransformLoad
from utils import get_module_logger
from etl.sources.census.etl_utils import get_state_fips_codes
logger = get_module_logger(__name__)
class ScoreETL(ExtractTransformLoad):
def __init__(self):
# Define some global parameters
self.BUCKET_SOCIOECONOMIC = "Socioeconomic Factors"
self.BUCKET_SENSITIVE = "Sensitive populations"
self.BUCKET_ENVIRONMENTAL = "Environmental effects"
self.BUCKET_EXPOSURES = "Exposures"
self.BUCKETS = [
self.BUCKET_SOCIOECONOMIC,
self.BUCKET_SENSITIVE,
self.BUCKET_ENVIRONMENTAL,
self.BUCKET_EXPOSURES,
]
# A few specific field names
# TODO: clean this up, I name some fields but not others.
self.UNEMPLOYED_FIELD_NAME = "Unemployed civilians (percent)"
self.LINGUISTIC_ISOLATION_FIELD_NAME = "Linguistic isolation (percent)"
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
self.POVERTY_FIELD_NAME = (
"Poverty (Less than 200% of federal poverty line)"
)
self.HIGH_SCHOOL_FIELD_NAME = "Percent individuals age 25 or over with less than high school degree"
# There's another aggregation level (a second level of "buckets").
self.AGGREGATION_POLLUTION = "Pollution Burden"
self.AGGREGATION_POPULATION = "Population Characteristics"
self.PERCENTILE_FIELD_SUFFIX = " (percentile)"
self.MIN_MAX_FIELD_SUFFIX = " (min-max normalized)"
self.SCORE_CSV_PATH = self.DATA_PATH / "score" / "csv" / "full"
# dataframes
self.df: pd.DataFrame
self.ejscreen_df: pd.DataFrame
self.census_df: pd.DataFrame
self.housing_and_transportation_df: pd.DataFrame
self.hud_housing_df: pd.DataFrame
def extract(self) -> None:
# EJSCreen csv Load
ejscreen_csv = self.DATA_PATH / "dataset" / "ejscreen_2019" / "usa.csv"
self.ejscreen_df = pd.read_csv(
ejscreen_csv, dtype={"ID": "string"}, low_memory=False
)
self.ejscreen_df.rename(
columns={"ID": self.GEOID_FIELD_NAME}, inplace=True
)
# Load census data
census_csv = self.DATA_PATH / "dataset" / "census_acs_2019" / "usa.csv"
self.census_df = pd.read_csv(
census_csv,
dtype={self.GEOID_FIELD_NAME: "string"},
low_memory=False,
)
# Load housing and transportation data
housing_and_transportation_index_csv = (
self.DATA_PATH
/ "dataset"
/ "housing_and_transportation_index"
/ "usa.csv"
)
self.housing_and_transportation_df = pd.read_csv(
housing_and_transportation_index_csv,
dtype={self.GEOID_FIELD_NAME: "string"},
low_memory=False,
)
# Load HUD housing data
hud_housing_csv = self.DATA_PATH / "dataset" / "hud_housing" / "usa.csv"
self.hud_housing_df = pd.read_csv(
hud_housing_csv,
dtype={self.GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)
def transform(self) -> None:
logger.info(f"Transforming Score Data")
# Join all the data sources that use census block groups
census_block_group_dfs = [
self.ejscreen_df,
self.census_df,
self.housing_and_transportation_df,
]
census_block_group_df = functools.reduce(
lambda left, right: pd.merge(
left=left, right=right, on=self.GEOID_FIELD_NAME, how="outer"
),
census_block_group_dfs,
)
# Sanity check the join.
if (
len(census_block_group_df[self.GEOID_FIELD_NAME].str.len().unique())
!= 1
):
raise ValueError(
f"One of the input CSVs uses {self.GEOID_FIELD_NAME} with a different length."
)
# Join all the data sources that use census tracts
# TODO: when there's more than one data source using census tract, reduce/merge them here.
census_tract_df = self.hud_housing_df
# Calculate the tract for the CBG data.
census_block_group_df[
self.GEOID_TRACT_FIELD_NAME
] = census_block_group_df[self.GEOID_FIELD_NAME].str[0:11]
self.df = census_block_group_df.merge(
census_tract_df, on=self.GEOID_TRACT_FIELD_NAME
)
if len(census_block_group_df) > 220333:
raise ValueError("Too many rows in the join.")
# Define a named tuple that will be used for each data set input.
DataSet = collections.namedtuple(
typename="DataSet",
field_names=["input_field", "renamed_field", "bucket"],
)
data_sets = [
# The following data sets have `bucket=None`, because it's not used in the bucket based score ("Score C").
DataSet(
input_field=self.GEOID_FIELD_NAME,
# Use the name `GEOID10` to enable geoplatform.gov's workflow.
renamed_field=self.GEOID_FIELD_NAME,
bucket=None,
),
DataSet(
input_field=self.HOUSING_BURDEN_FIELD_NAME,
renamed_field=self.HOUSING_BURDEN_FIELD_NAME,
bucket=None,
),
DataSet(
input_field="ACSTOTPOP",
renamed_field="Total population",
bucket=None,
),
# The following data sets have buckets, because they're used in the score
DataSet(
input_field="CANCER",
renamed_field="Air toxics cancer risk",
bucket=self.BUCKET_EXPOSURES,
),
DataSet(
input_field="RESP",
renamed_field="Respiratory hazard index",
bucket=self.BUCKET_EXPOSURES,
),
DataSet(
input_field="DSLPM",
renamed_field="Diesel particulate matter",
bucket=self.BUCKET_EXPOSURES,
),
DataSet(
input_field="PM25",
renamed_field="Particulate matter (PM2.5)",
bucket=self.BUCKET_EXPOSURES,
),
DataSet(
input_field="OZONE",
renamed_field="Ozone",
bucket=self.BUCKET_EXPOSURES,
),
DataSet(
input_field="PTRAF",
renamed_field="Traffic proximity and volume",
bucket=self.BUCKET_EXPOSURES,
),
DataSet(
input_field="PRMP",
renamed_field="Proximity to RMP sites",
bucket=self.BUCKET_ENVIRONMENTAL,
),
DataSet(
input_field="PTSDF",
renamed_field="Proximity to TSDF sites",
bucket=self.BUCKET_ENVIRONMENTAL,
),
DataSet(
input_field="PNPL",
renamed_field="Proximity to NPL sites",
bucket=self.BUCKET_ENVIRONMENTAL,
),
DataSet(
input_field="PWDIS",
renamed_field="Wastewater discharge",
bucket=self.BUCKET_ENVIRONMENTAL,
),
DataSet(
input_field="PRE1960PCT",
renamed_field="Percent pre-1960s housing (lead paint indicator)",
bucket=self.BUCKET_ENVIRONMENTAL,
),
DataSet(
input_field="UNDER5PCT",
renamed_field="Individuals under 5 years old",
bucket=self.BUCKET_SENSITIVE,
),
DataSet(
input_field="OVER64PCT",
renamed_field="Individuals over 64 years old",
bucket=self.BUCKET_SENSITIVE,
),
DataSet(
input_field=self.LINGUISTIC_ISOLATION_FIELD_NAME,
renamed_field=self.LINGUISTIC_ISOLATION_FIELD_NAME,
bucket=self.BUCKET_SENSITIVE,
),
DataSet(
input_field="LINGISOPCT",
renamed_field="Percent of households in linguistic isolation",
bucket=self.BUCKET_SOCIOECONOMIC,
),
DataSet(
input_field="LOWINCPCT",
renamed_field=self.POVERTY_FIELD_NAME,
bucket=self.BUCKET_SOCIOECONOMIC,
),
DataSet(
input_field="LESSHSPCT",
renamed_field=self.HIGH_SCHOOL_FIELD_NAME,
bucket=self.BUCKET_SOCIOECONOMIC,
),
DataSet(
input_field=self.UNEMPLOYED_FIELD_NAME,
renamed_field=self.UNEMPLOYED_FIELD_NAME,
bucket=self.BUCKET_SOCIOECONOMIC,
),
DataSet(
input_field="ht_ami",
renamed_field="Housing + Transportation Costs % Income for the Regional Typical Household",
bucket=self.BUCKET_SOCIOECONOMIC,
),
]
# Rename columns:
renaming_dict = {
data_set.input_field: data_set.renamed_field
for data_set in data_sets
}
self.df.rename(
columns=renaming_dict,
inplace=True,
errors="raise",
)
columns_to_keep = [data_set.renamed_field for data_set in data_sets]
self.df = self.df[columns_to_keep]
# Convert all columns to numeric.
for data_set in data_sets:
# Skip GEOID_FIELD_NAME, because it's a string.
if data_set.renamed_field == self.GEOID_FIELD_NAME:
continue
self.df[f"{data_set.renamed_field}"] = pd.to_numeric(
self.df[data_set.renamed_field]
)
# calculate percentiles
for data_set in data_sets:
self.df[
f"{data_set.renamed_field}{self.PERCENTILE_FIELD_SUFFIX}"
] = self.df[data_set.renamed_field].rank(pct=True)
# Math:
# (
# Observed value
# - minimum of all values
# )
# divided by
# (
# Maximum of all values
# - minimum of all values
# )
for data_set in data_sets:
# Skip GEOID_FIELD_NAME, because it's a string.
if data_set.renamed_field == self.GEOID_FIELD_NAME:
continue
min_value = self.df[data_set.renamed_field].min(skipna=True)
max_value = self.df[data_set.renamed_field].max(skipna=True)
logger.info(
f"For data set {data_set.renamed_field}, the min value is {min_value} and the max value is {max_value}."
)
self.df[f"{data_set.renamed_field}{self.MIN_MAX_FIELD_SUFFIX}"] = (
self.df[data_set.renamed_field] - min_value
) / (max_value - min_value)
# Graph distributions and correlations.
min_max_fields = [
f"{data_set.renamed_field}{self.MIN_MAX_FIELD_SUFFIX}"
for data_set in data_sets
if data_set.renamed_field != self.GEOID_FIELD_NAME
]
# Calculate score "A" and score "B"
self.df["Score A"] = self.df[
[
"Poverty (Less than 200% of federal poverty line) (percentile)",
"Percent individuals age 25 or over with less than high school degree (percentile)",
]
].mean(axis=1)
self.df["Score B"] = (
self.df[
"Poverty (Less than 200% of federal poverty line) (percentile)"
]
* self.df[
"Percent individuals age 25 or over with less than high school degree (percentile)"
]
)
# Calculate "CalEnviroScreen for the US" score
# Average all the percentile values in each bucket into a single score for each of the four buckets.
for bucket in self.BUCKETS:
fields_in_bucket = [
f"{data_set.renamed_field}{self.PERCENTILE_FIELD_SUFFIX}"
for data_set in data_sets
if data_set.bucket == bucket
]
self.df[f"{bucket}"] = self.df[fields_in_bucket].mean(axis=1)
# Combine the score from the two Exposures and Environmental Effects buckets into a single score called "Pollution Burden". The math for this score is: (1.0 * Exposures Score + 0.5 * Environment Effects score) / 1.5.
self.df[self.AGGREGATION_POLLUTION] = (
1.0 * self.df[f"{self.BUCKET_EXPOSURES}"]
+ 0.5 * self.df[f"{self.BUCKET_ENVIRONMENTAL}"]
) / 1.5
# Average the score from the two Sensitive populations and Socioeconomic factors buckets into a single score called "Population Characteristics".
self.df[self.AGGREGATION_POPULATION] = self.df[
[f"{self.BUCKET_SENSITIVE}", f"{self.BUCKET_SOCIOECONOMIC}"]
].mean(axis=1)
# Multiply the "Pollution Burden" score and the "Population Characteristics" together to produce the cumulative impact score.
self.df["Score C"] = (
self.df[self.AGGREGATION_POLLUTION]
* self.df[self.AGGREGATION_POPULATION]
)
if len(census_block_group_df) > 220333:
raise ValueError("Too many rows in the join.")
fields_to_use_in_score = [
self.UNEMPLOYED_FIELD_NAME,
self.LINGUISTIC_ISOLATION_FIELD_NAME,
self.HOUSING_BURDEN_FIELD_NAME,
self.POVERTY_FIELD_NAME,
self.HIGH_SCHOOL_FIELD_NAME,
]
fields_min_max = [
f"{field}{self.MIN_MAX_FIELD_SUFFIX}"
for field in fields_to_use_in_score
]
fields_percentile = [
f"{field}{self.PERCENTILE_FIELD_SUFFIX}"
for field in fields_to_use_in_score
]
# Calculate "Score D", which uses min-max normalization
# and calculate "Score E", which uses percentile normalization for the same fields
self.df["Score D"] = self.df[fields_min_max].mean(axis=1)
self.df["Score E"] = self.df[fields_percentile].mean(axis=1)
# Calculate correlations
self.df[fields_min_max].corr()
# Create percentiles for the scores
for score_field in [
"Score A",
"Score B",
"Score C",
"Score D",
"Score E",
]:
self.df[f"{score_field}{self.PERCENTILE_FIELD_SUFFIX}"] = self.df[
score_field
].rank(pct=True)
self.df[f"{score_field} (top 25th percentile)"] = (
self.df[f"{score_field}{self.PERCENTILE_FIELD_SUFFIX}"] >= 0.75
)
def load(self) -> None:
logger.info(f"Saving Score CSV")
# write nationwide csv
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
self.df.to_csv(self.SCORE_CSV_PATH / f"usa.csv", index=False)

View file

@ -0,0 +1,112 @@
import pandas as pd
from etl.base import ExtractTransformLoad
from utils import get_module_logger
logger = get_module_logger(__name__)
class PostScoreETL(ExtractTransformLoad):
"""
A class used to instantiate an ETL object to retrieve and process data from
datasets.
"""
def __init__(self):
self.CENSUS_COUNTIES_ZIP_URL = "https://www2.census.gov/geo/docs/maps-data/data/gazetteer/Gaz_counties_national.zip"
self.CENSUS_COUNTIES_TXT = self.TMP_PATH / "Gaz_counties_national.txt"
self.CENSUS_COUNTIES_COLS = ["USPS", "GEOID", "NAME"]
self.SCORE_CSV_PATH = self.DATA_PATH / "score" / "csv"
self.STATE_CSV = (
self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
)
self.FULL_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
self.TILR_SCORE_CSV = self.SCORE_CSV_PATH / "tile" / "usa.csv"
self.TILES_SCORE_COLUMNS = [
"GEOID10",
"Score E (percentile)",
"Score E (top 25th percentile)",
"GEOID",
"State Abbreviation",
"County Name",
]
self.TILES_SCORE_CSV_PATH = self.SCORE_CSV_PATH / "tiles"
self.TILES_SCORE_CSV = self.TILES_SCORE_CSV_PATH / "usa.csv"
self.counties_df: pd.DataFrame
self.states_df: pd.DataFrame
self.score_df: pd.DataFrame
self.score_county_state_merged: pd.DataFrame
self.score_for_tiles: pd.DataFrame
def extract(self) -> None:
super().extract(
self.CENSUS_COUNTIES_ZIP_URL,
self.TMP_PATH,
)
logger.info(f"Reading Counties CSV")
self.counties_df = pd.read_csv(
self.CENSUS_COUNTIES_TXT,
sep="\t",
dtype={"GEOID": "string", "USPS": "string"},
low_memory=False,
encoding="latin-1",
)
logger.info(f"Reading States CSV")
self.states_df = pd.read_csv(
self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
)
self.score_df = pd.read_csv(
self.FULL_SCORE_CSV, dtype={"GEOID10": "string"}
)
def transform(self) -> None:
logger.info(f"Transforming data sources for Score + County CSV")
# rename some of the columns to prepare for merge
self.counties_df = self.counties_df[["USPS", "GEOID", "NAME"]]
self.counties_df.rename(
columns={"USPS": "State Abbreviation", "NAME": "County Name"},
inplace=True,
)
# remove unnecessary columns
self.states_df.rename(
columns={
"fips": "State Code",
"state_name": "State Name",
"state_abbreviation": "State Abbreviation",
},
inplace=True,
)
self.states_df.drop(["region", "division"], axis=1, inplace=True)
# add the tract level column
self.score_df["GEOID"] = self.score_df.GEOID10.str[:5]
# merge state and counties
county_state_merged = self.counties_df.join(
self.states_df, rsuffix=" Other"
)
del county_state_merged["State Abbreviation Other"]
# merge county and score
self.score_county_state_merged = self.score_df.join(
county_state_merged, rsuffix="_OTHER"
)
del self.score_county_state_merged["GEOID_OTHER"]
def load(self) -> None:
logger.info(f"Saving Full Score CSV with County Information")
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
self.score_county_state_merged.to_csv(self.FULL_SCORE_CSV, index=False)
logger.info(f"Saving Tile Score CSV")
# TODO: check which are the columns we'll use
# Related to: https://github.com/usds/justice40-tool/issues/302
score_tiles = self.score_county_state_merged[self.TILES_SCORE_COLUMNS]
self.TILES_SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
score_tiles.to_csv(self.TILES_SCORE_CSV, index=False)

View file

@ -0,0 +1,69 @@
import pandas as pd
from etl.base import ExtractTransformLoad
from utils import get_module_logger
logger = get_module_logger(__name__)
class CalEnviroScreenETL(ExtractTransformLoad):
def __init__(self):
self.CALENVIROSCREEN_FTP_URL = "https://justice40-data.s3.amazonaws.com/CalEnviroScreen/CalEnviroScreen_4.0_2021.zip"
self.CALENVIROSCREEN_CSV = self.TMP_PATH / "CalEnviroScreen_4.0_2021.csv"
self.CSV_PATH = self.DATA_PATH / "dataset" / "calenviroscreen4"
# Definining some variable names
self.CALENVIROSCREEN_SCORE_FIELD_NAME = "calenviroscreen_score"
self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME = "calenviroscreen_percentile"
self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME = (
"calenviroscreen_priority_community"
)
# Choosing constants.
# None of these numbers are final, but just for the purposes of comparison.
self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD = 75
self.df: pd.DataFrame
def extract(self) -> None:
logger.info(f"Downloading CalEnviroScreen Data")
super().extract(
self.CALENVIROSCREEN_FTP_URL,
self.TMP_PATH,
)
def transform(self) -> None:
logger.info(f"Transforming CalEnviroScreen Data")
# Data from https://calenviroscreen-oehha.hub.arcgis.com/#Data, specifically:
# https://oehha.ca.gov/media/downloads/calenviroscreen/document/calenviroscreen40resultsdatadictionaryd12021.zip
# Load comparison index (CalEnviroScreen 4)
self.df = pd.read_csv(
self.CALENVIROSCREEN_CSV, dtype={"Census Tract": "string"}
)
self.df.rename(
columns={
"Census Tract": self.GEOID_TRACT_FIELD_NAME,
"DRAFT CES 4.0 Score": self.CALENVIROSCREEN_SCORE_FIELD_NAME,
"DRAFT CES 4.0 Percentile": self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME,
},
inplace=True,
)
# Add a leading "0" to the Census Tract to match our format in other data frames.
self.df[self.GEOID_TRACT_FIELD_NAME] = (
"0" + self.df[self.GEOID_TRACT_FIELD_NAME]
)
# Calculate the top K% of prioritized communities
self.df[self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME] = (
self.df[self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME]
>= self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD
)
def load(self) -> None:
logger.info(f"Saving CalEnviroScreen CSV")
# write nationwide csv
self.CSV_PATH.mkdir(parents=True, exist_ok=True)
self.df.to_csv(self.CSV_PATH / f"data06.csv", index=False)

View file

@ -0,0 +1,111 @@
import csv
import os
import json
from pathlib import Path
from .etl_utils import get_state_fips_codes
from utils import unzip_file_from_url, get_module_logger
logger = get_module_logger(__name__)
def download_census_csvs(data_path: Path) -> None:
"""Download all census shape files from the Census FTP and extract the geojson
to generate national and by state Census Block Group CSVs
Args:
data_path (pathlib.Path): Name of the directory where the files and directories will
be created
Returns:
None
"""
# the fips_states_2010.csv is generated from data here
# https://www.census.gov/geographies/reference-files/time-series/geo/tallies.html
state_fips_codes = get_state_fips_codes(data_path)
geojson_dir_path = data_path / "census" / "geojson"
for fips in state_fips_codes:
# check if file exists
shp_file_path = (
data_path / "census" / "shp" / fips / f"tl_2010_{fips}_bg10.shp"
)
logger.info(f"Checking if {fips} file exists")
if not os.path.isfile(shp_file_path):
logger.info(f"Downloading and extracting {fips} shape file")
# 2020 tiger data is here: https://www2.census.gov/geo/tiger/TIGER2020/BG/
# But using 2010 for now
cbg_state_url = f"https://www2.census.gov/geo/tiger/TIGER2010/BG/2010/tl_2010_{fips}_bg10.zip"
unzip_file_from_url(
cbg_state_url,
data_path / "tmp",
data_path / "census" / "shp" / fips,
)
cmd = (
"ogr2ogr -f GeoJSON data/census/geojson/"
+ fips
+ ".json data/census/shp/"
+ fips
+ "/tl_2010_"
+ fips
+ "_bg10.shp"
)
os.system(cmd)
# generate CBG CSV table for pandas
## load in memory
cbg_national = [] # in-memory global list
cbg_per_state: dict = {} # in-memory dict per state
for file in os.listdir(geojson_dir_path):
if file.endswith(".json"):
logger.info(f"Ingesting geoid10 for file {file}")
with open(geojson_dir_path / file) as f:
geojson = json.load(f)
for feature in geojson["features"]:
geoid10 = feature["properties"]["GEOID10"]
cbg_national.append(str(geoid10))
geoid10_state_id = geoid10[:2]
if not cbg_per_state.get(geoid10_state_id):
cbg_per_state[geoid10_state_id] = []
cbg_per_state[geoid10_state_id].append(geoid10)
csv_dir_path = data_path / "census" / "csv"
## write to individual state csv
for state_id in cbg_per_state:
geoid10_list = cbg_per_state[state_id]
with open(
csv_dir_path / f"{state_id}.csv", mode="w", newline=""
) as cbg_csv_file:
cbg_csv_file_writer = csv.writer(
cbg_csv_file,
delimiter=",",
quotechar='"',
quoting=csv.QUOTE_MINIMAL,
)
for geoid10 in geoid10_list:
cbg_csv_file_writer.writerow(
[
geoid10,
]
)
## write US csv
with open(csv_dir_path / "us.csv", mode="w", newline="") as cbg_csv_file:
cbg_csv_file_writer = csv.writer(
cbg_csv_file,
delimiter=",",
quotechar='"',
quoting=csv.QUOTE_MINIMAL,
)
for geoid10 in cbg_national:
cbg_csv_file_writer.writerow(
[
geoid10,
]
)
logger.info("Census block groups downloading complete")

View file

@ -0,0 +1,55 @@
from pathlib import Path
import csv
import os
from config import settings
from utils import (
remove_files_from_dir,
remove_all_dirs_from_dir,
unzip_file_from_url,
get_module_logger,
)
logger = get_module_logger(__name__)
def reset_data_directories(data_path: Path) -> None:
census_data_path = data_path / "census"
# csv
csv_path = census_data_path / "csv"
remove_files_from_dir(csv_path, ".csv")
# geojson
geojson_path = census_data_path / "geojson"
remove_files_from_dir(geojson_path, ".json")
# shp
shp_path = census_data_path / "shp"
remove_all_dirs_from_dir(shp_path)
def get_state_fips_codes(data_path: Path) -> list:
fips_csv_path = data_path / "census" / "csv" / "fips_states_2010.csv"
# check if file exists
if not os.path.isfile(fips_csv_path):
logger.info(f"Downloading fips from S3 repository")
unzip_file_from_url(
settings.AWS_JUSTICE40_DATA_URL + "/Census/fips_states_2010.zip",
data_path / "tmp",
data_path / "census" / "csv",
)
fips_state_list = []
with open(fips_csv_path) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=",")
line_count = 0
for row in csv_reader:
if line_count == 0:
line_count += 1
else:
fips = row[0].strip()
fips_state_list.append(fips)
return fips_state_list

View file

@ -0,0 +1,108 @@
import pandas as pd
import censusdata
from etl.base import ExtractTransformLoad
from etl.sources.census.etl_utils import get_state_fips_codes
from utils import get_module_logger
logger = get_module_logger(__name__)
class CensusACSETL(ExtractTransformLoad):
def __init__(self):
self.ACS_YEAR = 2019
self.OUTPUT_PATH = (
self.DATA_PATH / "dataset" / f"census_acs_{self.ACS_YEAR}"
)
self.UNEMPLOYED_FIELD_NAME = "Unemployed civilians (percent)"
self.LINGUISTIC_ISOLATION_FIELD_NAME = "Linguistic isolation (percent)"
self.LINGUISTIC_ISOLATION_TOTAL_FIELD_NAME = (
"Linguistic isolation (total)"
)
self.LINGUISTIC_ISOLATION_FIELDS = [
"C16002_001E",
"C16002_004E",
"C16002_007E",
"C16002_010E",
"C16002_013E",
]
self.df: pd.DataFrame
def _fips_from_censusdata_censusgeo(
self, censusgeo: censusdata.censusgeo
) -> str:
"""Create a FIPS code from the proprietary censusgeo index."""
fips = "".join([value for (key, value) in censusgeo.params()])
return fips
def extract(self) -> None:
dfs = []
for fips in get_state_fips_codes(self.DATA_PATH):
logger.info(
f"Downloading data for state/territory with FIPS code {fips}"
)
dfs.append(
censusdata.download(
src="acs5",
year=self.ACS_YEAR,
geo=censusdata.censusgeo(
[("state", fips), ("county", "*"), ("block group", "*")]
),
var=[
# Emploment fields
"B23025_005E",
"B23025_003E",
]
+ self.LINGUISTIC_ISOLATION_FIELDS,
)
)
self.df = pd.concat(dfs)
self.df[self.GEOID_FIELD_NAME] = self.df.index.to_series().apply(
func=self._fips_from_censusdata_censusgeo
)
def transform(self) -> None:
logger.info(f"Starting Census ACS Transform")
# Calculate percent unemployment.
# TODO: remove small-sample data that should be `None` instead of a high-variance fraction.
self.df[self.UNEMPLOYED_FIELD_NAME] = (
self.df.B23025_005E / self.df.B23025_003E
)
# Calculate linguistic isolation.
individual_limited_english_fields = [
"C16002_004E",
"C16002_007E",
"C16002_010E",
"C16002_013E",
]
self.df[self.LINGUISTIC_ISOLATION_TOTAL_FIELD_NAME] = self.df[
individual_limited_english_fields
].sum(axis=1, skipna=True)
self.df[self.LINGUISTIC_ISOLATION_FIELD_NAME] = (
self.df[self.LINGUISTIC_ISOLATION_TOTAL_FIELD_NAME].astype(float)
/ self.df["C16002_001E"]
)
self.df[self.LINGUISTIC_ISOLATION_FIELD_NAME].describe()
def load(self) -> None:
logger.info(f"Saving Census ACS Data")
# mkdir census
self.OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
columns_to_include = [
self.GEOID_FIELD_NAME,
self.UNEMPLOYED_FIELD_NAME,
self.LINGUISTIC_ISOLATION_FIELD_NAME,
]
self.df[columns_to_include].to_csv(
path_or_buf=self.OUTPUT_PATH / "usa.csv", index=False
)

View file

@ -0,0 +1,37 @@
import pandas as pd
from etl.base import ExtractTransformLoad
from utils import get_module_logger
logger = get_module_logger(__name__)
class EJScreenETL(ExtractTransformLoad):
def __init__(self):
self.EJSCREEN_FTP_URL = "https://gaftp.epa.gov/EJSCREEN/2019/EJSCREEN_2019_StatePctile.csv.zip"
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctiles.csv"
self.CSV_PATH = self.DATA_PATH / "dataset" / "ejscreen_2019"
self.df: pd.DataFrame
def extract(self) -> None:
logger.info(f"Downloading EJScreen Data")
super().extract(
self.EJSCREEN_FTP_URL,
self.TMP_PATH,
)
def transform(self) -> None:
logger.info(f"Transforming EJScreen Data")
self.df = pd.read_csv(
self.EJSCREEN_CSV,
dtype={"ID": "string"},
# EJSCREEN writes the word "None" for NA data.
na_values=["None"],
low_memory=False,
)
def load(self) -> None:
logger.info(f"Saving EJScreen CSV")
# write nationwide csv
self.CSV_PATH.mkdir(parents=True, exist_ok=True)
self.df.to_csv(self.CSV_PATH / f"usa.csv", index=False)

View file

@ -0,0 +1,62 @@
import pandas as pd
from etl.base import ExtractTransformLoad
from etl.sources.census.etl_utils import get_state_fips_codes
from utils import get_module_logger, unzip_file_from_url
logger = get_module_logger(__name__)
class HousingTransportationETL(ExtractTransformLoad):
def __init__(self):
self.HOUSING_FTP_URL = (
"https://htaindex.cnt.org/download/download.php?focus=blkgrp&geoid="
)
self.OUTPUT_PATH = (
self.DATA_PATH / "dataset" / "housing_and_transportation_index"
)
self.df: pd.DataFrame
def extract(self) -> None:
# Download each state / territory individually
dfs = []
zip_file_dir = self.TMP_PATH / "housing_and_transportation_index"
for fips in get_state_fips_codes(self.DATA_PATH):
logger.info(
f"Downloading housing data for state/territory with FIPS code {fips}"
)
# Puerto Rico has no data, so skip
if fips == "72":
continue
unzip_file_from_url(
f"{self.HOUSING_FTP_URL}{fips}", self.TMP_PATH, zip_file_dir
)
# New file name:
tmp_csv_file_path = (
zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
)
tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
dfs.append(tmp_df)
self.df = pd.concat(dfs)
self.df.head()
def transform(self) -> None:
logger.info(f"Transforming Housing and Transportation Data")
# Rename and reformat block group ID
self.df.rename(columns={"blkgrp": self.GEOID_FIELD_NAME}, inplace=True)
self.df[self.GEOID_FIELD_NAME] = self.df[
self.GEOID_FIELD_NAME
].str.replace('"', "")
def load(self) -> None:
logger.info(f"Saving Housing and Transportation Data")
self.OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
self.df.to_csv(path_or_buf=self.OUTPUT_PATH / "usa.csv", index=False)

View file

@ -0,0 +1,187 @@
import pandas as pd
from etl.base import ExtractTransformLoad
from etl.sources.census.etl_utils import get_state_fips_codes
from utils import get_module_logger, unzip_file_from_url, remove_all_from_dir
logger = get_module_logger(__name__)
class HudHousingETL(ExtractTransformLoad):
def __init__(self):
self.OUTPUT_PATH = self.DATA_PATH / "dataset" / "hud_housing"
self.GEOID_TRACT_FIELD_NAME = "GEOID10_TRACT"
self.HOUSING_FTP_URL = "https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
self.HOUSING_ZIP_FILE_DIR = self.TMP_PATH / "hud_housing"
# We measure households earning less than 80% of HUD Area Median Family Income by county
# and paying greater than 30% of their income to housing costs.
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME = "HOUSING_BURDEN_NUMERATOR"
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = (
"HOUSING_BURDEN_DENOMINATOR"
)
# Note: some variable definitions.
# HUD-adjusted median family income (HAMFI).
# The four housing problems are: incomplete kitchen facilities, incomplete plumbing facilities, more than 1 person per room, and cost burden greater than 30%.
# Table 8 is the desired table.
self.df: pd.DataFrame
def extract(self) -> None:
logger.info(f"Extracting HUD Housing Data")
super().extract(
self.HOUSING_FTP_URL,
self.HOUSING_ZIP_FILE_DIR,
)
def transform(self) -> None:
logger.info(f"Transforming HUD Housing Data")
# New file name:
tmp_csv_file_path = (
self.HOUSING_ZIP_FILE_DIR
/ "2012thru2016-140-csv"
/ "2012thru2016-140-csv"
/ "140"
/ "Table8.csv"
)
self.df = pd.read_csv(
filepath_or_buffer=tmp_csv_file_path,
encoding="latin-1",
)
# Rename and reformat block group ID
self.df.rename(
columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True
)
# The CHAS data has census tract ids such as `14000US01001020100`
# Whereas the rest of our data uses, for the same tract, `01001020100`.
# the characters before `US`:
self.df[self.GEOID_TRACT_FIELD_NAME] = self.df[
self.GEOID_TRACT_FIELD_NAME
].str.replace(r"^.*?US", "", regex=True)
# Calculate housing burden
# This is quite a number of steps. It does not appear to be accessible nationally in a simpler format, though.
# See "CHAS data dictionary 12-16.xlsx"
# Owner occupied numerator fields
OWNER_OCCUPIED_NUMERATOR_FIELDS = [
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
# T8_est7 Subtotal Owner occupied less than or equal to 30% of HAMFI greater than 30% but less than or equal to 50% All
"T8_est7",
# T8_est10 Subtotal Owner occupied less than or equal to 30% of HAMFI greater than 50% All
"T8_est10",
# T8_est20 Subtotal Owner occupied greater than 30% but less than or equal to 50% of HAMFI greater than 30% but less than or equal to 50% All
"T8_est20",
# T8_est23 Subtotal Owner occupied greater than 30% but less than or equal to 50% of HAMFI greater than 50% All
"T8_est23",
# T8_est33 Subtotal Owner occupied greater than 50% but less than or equal to 80% of HAMFI greater than 30% but less than or equal to 50% All
"T8_est33",
# T8_est36 Subtotal Owner occupied greater than 50% but less than or equal to 80% of HAMFI greater than 50% All
"T8_est36",
]
# These rows have the values where HAMFI was not computed, b/c of no or negative income.
OWNER_OCCUPIED_NOT_COMPUTED_FIELDS = [
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
# T8_est13 Subtotal Owner occupied less than or equal to 30% of HAMFI not computed (no/negative income) All
"T8_est13",
# T8_est26 Subtotal Owner occupied greater than 30% but less than or equal to 50% of HAMFI not computed (no/negative income) All
"T8_est26",
# T8_est39 Subtotal Owner occupied greater than 50% but less than or equal to 80% of HAMFI not computed (no/negative income) All
"T8_est39",
# T8_est52 Subtotal Owner occupied greater than 80% but less than or equal to 100% of HAMFI not computed (no/negative income) All
"T8_est52",
# T8_est65 Subtotal Owner occupied greater than 100% of HAMFI not computed (no/negative income) All
"T8_est65",
]
# T8_est2 Subtotal Owner occupied All All All
OWNER_OCCUPIED_POPULATION_FIELD = "T8_est2"
# Renter occupied numerator fields
RENTER_OCCUPIED_NUMERATOR_FIELDS = [
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
# T8_est73 Subtotal Renter occupied less than or equal to 30% of HAMFI greater than 30% but less than or equal to 50% All
"T8_est73",
# T8_est76 Subtotal Renter occupied less than or equal to 30% of HAMFI greater than 50% All
"T8_est76",
# T8_est86 Subtotal Renter occupied greater than 30% but less than or equal to 50% of HAMFI greater than 30% but less than or equal to 50% All
"T8_est86",
# T8_est89 Subtotal Renter occupied greater than 30% but less than or equal to 50% of HAMFI greater than 50% All
"T8_est89",
# T8_est99 Subtotal Renter occupied greater than 50% but less than or equal to 80% of HAMFI greater than 30% but less than or equal to 50% All
"T8_est99",
# T8_est102 Subtotal Renter occupied greater than 50% but less than or equal to 80% of HAMFI greater than 50% All
"T8_est102",
]
# These rows have the values where HAMFI was not computed, b/c of no or negative income.
RENTER_OCCUPIED_NOT_COMPUTED_FIELDS = [
# Key: Column Name Line_Type Tenure Household income Cost burden Facilities
# T8_est79 Subtotal Renter occupied less than or equal to 30% of HAMFI not computed (no/negative income) All
"T8_est79",
# T8_est92 Subtotal Renter occupied greater than 30% but less than or equal to 50% of HAMFI not computed (no/negative income) All
"T8_est92",
# T8_est105 Subtotal Renter occupied greater than 50% but less than or equal to 80% of HAMFI not computed (no/negative income) All
"T8_est105",
# T8_est118 Subtotal Renter occupied greater than 80% but less than or equal to 100% of HAMFI not computed (no/negative income) All
"T8_est118",
# T8_est131 Subtotal Renter occupied greater than 100% of HAMFI not computed (no/negative income) All
"T8_est131",
]
# T8_est68 Subtotal Renter occupied All All All
RENTER_OCCUPIED_POPULATION_FIELD = "T8_est68"
# Math:
# (
# # of Owner Occupied Units Meeting Criteria
# + # of Renter Occupied Units Meeting Criteria
# )
# divided by
# (
# Total # of Owner Occupied Units
# + Total # of Renter Occupied Units
# - # of Owner Occupied Units with HAMFI Not Computed
# - # of Renter Occupied Units with HAMFI Not Computed
# )
self.df[self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME] = self.df[
OWNER_OCCUPIED_NUMERATOR_FIELDS
].sum(axis=1) + self.df[RENTER_OCCUPIED_NUMERATOR_FIELDS].sum(axis=1)
self.df[self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME] = (
self.df[OWNER_OCCUPIED_POPULATION_FIELD]
+ self.df[RENTER_OCCUPIED_POPULATION_FIELD]
- self.df[OWNER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)
- self.df[RENTER_OCCUPIED_NOT_COMPUTED_FIELDS].sum(axis=1)
)
# TODO: add small sample size checks
self.df[self.HOUSING_BURDEN_FIELD_NAME] = self.df[
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME
].astype(float) / self.df[
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME
].astype(
float
)
def load(self) -> None:
logger.info(f"Saving HUD Housing Data")
self.OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
# Drop unnecessary fields
self.df[
[
self.GEOID_TRACT_FIELD_NAME,
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME,
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME,
self.HOUSING_BURDEN_FIELD_NAME,
]
].to_csv(path_or_buf=self.OUTPUT_PATH / "usa.csv", index=False)

View file

@ -0,0 +1,63 @@
import pandas as pd
import requests
from etl.base import ExtractTransformLoad
from utils import get_module_logger
logger = get_module_logger(__name__)
class HudRecapETL(ExtractTransformLoad):
def __init__(self):
self.HUD_RECAP_CSV_URL = "https://opendata.arcgis.com/api/v3/datasets/56de4edea8264fe5a344da9811ef5d6e_0/downloads/data?format=csv&spatialRefId=4326"
self.HUD_RECAP_CSV = (
self.TMP_PATH
/ "Racially_or_Ethnically_Concentrated_Areas_of_Poverty__R_ECAPs_.csv"
)
self.CSV_PATH = self.DATA_PATH / "dataset" / "hud_recap"
# Definining some variable names
self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME = "hud_recap_priority_community"
self.df: pd.DataFrame
def extract(self) -> None:
logger.info(f"Downloading HUD Recap Data")
download = requests.get(self.HUD_RECAP_CSV_URL, verify=None)
file_contents = download.content
csv_file = open(self.HUD_RECAP_CSV, "wb")
csv_file.write(file_contents)
csv_file.close()
def transform(self) -> None:
logger.info(f"Transforming HUD Recap Data")
# Load comparison index (CalEnviroScreen 4)
self.df = pd.read_csv(self.HUD_RECAP_CSV, dtype={"Census Tract": "string"})
self.df.rename(
columns={
"GEOID": self.GEOID_TRACT_FIELD_NAME,
# Interestingly, there's no data dictionary for the RECAP data that I could find.
# However, this site (http://www.schousing.com/library/Tax%20Credit/2020/QAP%20Instructions%20(2).pdf)
# suggests:
# "If RCAP_Current for the tract in which the site is located is 1, the tract is an R/ECAP. If RCAP_Current is 0, it is not."
"RCAP_Current": self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME,
},
inplace=True,
)
# Convert to boolean
self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME] = self.df[
self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME
].astype("bool")
self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME].value_counts()
self.df.sort_values(by=self.GEOID_TRACT_FIELD_NAME, inplace=True)
def load(self) -> None:
logger.info(f"Saving HUD Recap CSV")
# write nationwide csv
self.CSV_PATH.mkdir(parents=True, exist_ok=True)
self.df.to_csv(self.CSV_PATH / f"usa.csv", index=False)

View file

@ -0,0 +1,161 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "7185e18d",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import csv\n",
"from pathlib import Path\n",
"import os\n",
"import sys"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "174bbd09",
"metadata": {},
"outputs": [],
"source": [
"module_path = os.path.abspath(os.path.join(\"..\"))\n",
"if module_path not in sys.path:\n",
" sys.path.append(module_path)\n",
" \n",
"from utils import unzip_file_from_url\n",
"from etl.sources.census.etl_utils import get_state_fips_codes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd090fcc",
"metadata": {},
"outputs": [],
"source": [
"DATA_PATH = Path.cwd().parent / \"data\"\n",
"TMP_PATH: Path = DATA_PATH / \"tmp\"\n",
"STATE_CSV = DATA_PATH / \"census\" / \"csv\" / \"fips_states_2010.csv\"\n",
"SCORE_CSV = DATA_PATH / \"score\" / \"csv\" / \"usa.csv\"\n",
"COUNTY_SCORE_CSV = DATA_PATH / \"score\" / \"csv\" / \"usa-county.csv\"\n",
"CENSUS_COUNTIES_ZIP_URL = \"https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2020_Gazetteer/2020_Gaz_counties_national.zip\"\n",
"CENSUS_COUNTIES_TXT = TMP_PATH / \"2020_Gaz_counties_national.txt\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf2e266b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"unzip_file_from_url(CENSUS_COUNTIES_ZIP_URL, TMP_PATH, TMP_PATH)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ff96da8",
"metadata": {},
"outputs": [],
"source": [
"counties_df = pd.read_csv(CENSUS_COUNTIES_TXT, sep=\"\\t\", dtype={\"GEOID\": \"string\", \"USPS\": \"string\"}, low_memory=False)\n",
"counties_df = counties_df[['USPS', 'GEOID', 'NAME']]\n",
"counties_df.rename(columns={\"USPS\": \"State Abbreviation\", \"NAME\": \"County Name\"}, inplace=True)\n",
"counties_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5af103da",
"metadata": {},
"outputs": [],
"source": [
"states_df = pd.read_csv(STATE_CSV, dtype={\"fips\": \"string\", \"state_abbreviation\": \"string\"})\n",
"states_df.rename(columns={\"fips\": \"State Code\", \"state_name\": \"State Name\", \"state_abbreviation\": \"State Abbreviation\"}, inplace=True)\n",
"states_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c8680258",
"metadata": {},
"outputs": [],
"source": [
"county_state_merged = counties_df.join(states_df, rsuffix=' Other')\n",
"del county_state_merged[\"State Abbreviation Other\"]\n",
"county_state_merged.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58dca55a",
"metadata": {},
"outputs": [],
"source": [
"score_df = pd.read_csv(SCORE_CSV, dtype={\"GEOID10\": \"string\"})\n",
"score_df[\"GEOID\"] = score_df.GEOID10.str[:5]\n",
"score_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "45e04d42",
"metadata": {},
"outputs": [],
"source": [
"score_county_state_merged = score_df.join(county_state_merged, rsuffix='_OTHER')\n",
"del score_county_state_merged[\"GEOID_OTHER\"]\n",
"score_county_state_merged.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5a0b32b",
"metadata": {},
"outputs": [],
"source": [
"score_county_state_merged.to_csv(COUNTY_SCORE_CSV, index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b690937e",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View file

@ -0,0 +1,901 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "54615cef",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Before running this script as it currently stands, you'll need to run these notebooks (in any order):\n",
"# * score_calc.ipynb\n",
"# * calenviroscreen_etl.ipynb\n",
"# * hud_recap_etl.ipynb\n",
"\n",
"import collections\n",
"import functools\n",
"import IPython\n",
"import numpy as np\n",
"import os\n",
"import pandas as pd\n",
"import pathlib\n",
"import pypandoc\n",
"import requests\n",
"import string\n",
"import sys\n",
"import typing\n",
"import us\n",
"import zipfile\n",
"\n",
"from datetime import datetime\n",
"from tqdm.notebook import tqdm_notebook\n",
"\n",
"module_path = os.path.abspath(os.path.join(\"..\"))\n",
"if module_path not in sys.path:\n",
" sys.path.append(module_path)\n",
"\n",
"from utils import remove_all_from_dir, get_excel_column_name\n",
"\n",
"# Turn on TQDM for pandas so that we can have progress bars when running `apply`.\n",
"tqdm_notebook.pandas()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49a63129",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Suppress scientific notation in pandas (this shows up for census tract IDs)\n",
"pd.options.display.float_format = \"{:.2f}\".format\n",
"\n",
"# Set some global parameters\n",
"DATA_DIR = pathlib.Path.cwd().parent / \"data\"\n",
"TEMP_DATA_DIR = pathlib.Path.cwd().parent / \"data\" / \"tmp\"\n",
"COMPARISON_OUTPUTS_DIR = TEMP_DATA_DIR / \"comparison_outputs\"\n",
"\n",
"# Make the dirs if they don't exist\n",
"TEMP_DATA_DIR.mkdir(parents=True, exist_ok=True)\n",
"COMPARISON_OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)\n",
"\n",
"CEJST_PRIORITY_COMMUNITY_THRESHOLD = 0.75\n",
"\n",
"# Name fields using variables. (This makes it easy to reference the same fields frequently without using strings\n",
"# and introducing the risk of misspelling the field name.)\n",
"\n",
"GEOID_FIELD_NAME = \"GEOID10\"\n",
"GEOID_TRACT_FIELD_NAME = \"GEOID10_TRACT\"\n",
"GEOID_STATE_FIELD_NAME = \"GEOID10_STATE\"\n",
"CENSUS_BLOCK_GROUP_POPULATION_FIELD = \"Total population\"\n",
"\n",
"CEJST_SCORE_FIELD = \"cejst_score\"\n",
"CEJST_PERCENTILE_FIELD = \"cejst_percentile\"\n",
"CEJST_PRIORITY_COMMUNITY_FIELD = \"cejst_priority_community\"\n",
"\n",
"# Define some suffixes\n",
"POPULATION_SUFFIX = \" (priority population)\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2b26dccf",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Load CEJST score data\n",
"cejst_data_path = DATA_DIR / \"score\" / \"csv\" / \"usa.csv\"\n",
"cejst_df = pd.read_csv(cejst_data_path, dtype={GEOID_FIELD_NAME: \"string\"})\n",
"\n",
"# score_used = \"Score A\"\n",
"\n",
"# # Rename unclear name \"id\" to \"census_block_group_id\", as well as other renamings.\n",
"# cejst_df.rename(\n",
"# columns={\n",
"# \"Total population\": CENSUS_BLOCK_GROUP_POPULATION_FIELD,\n",
"# score_used: CEJST_SCORE_FIELD,\n",
"# f\"{score_used} (percentile)\": CEJST_PERCENTILE_FIELD,\n",
"# },\n",
"# inplace=True,\n",
"# errors=\"raise\",\n",
"# )\n",
"\n",
"# Create the CBG's Census Tract ID by dropping the last number from the FIPS CODE of the CBG.\n",
"# The CBG ID is the last one character.\n",
"# For more information, see https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html.\n",
"cejst_df.loc[:, GEOID_TRACT_FIELD_NAME] = (\n",
" cejst_df.loc[:, GEOID_FIELD_NAME].astype(str).str[:-1]\n",
")\n",
"\n",
"cejst_df.loc[:, GEOID_STATE_FIELD_NAME] = (\n",
" cejst_df.loc[:, GEOID_FIELD_NAME].astype(str).str[0:2]\n",
")\n",
"\n",
"cejst_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08962382",
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Load CalEnviroScreen 4.0\n",
"CALENVIROSCREEN_SCORE_FIELD = \"calenviroscreen_score\"\n",
"CALENVIROSCREEN_PERCENTILE_FIELD = \"calenviroscreen_percentile\"\n",
"CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD = \"calenviroscreen_priority_community\"\n",
"\n",
"calenviroscreen_data_path = DATA_DIR / \"dataset\" / \"calenviroscreen4\" / \"data06.csv\"\n",
"calenviroscreen_df = pd.read_csv(\n",
" calenviroscreen_data_path, dtype={GEOID_TRACT_FIELD_NAME: \"string\"}\n",
")\n",
"\n",
"# Convert priority community field to a bool.\n",
"calenviroscreen_df[CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD] = calenviroscreen_df[\n",
" CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD\n",
"].astype(bool)\n",
"\n",
"calenviroscreen_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "42bd28d4",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Load HUD data\n",
"hud_recap_data_path = DATA_DIR / \"dataset\" / \"hud_recap\" / \"usa.csv\"\n",
"hud_recap_df = pd.read_csv(\n",
" hud_recap_data_path, dtype={GEOID_TRACT_FIELD_NAME: \"string\"}\n",
")\n",
"\n",
"hud_recap_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d77cd872",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Join all dataframes that use tracts\n",
"census_tract_dfs = [calenviroscreen_df, hud_recap_df]\n",
"\n",
"census_tract_df = functools.reduce(\n",
" lambda left, right: pd.merge(\n",
" left=left, right=right, on=GEOID_TRACT_FIELD_NAME, how=\"outer\"\n",
" ),\n",
" census_tract_dfs,\n",
")\n",
"\n",
"if census_tract_df[GEOID_TRACT_FIELD_NAME].str.len().unique() != [11]:\n",
" raise ValueError(\"Some of the census tract data has the wrong length.\")\n",
"\n",
"if len(census_tract_df) > 74134:\n",
" raise ValueError(\"Too many rows in the join.\")\n",
"\n",
"census_tract_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "813e5656",
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Join tract indices and CEJST data.\n",
"# Note: we're joining on the census *tract*, so there will be multiple CBG entries joined to the same census tract row from CES,\n",
"# creating multiple rows of the same CES data.\n",
"merged_df = cejst_df.merge(\n",
" census_tract_df,\n",
" how=\"left\",\n",
" on=GEOID_TRACT_FIELD_NAME,\n",
")\n",
"\n",
"\n",
"if len(merged_df) > 220333:\n",
" raise ValueError(\"Too many rows in the join.\")\n",
"\n",
"merged_df.head()\n",
"\n",
"\n",
"# merged_df.to_csv(\n",
"# path_or_buf=COMPARISON_OUTPUTS_DIR / \"merged.csv\", na_rep=\"\", index=False\n",
"# )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a801121",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"cejst_priority_communities_fields = [\n",
" \"Score A (top 25th percentile)\",\n",
" \"Score B (top 25th percentile)\",\n",
" \"Score C (top 25th percentile)\",\n",
" \"Score D (top 25th percentile)\",\n",
" \"Score E (top 25th percentile)\",\n",
"]\n",
"\n",
"comparison_priority_communities_fields = [\n",
" \"calenviroscreen_priority_community\",\n",
" \"hud_recap_priority_community\",\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9fef0da9",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"def get_state_distributions(\n",
" df: pd.DataFrame, priority_communities_fields: typing.List[str]\n",
") -> pd.DataFrame:\n",
" \"\"\"For each boolean field of priority communities, calculate distribution across states and territories.\"\"\"\n",
"\n",
" # Ensure each field is boolean.\n",
" for priority_communities_field in priority_communities_fields:\n",
" if df[priority_communities_field].dtype != bool:\n",
" print(f\"Converting {priority_communities_field} to boolean.\")\n",
"\n",
" # Calculate the population included as priority communities per CBG. Will either be 0 or the population.\n",
" df[f\"{priority_communities_field}{POPULATION_SUFFIX}\"] = (\n",
" df[priority_communities_field] * df[CENSUS_BLOCK_GROUP_POPULATION_FIELD]\n",
" )\n",
"\n",
" def calculate_state_comparison(frame: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"\n",
" This method will be applied to a `group_by` object. Inherits some parameters from outer scope.\n",
" \"\"\"\n",
" state_id = frame[GEOID_STATE_FIELD_NAME].unique()[0]\n",
"\n",
" summary_dict = {}\n",
" summary_dict[GEOID_STATE_FIELD_NAME] = state_id\n",
" summary_dict[\"State name\"] = us.states.lookup(state_id).name\n",
" summary_dict[\"Total CBGs in state\"] = len(frame)\n",
" summary_dict[\"Total population in state\"] = frame[\n",
" CENSUS_BLOCK_GROUP_POPULATION_FIELD\n",
" ].sum()\n",
"\n",
" for priority_communities_field in priority_communities_fields:\n",
" summary_dict[f\"{priority_communities_field}{POPULATION_SUFFIX}\"] = frame[\n",
" f\"{priority_communities_field}{POPULATION_SUFFIX}\"\n",
" ].sum()\n",
"\n",
" summary_dict[f\"{priority_communities_field} (total CBGs)\"] = frame[\n",
" f\"{priority_communities_field}\"\n",
" ].sum()\n",
"\n",
" # Calculate some combinations of other variables.\n",
" summary_dict[f\"{priority_communities_field} (percent CBGs)\"] = (\n",
" summary_dict[f\"{priority_communities_field} (total CBGs)\"]\n",
" / summary_dict[\"Total CBGs in state\"]\n",
" )\n",
"\n",
" summary_dict[f\"{priority_communities_field} (percent population)\"] = (\n",
" summary_dict[f\"{priority_communities_field}{POPULATION_SUFFIX}\"]\n",
" / summary_dict[\"Total population in state\"]\n",
" )\n",
"\n",
" df = pd.DataFrame(summary_dict, index=[0])\n",
"\n",
" return df\n",
"\n",
" grouped_df = df.groupby(GEOID_STATE_FIELD_NAME)\n",
"\n",
" # Run the comparison function on the groups.\n",
" state_distribution_df = grouped_df.progress_apply(calculate_state_comparison)\n",
"\n",
" return state_distribution_df\n",
"\n",
"\n",
"def write_state_distribution_excel(\n",
" state_distribution_df: pd.DataFrame, file_path: pathlib.PosixPath\n",
") -> None:\n",
" \"\"\"Write the dataframe to excel with special formatting.\"\"\"\n",
" # Create a Pandas Excel writer using XlsxWriter as the engine.\n",
" writer = pd.ExcelWriter(file_path, engine=\"xlsxwriter\")\n",
"\n",
" # Convert the dataframe to an XlsxWriter Excel object. We also turn off the\n",
" # index column at the left of the output dataframe.\n",
" state_distribution_df.to_excel(writer, sheet_name=\"Sheet1\", index=False)\n",
"\n",
" # Get the xlsxwriter workbook and worksheet objects.\n",
" workbook = writer.book\n",
" worksheet = writer.sheets[\"Sheet1\"]\n",
" worksheet.autofilter(\n",
" 0, 0, state_distribution_df.shape[0], state_distribution_df.shape[1]\n",
" )\n",
"\n",
" for column in state_distribution_df.columns:\n",
" # Special formatting for columns that capture the percent of population considered priority.\n",
" if \"(percent population)\" in column:\n",
" # Turn the column index into excel ranges (e.g., column #95 is \"CR\" and the range may be \"CR2:CR53\").\n",
" column_index = state_distribution_df.columns.get_loc(column)\n",
" column_character = get_excel_column_name(column_index)\n",
" column_ranges = (\n",
" f\"{column_character}2:{column_character}{len(state_distribution_df)+1}\"\n",
" )\n",
"\n",
" # Add green to red conditional formatting.\n",
" worksheet.conditional_format(\n",
" column_ranges,\n",
" # Min: green, max: red.\n",
" {\n",
" \"type\": \"2_color_scale\",\n",
" \"min_color\": \"#00FF7F\",\n",
" \"max_color\": \"#C82538\",\n",
" },\n",
" )\n",
"\n",
" # TODO: text wrapping not working, fix.\n",
" text_wrap = workbook.add_format({\"text_wrap\": True})\n",
"\n",
" # Make these columns wide enough that you can read them.\n",
" worksheet.set_column(\n",
" f\"{column_character}:{column_character}\", 40, text_wrap\n",
" )\n",
"\n",
" writer.save()\n",
"\n",
"\n",
"state_distribution_df = get_state_distributions(\n",
" df=merged_df,\n",
" priority_communities_fields=cejst_priority_communities_fields\n",
" + comparison_priority_communities_fields,\n",
")\n",
"\n",
"state_distribution_df.to_csv(\n",
" path_or_buf=COMPARISON_OUTPUTS_DIR / \"Priority CBGs by state.csv\",\n",
" na_rep=\"\",\n",
" index=False,\n",
")\n",
"\n",
"write_state_distribution_excel(\n",
" state_distribution_df=state_distribution_df,\n",
" file_path=COMPARISON_OUTPUTS_DIR / \"Priority CBGs by state.xlsx\",\n",
")\n",
"\n",
"state_distribution_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d46667cf",
"metadata": {},
"outputs": [],
"source": [
"# This cell defines a couple of comparison functions. It does not run them.\n",
"\n",
"# Define a namedtuple for column names, which need to be shared between multiple parts of this comparison pipeline.\n",
"# Named tuples are useful here because they provide guarantees that for each instance, all properties are defined and\n",
"# can be accessed as properties (rather than as strings).\n",
"\n",
"# Note: if you'd like to add a field used throughout the comparison process, add it in three places.\n",
"# For an example `new_field`,\n",
"# 1. in this namedtuple, add the field as a string in `field_names` (e.g., `field_names=[..., \"new_field\"])`)\n",
"# 2. in the function `get_comparison_field_names`, define how the field name should be created from input data\n",
"# (e.g., `...new_field=f\"New field compares {method_a_name} to {method_b_name}\")\n",
"# 3. In the function `get_comparison_markdown_content`, add some reporting on the new field to the markdown content.\n",
"# (e.g., `The statistics indicate that {calculation_based_on_new_field} percent of census tracts are different between scores.`)\n",
"ComparisonFieldNames = collections.namedtuple(\n",
" typename=\"ComparisonFieldNames\",\n",
" field_names=[\n",
" \"any_tract_has_at_least_one_method_a_cbg\",\n",
" \"method_b_tract_has_at_least_one_method_a_cbg\",\n",
" \"method_b_tract_has_100_percent_method_a_cbg\",\n",
" \"method_b_non_priority_tract_has_at_least_one_method_a_cbg\",\n",
" \"method_b_non_priority_tract_has_100_percent_method_a_cbg\",\n",
" ],\n",
")\n",
"\n",
"# Define a namedtuple for indices.\n",
"Index = collections.namedtuple(\n",
" typename=\"Index\",\n",
" field_names=[\n",
" \"method_name\",\n",
" \"priority_communities_field\",\n",
" # Note: this field only used by indices defined at the census tract level.\n",
" \"other_census_tract_fields_to_keep\",\n",
" ],\n",
")\n",
"\n",
"\n",
"def get_comparison_field_names(\n",
" method_a_name: str,\n",
" method_b_name: str,\n",
") -> ComparisonFieldNames:\n",
" comparison_field_names = ComparisonFieldNames(\n",
" any_tract_has_at_least_one_method_a_cbg=(\n",
" f\"Any tract has at least one {method_a_name} Priority CBG?\"\n",
" ),\n",
" method_b_tract_has_at_least_one_method_a_cbg=(\n",
" f\"{method_b_name} priority tract has at least one {method_a_name} CBG?\"\n",
" ),\n",
" method_b_tract_has_100_percent_method_a_cbg=(\n",
" f\"{method_b_name} tract has 100% {method_a_name} priority CBGs?\"\n",
" ),\n",
" method_b_non_priority_tract_has_at_least_one_method_a_cbg=(\n",
" f\"Non-priority {method_b_name} tract has at least one {method_a_name} priority CBG?\"\n",
" ),\n",
" method_b_non_priority_tract_has_100_percent_method_a_cbg=(\n",
" f\"Non-priority {method_b_name} tract has 100% {method_a_name} priority CBGs?\"\n",
" ),\n",
" )\n",
" return comparison_field_names\n",
"\n",
"\n",
"def get_df_with_only_shared_states(\n",
" df: pd.DataFrame,\n",
" field_a: str,\n",
" field_b: str,\n",
" state_field=GEOID_STATE_FIELD_NAME,\n",
") -> pd.DataFrame:\n",
" \"\"\"\n",
" Useful for looking at shared geographies across two fields.\n",
"\n",
" For a data frame and two fields, return a data frame only for states where there are non-null\n",
" values for both fields in that state (or territory).\n",
"\n",
" This is useful, for example, when running a comparison of CalEnviroScreen (only in California) against\n",
" a draft score that's national, and returning only the data for California for the entire data frame.\n",
" \"\"\"\n",
" field_a_states = df.loc[df[field_a].notnull(), state_field].unique()\n",
" field_b_states = df.loc[df[field_b].notnull(), state_field].unique()\n",
"\n",
" shared_states = list(set(field_a_states) & set(field_b_states))\n",
"\n",
" df = df.loc[df[state_field].isin(shared_states), :]\n",
"\n",
" return df\n",
"\n",
"\n",
"def get_comparison_df(\n",
" df: pd.DataFrame,\n",
" method_a_priority_census_block_groups_field: str,\n",
" method_b_priority_census_tracts_field: str,\n",
" other_census_tract_fields_to_keep: typing.Optional[typing.List[str]],\n",
" comparison_field_names: ComparisonFieldNames,\n",
" output_dir: pathlib.PosixPath,\n",
") -> None:\n",
" \"\"\"Produces a comparison report for any two given boolean columns representing priority fields.\n",
"\n",
" Args:\n",
" df: a pandas dataframe including the data for this comparison.\n",
" method_a_priority_census_block_groups_field: the name of a boolean column in `df`, such as the CEJST priority\n",
" community field that defines communities at the level of census block groups (CBGs).\n",
" method_b_priority_census_tracts_field: the name of a boolean column in `df`, such as the CalEnviroScreen priority\n",
" community field that defines communities at the level of census tracts.\n",
" other_census_tract_fields_to_keep (optional): a list of field names to preserve at the census tract level\n",
"\n",
" Returns:\n",
" df: a pandas dataframe with one row with the results of this comparison\n",
" \"\"\"\n",
"\n",
" def calculate_comparison(frame: pd.DataFrame) -> pd.DataFrame:\n",
" \"\"\"\n",
" This method will be applied to a `group_by` object.\n",
"\n",
" Note: It inherits from outer scope `method_a_priority_census_block_groups_field`, `method_b_priority_census_tracts_field`,\n",
" and `other_census_tract_fields_to_keep`.\n",
" \"\"\"\n",
" # Keep all the tract values at the Census Tract Level\n",
" for field in other_census_tract_fields_to_keep:\n",
" if len(frame[field].unique()) != 1:\n",
" raise ValueError(\n",
" f\"There are different values per CBG for field {field}.\"\n",
" \"`other_census_tract_fields_to_keep` can only be used for fields at the census tract level.\"\n",
" )\n",
"\n",
" df = frame.loc[\n",
" frame.index[0],\n",
" [\n",
" GEOID_TRACT_FIELD_NAME,\n",
" method_b_priority_census_tracts_field,\n",
" ]\n",
" + other_census_tract_fields_to_keep,\n",
" ]\n",
"\n",
" # Convenience constant for whether the tract is or is not a method B priority community.\n",
" is_a_method_b_priority_tract = frame.loc[\n",
" frame.index[0], [method_b_priority_census_tracts_field]\n",
" ][0]\n",
"\n",
" # Recall that NaN values are not falsy, so we need to check if `is_a_method_b_priority_tract` is True.\n",
" is_a_method_b_priority_tract = is_a_method_b_priority_tract is True\n",
"\n",
" # Calculate whether the tract (whether or not it is a comparison priority tract) includes CBGs that are priority\n",
" # according to the current CBG score.\n",
" df[comparison_field_names.any_tract_has_at_least_one_method_a_cbg] = (\n",
" frame.loc[:, method_a_priority_census_block_groups_field].sum() > 0\n",
" )\n",
"\n",
" # Calculate comparison\n",
" # A comparison priority tract has at least one CBG that is a priority CBG.\n",
" df[comparison_field_names.method_b_tract_has_at_least_one_method_a_cbg] = (\n",
" frame.loc[:, method_a_priority_census_block_groups_field].sum() > 0\n",
" if is_a_method_b_priority_tract\n",
" else None\n",
" )\n",
"\n",
" # A comparison priority tract has all of its contained CBGs as CBG priority CBGs.\n",
" df[comparison_field_names.method_b_tract_has_100_percent_method_a_cbg] = (\n",
" frame.loc[:, method_a_priority_census_block_groups_field].mean() == 1\n",
" if is_a_method_b_priority_tract\n",
" else None\n",
" )\n",
"\n",
" # Calculate the inverse\n",
" # A tract that is _not_ a comparison priority has at least one CBG priority CBG.\n",
" df[\n",
" comparison_field_names.method_b_non_priority_tract_has_at_least_one_method_a_cbg\n",
" ] = (\n",
" frame.loc[:, method_a_priority_census_block_groups_field].sum() > 0\n",
" if not is_a_method_b_priority_tract\n",
" else None\n",
" )\n",
"\n",
" # A tract that is _not_ a comparison priority has all of its contained CBGs as CBG priority CBGs.\n",
" df[\n",
" comparison_field_names.method_b_non_priority_tract_has_100_percent_method_a_cbg\n",
" ] = (\n",
" frame.loc[:, method_a_priority_census_block_groups_field].mean() == 1\n",
" if not is_a_method_b_priority_tract\n",
" else None\n",
" )\n",
"\n",
" return df\n",
"\n",
" # Group all data by the census tract.\n",
" grouped_df = df.groupby(GEOID_TRACT_FIELD_NAME)\n",
"\n",
" # Run the comparison function on the groups.\n",
" comparison_df = grouped_df.progress_apply(calculate_comparison)\n",
"\n",
" return comparison_df\n",
"\n",
"\n",
"def get_comparison_markdown_content(\n",
" original_df: pd.DataFrame,\n",
" comparison_df: pd.DataFrame,\n",
" comparison_field_names: ComparisonFieldNames,\n",
" method_a_name: str,\n",
" method_b_name: str,\n",
" method_a_priority_census_block_groups_field: str,\n",
" method_b_priority_census_tracts_field: str,\n",
" state_field: str = GEOID_STATE_FIELD_NAME,\n",
") -> str:\n",
" # Prepare some constants for use in the following Markdown content.\n",
" total_cbgs = len(original_df)\n",
"\n",
" # List of all states/territories in their FIPS codes:\n",
" state_ids = sorted(original_df[state_field].unique())\n",
" state_names = \", \".join([us.states.lookup(state_id).name for state_id in state_ids])\n",
"\n",
" # Note: using squeeze throughout do reduce result of `sum()` to a scalar.\n",
" # TODO: investigate why sums are sometimes series and sometimes scalar.\n",
" method_a_priority_cbgs = (\n",
" original_df.loc[:, method_a_priority_census_block_groups_field].sum().squeeze()\n",
" )\n",
" method_a_priority_cbgs_percent = f\"{method_a_priority_cbgs / total_cbgs:.0%}\"\n",
"\n",
" total_tracts_count = len(comparison_df)\n",
"\n",
" method_b_priority_tracts_count = comparison_df.loc[\n",
" :, method_b_priority_census_tracts_field\n",
" ].sum()\n",
"\n",
" method_b_priority_tracts_count_percent = (\n",
" f\"{method_b_priority_tracts_count / total_tracts_count:.0%}\"\n",
" )\n",
" method_b_non_priority_tracts_count = (\n",
" total_tracts_count - method_b_priority_tracts_count\n",
" )\n",
"\n",
" method_a_tracts_count = (\n",
" comparison_df.loc[\n",
" :, comparison_field_names.any_tract_has_at_least_one_method_a_cbg\n",
" ]\n",
" .sum()\n",
" .squeeze()\n",
" )\n",
" method_a_tracts_count_percent = f\"{method_a_tracts_count / total_tracts_count:.0%}\"\n",
"\n",
" # Method A priority community stats\n",
" method_b_tracts_with_at_least_one_method_a_cbg = comparison_df.loc[\n",
" :, comparison_field_names.method_b_tract_has_at_least_one_method_a_cbg\n",
" ].sum()\n",
" method_b_tracts_with_at_least_one_method_a_cbg_percent = f\"{method_b_tracts_with_at_least_one_method_a_cbg / method_b_priority_tracts_count:.0%}\"\n",
"\n",
" method_b_tracts_with_at_100_percent_method_a_cbg = comparison_df.loc[\n",
" :, comparison_field_names.method_b_tract_has_100_percent_method_a_cbg\n",
" ].sum()\n",
" method_b_tracts_with_at_100_percent_method_a_cbg_percent = f\"{method_b_tracts_with_at_100_percent_method_a_cbg / method_b_priority_tracts_count:.0%}\"\n",
"\n",
" # Method A non-priority community stats\n",
" method_b_non_priority_tracts_with_at_least_one_method_a_cbg = comparison_df.loc[\n",
" :,\n",
" comparison_field_names.method_b_non_priority_tract_has_at_least_one_method_a_cbg,\n",
" ].sum()\n",
"\n",
" method_b_non_priority_tracts_with_at_least_one_method_a_cbg_percent = f\"{method_b_non_priority_tracts_with_at_least_one_method_a_cbg / method_b_non_priority_tracts_count:.0%}\"\n",
"\n",
" method_b_non_priority_tracts_with_100_percent_method_a_cbg = comparison_df.loc[\n",
" :,\n",
" comparison_field_names.method_b_non_priority_tract_has_100_percent_method_a_cbg,\n",
" ].sum()\n",
" method_b_non_priority_tracts_with_100_percent_method_a_cbg_percent = f\"{method_b_non_priority_tracts_with_100_percent_method_a_cbg / method_b_non_priority_tracts_count:.0%}\"\n",
"\n",
" # Create markdown content for comparisons.\n",
" markdown_content = f\"\"\"\n",
"# {method_a_name} compared to {method_b_name}\n",
"\n",
"(This report was calculated on {datetime.today().strftime('%Y-%m-%d')}.)\n",
"\n",
"This report analyzes the following US states and territories: {state_names}.\n",
"\n",
"Recall that census tracts contain one or more census block groups, with up to nine census block groups per tract.\n",
"\n",
"Within the geographic area analyzed, there are {method_b_priority_tracts_count} census tracts designated as priority communities by {method_b_name}, out of {total_tracts_count} total tracts ({method_b_priority_tracts_count_percent}). \n",
"\n",
"Within the geographic region analyzed, there are {method_a_priority_cbgs} census block groups considered as priority communities by {method_a_name}, out of {total_cbgs} CBGs ({method_a_priority_cbgs_percent}). They occupy {method_a_tracts_count} census tracts ({method_a_tracts_count_percent}) of the geographic area analyzed.\n",
"\n",
"Out of every {method_b_name} priority census tract, {method_b_tracts_with_at_least_one_method_a_cbg} ({method_b_tracts_with_at_least_one_method_a_cbg_percent}) of these census tracts have at least one census block group within them that is considered a priority community by {method_a_name}.\n",
"\n",
"Out of every {method_b_name} priority census tract, {method_b_tracts_with_at_100_percent_method_a_cbg} ({method_b_tracts_with_at_100_percent_method_a_cbg_percent}) of these census tracts have 100% of the included census block groups within them considered priority communities by {method_a_name}.\n",
"\n",
"Out of every census tract that is __not__ marked as a priority community by {method_b_name}, {method_b_non_priority_tracts_with_at_least_one_method_a_cbg} ({method_b_non_priority_tracts_with_at_least_one_method_a_cbg_percent}) of these census tracts have at least one census block group within them that is considered a priority community by the current version of the CEJST score.\n",
"\n",
"Out of every census tract that is __not__ marked as a priority community by {method_b_name}, {method_b_non_priority_tracts_with_100_percent_method_a_cbg} ({method_b_non_priority_tracts_with_100_percent_method_a_cbg_percent}) of these census tracts have 100% of the included census block groups within them considered priority communities by the current version of the CEJST score.\n",
"\"\"\"\n",
"\n",
" return markdown_content\n",
"\n",
"\n",
"def write_markdown_and_docx_content(\n",
" markdown_content: str, file_dir: pathlib.PosixPath, file_name_without_extension: str\n",
") -> pathlib.PosixPath:\n",
" \"\"\"Write Markdown content to both .md and .docx files.\"\"\"\n",
" # Set the file paths for both files.\n",
" markdown_file_path = file_dir / f\"{file_name_without_extension}.md\"\n",
" docx_file_path = file_dir / f\"{file_name_without_extension}.docx\"\n",
"\n",
" # Write the markdown content to file.\n",
" with open(markdown_file_path, \"w\") as text_file:\n",
" text_file.write(markdown_content)\n",
"\n",
" # Convert markdown file to Word doc.\n",
" pypandoc.convert_file(\n",
" source_file=str(markdown_file_path),\n",
" to=\"docx\",\n",
" outputfile=str(docx_file_path),\n",
" extra_args=[],\n",
" )\n",
"\n",
" return docx_file_path\n",
"\n",
"\n",
"def execute_comparison(\n",
" df: pd.DataFrame,\n",
" method_a_name: str,\n",
" method_b_name: str,\n",
" method_a_priority_census_block_groups_field: str,\n",
" method_b_priority_census_tracts_field: str,\n",
" other_census_tract_fields_to_keep: typing.Optional[typing.List[str]],\n",
") -> pathlib.PosixPath:\n",
" \"\"\"Execute an individual comparison by creating the data frame and writing the report.\n",
"\n",
" Args:\n",
" df: a pandas dataframe including the data for this comparison.\n",
" method_a_priority_census_block_groups_field: the name of a boolean column in `df`, such as the CEJST priority\n",
" community field that defines communities at the level of census block groups (CBGs).\n",
" method_b_priority_census_tracts_field: the name of a boolean column in `df`, such as the CalEnviroScreen priority\n",
" community field that defines communities at the level of census tracts.\n",
" other_census_tract_fields_to_keep (optional): a list of field names to preserve at the census tract level\n",
"\n",
" Returns:\n",
" df: a pandas dataframe with one row with the results of this comparison\n",
"\n",
" \"\"\"\n",
" comparison_field_names = get_comparison_field_names(\n",
" method_a_name=method_a_name, method_b_name=method_b_name\n",
" )\n",
"\n",
" # Create or use a directory for outputs grouped by Method A.\n",
" output_dir = COMPARISON_OUTPUTS_DIR / method_a_name\n",
" output_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
" df_with_only_shared_states = get_df_with_only_shared_states(\n",
" df=df,\n",
" field_a=method_a_priority_census_block_groups_field,\n",
" field_b=method_b_priority_census_tracts_field,\n",
" )\n",
"\n",
" comparison_df = get_comparison_df(\n",
" df=df_with_only_shared_states,\n",
" method_a_priority_census_block_groups_field=method_a_priority_census_block_groups_field,\n",
" method_b_priority_census_tracts_field=method_b_priority_census_tracts_field,\n",
" comparison_field_names=comparison_field_names,\n",
" other_census_tract_fields_to_keep=other_census_tract_fields_to_keep,\n",
" output_dir=output_dir,\n",
" )\n",
"\n",
" # Choose output path\n",
" file_path = (\n",
" output_dir / f\"Comparison Output - {method_a_name} and {method_b_name}.csv\"\n",
" )\n",
"\n",
" # Write comparison to CSV.\n",
" comparison_df.to_csv(\n",
" path_or_buf=file_path,\n",
" na_rep=\"\",\n",
" index=False,\n",
" )\n",
"\n",
" markdown_content = get_comparison_markdown_content(\n",
" original_df=df_with_only_shared_states,\n",
" comparison_df=comparison_df,\n",
" comparison_field_names=comparison_field_names,\n",
" method_a_name=method_a_name,\n",
" method_b_name=method_b_name,\n",
" method_a_priority_census_block_groups_field=method_a_priority_census_block_groups_field,\n",
" method_b_priority_census_tracts_field=method_b_priority_census_tracts_field,\n",
" )\n",
"\n",
" comparison_docx_file_path = write_markdown_and_docx_content(\n",
" markdown_content=markdown_content,\n",
" file_dir=output_dir,\n",
" file_name_without_extension=f\"Comparison report - {method_a_name} and {method_b_name}\",\n",
" )\n",
"\n",
" return comparison_docx_file_path\n",
"\n",
"\n",
"def execute_comparisons(\n",
" df: pd.DataFrame,\n",
" census_block_group_indices: typing.List[Index],\n",
" census_tract_indices: typing.List[Index],\n",
"):\n",
" \"\"\"Create multiple comparison reports.\"\"\"\n",
" comparison_docx_file_paths = []\n",
" for cbg_index in census_block_group_indices:\n",
" for census_tract_index in census_tract_indices:\n",
" print(\n",
" f\"Running comparisons for {cbg_index.method_name} against {census_tract_index.method_name}...\"\n",
" )\n",
"\n",
" comparison_docx_file_path = execute_comparison(\n",
" df=df,\n",
" method_a_name=cbg_index.method_name,\n",
" method_b_name=census_tract_index.method_name,\n",
" method_a_priority_census_block_groups_field=cbg_index.priority_communities_field,\n",
" method_b_priority_census_tracts_field=census_tract_index.priority_communities_field,\n",
" other_census_tract_fields_to_keep=census_tract_index.other_census_tract_fields_to_keep,\n",
" )\n",
"\n",
" comparison_docx_file_paths.append(comparison_docx_file_path)\n",
"\n",
" return comparison_docx_file_paths"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "48d9bf6b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Actually execute the functions\n",
"\n",
"# # California only\n",
"# cal_df = merged_df[merged_df[GEOID_TRACT_FIELD_NAME].astype(str).str[0:2] == \"06\"]\n",
"# # cal_df = cal_df[0:1000]\n",
"# print(len(cal_df))\n",
"\n",
"census_block_group_indices = [\n",
" Index(\n",
" method_name=\"Score A\",\n",
" priority_communities_field=\"Score A (top 25th percentile)\",\n",
" other_census_tract_fields_to_keep=[],\n",
" ),\n",
" # Index(\n",
" # method_name=\"Score B\",\n",
" # priority_communities_field=\"Score B (top 25th percentile)\",\n",
" # other_census_tract_fields_to_keep=[],\n",
" # ),\n",
" Index(\n",
" method_name=\"Score C\",\n",
" priority_communities_field=\"Score C (top 25th percentile)\",\n",
" other_census_tract_fields_to_keep=[],\n",
" ),\n",
" Index(\n",
" method_name=\"Score D\",\n",
" priority_communities_field=\"Score D (top 25th percentile)\",\n",
" other_census_tract_fields_to_keep=[],\n",
" ),\n",
" # Index(\n",
" # method_name=\"Score E\",\n",
" # priority_communities_field=\"Score E (top 25th percentile)\",\n",
" # other_census_tract_fields_to_keep=[],\n",
" # ),\n",
"]\n",
"\n",
"census_tract_indices = [\n",
" Index(\n",
" method_name=\"CalEnviroScreen 4.0\",\n",
" priority_communities_field=\"calenviroscreen_priority_community\",\n",
" other_census_tract_fields_to_keep=[\n",
" CALENVIROSCREEN_SCORE_FIELD,\n",
" CALENVIROSCREEN_PERCENTILE_FIELD,\n",
" ],\n",
" ),\n",
" Index(\n",
" method_name=\"HUD RECAP\",\n",
" priority_communities_field=\"hud_recap_priority_community\",\n",
" other_census_tract_fields_to_keep=[],\n",
" ),\n",
"]\n",
"\n",
"file_paths = execute_comparisons(\n",
" df=merged_df,\n",
" census_block_group_indices=census_block_group_indices,\n",
" census_tract_indices=census_tract_indices,\n",
")\n",
"\n",
"print(file_paths)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

1927
data/data-pipeline/poetry.lock generated Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,26 @@
[tool.poetry]
name = "score"
version = "0.1.0"
description = "ETL and Generation of Justice 40 Score"
authors = ["Your Name <you@example.com>"]
[tool.poetry.dependencies]
python = "^3.7.1"
CensusData = "^1.13"
click = "^8.0.1"
dynaconf = "^3.1.4"
ipython = "^7.24.1"
jupyter = "^1.0.0"
jupyter-contrib-nbextensions = "^0.5.1"
numpy = "^1.21.0"
pandas = "^1.2.5"
requests = "^2.25.1"
types-requests = "^2.25.0"
[tool.poetry.dev-dependencies]
mypy = "^0.910"
black = {version = "^21.6b0", allow-prereleases = true}
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

View file

@ -0,0 +1,83 @@
appnope==0.1.2; sys_platform == "darwin" and python_version >= "3.7"
argon2-cffi==20.1.0; python_version >= "3.6"
async-generator==1.10; python_full_version >= "3.6.1" and python_version >= "3.7"
attrs==21.2.0; python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.5"
backcall==0.2.0; python_version >= "3.7"
bleach==3.3.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
censusdata==1.13; python_version >= "2.7"
certifi==2021.5.30; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "2.7"
cffi==1.14.6; implementation_name == "pypy" and python_version >= "3.6"
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "2.7"
click==8.0.1; python_version >= "3.6"
colorama==0.4.4; python_version >= "3.7" and python_full_version < "3.0.0" and sys_platform == "win32" and platform_system == "Windows" or sys_platform == "win32" and python_version >= "3.7" and python_full_version >= "3.5.0" and platform_system == "Windows"
debugpy==1.3.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
decorator==5.0.9; python_version >= "3.7"
defusedxml==0.7.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
dynaconf==3.1.4
entrypoints==0.3; python_version >= "3.7"
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "2.7"
importlib-metadata==3.10.1; python_version < "3.8" and python_version >= "3.7"
ipykernel==6.0.1; python_version >= "3.7"
ipython-genutils==0.2.0; python_version >= "3.7"
ipython==7.25.0; python_version >= "3.7"
ipywidgets==7.6.3
jedi==0.18.0; python_version >= "3.7"
jinja2==3.0.1; python_version >= "3.7"
jsonschema==3.2.0; python_version >= "3.5"
jupyter-client==6.2.0; python_full_version >= "3.6.1" and python_version >= "3.7"
jupyter-console==6.4.0; python_version >= "3.6"
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.5.1
jupyter-core==4.7.1; python_full_version >= "3.6.1" and python_version >= "3.7"
jupyter-highlight-selected-word==0.2.0
jupyter-latex-envs==1.4.6
jupyter-nbextensions-configurator==0.4.1
jupyter==1.0.0
jupyterlab-pygments==0.1.2; python_version >= "3.7"
jupyterlab-widgets==1.0.0; python_version >= "3.6"
lxml==4.6.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
markupsafe==2.0.1; python_version >= "3.7"
matplotlib-inline==0.1.2; platform_system == "Darwin" and python_version >= "3.7"
mistune==0.8.4; python_version >= "3.7"
nbclient==0.5.3; python_full_version >= "3.6.1" and python_version >= "3.7"
nbconvert==6.1.0; python_version >= "3.7"
nbformat==5.1.3; python_full_version >= "3.6.1" and python_version >= "3.7"
nest-asyncio==1.5.1; python_full_version >= "3.6.1" and python_version >= "3.7"
notebook==6.4.0; python_version >= "3.6"
numpy==1.21.0; python_version >= "3.7"
packaging==21.0; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
pandas==1.3.0; python_full_version >= "3.7.1"
pandocfilters==1.4.3; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.7"
parso==0.8.2; python_version >= "3.7"
pexpect==4.8.0; sys_platform != "win32" and python_version >= "3.7"
pickleshare==0.7.5; python_version >= "3.7"
prometheus-client==0.11.0; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
prompt-toolkit==3.0.19; python_full_version >= "3.6.1" and python_version >= "3.7"
ptyprocess==0.7.0; sys_platform != "win32" and python_version >= "3.7" and os_name != "nt"
py==1.10.0; python_version >= "3.6" and python_full_version < "3.0.0" and implementation_name == "pypy" or implementation_name == "pypy" and python_version >= "3.6" and python_full_version >= "3.4.0"
pycparser==2.20; python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0" and python_version >= "3.6"
pygments==2.9.0; python_version >= "3.7"
pyparsing==2.4.7; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
pyrsistent==0.18.0; python_version >= "3.6"
python-dateutil==2.8.1; python_full_version >= "3.7.1" and python_version >= "3.7"
pytz==2021.1; python_full_version >= "3.7.1" and python_version >= "2.7"
pywin32==301; sys_platform == "win32" and python_version >= "3.6"
pywinpty==1.1.3; os_name == "nt" and python_version >= "3.6"
pyyaml==5.4.1; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.6.0"
pyzmq==22.1.0; python_full_version >= "3.6.1" and python_version >= "3.7"
qtconsole==5.1.1; python_version >= "3.6"
qtpy==1.9.0; python_version >= "3.6"
requests==2.25.1; (python_version >= "2.7" and python_full_version < "3.0.0") or (python_full_version >= "3.5.0")
send2trash==1.7.1; python_version >= "3.6"
six==1.16.0; python_full_version >= "3.7.1" and python_version >= "3.7" and (python_version >= "3.6" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.6") and (python_version >= "3.5" and python_full_version < "3.0.0" or python_full_version >= "3.3.0" and python_version >= "3.5")
terminado==0.10.1; python_version >= "3.6"
testpath==0.5.0; python_version >= "3.7"
tornado==6.1; python_full_version >= "3.6.1" and python_version >= "3.7"
traitlets==5.0.5; python_full_version >= "3.6.1" and python_version >= "3.7"
types-requests==2.25.0
typing-extensions==3.10.0.0; python_version < "3.8" and python_version >= "3.6"
urllib3==1.26.6; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version < "4" and python_version >= "2.7"
wcwidth==0.2.5; python_full_version >= "3.6.1" and python_version >= "3.7"
webencodings==0.5.1; python_version >= "3.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0" and python_version >= "3.7"
widgetsnbextension==3.5.1
zipp==3.5.0; python_version < "3.8" and python_version >= "3.6"

View file

@ -0,0 +1,8 @@
[default]
AWS_JUSTICE40_DATA_URL = "https://justice40-data.s3.amazonaws.com"
[development]
[staging]
[production]

View file

View file

@ -0,0 +1,63 @@
import os
from pathlib import Path
import shutil
from etl.sources.census.etl_utils import get_state_fips_codes
def generate_tiles(data_path: Path) -> None:
# remove existing mbtiles file
mb_tiles_path = data_path / "tiles" / "block2010.mbtiles"
if os.path.exists(mb_tiles_path):
os.remove(mb_tiles_path)
# remove existing mvt directory
mvt_tiles_path = data_path / "tiles" / "mvt"
if os.path.exists(mvt_tiles_path):
shutil.rmtree(mvt_tiles_path)
# remove existing score json files
score_geojson_dir = data_path / "score" / "geojson"
files_in_directory = os.listdir(score_geojson_dir)
filtered_files = [file for file in files_in_directory if file.endswith(".json")]
for file in filtered_files:
path_to_file = os.path.join(score_geojson_dir, file)
os.remove(path_to_file)
# join the state shape sqllite with the score csv
state_fips_codes = get_state_fips_codes()
for fips in state_fips_codes:
cmd = (
"ogr2ogr -f GeoJSON "
+ f"-sql \"SELECT * FROM tl_2010_{fips}_bg10 LEFT JOIN 'data/score/csv/data{fips}.csv'.data{fips} ON tl_2010_{fips}_bg10.GEOID10 = data{fips}.ID\" "
+ f"data/score/geojson/{fips}.json data/census/shp/{fips}/tl_2010_{fips}_bg10.dbf"
)
os.system(cmd)
# get a list of all json files to plug in the docker commands below
# (workaround since *.json doesn't seem to work)
geojson_list = ""
geojson_path = data_path / "score" / "geojson"
for file in os.listdir(geojson_path):
if file.endswith(".json"):
geojson_list += f"data/score/geojson/{file} "
if geojson_list == "":
logging.error(
"No GeoJson files found. Please run scripts/download_cbg.py first"
)
# generate mbtiles file
cmd = (
"tippecanoe --drop-densest-as-needed -zg -o /home/data/tiles/block2010.mbtiles --extend-zooms-if-still-dropping -l cbg2010 -s_srs EPSG:4269 -t_srs EPSG:4326 "
+ geojson_list
)
os.system(cmd)
# generate mvts
cmd = (
"tippecanoe --drop-densest-as-needed --no-tile-compression -zg -e /home/data/tiles/mvt "
+ geojson_list
)
os.system(cmd)

1177
data/data-pipeline/utils.py Normal file

File diff suppressed because it is too large Load diff

153
data/data-roadmap/README.md Normal file
View file

@ -0,0 +1,153 @@
# Overview
This document describes our "data roadmap", which serves several purposes.
# Data roadmap goals
The goals of the data roadmap are as follows:
- Tracking data sets being considered for inclusion in the Climate and Economic Justice Screening Tool (CEJST), either as a data set that is included in the cumulative impacts score or a reference data set that is not included in the score
- Prioritizing data sets, so that it's obvious to developers working on the CEJST which data sets to incorporate next into the tool
- Gathering important details about each data set, such as its geographic resolution and the year it was last updated, so that the CEJST team can make informed decisions about what data to prioritize
- Tracking the problem areas that each data set relates to (e.g., a certain data set may relate to the problem of pesticide exposure amongst migrant farm workers)
- Enabling members of the public to submit ideas for problem areas or data sets to be considered for inclusion in the CEJST, with easy-to-use and accessible tools
- Enabling members of the public to submit revisions to the information about each problem area or data set, with easy-to-use and accessible tools
- Enabling the CEJST development team to review suggestions before incorporating them officially into the data roadmap, to filter out potential noise and spam, or consider how requests may lead to changes in software features and documentation
# User stories
These goals can map onto several user stories for the data roadmap, such as:
- As a community member, I want to suggest a new idea for a dataset.
- As a community member, I want to understand what happened with my suggestion for a new dataset.
- As a community member, I want to edit the details of a dataset proposal to add more information.
- As a WHEJAC board member, I want to vote on what data sources should be prioritized next.
- As a product manager, I want to filter based on characteristics of the data.
- As a developer, I want to know what to work on next.
# Data set descriptions
There are lots of details that are important to track for each data set. This
information helps us prepare to integrate a data set into the tool and prioritize
between different options for data in the data roadmap.
In order to support a process of peer review on edits and updates, these details are
tracked in one `YAML` file per data set description in the directory
[data_roadmap/data_set_descriptions](data_roadmap/data_set_descriptions).
Each data set description includes a number of fields, some of which are required.
The schema defining these fields is written in [Yamale](https://github.com/23andMe/Yamale)
and lives at [data_roadmap/data_set_description_schema.yaml](data_roadmap/data_set_description_schema.yaml).
Because `Yamale` does not provide a method for describing fields, we've created an
additional file that includes written descriptions of the meaning of each field in
the schema. These live in [data_roadmap/data_set_description_field_descriptions.yaml](data_roadmap/data_set_description_field_descriptions.yaml).
In order to provide a helpful starting point for people who are ready to contribute
ideas for a new data set for consideration, there is an auto-generated data set
description template that lives at [data_roadmap/data_set_description_template.yaml](data_roadmap/data_set_description_template.yaml).
# Steps to add a new data set description: the "easy" way
Soon we will create a Google Form that contributors can use to submit ideas for new
data sets. The Google Form will match the schema of the data set descriptions. Please
see [this ticket](https://app.zenhub.com/workspaces/justice40-60993f6e05473d0010ec44e3/issues/usds/justice40-tool/39)
for tracking this work.
# Steps to add a new data set description: the git-savvy way
For those who are comfortable using `git` and `Markdown`, these are the steps to
contribute a new data set description to the data roadmap:
1. Research and learn about the data set you're proposing for consideration.
2. Clone the repository and learn about the [contribution guidelines for this
project](../docs/CONTRIBUTING.md).
3. In your local version of the repository, copy the template from
`data_roadmap/data_set_description_template.yaml` into a new file that lives in
`data_roadmap/data_set_descriptions` and has the name of the data set as the name of the file.
4. Edit this file to ensure it has all of the appropriate details about the data set.
5. If you'd like, you can run the validations in `run_validations_and_write_template`
to ensure your contribution is valid according to the schema. These checks will also
run automatically on each commit.
6. Create a pull request with your new data set description and submit it for peer
review.
Thank you for contributing!
# Tooling proposal and milestones
There is no single tool that supports all the goals and user stories described above.
Therefore we've proposed combining a number of tools in a way that can support them all.
We've also proposed various "milestones" that will allow us to iteratively and
sequentially build the data roadmap in a way that supports the entire vision but
starts with small and achievable steps. These milestones are proposed in order.
This work is most accurately tracked in [this epic](https://app.zenhub.com/workspaces/justice40-60993f6e05473d0010ec44e3/issues/usds/justice40-tool/38).
We've also verbally described them below.
## Milestone: YAML files for data sets and linter (Done)
To start, we'll create a folder in this repository that can
house YAML files, one per data set. Each file will describe the characteristics of the data.
The benefit of using a YAML file for this is that it's easy to subject changes to these files to peer review through the pull request process. This allows external collaborators from the open source community to submit suggested changes, which can be reviewed by the core CEJST team.
We'll use a Python-based script to load all the files in the directory, and then run a schema validator to ensure all the files have valid entries.
For schema validation, we propose using [Yamale](https://github.com/23andMe/Yamale). This provides a lightweight schema and validator, and [integrates nicely with GitHub actions](https://github.com/nrkno/yaml-schema-validator-github-action).
If there's an improper format in any of the files, the schema validator will throw an error.
As part of this milestone, we will also set this up to run automatically with each commit to any branch as part of CI/CD.
## Milestone: Google forms integration
To make it easy for non-engineer members of the public and advisory bodies such as the WHEJAC to submit suggestions for data sets, we will configure a Google Form that maps to the schema of the data set files.
This will enable members of the public to fill out a simple form suggesting data without needing to understand Github or other engineering concepts.
At first, these responses can just go into a resulting Google Sheet and be manually copied and converted into data set description files. Later, we can write a script that converts new entries in the Google Sheet automatically into data set files. This can be setup to run as a trigger on the addition of new rows to the Google Sheet.
## Milestone: Post data in tabular format
Add a script that runs the schema validator on all files and, if successful, posts the results in a tabular format. There are straightforward packages to post a Python dictionary / `pandas` dataframe to Google Sheets and/or Airtable. As part of this milestone, we will also set this up to run automatically with each commit to `main` as part of CI/CD.
This will make it easier to filter the data to answer questions like, "which data sources are available at the census block group level".
## Milestone: Tickets created for incorporating data sets
For each data set that is being considered for inclusion soon in the tool, the project management team will create a ticket for "Incorporating \_\_\_ data set into the database", with a link to the data set detail document. This ticket will be created in the ticket tracking system used by the open source repository, which is ZenHub. This project management system will be public.
At the initial launch, we are not planning for members of the open source community to be able to create tickets, but we would like to consider a process for members of the open source community creating tickets that can go through review by the CEJST team.
This will help developers know what to work on next, and open source community members can also pick up tickets and work to integrate the data sets.
## Milestone: Add problem areas
We'll need to somehow track "problem areas" that describe problems in climate, environmental, and economic justice, even without specific proposals of data sets. For instance, a problem area may be "food insecurity", and a number of data sets can have this as their problem area.
We can change the linter to validate that every data set description maps to one or more known problem areas.
The benefit of this is that some non-data-focused members of the public or the WHEJAC advisory body may want to suggest we prioritize certain problem areas, with or without ideas for specific data sets that may best address that problem area.
It is not clear at this time the best path forward for implementing these problem area descriptions. One option is to create a folder for descriptions of problem areas, which contains YAML files that get validated according to a schema. Another option would be simply to add these as an array in the description of data sets, or add labels to the tickets once data sets are tracked in GitHub tickets.
## Milestone: Add prioritzation voting for WHEJAC and members of the public
This milestone is currently the least well-defined. It's important that members of advisory bodies like the WHEJAC and members of the public be able to "upvote" certain data sets for inclusion in the tool.
One potential for this is to use the [Stanford Participatory Budgeting Platform](https://pbstanford.org/). Here's an [example of voting on proposals within a limited budget](https://pbstanford.org/nyc8/knapsack).
For instance, going into a quarterly planning cycle, the CEJST development team could estimate the amount of time (in developer-weeks) that it would take to clean, analyze, and incorporate each potential data set. For instance, incorporating some already-cleaned census data may take 1 week of a developer's time, while incorporating new asthma data from CMS that's never been publicly released could take 5 weeks. Given a "budget" of the number of developer weeks available (e.g., 2 developers for 13 weeks, or 26 developer-weeks), advisors can vote on their top priorities for inclusion in the tool within the available "budget".

View file

View file

@ -0,0 +1,39 @@
# There is no method for adding field descriptions to `yamale` schemas.
# Therefore, we've created a dictionary here of fields and their descriptions.
name: A short name of the data set.
source: The URL pointing towards the data set itself or more information about the
data set.
relevance_to_environmental_justice: It's useful to spell out why this data is
relevant to EJ issues and/or can be used to identify EJ communities.
spatial_resolution: Dev team needs to know if the resolution is granular enough to be useful
public_status: Whether a dataset has already gone through public release process
(like Census data) or may need a lengthy review process (like Medicaid data).
sponsor: Whether there's a federal agency or non-governmental agency that is working
to provide and maintain this data.
subjective_rating_of_data_quality: Sometimes we don't have statistics on data
quality, but we know it is likely to be accurate or not. How much has it been
vetted by an agency; is this the de facto data set for the topic?
estimated_margin_of_error: Estimated margin of error on measurement, if known. Often
more narrow geographic measures have a higher margin of error due to a smaller sample
for each measurement.
known_data_quality_issues: It can be helpful to write out known problems.
geographic_coverage_percent: We want to think about data that is comprehensive across
America.
geographic_coverage_description: A verbal description of geographic coverage.
data_formats: Developers need to know what formats the data is available in
last_updated_date: When was the data last updated / refreshed? (In format YYYY-MM-DD.
If exact date is not known, use YYYY-01-01.)
frequency_of_updates: How often is this data updated? Is it updated on a reliable
cadence?
documentation: Link to docs. Also, is the documentation good enough? Can we get the
info we need?
data_can_go_in_cloud: Some datasets can not legally go in the cloud
discussion: Review of other topics, such as
peer review (Overview or links out to peer review done on this dataset),
where and how data is available (e.g., Geoplatform.gov? Is it available from multiple
sources?),
risk assessment of the data (e.g. a vendor-processed version of the dataset might not
be open or good enough),
legal considerations (Legal disclaimers, assumption of risk, proprietary?),
accreditation (Is this source accredited?)

View file

@ -0,0 +1,24 @@
# `yamale` schema for descriptions of data sets.
name: str(required=True)
source: str(required=True)
relevance_to_environmental_justice: str(required=False)
data_formats: enum('GeoJSON', 'Esri Shapefile (SHP, DBF, SHX)', 'GML', 'KML/KMZ',
'GPX', 'CSV/XLSX', 'GDB', 'MBTILES', 'LAS', required=True)
spatial_resolution: enum('State/territory', 'County', 'Zip code', 'Census tract',
'Census block group', 'Exact address or lat/long', 'Other', required=True)
public_status: enum('Not Released', 'Public', 'Public for certain audiences', 'Other',
required=True)
sponsor: str(required=True)
subjective_rating_of_data_quality: enum('Low Quality', 'Medium Quality', 'High
Quality', required=False)
estimated_margin_of_error: num(required=False)
known_data_quality_issues: str(required=False)
geographic_coverage_percent: num(required=False)
geographic_coverage_description: str(required=False)
last_updated_date: day(min='2001-01-01', max='2100-01-01', required=True)
frequency_of_updates: enum('Less than annually', 'Approximately annually',
'Once very 1-6 months',
'Daily or more frequently than daily', 'Unknown', required=True)
documentation: str(required=False)
data_can_go_in_cloud: bool(required=False)
discussion: str(required=False)

View file

@ -0,0 +1,94 @@
# Note: This template is automatically generated by the function
# `write_data_set_description_template_file` from the schema
# and field descriptions files. Do not manually edit this file.
name:
# Description: A short name of the data set.
# Required field: True
# Field type: str
source:
# Description: The URL pointing towards the data set itself or more information about the data set.
# Required field: True
# Field type: str
relevance_to_environmental_justice:
# Description: It's useful to spell out why this data is relevant to EJ issues and/or can be used to identify EJ communities.
# Required field: False
# Field type: str
data_formats:
# Description: Developers need to know what formats the data is available in
# Required field: True
# Field type: enum
# Valid choices are one of the following: ('GeoJSON', 'Esri Shapefile (SHP, DBF, SHX)', 'GML', 'KML/KMZ', 'GPX', 'CSV/XLSX', 'GDB', 'MBTILES', 'LAS')
spatial_resolution:
# Description: Dev team needs to know if the resolution is granular enough to be useful
# Required field: True
# Field type: enum
# Valid choices are one of the following: ('State/territory', 'County', 'Zip code', 'Census tract', 'Census block group', 'Exact address or lat/long', 'Other')
public_status:
# Description: Whether a dataset has already gone through public release process (like Census data) or may need a lengthy review process (like Medicaid data).
# Required field: True
# Field type: enum
# Valid choices are one of the following: ('Not Released', 'Public', 'Public for certain audiences', 'Other')
sponsor:
# Description: Whether there's a federal agency or non-governmental agency that is working to provide and maintain this data.
# Required field: True
# Field type: str
subjective_rating_of_data_quality:
# Description: Sometimes we don't have statistics on data quality, but we know it is likely to be accurate or not. How much has it been vetted by an agency; is this the de facto data set for the topic?
# Required field: False
# Field type: enum
# Valid choices are one of the following: ('Low Quality', 'Medium Quality', 'High Quality')
estimated_margin_of_error:
# Description: Estimated margin of error on measurement, if known. Often more narrow geographic measures have a higher margin of error due to a smaller sample for each measurement.
# Required field: False
# Field type: num
known_data_quality_issues:
# Description: It can be helpful to write out known problems.
# Required field: False
# Field type: str
geographic_coverage_percent:
# Description: We want to think about data that is comprehensive across America.
# Required field: False
# Field type: num
geographic_coverage_description:
# Description: A verbal description of geographic coverage.
# Required field: False
# Field type: str
last_updated_date:
# Description: When was the data last updated / refreshed? (In format YYYY-MM-DD. If exact date is not known, use YYYY-01-01.)
# Required field: True
# Field type: day
frequency_of_updates:
# Description: How often is this data updated? Is it updated on a reliable cadence?
# Required field: True
# Field type: enum
# Valid choices are one of the following: ('Less than annually', 'Approximately annually', 'Once very 1-6 months', 'Daily or more frequently than daily', 'Unknown')
documentation:
# Description: Link to docs. Also, is the documentation good enough? Can we get the info we need?
# Required field: False
# Field type: str
data_can_go_in_cloud:
# Description: Some datasets can not legally go in the cloud
# Required field: False
# Field type: bool
discussion:
# Description: Review of other topics, such as peer review (Overview or links out to peer review done on this dataset), where and how data is available (e.g., Geoplatform.gov? Is it available from multiple sources?), risk assessment of the data (e.g. a vendor-processed version of the dataset might not be open or good enough), legal considerations (Legal disclaimers, assumption of risk, proprietary?), accreditation (Is this source accredited?)
# Required field: False
# Field type: str

View file

@ -0,0 +1,35 @@
name: Particulate Matter 2.5
source: https://gaftp.epa.gov/EJSCREEN/
relevance_to_environmental_justice: Particulate matter has a lot of adverse impacts
on health.
data_formats: CSV/XLSX
spatial_resolution: Census block group
public_status: Public
sponsor: EPA
subjective_rating_of_data_quality: Medium Quality
estimated_margin_of_error:
known_data_quality_issues: Many PM 2.5 stations are known to be pretty far apart, so
averaging them can lead to data quality loss.
geographic_coverage_percent:
geographic_coverage_description:
last_updated_date: 2017-01-01
frequency_of_updates: Less than annually
documentation: https://www.epa.gov/sites/production/files/2015-05/documents/ejscreen_technical_document_20150505.pdf#page=13
data_can_go_in_cloud: True
discussion:

View file

@ -0,0 +1 @@
yamale==3.0.6

View file

@ -0,0 +1,21 @@
"""Setup script for `data_roadmap` package."""
import os
from setuptools import find_packages
from setuptools import setup
# TODO: replace this with `poetry`. https://github.com/usds/justice40-tool/issues/57
_PACKAGE_DIRECTORY = os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(_PACKAGE_DIRECTORY, "requirements.txt")) as f:
requirements = f.readlines()
setup(
name="data_roadmap",
description="Data roadmap package",
author="CEJST Development Team",
author_email="justice40open@usds.gov",
install_requires=requirements,
include_package_data=True,
packages=find_packages(),
)

View file

@ -0,0 +1,151 @@
import importlib_resources
import pathlib
import yamale
import yaml
# Set directories.
DATA_ROADMAP_DIRECTORY = importlib_resources.files("data_roadmap")
UTILS_DIRECTORY = DATA_ROADMAP_DIRECTORY / "utils"
DATA_SET_DESCRIPTIONS_DIRECTORY = DATA_ROADMAP_DIRECTORY / "data_set_descriptions"
# Set file paths.
DATA_SET_DESCRIPTION_SCHEMA_FILE_PATH = (
DATA_ROADMAP_DIRECTORY / "data_set_description_schema.yaml"
)
DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH = (
DATA_ROADMAP_DIRECTORY / "data_set_description_field_descriptions.yaml"
)
DATA_SET_DESCRIPTION_TEMPLATE_FILE_PATH = (
DATA_ROADMAP_DIRECTORY / "data_set_description_template.yaml"
)
def load_data_set_description_schema(
file_path: pathlib.PosixPath = DATA_SET_DESCRIPTION_SCHEMA_FILE_PATH,
) -> yamale.schema.schema.Schema:
"""Load from file the data set description schema."""
schema = yamale.make_schema(path=file_path)
return schema
def load_data_set_description_field_descriptions(
file_path: pathlib.PosixPath = DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH,
) -> dict:
"""Load from file the descriptions of fields in the data set description."""
# Load field descriptions.
with open(file_path, "r") as stream:
data_set_description_field_descriptions = yaml.safe_load(stream=stream)
return data_set_description_field_descriptions
def validate_descriptions_for_schema(
schema: yamale.schema.schema.Schema,
field_descriptions: dict,
) -> None:
"""Validate descriptions for schema.
Checks that every field in the `yamale` schema also has a field
description in the `field_descriptions` dict.
"""
for field_name in schema.dict.keys():
if field_name not in field_descriptions:
raise ValueError(
f"Field `{field_name}` does not have a "
f"description. Please add one to file `{DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH}`"
)
for field_name in field_descriptions.keys():
if field_name not in schema.dict.keys():
raise ValueError(
f"Field `{field_name}` has a description but is not in the " f"schema."
)
def validate_all_data_set_descriptions(
data_set_description_schema: yamale.schema.schema.Schema,
) -> None:
"""Validate data set descriptions.
Validate each file in the `data_set_descriptions` directory the schema
against the provided schema.
"""
data_set_description_file_paths_generator = DATA_SET_DESCRIPTIONS_DIRECTORY.glob(
"*.yaml"
)
# Validate each file
for file_path in data_set_description_file_paths_generator:
print(f"Validating {file_path}...")
# Create a yamale Data object
data_set_description = yamale.make_data(file_path)
# TODO: explore collecting all errors and raising them at once. - Lucas
yamale.validate(schema=data_set_description_schema, data=data_set_description)
def write_data_set_description_template_file(
data_set_description_schema: yamale.schema.schema.Schema,
data_set_description_field_descriptions: dict,
template_file_path: str = DATA_SET_DESCRIPTION_TEMPLATE_FILE_PATH,
) -> None:
"""Write an example data set description with helpful comments."""
template_file_lines = []
# Write comments at the top of the template
template_file_lines.append(
"# Note: This template is automatically generated by the function\n"
"# `write_data_set_description_template_file` from the schema\n"
"# and field descriptions files. Do not manually edit this file.\n\n"
)
schema_dict = data_set_description_schema.dict
for field_name, field_schema in schema_dict.items():
template_file_lines.append(f"{field_name}: \n")
template_file_lines.append(
f"# Description: {data_set_description_field_descriptions[field_name]}\n"
)
template_file_lines.append(f"# Required field: {field_schema.is_required}\n")
template_file_lines.append(f"# Field type: {field_schema.get_name()}\n")
if type(field_schema) is yamale.validators.validators.Enum:
template_file_lines.append(
f"# Valid choices are one of the following: {field_schema.enums}\n"
)
# Add an empty linebreak to separate fields.
template_file_lines.append("\n")
with open(template_file_path, "w") as file:
file.writelines(template_file_lines)
def run_validations_and_write_template() -> None:
"""Run validations of schema and descriptions, and write a template file."""
# Load the schema and a separate dictionary
data_set_description_schema = load_data_set_description_schema()
data_set_description_field_descriptions = (
load_data_set_description_field_descriptions()
)
validate_descriptions_for_schema(
schema=data_set_description_schema,
field_descriptions=data_set_description_field_descriptions,
)
# Validate all data set descriptions in the directory against schema.
validate_all_data_set_descriptions(
data_set_description_schema=data_set_description_schema
)
# Write an example template for data set descriptions.
write_data_set_description_template_file(
data_set_description_schema=data_set_description_schema,
data_set_description_field_descriptions=data_set_description_field_descriptions,
)
if __name__ == "__main__":
run_validations_and_write_template()

View file

@ -0,0 +1,248 @@
import unittest
from unittest import mock
import yamale
from data_roadmap.utils.utils_data_set_description_schema import (
load_data_set_description_schema,
load_data_set_description_field_descriptions,
validate_descriptions_for_schema,
validate_all_data_set_descriptions,
write_data_set_description_template_file,
)
class UtilsDataSetDescriptionSchema(unittest.TestCase):
@mock.patch("yamale.make_schema")
def test_load_data_set_description_schema(self, make_schema_mock):
load_data_set_description_schema(file_path="mock.yaml")
make_schema_mock.assert_called_once_with(path="mock.yaml")
@mock.patch("yaml.safe_load")
def test_load_data_set_description_field_descriptions(self, yaml_safe_load_mock):
# Note: this isn't a great test, we could mock the actual YAML to
# make it better. - Lucas
mock_dict = {
"name": "The name of the thing.",
"age": "The age of the thing.",
"height": "The height of the thing.",
"awesome": "The awesome of the thing.",
"field": "The field of the thing.",
}
yaml_safe_load_mock.return_value = mock_dict
field_descriptions = load_data_set_description_field_descriptions()
yaml_safe_load_mock.assert_called_once()
self.assertDictEqual(field_descriptions, mock_dict)
def test_validate_descriptions_for_schema(self):
# Test when all descriptions are present.
field_descriptions = {
"name": "The name of the thing.",
"age": "The age of the thing.",
"height": "The height of the thing.",
"awesome": "The awesome of the thing.",
"field": "The field of the thing.",
}
schema = yamale.make_schema(
content="""
name: str()
age: int(max=200)
height: num()
awesome: bool()
field: enum('option 1', 'option 2')
"""
)
# Should pass.
validate_descriptions_for_schema(
schema=schema, field_descriptions=field_descriptions
)
field_descriptions_missing_one = {
"name": "The name of the thing.",
"age": "The age of the thing.",
"height": "The height of the thing.",
"awesome": "The awesome of the thing.",
}
# Should fail because of the missing field description.
with self.assertRaises(ValueError) as context_manager:
validate_descriptions_for_schema(
schema=schema, field_descriptions=field_descriptions_missing_one
)
# Using `assertIn` because the file path is returned in the error
# message, and it varies based on environment.
self.assertIn(
"Field `field` does not have a description. Please add one to file",
str(context_manager.exception),
)
field_descriptions_extra_one = {
"name": "The name of the thing.",
"age": "The age of the thing.",
"height": "The height of the thing.",
"awesome": "The awesome of the thing.",
"field": "The field of the thing.",
"extra": "Extra description.",
}
# Should fail because of the extra field description.
with self.assertRaises(ValueError) as context_manager:
validate_descriptions_for_schema(
schema=schema, field_descriptions=field_descriptions_extra_one
)
# Using `assertIn` because the file path is returned in the error
# message, and it varies based on environment.
self.assertEquals(
"Field `extra` has a description but is not in the schema.",
str(context_manager.exception),
)
def test_validate_all_data_set_descriptions(self):
# Setup a few examples of `yamale` data *before* we mock the `make_data`
# function.
valid_data = yamale.make_data(
content="""
name: Bill
age: 26
height: 6.2
awesome: True
field: option 1
"""
)
invalid_data_1 = yamale.make_data(
content="""
name: Bill
age: asdf
height: 6.2
awesome: asdf
field: option 1
"""
)
invalid_data_2 = yamale.make_data(
content="""
age: 26
height: 6.2
awesome: True
field: option 1
"""
)
# Mock `make_data`.
with mock.patch.object(
yamale, "make_data", return_value=None
) as yamale_make_data_mock:
schema = yamale.make_schema(
content="""
name: str()
age: int(max=200)
height: num()
awesome: bool()
field: enum('option 1', 'option 2')
"""
)
# Make the `make_data` method return valid data.
yamale_make_data_mock.return_value = valid_data
# Should pass.
validate_all_data_set_descriptions(data_set_description_schema=schema)
# Make some of the data invalid.
yamale_make_data_mock.return_value = invalid_data_1
# Should fail because of the invalid field values.
with self.assertRaises(yamale.YamaleError) as context_manager:
validate_all_data_set_descriptions(data_set_description_schema=schema)
self.assertEqual(
str(context_manager.exception),
"""Error validating data
age: 'asdf' is not a int.
awesome: 'asdf' is not a bool.""",
)
# Make some of the data missing.
yamale_make_data_mock.return_value = invalid_data_2
# Should fail because of the missing fields.
with self.assertRaises(yamale.YamaleError) as context_manager:
validate_all_data_set_descriptions(data_set_description_schema=schema)
self.assertEqual(
str(context_manager.exception),
"""Error validating data
name: Required field missing""",
)
@mock.patch("builtins.open", new_callable=mock.mock_open)
def test_write_data_set_description_template_file(self, builtins_writelines_mock):
schema = yamale.make_schema(
content="""
name: str()
age: int(max=200)
height: num()
awesome: bool()
field: enum('option 1', 'option 2')
"""
)
data_set_description_field_descriptions = {
"name": "The name of the thing.",
"age": "The age of the thing.",
"height": "The height of the thing.",
"awesome": "The awesome of the thing.",
"field": "The field of the thing.",
}
write_data_set_description_template_file(
data_set_description_schema=schema,
data_set_description_field_descriptions=data_set_description_field_descriptions,
template_file_path="mock_template.yaml",
)
call_to_writelines = builtins_writelines_mock.mock_calls[2][1][0]
self.assertListEqual(
call_to_writelines,
[
"# Note: This template is automatically generated by the function\n"
"# `write_data_set_description_template_file` from the schema\n"
"# and field descriptions files. Do not manually edit this file.\n\n",
"name: \n",
"# Description: The name of the thing.\n",
"# Required field: True\n",
"# Field type: str\n",
"\n",
"age: \n",
"# Description: The age of the thing.\n",
"# Required field: True\n",
"# Field type: int\n",
"\n",
"height: \n",
"# Description: The height of the thing.\n",
"# Required field: True\n",
"# Field type: num\n",
"\n",
"awesome: \n",
"# Description: The awesome of the thing.\n",
"# Required field: True\n",
"# Field type: bool\n",
"\n",
"field: \n",
"# Description: The field of the thing.\n",
"# Required field: True\n",
"# Field type: enum\n",
"# Valid choices are one of the following: ('option 1', 'option 2')\n",
"\n",
],
)