Data Unit Tests (#509)

* Fixes #341 -
As a J40 developer, I want to write Unit Tests for the ETL files,
so that tests are run on each commit

* Location bug

* Adding Load tests

* Fixing XLSX filename

* Adding downloadable zip test

* updating pickle

* Fixing pylint warnings

* Updte readme to correct some typos and reorganize test content structure

* Removing unused schemas file, adding details to readme around pickles, per PR feedback

* Update test to pass with Score D added to score file; update path in readme

* fix requirements.txt after merge

* fix poetry.lock after merge

Co-authored-by: Shelby Switzer <shelby.switzer@cms.hhs.gov>
This commit is contained in:
Nat Hillard 2021-09-10 14:17:34 -04:00 committed by GitHub
parent 88c8209bb0
commit 536a35d6a0
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
17 changed files with 676 additions and 242 deletions

1
.gitignore vendored
View file

@ -149,3 +149,4 @@ node_modules
# pyenv
.python-version
.DS_Store
temp_dir

View file

@ -8,6 +8,10 @@
- [Justice 40 Score application](#justice-40-score-application)
- [About this application](#about-this-application)
- [Using the data](#using-the-data)
- [1. Source data](#1-source-data)
- [2. Extract-Transform-Load (ETL) the data](#2-extract-transform-load-etl-the-data)
- [3. Combined dataset](#3-combined-dataset)
- [4. Tileset](#4-tileset)
- [Score generation and comparison workflow](#score-generation-and-comparison-workflow)
- [Workflow Diagram](#workflow-diagram)
- [Step 0: Set up your environment](#step-0-set-up-your-environment)
@ -28,6 +32,15 @@
- [Running Jupyter notebooks](#running-jupyter-notebooks)
- [Activating variable-enabled Markdown for Jupyter notebooks](#activating-variable-enabled-markdown-for-jupyter-notebooks)
- [Miscellaneous](#miscellaneous)
- [Testing](#testing)
- [Background](#background)
- [Configuration / Fixtures](#configuration--fixtures)
- [Updating Pickles](#updating-pickles)
- [Future Enchancements](#future-enchancements)
- [ETL Unit Tests](#etl-unit-tests)
- [Extract Tests](#extract-tests)
- [Transform Tests](#transform-tests)
- [Load Tests](#load-tests)
<!-- /TOC -->
@ -46,15 +59,17 @@ One of our primary development principles is that the entire data pipeline shoul
In the sub-sections below, we outline what each stage of the data provenance looks like and where you can find the data output by that stage. If you'd like to actually perform each step in your own environment, skip down to [Score generation and comparison workflow](#score-generation-and-comparison-workflow).
#### 1. Source data
If you would like to find and use the raw source data, you can find the source URLs in the `etl.py` files located within each directory in `data/data-pipeline/etl/sources`.
#### 2. Extract-Transform-Load (ETL) the data
The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in `data/data-pipeline/etl/sources`, and the output of this process is a number of CSVs available at the following locations:
* EJScreen: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv
* Census ACS 2019: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv
* Housing and Transportation Index: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv
* HUD Housing: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv
The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in `data/data-pipeline/etl/sources`, and the output of this process is a number of CSVs available at the following locations:
- EJScreen: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv>
- Census ACS 2019: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv>
- Housing and Transportation Index: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv>
- HUD Housing: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv>
Each CSV may have a different column name for the census tract or census block group identifier. You can find what the name is in the ETL code. Please note that when you view these files you should make sure that your text editor or spreadsheet software does not remove the initial `0` from this identifier field (many IDs begin with `0`).
@ -242,3 +257,78 @@ see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nb
## Miscellaneous
- To export packages from Poetry to `requirements.txt` run `poetry export --without-hashes > requirements.txt`
## Testing
### Background
For this project, we make use of [pytest](https://docs.pytest.org/en/latest/) for testing purposes. To run tests, simply run `poetry run pytest` in this directory (i.e. `justice40-tool/data/data-pipeline`).
### Configuration / Fixtures
Test data is configured via [fixtures](https://docs.pytest.org/en/latest/explanation/fixtures.html).
These fixtures utilize [pickle files](https://docs.python.org/3/library/pickle.html) to store dataframes to disk. This is ultimately because if you assert equality on two dataframes, even if column values have the same "visible" value, if their types are mismatching they will be counted as not being equal.
In a bit more detail:
1. Pandas dataframes are typed, and by default, types are inferred when you create one from scratch. If you create a dataframe using the `DataFrame` [constructors](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame), there is no guarantee that types will be correct, without explicit `dtype` annotations. Explicit `dtype` annotations are possible, but, and this leads us to point #2:
2. Our transformations/dataframes in the source code under test itself doesn't always require specific types, and it is often sufficient in the code itself to just rely on the `object` type. I attempted adding explicit typing based on the "logical" type of given columns, but in practice it resulted in non-matching dataframes that _actually_ had the same value -- in particular it was very common to have one dataframe column of type `string` and another of type `object` that carried the same values. So, that is to say, even if we did create a "correctly" typed dataframe (according to our logical assumptions about what types should be), they were still counted as mismatched against the dataframes that are actually used in our program. To fix this "the right way", it is necessary to explicitly annotate types at the point of the `read_csv` call, which definitely has other potential unintended side effects and would need to be done carefully.
3. For larger dataframes (some of these have 150+ values), it was initially deemed too difficult/time consuming to manually annotate all types, and further, to modify those type annotations based on what is expected in the souce code under test.
#### Updating Pickles
If you update the input our output to various methods, it is necessary to create new pickles so that data is validated correctly. To do this:
1. Drop a breakpoint just before the dataframe will otherwise be written to / read from disk. If you're using VSCode, use one of the named run targets within `data-pipeline` such as `Score Full Run` , and put a breakpoint in the margin just before the actionable step. More on using breakpoints in VSCode [here](https://code.visualstudio.com/docs/editor/debugging#_breakpoints). If you are not using VSCode, you can put the line `breakpoint()` in your code and it will stop where you have placed the line in whatever calling context you are using.
1. In your editor/terminal, run `df.to_pickle("data_pipeline/etl/score/tests/snapshots/YOUR_OUT_PATH_HERE.pkl")` to write the pickle to the appropriate location on disk.
1. Be sure to do this for all inputs/outputs that have changed as a result of your modification. It is often necessary to do this several times for cascading operations.
1. To inspect your pickle, open a python interpreter, then run `pickle.load( open( "data_pipeline/etl/score/tests/snapshots/YOUR_OUT_PATH_HERE.pkl", "rb" ) )` to get file contents.
#### Future Enchancements
Pickles have several downsides that we should consider alternatives for:
1. They are opaque - it is necessary to open a python interpreter (as written above) to confirm its contents
2. They are a bit harder for newcomers to python to grok.
3. They potentially encode flawed typing assumptions (see above) which are paved over for future test runs.
In the future, we could adopt any of the below strategies to work around this:
1. We could use [pytest-snapshot](https://pypi.org/project/pytest-snapshot/) to automatically store the output of each test as data changes. This would make it so that you could avoid having to generate a pickle for each method - instead, you would only need to call `generate` once , and only when the dataframe had changed.
Additionally, you could use a pandas type schema annotation such as [pandera](https://pandera.readthedocs.io/en/stable/schema_models.html?highlight=inputschema#basic-usage) to annotate input/output schemas for given functions, and your unit tests could use these to validate explicitly. This could be of very high value for annotating expectations.
Alternatively, or in conjunction, you could move toward using a more strictly-typed container format for read/writes such as SQL/SQLite, and use something like [SQLModel](https://github.com/tiangolo/sqlmodel) to handle more explicit type guarantees.
### ETL Unit Tests
ETL unit tests are typically organized into three buckets:
- Extract Tests
- Transform Tests, and
- Load Tests
These are tested using different strategies, explained below.
#### Extract Tests
Extract tests rely on the limited data transformations that occur as data is loaded from source files.
In tests, we use fake, limited CSVs read via `StringIO` , taken from the first several rows of the files of interest, and ensure data types are correct.
Down the line, we could use a tool like [Pandera](https://pandera.readthedocs.io/) to enforce schemas, both for the tests and the classes themselves.
#### Transform Tests
Transform tests are the heart of ETL unit tests, and compare ideal dataframes with their actual counterparts.
See above [Fixtures](#configuration--fixtures) section for information about where data is coming from.
#### Load Tests
These make use of [tmp_path_factory](https://docs.pytest.org/en/latest/how-to/tmp_path.html) to create a file-system located under `temp_dir`, and validate whether the correct files are written to the correct locations.
Additional future modifications could include the use of Pandera and/or other schema validation tools, and or a more explicit test that the data written to file can be read back in and yield the same dataframe.

View file

@ -0,0 +1,108 @@
from pathlib import Path
import pandas as pd
from data_pipeline.config import settings
# Base Paths
DATA_PATH = Path(settings.APP_ROOT) / "data"
TMP_PATH = DATA_PATH / "tmp"
# Remote Paths
CENSUS_COUNTIES_ZIP_URL = "https://www2.census.gov/geo/docs/maps-data/data/gazetteer/Gaz_counties_national.zip"
# Local Paths
CENSUS_COUNTIES_FILE_NAME = TMP_PATH / "Gaz_counties_national.txt"
# Census paths
DATA_CENSUS_DIR = DATA_PATH / "census"
DATA_CENSUS_CSV_DIR = DATA_CENSUS_DIR / "csv"
DATA_CENSUS_CSV_FILE_PATH = DATA_CENSUS_CSV_DIR / "us.csv"
DATA_CENSUS_CSV_STATE_FILE_PATH = DATA_CENSUS_CSV_DIR / "fips_states_2010.csv"
# Score paths
DATA_SCORE_DIR = DATA_PATH / "score"
## Score CSV Paths
DATA_SCORE_CSV_DIR = DATA_SCORE_DIR / "csv"
DATA_SCORE_CSV_FULL_DIR = DATA_SCORE_CSV_DIR / "full"
DATA_SCORE_CSV_FULL_FILE_PATH = DATA_SCORE_CSV_FULL_DIR / "usa.csv"
FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH = (
DATA_SCORE_CSV_FULL_DIR / "usa_counties.csv"
)
## Score Tile paths
DATA_SCORE_TILES_DIR = DATA_SCORE_DIR / "tiles"
DATA_SCORE_TILES_FILE_PATH = DATA_SCORE_TILES_DIR / "usa.csv"
# Downloadable paths
SCORE_DOWNLOADABLE_DIR = DATA_SCORE_DIR / "downloadable"
SCORE_DOWNLOADABLE_CSV_FILE_PATH = SCORE_DOWNLOADABLE_DIR / "usa.csv"
SCORE_DOWNLOADABLE_EXCEL_FILE_PATH = SCORE_DOWNLOADABLE_DIR / "usa.xlsx"
SCORE_DOWNLOADABLE_ZIP_FILE_PATH = SCORE_DOWNLOADABLE_DIR / "Screening Tool Data.zip"
# Column subsets
CENSUS_COUNTIES_COLUMNS = ["USPS", "GEOID", "NAME"]
TILES_SCORE_COLUMNS = [
"GEOID10",
"State Name",
"County Name",
"Total population",
"Score D (percentile)",
"Score D (top 25th percentile)",
"Score E (percentile)",
"Score E (top 25th percentile)",
"Poverty (Less than 200% of federal poverty line) (percentile)",
"Percent individuals age 25 or over with less than high school degree (percentile)",
"Linguistic isolation (percent) (percentile)",
"Unemployed civilians (percent) (percentile)",
"Housing burden (percent) (percentile)",
]
# columns to round floats to 2 decimals
TILES_SCORE_FLOAT_COLUMNS = [
"Score D (percentile)",
"Score D (top 25th percentile)",
"Score E (percentile)",
"Score E (top 25th percentile)",
"Poverty (Less than 200% of federal poverty line)",
"Percent individuals age 25 or over with less than high school degree",
"Linguistic isolation (percent)",
"Unemployed civilians (percent)",
"Housing burden (percent)",
]
TILES_ROUND_NUM_DECIMALS = 2
DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_BASIC = [
"Percent individuals age 25 or over with less than high school degree",
"Linguistic isolation (percent)",
"Poverty (Less than 200% of federal poverty line)",
"Unemployed civilians (percent)",
"Housing burden (percent)",
"Respiratory hazard index",
"Diesel particulate matter",
"Particulate matter (PM2.5)",
"Traffic proximity and volume",
"Proximity to RMP sites",
"Wastewater discharge",
"Percent pre-1960s housing (lead paint indicator)",
"Total population",
]
# For every indicator above, we want to include percentile and min-max normalized variants also
DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_FULL = list(
pd.core.common.flatten(
[
[p, f"{p} (percentile)", f"{p} (min-max normalized)"]
for p in DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_BASIC
]
)
)
# Finally we augment with the GEOID10, county, and state
DOWNLOADABLE_SCORE_COLUMNS = [
"GEOID10",
"County Name",
"State Name",
*DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_FULL,
]

View file

@ -6,6 +6,8 @@ import pandas as pd
from data_pipeline.etl.base import ExtractTransformLoad
from data_pipeline.utils import get_module_logger, get_zip_info
from . import constants
## zlib is not available on all systems
try:
import zlib # noqa # pylint: disable=unused-import
@ -25,108 +27,19 @@ class PostScoreETL(ExtractTransformLoad):
"""
def __init__(self):
self.CENSUS_COUNTIES_ZIP_URL = "https://www2.census.gov/geo/docs/maps-data/data/gazetteer/Gaz_counties_national.zip"
self.CENSUS_COUNTIES_TXT = self.TMP_PATH / "Gaz_counties_national.txt"
self.CENSUS_COUNTIES_COLS = ["USPS", "GEOID", "NAME"]
self.CENSUS_USA_CSV = self.DATA_PATH / "census" / "csv" / "us.csv"
self.SCORE_CSV_PATH = self.DATA_PATH / "score" / "csv"
self.DOWNLOADABLE_INFO_PATH = self.DATA_PATH / "score" / "downloadable"
self.input_counties_df: pd.DataFrame
self.input_states_df: pd.DataFrame
self.input_score_df: pd.DataFrame
self.input_national_cbg_df: pd.DataFrame
self.STATE_CSV = (
self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
)
self.FULL_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
self.FULL_SCORE_CSV_PLUS_COUNTIES = (
self.SCORE_CSV_PATH / "full" / "usa_counties.csv"
)
self.TILES_SCORE_COLUMNS = [
"GEOID10",
"State Name",
"County Name",
"Total population",
"Score D (percentile)",
"Score D (top 25th percentile)",
"Score E (percentile)",
"Score E (top 25th percentile)",
"Poverty (Less than 200% of federal poverty line) (percentile)",
"Percent individuals age 25 or over with less than high school degree (percentile)",
"Linguistic isolation (percent) (percentile)",
"Unemployed civilians (percent) (percentile)",
"Housing burden (percent) (percentile)",
]
self.TILES_SCORE_CSV_PATH = self.SCORE_CSV_PATH / "tiles"
self.TILES_SCORE_CSV = self.TILES_SCORE_CSV_PATH / "usa.csv"
# columns to round floats to 2 decimals
self.TILES_SCORE_FLOAT_COLUMNS = [
"Score D (percentile)",
"Score D (top 25th percentile)",
"Score E (percentile)",
"Score E (top 25th percentile)",
"Poverty (Less than 200% of federal poverty line) (percentile)",
"Percent individuals age 25 or over with less than high school degree (percentile)",
"Linguistic isolation (percent) (percentile)",
"Unemployed civilians (percent) (percentile)",
"Housing burden (percent) (percentile)",
]
self.TILES_ROUND_NUM_DECIMALS = 2
self.DOWNLOADABLE_SCORE_INDICATORS_BASIC = [
"Percent individuals age 25 or over with less than high school degree",
"Linguistic isolation (percent)",
"Poverty (Less than 200% of federal poverty line)",
"Unemployed civilians (percent)",
"Housing burden (percent)",
"Respiratory hazard index",
"Diesel particulate matter",
"Particulate matter (PM2.5)",
"Traffic proximity and volume",
"Proximity to RMP sites",
"Wastewater discharge",
"Percent pre-1960s housing (lead paint indicator)",
"Total population",
]
# For every indicator above, we want to include percentile and min-max normalized variants also
self.DOWNLOADABLE_SCORE_INDICATORS_FULL = list(
pd.core.common.flatten(
[
[p, f"{p} (percentile)", f"{p} (min-max normalized)"]
for p in self.DOWNLOADABLE_SCORE_INDICATORS_BASIC
]
)
)
# Finally we augment with the GEOID10, county, and state
self.DOWNLOADABLE_SCORE_COLUMNS = [
"GEOID10",
"County Name",
"State Name",
*self.DOWNLOADABLE_SCORE_INDICATORS_FULL,
]
self.DOWNLOADABLE_SCORE_CSV = self.DOWNLOADABLE_INFO_PATH / "usa.csv"
self.DOWNLOADABLE_SCORE_EXCEL = self.DOWNLOADABLE_INFO_PATH / "usa.xlsx"
self.DOWNLOADABLE_SCORE_ZIP = (
self.DOWNLOADABLE_INFO_PATH / "Screening Tool Data.zip"
)
self.counties_df: pd.DataFrame
self.states_df: pd.DataFrame
self.score_df: pd.DataFrame
self.score_county_state_merged: pd.DataFrame
self.score_for_tiles: pd.DataFrame
def extract(self) -> None:
super().extract(
self.CENSUS_COUNTIES_ZIP_URL,
self.TMP_PATH,
)
self.output_score_county_state_merged_df: pd.DataFrame
self.output_score_tiles_df: pd.DataFrame
self.output_downloadable_df: pd.DataFrame
def _extract_counties(self, county_path: Path) -> pd.DataFrame:
logger.info("Reading Counties CSV")
self.counties_df = pd.read_csv(
self.CENSUS_COUNTIES_TXT,
return pd.read_csv(
county_path,
sep="\t",
dtype={
"GEOID": "string",
@ -136,134 +49,213 @@ class PostScoreETL(ExtractTransformLoad):
encoding="latin-1",
)
def _extract_states(self, state_path: Path) -> pd.DataFrame:
logger.info("Reading States CSV")
self.states_df = pd.read_csv(
self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
return pd.read_csv(
state_path, dtype={"fips": "string", "state_abbreviation": "string"}
)
self.score_df = pd.read_csv(
self.FULL_SCORE_CSV,
def _extract_score(self, score_path: Path) -> pd.DataFrame:
logger.info("Reading Score CSV")
return pd.read_csv(
score_path,
dtype={"GEOID10": "string", "Total population": "int64"},
)
def transform(self) -> None:
logger.info("Transforming data sources for Score + County CSV")
# rename some of the columns to prepare for merge
self.counties_df = self.counties_df[["USPS", "GEOID", "NAME"]]
self.counties_df.rename(
columns={"USPS": "State Abbreviation", "NAME": "County Name"},
inplace=True,
)
# remove unnecessary columns
self.states_df.rename(
columns={
"fips": "State Code",
"state_name": "State Name",
"state_abbreviation": "State Abbreviation",
},
inplace=True,
)
self.states_df.drop(["region", "division"], axis=1, inplace=True)
# add the tract level column
self.score_df["GEOID"] = self.score_df.GEOID10.str[:5]
# merge state with counties
county_state_merged = self.counties_df.merge(
self.states_df, on="State Abbreviation", how="left"
)
# merge state + county with score
self.score_county_state_merged = self.score_df.merge(
county_state_merged, on="GEOID", how="left"
)
# check if there are census cbgs without score
logger.info("Removing CBG rows without score")
## load cbgs
cbg_usa_df = pd.read_csv(
self.CENSUS_USA_CSV,
def _extract_national_cbg(self, national_cbg_path: Path) -> pd.DataFrame:
logger.info("Reading national CBG")
return pd.read_csv(
national_cbg_path,
names=["GEOID10"],
dtype={"GEOID10": "string"},
low_memory=False,
header=None,
)
def extract(self) -> None:
logger.info("Starting Extraction")
super().extract(
constants.CENSUS_COUNTIES_ZIP_URL,
constants.TMP_PATH,
)
self.input_counties_df = self._extract_counties(
constants.CENSUS_COUNTIES_FILE_NAME
)
self.input_states_df = self._extract_states(
constants.DATA_CENSUS_CSV_STATE_FILE_PATH
)
self.input_score_df = self._extract_score(
constants.DATA_SCORE_CSV_FULL_FILE_PATH
)
self.input_national_cbg_df = self._extract_national_cbg(
constants.DATA_CENSUS_CSV_FILE_PATH
)
def _transform_counties(self, initial_counties_df: pd.DataFrame) -> pd.DataFrame:
"""
Necessary modifications to the counties dataframe
"""
# Rename some of the columns to prepare for merge
new_df = initial_counties_df[constants.CENSUS_COUNTIES_COLUMNS]
new_df.rename(
columns={"USPS": "State Abbreviation", "NAME": "County Name"}, inplace=True
)
return new_df
def _transform_states(self, initial_states_df: pd.DataFrame) -> pd.DataFrame:
"""
Necessary modifications to the states dataframe
"""
# remove unnecessary columns
new_df = initial_states_df.rename(
columns={
"fips": "State Code",
"state_name": "State Name",
"state_abbreviation": "State Abbreviation",
}
)
new_df.drop(["region", "division"], axis=1, inplace=True)
return new_df
def _transform_score(self, initial_score_df: pd.DataFrame) -> pd.DataFrame:
"""
Necessary modifications to the score dataframe
"""
# Add the tract level column
new_df = initial_score_df.copy()
new_df["GEOID"] = initial_score_df.GEOID10.str[:5]
return new_df
def _create_score_data(
self,
national_cbg_df: pd.DataFrame,
counties_df: pd.DataFrame,
states_df: pd.DataFrame,
score_df: pd.DataFrame,
) -> pd.DataFrame:
# merge state with counties
logger.info("Merging state with county info")
county_state_merged = counties_df.merge(
states_df, on="State Abbreviation", how="left"
)
# merge state + county with score
score_county_state_merged = score_df.merge(
county_state_merged, on="GEOID", how="left"
)
# check if there are census cbgs without score
logger.info("Removing CBG rows without score")
# merge census cbgs with score
merged_df = cbg_usa_df.merge(
self.score_county_state_merged,
on="GEOID10",
how="left",
merged_df = national_cbg_df.merge(
score_county_state_merged, on="GEOID10", how="left"
)
# recast population to integer
merged_df["Total population"] = (
score_county_state_merged["Total population"] = (
merged_df["Total population"].fillna(0.0).astype(int)
)
# list the null score cbgs
null_cbg_df = merged_df[merged_df["Score E (percentile)"].isnull()]
# subsctract data sets
# subtract data sets
# this follows the XOR pattern outlined here:
# https://stackoverflow.com/a/37313953
removed_df = pd.concat(
de_duplicated_df = pd.concat(
[merged_df, null_cbg_df, null_cbg_df]
).drop_duplicates(keep=False)
# set the score to the new df
self.score_county_state_merged = removed_df
def _save_full_csv(self):
logger.info("Saving Full Score CSV with County Information")
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
self.score_county_state_merged.to_csv(
self.FULL_SCORE_CSV_PLUS_COUNTIES, index=False
)
def _save_tile_csv(self):
logger.info("Saving Tile Score CSV")
score_tiles = self.score_county_state_merged[self.TILES_SCORE_COLUMNS]
return de_duplicated_df
def _create_tile_data(
self, score_county_state_merged_df: pd.DataFrame
) -> pd.DataFrame:
score_tiles = score_county_state_merged_df[constants.TILES_SCORE_COLUMNS]
decimals = pd.Series(
[self.TILES_ROUND_NUM_DECIMALS]
* len(self.TILES_SCORE_FLOAT_COLUMNS),
index=self.TILES_SCORE_FLOAT_COLUMNS,
[constants.TILES_ROUND_NUM_DECIMALS]
* len(constants.TILES_SCORE_FLOAT_COLUMNS),
index=constants.TILES_SCORE_FLOAT_COLUMNS,
)
score_tiles = score_tiles.round(decimals)
return score_tiles.round(decimals)
self.TILES_SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
score_tiles.to_csv(self.TILES_SCORE_CSV, index=False)
def _create_downloadable_data(
self, score_county_state_merged_df: pd.DataFrame
) -> pd.DataFrame:
return score_county_state_merged_df[constants.DOWNLOADABLE_SCORE_COLUMNS]
def _save_downloadable_zip(self):
def transform(self) -> None:
logger.info("Transforming data sources for Score + County CSV")
transformed_counties = self._transform_counties(self.input_counties_df)
transformed_states = self._transform_states(self.input_states_df)
transformed_score = self._transform_score(self.input_score_df)
output_score_county_state_merged_df = self._create_score_data(
self.input_national_cbg_df,
transformed_counties,
transformed_states,
transformed_score,
)
self.output_score_tiles_df = self._create_tile_data(
output_score_county_state_merged_df
)
self.output_downloadable_df = self._create_downloadable_data(
output_score_county_state_merged_df
)
self.output_score_county_state_merged_df = output_score_county_state_merged_df
def _load_score_csv(
self, score_county_state_merged: pd.DataFrame, score_csv_path: Path
) -> None:
logger.info("Saving Full Score CSV with County Information")
score_csv_path.parent.mkdir(parents=True, exist_ok=True)
score_county_state_merged.to_csv(score_csv_path, index=False)
def _load_tile_csv(
self, score_tiles_df: pd.DataFrame, tile_score_path: Path
) -> None:
logger.info("Saving Tile Score CSV")
# TODO: check which are the columns we'll use
# Related to: https://github.com/usds/justice40-tool/issues/302
tile_score_path.mkdir(parents=True, exist_ok=True)
score_tiles_df.to_csv(tile_score_path, index=False)
def _load_downloadable_zip(
self, downloadable_df: pd.DataFrame, downloadable_info_path: Path
) -> None:
logger.info("Saving Downloadable CSV")
logger.info(list(self.score_county_state_merged.columns))
logger.info(self.DOWNLOADABLE_SCORE_COLUMNS)
downloadable_tiles = self.score_county_state_merged[
self.DOWNLOADABLE_SCORE_COLUMNS
]
self.DOWNLOADABLE_INFO_PATH.mkdir(parents=True, exist_ok=True)
downloadable_info_path.mkdir(parents=True, exist_ok=True)
csv_path = downloadable_info_path / "usa.csv"
excel_path = downloadable_info_path / "usa.xlsx"
zip_path = downloadable_info_path / "Screening Tool Data.zip"
logger.info("Writing downloadable csv")
downloadable_tiles.to_csv(self.DOWNLOADABLE_SCORE_CSV, index=False)
downloadable_df.to_csv(csv_path, index=False)
logger.info("Writing downloadable excel")
downloadable_tiles.to_excel(self.DOWNLOADABLE_SCORE_EXCEL, index=False)
downloadable_df.to_excel(excel_path, index=False)
logger.info("Compressing files")
files_to_compress = [
self.DOWNLOADABLE_SCORE_CSV,
self.DOWNLOADABLE_SCORE_EXCEL,
]
with zipfile.ZipFile(self.DOWNLOADABLE_SCORE_ZIP, "w") as zf:
files_to_compress = [csv_path, excel_path]
with zipfile.ZipFile(zip_path, "w") as zf:
for f in files_to_compress:
zf.write(f, arcname=Path(f).name, compress_type=compression)
zip_info = get_zip_info(self.DOWNLOADABLE_SCORE_ZIP)
zip_info = get_zip_info(zip_path)
logger.info(json.dumps(zip_info, indent=4, sort_keys=True, default=str))
def load(self) -> None:
self._save_full_csv()
self._save_tile_csv()
self._save_downloadable_zip()
self._load_score_csv(
self.output_score_county_state_merged_df,
constants.FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH,
)
self._load_tile_csv(
self.output_score_tiles_df, constants.DATA_SCORE_TILES_FILE_PATH
)
self._load_downloadable_zip(
self.output_downloadable_df, constants.SCORE_DOWNLOADABLE_DIR
)

View file

@ -0,0 +1,118 @@
import os
from importlib import reload
from pathlib import Path
import pandas as pd
import pytest
from data_pipeline import config
from data_pipeline.etl.score import etl_score_post, tests
from data_pipeline.etl.score.etl_score_post import PostScoreETL
def pytest_configure():
pytest.SNAPSHOT_DIR = Path(__file__).parent / "snapshots"
@pytest.fixture(scope="session")
def root(tmp_path_factory):
basetemp = Path.cwd() / "temp_dir"
os.environ["PYTEST_DEBUG_TEMPROOT"] = str(
basetemp
) # this sets the location of the temp directory inside the project folder
basetemp.mkdir(parents=True, exist_ok=True)
root = tmp_path_factory.mktemp("root", numbered=False)
return root
@pytest.fixture(autouse=True)
def settings_override(monkeypatch, root):
reload(config)
monkeypatch.setattr(config.settings, "APP_ROOT", root)
return config.settings
@pytest.fixture()
def etl(monkeypatch, root):
reload(etl_score_post)
tmp_path = root / "tmp"
tmp_path.mkdir(parents=True, exist_ok=True)
etl = PostScoreETL()
monkeypatch.setattr(etl, "DATA_PATH", root)
monkeypatch.setattr(etl, "TMP_PATH", tmp_path)
return etl
@pytest.fixture(scope="session")
def sample_data_dir():
base_dir = Path(tests.__file__).resolve().parent
return base_dir / "sample_data"
@pytest.fixture()
def county_data_initial(sample_data_dir):
return sample_data_dir / "county_data_initial.csv"
@pytest.fixture()
def state_data_initial(sample_data_dir):
return sample_data_dir / "state_data_initial.csv"
@pytest.fixture()
def score_data_initial(sample_data_dir):
return sample_data_dir / "score_data_initial.csv"
@pytest.fixture()
def counties_transformed_expected():
return pd.DataFrame.from_dict(
data={
"State Abbreviation": pd.Series(["AL", "AL"], dtype="string"),
"GEOID": pd.Series(["01001", "01003"], dtype="string"),
"County Name": pd.Series(
["AutaugaCounty", "BaldwinCounty"], dtype="object"
),
},
)
@pytest.fixture()
def states_transformed_expected():
return pd.DataFrame.from_dict(
data={
"State Code": pd.Series(["01", "02", "04"], dtype="string"),
"State Name": pd.Series(["Alabama", "Alaska", "Arizona"], dtype="object"),
"State Abbreviation": pd.Series(["AL", "AK", "AZ"], dtype="string"),
},
)
@pytest.fixture()
def score_transformed_expected():
return pd.read_pickle(pytest.SNAPSHOT_DIR / "score_transformed_expected.pkl")
@pytest.fixture()
def national_cbg_df():
return pd.DataFrame.from_dict(
data={
"GEOID10": pd.Series(["010010201001", "010010201002"], dtype="string"),
},
)
@pytest.fixture()
def score_data_expected():
return pd.read_pickle(pytest.SNAPSHOT_DIR / "score_data_expected.pkl")
@pytest.fixture()
def tile_data_expected():
return pd.read_pickle(pytest.SNAPSHOT_DIR / "tile_data_expected.pkl")
@pytest.fixture()
def downloadable_data_expected():
return pd.read_pickle(pytest.SNAPSHOT_DIR / "downloadable_data_expected.pkl")

View file

@ -0,0 +1,3 @@
USPS GEOID ANSICODE NAME POP10 HU10 ALAND AWATER ALAND_SQMI AWATER_SQMI INTPTLAT INTPTLONG
AL 01001 00161526 AutaugaCounty 54571 22135 1539582278 25775735 594.436 9.952 32.536382 -86.644490
AL 01003 00161527 BaldwinCounty 182265 104061 4117521611 1133190229 1589.784 437.527 30.659218 -87.746067
1 USPS GEOID ANSICODE NAME POP10 HU10 ALAND AWATER ALAND_SQMI AWATER_SQMI INTPTLAT INTPTLONG
2 AL 01001 00161526 AutaugaCounty 54571 22135 1539582278 25775735 594.436 9.952 32.536382 -86.644490
3 AL 01003 00161527 BaldwinCounty 182265 104061 4117521611 1133190229 1589.784 437.527 30.659218 -87.746067

View file

@ -0,0 +1,3 @@
GEOID10,Housing burden (percent),Total population,Air toxics cancer risk,Respiratory hazard index,Diesel particulate matter,Particulate matter (PM2.5),Ozone,Traffic proximity and volume,Proximity to RMP sites,Proximity to TSDF sites,Proximity to NPL sites,Wastewater discharge,Percent pre-1960s housing (lead paint indicator),Individuals under 5 years old,Individuals over 64 years old,Linguistic isolation (percent),Percent of households in linguistic isolation,Poverty (Less than 200% of federal poverty line),Percent individuals age 25 or over with less than high school degree,Unemployed civilians (percent),Housing + Transportation Costs % Income for the Regional Typical Household,GEOID10 (percentile),Housing burden (percent) (percentile),Total population (percentile),Air toxics cancer risk (percentile),Respiratory hazard index (percentile),Diesel particulate matter (percentile),Particulate matter (PM2.5) (percentile),Ozone (percentile),Traffic proximity and volume (percentile),Proximity to RMP sites (percentile),Proximity to TSDF sites (percentile),Proximity to NPL sites (percentile),Wastewater discharge (percentile),Percent pre-1960s housing (lead paint indicator) (percentile),Individuals under 5 years old (percentile),Individuals over 64 years old (percentile),Linguistic isolation (percent) (percentile),Percent of households in linguistic isolation (percentile),Poverty (Less than 200% of federal poverty line) (percentile),Percent individuals age 25 or over with less than high school degree (percentile),Unemployed civilians (percent) (percentile),Housing + Transportation Costs % Income for the Regional Typical Household (percentile),Housing burden (percent) (min-max normalized),Total population (min-max normalized),Air toxics cancer risk (min-max normalized),Respiratory hazard index (min-max normalized),Diesel particulate matter (min-max normalized),Particulate matter (PM2.5) (min-max normalized),Ozone (min-max normalized),Traffic proximity and volume (min-max normalized),Proximity to RMP sites (min-max normalized),Proximity to TSDF sites (min-max normalized),Proximity to NPL sites (min-max normalized),Wastewater discharge (min-max normalized),Percent pre-1960s housing (lead paint indicator) (min-max normalized),Individuals under 5 years old (min-max normalized),Individuals over 64 years old (min-max normalized),Linguistic isolation (percent) (min-max normalized),Percent of households in linguistic isolation (min-max normalized),Poverty (Less than 200% of federal poverty line) (min-max normalized),Percent individuals age 25 or over with less than high school degree (min-max normalized),Unemployed civilians (percent) (min-max normalized),Housing + Transportation Costs % Income for the Regional Typical Household (min-max normalized),Score A,Score B,Socioeconomic Factors,Sensitive populations,Environmental effects,Exposures,Pollution Burden,Population Characteristics,Score C,Score D,Score E,Score A (percentile),Score A (top 25th percentile),Score A (top 30th percentile),Score A (top 35th percentile),Score A (top 40th percentile),Score B (percentile),Score B (top 25th percentile),Score B (top 30th percentile),Score B (top 35th percentile),Score B (top 40th percentile),Score C (percentile),Score C (top 25th percentile),Score C (top 30th percentile),Score C (top 35th percentile),Score C (top 40th percentile),Score D (percentile),Score D (top 25th percentile),Score D (top 30th percentile),Score D (top 35th percentile),Score D (top 40th percentile),Score E (percentile),Score E (top 25th percentile),Score E (top 30th percentile),Score E (top 35th percentile),Score E (top 40th percentile),Poverty (Less than 200% of federal poverty line) (top 25th percentile),Poverty (Less than 200% of federal poverty line) (top 30th percentile),Poverty (Less than 200% of federal poverty line) (top 35th percentile),Poverty (Less than 200% of federal poverty line) (top 40th percentile)
010010201001,0.15,692,49.3770316066,0.788051737456,0.2786630687,9.99813169399,40.1217287582,91.0159000855,0.0852006888915,0.0655778245369,0.0709415490545,0.0,0.29,0.0491329479769,0.0953757225434,0.0,0.04,0.293352601156,0.195011337868,0.028125,55.0,4.53858477849437e-06,0.15696279879978475,0.12089201345236528,0.9797143208291796,0.9829416396964773,0.34627219635208273,0.9086451463612172,0.28414902233020944,0.3410837232734089,0.13480504509083976,0.13460988594536452,0.5500810137382961,0.18238709002315753,0.5188510118774764,0.4494787435381899,0.25320991408459015,0.2596066814778244,0.7027453899325112,0.46606500161119757,0.7623733167523703,0.3628393561824028,0.5794871072813119,0.10909090909090909,0.013340530536705737,0.028853697167088285,0.18277886087526787,0.045859591901569303,0.5883290826337872,0.3121515260630353,0.0024222132770710053,0.004621252164336263,0.00015416214761450488,0.007893014211979786,0.0,0.29,0.09433526011570838,0.0953757225434,0.0,0.04,0.293352601156,0.195011337868,0.028125,0.2711864406779661,0.6142191591817839,0.3553155211005275,0.5747020343519587,0.3207651130335348,0.3041468093350269,0.640467674807096,0.5283607196497396,0.4477335736927467,0.23656483320764937,0.12511596962298183,0.4015694309647159,0.6357808408182161,False,False,False,True,0.6315486105122701,False,False,False,True,0.5104500914524833,False,False,False,False,0.44267994354000534,False,False,False,False,0.3517176274094212,False,False,False,False,False,False,False,False
010010201002,0.15,1153,49.3770316066,0.788051737456,0.2786630687,9.99813169399,40.1217287582,2.61874365577,0.0737963352265,0.0604962870646,0.0643436665275,0.0,0.094623655914,0.0416305290546,0.150043365134,0.0,0.0,0.182133564614,0.039119804401,0.0287878787878787,57.0,9.07716955698874e-06,0.15696279879978475,0.42875102685480615,0.9797143208291796,0.9829416396964773,0.34627219635208273,0.9086451463612172,0.28414902233020944,0.09634507767787849,0.11004706512415299,0.1228504127842856,0.5178479846414291,0.18238709002315753,0.28270163797524656,0.3660890561105236,0.5188963977252613,0.2596066814778244,0.25592171848974055,0.2701365660159849,0.2207635715031339,0.3696173450745396,0.6379947997334159,0.10909090909090909,0.022227791486736582,0.028853697167088285,0.18277886087526787,0.045859591901569303,0.5883290826337872,0.3121515260630353,6.96928300032502e-05,0.004002684465613169,0.00014221633002379553,0.007158928457599425,0.0,0.094623655914,0.07993061578488315,0.150043365134,0.0,0.0,0.182133564614,0.039119804401,0.0287878787878787,0.2824858757062147,0.24545006875955938,0.05963631310728093,0.350886800163363,0.38153071177120307,0.2431668381096544,0.5996779005411742,0.4808408797306676,0.36620875596728303,0.17608814038438173,0.07182643137875756,0.2554173925742535,0.21102603786087423,False,False,False,False,0.2509565067420677,False,False,False,False,0.2850458170133389,False,False,False,False,0.16239056337452856,False,False,False,False,0.11055992520412285,False,False,False,False,False,False,False,False
1 GEOID10 Housing burden (percent) Total population Air toxics cancer risk Respiratory hazard index Diesel particulate matter Particulate matter (PM2.5) Ozone Traffic proximity and volume Proximity to RMP sites Proximity to TSDF sites Proximity to NPL sites Wastewater discharge Percent pre-1960s housing (lead paint indicator) Individuals under 5 years old Individuals over 64 years old Linguistic isolation (percent) Percent of households in linguistic isolation Poverty (Less than 200% of federal poverty line) Percent individuals age 25 or over with less than high school degree Unemployed civilians (percent) Housing + Transportation Costs % Income for the Regional Typical Household GEOID10 (percentile) Housing burden (percent) (percentile) Total population (percentile) Air toxics cancer risk (percentile) Respiratory hazard index (percentile) Diesel particulate matter (percentile) Particulate matter (PM2.5) (percentile) Ozone (percentile) Traffic proximity and volume (percentile) Proximity to RMP sites (percentile) Proximity to TSDF sites (percentile) Proximity to NPL sites (percentile) Wastewater discharge (percentile) Percent pre-1960s housing (lead paint indicator) (percentile) Individuals under 5 years old (percentile) Individuals over 64 years old (percentile) Linguistic isolation (percent) (percentile) Percent of households in linguistic isolation (percentile) Poverty (Less than 200% of federal poverty line) (percentile) Percent individuals age 25 or over with less than high school degree (percentile) Unemployed civilians (percent) (percentile) Housing + Transportation Costs % Income for the Regional Typical Household (percentile) Housing burden (percent) (min-max normalized) Total population (min-max normalized) Air toxics cancer risk (min-max normalized) Respiratory hazard index (min-max normalized) Diesel particulate matter (min-max normalized) Particulate matter (PM2.5) (min-max normalized) Ozone (min-max normalized) Traffic proximity and volume (min-max normalized) Proximity to RMP sites (min-max normalized) Proximity to TSDF sites (min-max normalized) Proximity to NPL sites (min-max normalized) Wastewater discharge (min-max normalized) Percent pre-1960s housing (lead paint indicator) (min-max normalized) Individuals under 5 years old (min-max normalized) Individuals over 64 years old (min-max normalized) Linguistic isolation (percent) (min-max normalized) Percent of households in linguistic isolation (min-max normalized) Poverty (Less than 200% of federal poverty line) (min-max normalized) Percent individuals age 25 or over with less than high school degree (min-max normalized) Unemployed civilians (percent) (min-max normalized) Housing + Transportation Costs % Income for the Regional Typical Household (min-max normalized) Score A Score B Socioeconomic Factors Sensitive populations Environmental effects Exposures Pollution Burden Population Characteristics Score C Score D Score E Score A (percentile) Score A (top 25th percentile) Score A (top 30th percentile) Score A (top 35th percentile) Score A (top 40th percentile) Score B (percentile) Score B (top 25th percentile) Score B (top 30th percentile) Score B (top 35th percentile) Score B (top 40th percentile) Score C (percentile) Score C (top 25th percentile) Score C (top 30th percentile) Score C (top 35th percentile) Score C (top 40th percentile) Score D (percentile) Score D (top 25th percentile) Score D (top 30th percentile) Score D (top 35th percentile) Score D (top 40th percentile) Score E (percentile) Score E (top 25th percentile) Score E (top 30th percentile) Score E (top 35th percentile) Score E (top 40th percentile) Poverty (Less than 200% of federal poverty line) (top 25th percentile) Poverty (Less than 200% of federal poverty line) (top 30th percentile) Poverty (Less than 200% of federal poverty line) (top 35th percentile) Poverty (Less than 200% of federal poverty line) (top 40th percentile)
2 010010201001 0.15 692 49.3770316066 0.788051737456 0.2786630687 9.99813169399 40.1217287582 91.0159000855 0.0852006888915 0.0655778245369 0.0709415490545 0.0 0.29 0.0491329479769 0.0953757225434 0.0 0.04 0.293352601156 0.195011337868 0.028125 55.0 4.53858477849437e-06 0.15696279879978475 0.12089201345236528 0.9797143208291796 0.9829416396964773 0.34627219635208273 0.9086451463612172 0.28414902233020944 0.3410837232734089 0.13480504509083976 0.13460988594536452 0.5500810137382961 0.18238709002315753 0.5188510118774764 0.4494787435381899 0.25320991408459015 0.2596066814778244 0.7027453899325112 0.46606500161119757 0.7623733167523703 0.3628393561824028 0.5794871072813119 0.10909090909090909 0.013340530536705737 0.028853697167088285 0.18277886087526787 0.045859591901569303 0.5883290826337872 0.3121515260630353 0.0024222132770710053 0.004621252164336263 0.00015416214761450488 0.007893014211979786 0.0 0.29 0.09433526011570838 0.0953757225434 0.0 0.04 0.293352601156 0.195011337868 0.028125 0.2711864406779661 0.6142191591817839 0.3553155211005275 0.5747020343519587 0.3207651130335348 0.3041468093350269 0.640467674807096 0.5283607196497396 0.4477335736927467 0.23656483320764937 0.12511596962298183 0.4015694309647159 0.6357808408182161 False False False True 0.6315486105122701 False False False True 0.5104500914524833 False False False False 0.44267994354000534 False False False False 0.3517176274094212 False False False False False False False False
3 010010201002 0.15 1153 49.3770316066 0.788051737456 0.2786630687 9.99813169399 40.1217287582 2.61874365577 0.0737963352265 0.0604962870646 0.0643436665275 0.0 0.094623655914 0.0416305290546 0.150043365134 0.0 0.0 0.182133564614 0.039119804401 0.0287878787878787 57.0 9.07716955698874e-06 0.15696279879978475 0.42875102685480615 0.9797143208291796 0.9829416396964773 0.34627219635208273 0.9086451463612172 0.28414902233020944 0.09634507767787849 0.11004706512415299 0.1228504127842856 0.5178479846414291 0.18238709002315753 0.28270163797524656 0.3660890561105236 0.5188963977252613 0.2596066814778244 0.25592171848974055 0.2701365660159849 0.2207635715031339 0.3696173450745396 0.6379947997334159 0.10909090909090909 0.022227791486736582 0.028853697167088285 0.18277886087526787 0.045859591901569303 0.5883290826337872 0.3121515260630353 6.96928300032502e-05 0.004002684465613169 0.00014221633002379553 0.007158928457599425 0.0 0.094623655914 0.07993061578488315 0.150043365134 0.0 0.0 0.182133564614 0.039119804401 0.0287878787878787 0.2824858757062147 0.24545006875955938 0.05963631310728093 0.350886800163363 0.38153071177120307 0.2431668381096544 0.5996779005411742 0.4808408797306676 0.36620875596728303 0.17608814038438173 0.07182643137875756 0.2554173925742535 0.21102603786087423 False False False False 0.2509565067420677 False False False False 0.2850458170133389 False False False False 0.16239056337452856 False False False False 0.11055992520412285 False False False False False False False False

View file

@ -0,0 +1,4 @@
fips,state_name,state_abbreviation,region,division
01,Alabama,AL,South,East South Central
02,Alaska,AK,West,Pacific
04,Arizona,AZ,West,Mountain
1 fips state_name state_abbreviation region division
2 01 Alabama AL South East South Central
3 02 Alaska AK West Pacific
4 04 Arizona AZ West Mountain

View file

@ -0,0 +1,116 @@
# pylint: disable=W0212
## Above disables warning about access to underscore-prefixed methods
from importlib import reload
import pandas.api.types as ptypes
import pandas.testing as pdt
from data_pipeline.etl.score import constants
# See conftest.py for all fixtures used in these tests
# Extract Tests
def test_extract_counties(etl, county_data_initial):
reload(constants)
extracted = etl._extract_counties(county_data_initial)
assert all(
ptypes.is_string_dtype(extracted[col])
for col in constants.CENSUS_COUNTIES_COLUMNS
)
def test_extract_states(etl, state_data_initial):
extracted = etl._extract_states(state_data_initial)
string_cols = ["fips", "state_abbreviation"]
assert all(ptypes.is_string_dtype(extracted[col]) for col in string_cols)
def test_extract_score(etl, score_data_initial):
extracted = etl._extract_score(score_data_initial)
string_cols = ["GEOID10"]
assert all(ptypes.is_string_dtype(extracted[col]) for col in string_cols)
# Transform Tests
def test_transform_counties(etl, county_data_initial, counties_transformed_expected):
extracted_counties = etl._extract_counties(county_data_initial)
counties_transformed_actual = etl._transform_counties(extracted_counties)
pdt.assert_frame_equal(counties_transformed_actual, counties_transformed_expected)
def test_transform_states(etl, state_data_initial, states_transformed_expected):
extracted_states = etl._extract_states(state_data_initial)
states_transformed_actual = etl._transform_states(extracted_states)
pdt.assert_frame_equal(states_transformed_actual, states_transformed_expected)
def test_transform_score(etl, score_data_initial, score_transformed_expected):
extracted_score = etl._extract_score(score_data_initial)
score_transformed_actual = etl._transform_score(extracted_score)
pdt.assert_frame_equal(
score_transformed_actual, score_transformed_expected, check_dtype=False
)
# pylint: disable=too-many-arguments
def test_create_score_data(
etl,
national_cbg_df,
counties_transformed_expected,
states_transformed_expected,
score_transformed_expected,
score_data_expected,
):
score_data_actual = etl._create_score_data(
national_cbg_df,
counties_transformed_expected,
states_transformed_expected,
score_transformed_expected,
)
pdt.assert_frame_equal(
score_data_actual,
score_data_expected,
)
def test_create_tile_data(etl, score_data_expected, tile_data_expected):
output_tiles_df_actual = etl._create_tile_data(score_data_expected)
pdt.assert_frame_equal(
output_tiles_df_actual,
tile_data_expected,
)
def test_create_downloadable_data(etl, score_data_expected, downloadable_data_expected):
output_downloadable_df_actual = etl._create_downloadable_data(score_data_expected)
pdt.assert_frame_equal(
output_downloadable_df_actual,
downloadable_data_expected,
)
def test_load_score_csv(etl, score_data_expected):
reload(constants)
etl._load_score_csv(
score_data_expected,
constants.FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH,
)
assert constants.FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH.is_file()
def test_load_tile_csv(etl, tile_data_expected):
reload(constants)
etl._load_score_csv(tile_data_expected, constants.DATA_SCORE_TILES_FILE_PATH)
assert constants.DATA_SCORE_TILES_FILE_PATH.is_file()
def test_load_downloadable_zip(etl, downloadable_data_expected):
reload(constants)
etl._load_downloadable_zip(
downloadable_data_expected, constants.SCORE_DOWNLOADABLE_DIR
)
assert constants.SCORE_DOWNLOADABLE_DIR.is_dir()
assert constants.SCORE_DOWNLOADABLE_CSV_FILE_PATH.is_file()
assert constants.SCORE_DOWNLOADABLE_EXCEL_FILE_PATH.is_file()
assert constants.SCORE_DOWNLOADABLE_ZIP_FILE_PATH.is_file()

View file

@ -59,14 +59,6 @@ typed-ast = {version = ">=1.4.0,<1.5", markers = "implementation_name == \"cpyth
typing-extensions = {version = ">=3.7.4", markers = "python_version < \"3.8\""}
wrapt = ">=1.11,<1.13"
[[package]]
name = "async-generator"
version = "1.10"
description = "Async generators and context managers for Python 3.5+"
category = "main"
optional = false
python-versions = ">=3.5"
[[package]]
name = "atomicwrites"
version = "1.4.0"
@ -449,7 +441,7 @@ argcomplete = {version = ">=1.12.3", markers = "python_version < \"3.8.0\""}
debugpy = ">=1.0.0,<2.0"
importlib-metadata = {version = "<5", markers = "python_version < \"3.8.0\""}
ipython = ">=7.23.1,<8.0"
jupyter-client = "<7.0"
jupyter-client = "<8.0"
matplotlib-inline = ">=0.1.0,<0.2.0"
tornado = ">=4.2,<7.0"
traitlets = ">=4.1.0,<6.0"
@ -891,14 +883,13 @@ python-versions = "*"
[[package]]
name = "nbclient"
version = "0.5.3"
version = "0.5.4"
description = "A client library for executing notebooks. Formerly nbconvert's ExecutePreprocessor."
category = "main"
optional = false
python-versions = ">=3.6.1"
[package.dependencies]
async-generator = "*"
jupyter-client = ">=6.1.5"
nbformat = ">=5.0"
nest-asyncio = "*"
@ -1026,7 +1017,7 @@ pyparsing = ">=2.0.2"
[[package]]
name = "pandas"
version = "1.3.1"
version = "1.3.2"
description = "Powerful data structures for data analysis, time series, and statistics"
category = "main"
optional = false
@ -1185,7 +1176,7 @@ python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
[[package]]
name = "pygments"
version = "2.9.0"
version = "2.10.0"
description = "Pygments is a syntax highlighting package written in Python."
category = "main"
optional = false
@ -1263,6 +1254,20 @@ toml = "*"
[package.extras]
testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "requests", "xmlschema"]
[[package]]
name = "pytest-mock"
version = "3.6.1"
description = "Thin-wrapper around the mock package for easier use with pytest"
category = "dev"
optional = false
python-versions = ">=3.6"
[package.dependencies]
pytest = ">=5.0"
[package.extras]
dev = ["pre-commit", "tox", "pytest-asyncio"]
[[package]]
name = "python-dateutil"
version = "2.8.2"
@ -1342,11 +1347,11 @@ test = ["flaky", "pytest", "pytest-qt"]
[[package]]
name = "qtpy"
version = "1.9.0"
version = "1.10.0"
description = "Provides an abstraction layer on top of the various Qt bindings (PyQt5, PyQt4 and PySide) and additional custom QWidgets."
category = "main"
optional = false
python-versions = "*"
python-versions = ">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*"
[[package]]
name = "regex"
@ -1703,10 +1708,6 @@ astroid = [
{file = "astroid-2.6.6-py3-none-any.whl", hash = "sha256:ab7f36e8a78b8e54a62028ba6beef7561db4cdb6f2a5009ecc44a6f42b5697ef"},
{file = "astroid-2.6.6.tar.gz", hash = "sha256:3975a0bd5373bdce166e60c851cfcbaf21ee96de80ec518c1f4cb3e94c3fb334"},
]
async-generator = [
{file = "async_generator-1.10-py3-none-any.whl", hash = "sha256:01c7bf666359b4967d2cda0000cc2e4af16a0ae098cbffcb8472fb9e8ad6585b"},
{file = "async_generator-1.10.tar.gz", hash = "sha256:6ebb3d106c12920aaae42ccb6f787ef5eefdcdd166ea3d628fa8476abe712144"},
]
atomicwrites = [
{file = "atomicwrites-1.4.0-py2.py3-none-any.whl", hash = "sha256:6d1784dea7c0c8d4a5172b6c620f40b6e4cbfdf96d783691f2e1302a7b88e197"},
{file = "atomicwrites-1.4.0.tar.gz", hash = "sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a"},
@ -2254,8 +2255,8 @@ mypy-extensions = [
{file = "mypy_extensions-0.4.3.tar.gz", hash = "sha256:2d82818f5bb3e369420cb3c4060a7970edba416647068eb4c5343488a6c604a8"},
]
nbclient = [
{file = "nbclient-0.5.3-py3-none-any.whl", hash = "sha256:e79437364a2376892b3f46bedbf9b444e5396cfb1bc366a472c37b48e9551500"},
{file = "nbclient-0.5.3.tar.gz", hash = "sha256:db17271330c68c8c88d46d72349e24c147bb6f34ec82d8481a8f025c4d26589c"},
{file = "nbclient-0.5.4-py3-none-any.whl", hash = "sha256:95a300c6fbe73721736cf13972a46d8d666f78794b832866ed7197a504269e11"},
{file = "nbclient-0.5.4.tar.gz", hash = "sha256:6c8ad36a28edad4562580847f9f1636fe5316a51a323ed85a24a4ad37d4aefce"},
]
nbconvert = [
{file = "nbconvert-6.1.0-py3-none-any.whl", hash = "sha256:37cd92ff2ae6a268e62075ff8b16129e0be4939c4dfcee53dc77cc8a7e06c684"},
@ -2312,25 +2313,25 @@ packaging = [
{file = "packaging-21.0.tar.gz", hash = "sha256:7dc96269f53a4ccec5c0670940a4281106dd0bb343f47b7471f779df49c2fbe7"},
]
pandas = [
{file = "pandas-1.3.1-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:1ee8418d0f936ff2216513aa03e199657eceb67690995d427a4a7ecd2e68f442"},
{file = "pandas-1.3.1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5d9acfca191140a518779d1095036d842d5e5bc8e8ad8b5eaad1aff90fe1870d"},
{file = "pandas-1.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e323028ab192fcfe1e8999c012a0fa96d066453bb354c7e7a4a267b25e73d3c8"},
{file = "pandas-1.3.1-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9d06661c6eb741ae633ee1c57e8c432bb4203024e263fe1a077fa3fda7817fdb"},
{file = "pandas-1.3.1-cp37-cp37m-win32.whl", hash = "sha256:23c7452771501254d2ae23e9e9dac88417de7e6eff3ce64ee494bb94dc88c300"},
{file = "pandas-1.3.1-cp37-cp37m-win_amd64.whl", hash = "sha256:7150039e78a81eddd9f5a05363a11cadf90a4968aac6f086fd83e66cf1c8d1d6"},
{file = "pandas-1.3.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:5c09a2538f0fddf3895070579082089ff4ae52b6cb176d8ec7a4dacf7e3676c1"},
{file = "pandas-1.3.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:905fc3e0fcd86b0a9f1f97abee7d36894698d2592b22b859f08ea5a8fe3d3aab"},
{file = "pandas-1.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5ee927c70794e875a59796fab8047098aa59787b1be680717c141cd7873818ae"},
{file = "pandas-1.3.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0c976e023ed580e60a82ccebdca8e1cc24d8b1fbb28175eb6521025c127dab66"},
{file = "pandas-1.3.1-cp38-cp38-win32.whl", hash = "sha256:22f3fcc129fb482ef44e7df2a594f0bd514ac45aabe50da1a10709de1b0f9d84"},
{file = "pandas-1.3.1-cp38-cp38-win_amd64.whl", hash = "sha256:45656cd59ae9745a1a21271a62001df58342b59c66d50754390066db500a8362"},
{file = "pandas-1.3.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:114c6789d15862508900a25cb4cb51820bfdd8595ea306bab3b53cd19f990b65"},
{file = "pandas-1.3.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:527c43311894aff131dea99cf418cd723bfd4f0bcf3c3da460f3b57e52a64da5"},
{file = "pandas-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fdb3b33dde260b1766ea4d3c6b8fbf6799cee18d50a2a8bc534cf3550b7c819a"},
{file = "pandas-1.3.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c28760932283d2c9f6fa5e53d2f77a514163b9e67fd0ee0879081be612567195"},
{file = "pandas-1.3.1-cp39-cp39-win32.whl", hash = "sha256:be12d77f7e03c40a2466ed00ccd1a5f20a574d3c622fe1516037faa31aa448aa"},
{file = "pandas-1.3.1-cp39-cp39-win_amd64.whl", hash = "sha256:9e1fe6722cbe27eb5891c1977bca62d456c19935352eea64d33956db46139364"},
{file = "pandas-1.3.1.tar.gz", hash = "sha256:341935a594db24f3ff07d1b34d1d231786aa9adfa84b76eab10bf42907c8aed3"},
{file = "pandas-1.3.2-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:ba7ceb8abc6dbdb1e34612d1173d61e4941f1a1eb7e6f703b2633134ae6a6c89"},
{file = "pandas-1.3.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fcb71b1935249de80e3a808227189eee381d4d74a31760ced2df21eedc92a8e3"},
{file = "pandas-1.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fa54dc1d3e5d004a09ab0b1751473698011ddf03e14f1f59b84ad9a6ac630975"},
{file = "pandas-1.3.2-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:34ced9ce5d5b17b556486da7256961b55b471d64a8990b56e67a84ebeb259416"},
{file = "pandas-1.3.2-cp37-cp37m-win32.whl", hash = "sha256:a56246de744baf646d1f3e050c4653d632bc9cd2e0605f41051fea59980e880a"},
{file = "pandas-1.3.2-cp37-cp37m-win_amd64.whl", hash = "sha256:53b17e4debba26b7446b1e4795c19f94f0c715e288e08145e44bdd2865e819b3"},
{file = "pandas-1.3.2-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:f07a9745ca075ae73a5ce116f5e58f691c0dc9de0bff163527858459df5c176f"},
{file = "pandas-1.3.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c9e8e0ce5284ebebe110efd652c164ed6eab77f5de4c3533abc756302ee77765"},
{file = "pandas-1.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:59a78d7066d1c921a77e3306aa0ebf6e55396c097d5dfcc4df8defe3dcecb735"},
{file = "pandas-1.3.2-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:132def05e73d292c949b02e7ef873debb77acc44a8b119d215921046f0c3a91d"},
{file = "pandas-1.3.2-cp38-cp38-win32.whl", hash = "sha256:69e1b2f5811f46827722fd641fdaeedb26002bd1e504eacc7a8ec36bdc25393e"},
{file = "pandas-1.3.2-cp38-cp38-win_amd64.whl", hash = "sha256:7996d311413379136baf0f3cf2a10e331697657c87ced3f17ac7c77f77fe34a3"},
{file = "pandas-1.3.2-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:1738154049062156429a5cf2fd79a69c9f3fa4f231346a7ec6fd156cd1a9a621"},
{file = "pandas-1.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9cce01f6d655b4add966fcd36c32c5d1fe84628e200626b3f5e2f40db2d16a0f"},
{file = "pandas-1.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1099e2a0cd3a01ec62cca183fc1555833a2d43764950ef8cb5948c8abfc51014"},
{file = "pandas-1.3.2-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0cd5776be891331a3e6b425b5abeab9596abea18435c5982191356f9b24ae731"},
{file = "pandas-1.3.2-cp39-cp39-win32.whl", hash = "sha256:66a95361b81b4ba04b699ecd2416b0591f40cd1e24c60a8bfe0d19009cfa575a"},
{file = "pandas-1.3.2-cp39-cp39-win_amd64.whl", hash = "sha256:89f40e5d21814192802421df809f948247d39ffe171e45fe2ab4abf7bd4279d8"},
{file = "pandas-1.3.2.tar.gz", hash = "sha256:cbcb84d63867af3411fa063af3de64902665bb5b3d40b25b2059e40603594e87"},
]
pandocfilters = [
{file = "pandocfilters-1.4.3.tar.gz", hash = "sha256:bc63fbb50534b4b1f8ebe1860889289e8af94a23bff7445259592df25a3906eb"},
@ -2443,8 +2444,8 @@ pyflakes = [
{file = "pyflakes-2.3.1.tar.gz", hash = "sha256:f5bc8ecabc05bb9d291eb5203d6810b49040f6ff446a756326104746cc00c1db"},
]
pygments = [
{file = "Pygments-2.9.0-py3-none-any.whl", hash = "sha256:d66e804411278594d764fc69ec36ec13d9ae9147193a1740cd34d272ca383b8e"},
{file = "Pygments-2.9.0.tar.gz", hash = "sha256:a18f47b506a429f6f4b9df81bb02beab9ca21d0a5fee38ed15aef65f0545519f"},
{file = "Pygments-2.10.0-py3-none-any.whl", hash = "sha256:b8e67fe6af78f492b3c4b3e2970c0624cbf08beb1e493b2c99b9fa1b67a20380"},
{file = "Pygments-2.10.0.tar.gz", hash = "sha256:f398865f7eb6874156579fdf36bc840a03cab64d1cde9e93d68f46a425ec52c6"},
]
pylint = [
{file = "pylint-2.9.6-py3-none-any.whl", hash = "sha256:2e1a0eb2e8ab41d6b5dbada87f066492bb1557b12b76c47c2ee8aa8a11186594"},
@ -2506,6 +2507,10 @@ pytest = [
{file = "pytest-6.2.4-py3-none-any.whl", hash = "sha256:91ef2131a9bd6be8f76f1f08eac5c5317221d6ad1e143ae03894b862e8976890"},
{file = "pytest-6.2.4.tar.gz", hash = "sha256:50bcad0a0b9c5a72c8e4e7c9855a3ad496ca6a881a3641b4260605450772c54b"},
]
pytest-mock = [
{file = "pytest-mock-3.6.1.tar.gz", hash = "sha256:40217a058c52a63f1042f0784f62009e976ba824c418cced42e88d5f40ab0e62"},
{file = "pytest_mock-3.6.1-py3-none-any.whl", hash = "sha256:30c2f2cc9759e76eee674b81ea28c9f0b94f8f0445a1b87762cadf774f0df7e3"},
]
python-dateutil = [
{file = "python-dateutil-2.8.2.tar.gz", hash = "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86"},
{file = "python_dateutil-2.8.2-py2.py3-none-any.whl", hash = "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9"},
@ -2565,13 +2570,6 @@ pyyaml = [
{file = "PyYAML-5.4.1.tar.gz", hash = "sha256:607774cbba28732bfa802b54baa7484215f530991055bb562efbed5b2f20a45e"},
]
pyzmq = [
{file = "pyzmq-22.2.1-cp310-cp310-macosx_10_15_universal2.whl", hash = "sha256:d60a407663b7c2af781ab7f49d94a3d379dd148bb69ea8d9dd5bc69adf18097c"},
{file = "pyzmq-22.2.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:631f932fb1fa4b76f31adf976f8056519bc6208a3c24c184581c3dd5be15066e"},
{file = "pyzmq-22.2.1-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl", hash = "sha256:0471d634c7fe48ff7d3849798da6c16afc71676dd890b5ae08eb1efe735c6fec"},
{file = "pyzmq-22.2.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:f520e9fee5d7a2e09b051d924f85b977c6b4e224e56c0551c3c241bbeeb0ad8d"},
{file = "pyzmq-22.2.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c1b6619ceb33a8907f1cb82ff8afc8a133e7a5f16df29528e919734718600426"},
{file = "pyzmq-22.2.1-cp310-cp310-win32.whl", hash = "sha256:31c5dfb6df5148789835128768c01bf6402eb753d06f524f12f6786caf96fb44"},
{file = "pyzmq-22.2.1-cp310-cp310-win_amd64.whl", hash = "sha256:4842a8263cbaba6fce401bbe4e2b125321c401a01714e42624dabc554bfc2629"},
{file = "pyzmq-22.2.1-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:b921758f8b5098faa85f341bbdd5e36d5339de5e9032ca2b07d8c8e7bec5069b"},
{file = "pyzmq-22.2.1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:240b83b3a8175b2f616f80092cbb019fcd5c18598f78ffc6aa0ae9034b300f14"},
{file = "pyzmq-22.2.1-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:da7f7f3bb08bcf59a6b60b4e53dd8f08bb00c9e61045319d825a906dbb3c8fb7"},
@ -2608,8 +2606,8 @@ qtconsole = [
{file = "qtconsole-5.1.1.tar.gz", hash = "sha256:bbc34bca14f65535afcb401bc74b752bac955e5313001ba640383f7e5857dc49"},
]
qtpy = [
{file = "QtPy-1.9.0-py2.py3-none-any.whl", hash = "sha256:fa0b8363b363e89b2a6f49eddc162a04c0699ae95e109a6be3bb145a913190ea"},
{file = "QtPy-1.9.0.tar.gz", hash = "sha256:2db72c44b55d0fe1407be8fba35c838ad0d6d3bb81f23007886dc1fc0f459c8d"},
{file = "QtPy-1.10.0-py2.py3-none-any.whl", hash = "sha256:f683ce6cd825ba8248a798bf1dfa1a07aca387c88ae44fa5479537490aace7be"},
{file = "QtPy-1.10.0.tar.gz", hash = "sha256:3d20f010caa3b2c04835d6a2f66f8873b041bdaf7a76085c2a0d7890cdd65ea9"},
]
regex = [
{file = "regex-2021.8.3-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:8764a78c5464ac6bde91a8c87dd718c27c1cabb7ed2b4beaf36d3e8e390567f9"},

View file

@ -33,6 +33,7 @@ pylint = "^2.9.6"
pytest = "^6.2.4"
safety = "^1.10.3"
tox = "^3.24.0"
pytest-mock = "^3.6.1"
[build-system]
build-backend = "poetry.core.masonry.api"