mirror of
https://github.com/DOI-DO/j40-cejst-2.git
synced 2025-02-22 17:44:20 -08:00
Data folder restructuring in preparation for 361 (#376)
* initial checkin * gitignore and docker-compose update * readme update and error on hud * encoding issue * one more small README change * data roadmap re-strcuture * pyproject sort * small update to score output folders * checkpoint * couple of last fixes
This commit is contained in:
parent
3032a8305d
commit
543d147e61
66 changed files with 130 additions and 108 deletions
14
.gitignore
vendored
14
.gitignore
vendored
|
@ -130,15 +130,15 @@ dmypy.json
|
||||||
cython_debug/
|
cython_debug/
|
||||||
|
|
||||||
# Ignore dynaconf secret files
|
# Ignore dynaconf secret files
|
||||||
score/.secrets.*
|
*/data-pipeline/.secrets.*
|
||||||
|
|
||||||
# ignore data
|
# ignore data
|
||||||
score/data
|
*/data-pipeline/data
|
||||||
score/data/census
|
*/data-pipeline/data/census
|
||||||
score/data/tiles
|
*/data-pipeline/data/tiles
|
||||||
score/data/tmp
|
*/data-pipeline/data/tmp
|
||||||
score/data/dataset
|
*/data-pipeline/data/dataset
|
||||||
score/data/score
|
*/data-pipeline/data/score
|
||||||
|
|
||||||
# node
|
# node
|
||||||
node_modules
|
node_modules
|
||||||
|
|
|
@ -25,7 +25,7 @@ RUN add-apt-repository ppa:ubuntugis/ppa
|
||||||
RUN apt-get -y install gdal-bin
|
RUN apt-get -y install gdal-bin
|
||||||
|
|
||||||
# Prepare python packages
|
# Prepare python packages
|
||||||
WORKDIR /score
|
WORKDIR /data-pipeline
|
||||||
RUN pip3 install --upgrade pip setuptools wheel
|
RUN pip3 install --upgrade pip setuptools wheel
|
||||||
COPY . .
|
COPY . .
|
||||||
|
|
|
@ -93,30 +93,30 @@ We use Docker to install the necessary libraries in a container that can be run
|
||||||
|
|
||||||
To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build`.
|
To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build`.
|
||||||
|
|
||||||
After that, to run commands type the following:
|
Once completed, run `docker-compose up` and then open a new tab or terminal window, and then run any command for the application using this format:
|
||||||
|
`docker exec j40_data_pipeline_1 python3 application.py [command]`
|
||||||
|
|
||||||
- Get help: `docker run --rm -it j40_score /bin/sh -c "python3 application.py --help"`
|
Here's a list of commands:
|
||||||
- Clean up the census data directories: `docker run --rm -it j40_score /bin/sh -c "python3 application.py census-cleanup"`
|
|
||||||
- Clean up the data directories: `docker run --rm -it j40_score /bin/sh -c "python3 application.py data-cleanup"`
|
|
||||||
- Generate census data: `docker run --rm -it j40_score /bin/sh -c "python3 application.py census-data-download"`
|
|
||||||
- Run all ETL processes: `docker run --rm -it j40_score /bin/sh -c "python3 application.py etl-run"`
|
|
||||||
- Generate Score: `docker run --rm -it j40_score /bin/sh -c "python3 application.py score-run"`
|
|
||||||
|
|
||||||
## Log visualization
|
- Get help: `docker exec j40_data_pipeline_1 python3 application.py --help"`
|
||||||
|
- Clean up the census data directories: `docker exec j40_data_pipeline_1 python3 application.py census-cleanup"`
|
||||||
If you want to visualize logs while running a command, the following temporary workaround can be used:
|
- Clean up the data directories: `docker exec j40_data_pipeline_1 python3 application.py data-cleanup"`
|
||||||
|
- Generate census data: `docker exec j40_data_pipeline_1 python3 application.py census-data-download"`
|
||||||
- Run `docker-compose up` on the root of the repo
|
- Run all ETL processes: `docker exec j40_data_pipeline_1 python3 application.py etl-run"`
|
||||||
- Open a new tab on your terminal
|
- Generate Score: `docker exec j40_data_pipeline_1 python3 application.py score-run"`
|
||||||
- Then run any command for the application using this format: `docker exec j40_score_1 python3 application.py [command]`
|
|
||||||
|
|
||||||
## Local development
|
## Local development
|
||||||
|
|
||||||
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe)
|
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe). Please refer to the repos for specific instructions for your OS.
|
||||||
|
|
||||||
Note: If you are using Windows, please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install Geopandas locally. If you want to install TippeCanoe, [follow these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
|
### Windows Users
|
||||||
|
- If you want to download Census data or run tile generation, please install TippeCanoe [following these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
|
||||||
|
- If you want to generate tiles, you need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally.
|
||||||
|
|
||||||
|
### Setting up Poetry
|
||||||
|
|
||||||
- Start a terminal
|
- Start a terminal
|
||||||
|
- Change to this directory (`/data/data-pipeline`)
|
||||||
- Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
|
- Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
|
||||||
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
|
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
|
||||||
- Install Poetry requirements with `poetry install`
|
- Install Poetry requirements with `poetry install`
|
||||||
|
@ -125,7 +125,7 @@ Note: If you are using Windows, please follow [these instructions](https://stack
|
||||||
|
|
||||||
- Make sure you have Docker running in your machine
|
- Make sure you have Docker running in your machine
|
||||||
- Start a terminal
|
- Start a terminal
|
||||||
- Change to this directory (i.e. `cd score`)
|
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||||
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
|
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
|
||||||
- Then run `poetry run python application.py census-data-download`
|
- Then run `poetry run python application.py census-data-download`
|
||||||
Note: Census files are not kept in the repository and the download directories are ignored by Git
|
Note: Census files are not kept in the repository and the download directories are ignored by Git
|
||||||
|
@ -137,18 +137,18 @@ Note: If you are using Windows, please follow [these instructions](https://stack
|
||||||
### Serve the map locally
|
### Serve the map locally
|
||||||
|
|
||||||
- Start a terminal
|
- Start a terminal
|
||||||
- Change to this directory (i.e. `cd score`)
|
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||||
- Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl`
|
- Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl`
|
||||||
|
|
||||||
### Running Jupyter notebooks
|
### Running Jupyter notebooks
|
||||||
|
|
||||||
- Start a terminal
|
- Start a terminal
|
||||||
- Change to this directory (i.e. `cd score`)
|
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||||
- Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
|
- Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
|
||||||
|
|
||||||
### Activating variable-enabled Markdown for Jupyter notebooks
|
### Activating variable-enabled Markdown for Jupyter notebooks
|
||||||
|
|
||||||
- Change to this directory (i.e. `cd score`)
|
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||||
- Activate a Poetry Shell (see above)
|
- Activate a Poetry Shell (see above)
|
||||||
- Run `jupyter contrib nbextension install --user`
|
- Run `jupyter contrib nbextension install --user`
|
||||||
- Run `jupyter nbextension enable python-markdown/main`
|
- Run `jupyter nbextension enable python-markdown/main`
|
|
@ -20,8 +20,8 @@ class PostScoreETL(ExtractTransformLoad):
|
||||||
self.STATE_CSV = (
|
self.STATE_CSV = (
|
||||||
self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
|
self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
|
||||||
)
|
)
|
||||||
self.SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
|
self.FULL_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
|
||||||
self.COUNTY_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa-county.csv"
|
self.TILR_SCORE_CSV = self.SCORE_CSV_PATH / "tile" / "usa.csv"
|
||||||
|
|
||||||
self.TILES_SCORE_COLUMNS = [
|
self.TILES_SCORE_COLUMNS = [
|
||||||
"GEOID10",
|
"GEOID10",
|
||||||
|
@ -59,7 +59,9 @@ class PostScoreETL(ExtractTransformLoad):
|
||||||
self.states_df = pd.read_csv(
|
self.states_df = pd.read_csv(
|
||||||
self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
|
self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
|
||||||
)
|
)
|
||||||
self.score_df = pd.read_csv(self.SCORE_CSV, dtype={"GEOID10": "string"})
|
self.score_df = pd.read_csv(
|
||||||
|
self.FULL_SCORE_CSV, dtype={"GEOID10": "string"}
|
||||||
|
)
|
||||||
|
|
||||||
def transform(self) -> None:
|
def transform(self) -> None:
|
||||||
logger.info(f"Transforming data sources for Score + County CSV")
|
logger.info(f"Transforming data sources for Score + County CSV")
|
||||||
|
@ -98,11 +100,9 @@ class PostScoreETL(ExtractTransformLoad):
|
||||||
del self.score_county_state_merged["GEOID_OTHER"]
|
del self.score_county_state_merged["GEOID_OTHER"]
|
||||||
|
|
||||||
def load(self) -> None:
|
def load(self) -> None:
|
||||||
logger.info(f"Saving Score + County CSV")
|
logger.info(f"Saving Full Score CSV with County Information")
|
||||||
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
|
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||||
# self.score_county_state_merged.to_csv(
|
self.score_county_state_merged.to_csv(self.FULL_SCORE_CSV, index=False)
|
||||||
# self.COUNTY_SCORE_CSV, index=False
|
|
||||||
# )
|
|
||||||
|
|
||||||
logger.info(f"Saving Tile Score CSV")
|
logger.info(f"Saving Tile Score CSV")
|
||||||
# TODO: check which are the columns we'll use
|
# TODO: check which are the columns we'll use
|
|
@ -3,7 +3,14 @@ import csv
|
||||||
import os
|
import os
|
||||||
from config import settings
|
from config import settings
|
||||||
|
|
||||||
from utils import remove_files_from_dir, remove_all_dirs_from_dir, unzip_file_from_url
|
from utils import (
|
||||||
|
remove_files_from_dir,
|
||||||
|
remove_all_dirs_from_dir,
|
||||||
|
unzip_file_from_url,
|
||||||
|
get_module_logger,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = get_module_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
def reset_data_directories(data_path: Path) -> None:
|
def reset_data_directories(data_path: Path) -> None:
|
||||||
|
@ -27,6 +34,7 @@ def get_state_fips_codes(data_path: Path) -> list:
|
||||||
|
|
||||||
# check if file exists
|
# check if file exists
|
||||||
if not os.path.isfile(fips_csv_path):
|
if not os.path.isfile(fips_csv_path):
|
||||||
|
logger.info(f"Downloading fips from S3 repository")
|
||||||
unzip_file_from_url(
|
unzip_file_from_url(
|
||||||
settings.AWS_JUSTICE40_DATA_URL + "/Census/fips_states_2010.zip",
|
settings.AWS_JUSTICE40_DATA_URL + "/Census/fips_states_2010.zip",
|
||||||
data_path / "tmp",
|
data_path / "tmp",
|
|
@ -9,7 +9,7 @@ logger = get_module_logger(__name__)
|
||||||
class EJScreenETL(ExtractTransformLoad):
|
class EJScreenETL(ExtractTransformLoad):
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.EJSCREEN_FTP_URL = "https://gaftp.epa.gov/EJSCREEN/2019/EJSCREEN_2019_StatePctile.csv.zip"
|
self.EJSCREEN_FTP_URL = "https://gaftp.epa.gov/EJSCREEN/2019/EJSCREEN_2019_StatePctile.csv.zip"
|
||||||
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctile.csv"
|
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctiles.csv"
|
||||||
self.CSV_PATH = self.DATA_PATH / "dataset" / "ejscreen_2019"
|
self.CSV_PATH = self.DATA_PATH / "dataset" / "ejscreen_2019"
|
||||||
self.df: pd.DataFrame
|
self.df: pd.DataFrame
|
||||||
|
|
|
@ -25,12 +25,19 @@ class HousingTransportationETL(ExtractTransformLoad):
|
||||||
logger.info(
|
logger.info(
|
||||||
f"Downloading housing data for state/territory with FIPS code {fips}"
|
f"Downloading housing data for state/territory with FIPS code {fips}"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Puerto Rico has no data, so skip
|
||||||
|
if fips == "72":
|
||||||
|
continue
|
||||||
|
|
||||||
unzip_file_from_url(
|
unzip_file_from_url(
|
||||||
f"{self.HOUSING_FTP_URL}{fips}", self.TMP_PATH, zip_file_dir
|
f"{self.HOUSING_FTP_URL}{fips}", self.TMP_PATH, zip_file_dir
|
||||||
)
|
)
|
||||||
|
|
||||||
# New file name:
|
# New file name:
|
||||||
tmp_csv_file_path = zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
|
tmp_csv_file_path = (
|
||||||
|
zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
|
||||||
|
)
|
||||||
tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
|
tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
|
||||||
|
|
||||||
dfs.append(tmp_df)
|
dfs.append(tmp_df)
|
||||||
|
@ -44,9 +51,9 @@ class HousingTransportationETL(ExtractTransformLoad):
|
||||||
|
|
||||||
# Rename and reformat block group ID
|
# Rename and reformat block group ID
|
||||||
self.df.rename(columns={"blkgrp": self.GEOID_FIELD_NAME}, inplace=True)
|
self.df.rename(columns={"blkgrp": self.GEOID_FIELD_NAME}, inplace=True)
|
||||||
self.df[self.GEOID_FIELD_NAME] = self.df[self.GEOID_FIELD_NAME].str.replace(
|
self.df[self.GEOID_FIELD_NAME] = self.df[
|
||||||
'"', ""
|
self.GEOID_FIELD_NAME
|
||||||
)
|
].str.replace('"', "")
|
||||||
|
|
||||||
def load(self) -> None:
|
def load(self) -> None:
|
||||||
logger.info(f"Saving Housing and Transportation Data")
|
logger.info(f"Saving Housing and Transportation Data")
|
|
@ -11,16 +11,16 @@ class HudHousingETL(ExtractTransformLoad):
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.OUTPUT_PATH = self.DATA_PATH / "dataset" / "hud_housing"
|
self.OUTPUT_PATH = self.DATA_PATH / "dataset" / "hud_housing"
|
||||||
self.GEOID_TRACT_FIELD_NAME = "GEOID10_TRACT"
|
self.GEOID_TRACT_FIELD_NAME = "GEOID10_TRACT"
|
||||||
self.HOUSING_FTP_URL = (
|
self.HOUSING_FTP_URL = "https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
|
||||||
"https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
|
|
||||||
)
|
|
||||||
self.HOUSING_ZIP_FILE_DIR = self.TMP_PATH / "hud_housing"
|
self.HOUSING_ZIP_FILE_DIR = self.TMP_PATH / "hud_housing"
|
||||||
|
|
||||||
# We measure households earning less than 80% of HUD Area Median Family Income by county
|
# We measure households earning less than 80% of HUD Area Median Family Income by county
|
||||||
# and paying greater than 30% of their income to housing costs.
|
# and paying greater than 30% of their income to housing costs.
|
||||||
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
|
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
|
||||||
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME = "HOUSING_BURDEN_NUMERATOR"
|
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME = "HOUSING_BURDEN_NUMERATOR"
|
||||||
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = "HOUSING_BURDEN_DENOMINATOR"
|
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = (
|
||||||
|
"HOUSING_BURDEN_DENOMINATOR"
|
||||||
|
)
|
||||||
|
|
||||||
# Note: some variable definitions.
|
# Note: some variable definitions.
|
||||||
# HUD-adjusted median family income (HAMFI).
|
# HUD-adjusted median family income (HAMFI).
|
||||||
|
@ -47,10 +47,15 @@ class HudHousingETL(ExtractTransformLoad):
|
||||||
/ "140"
|
/ "140"
|
||||||
/ "Table8.csv"
|
/ "Table8.csv"
|
||||||
)
|
)
|
||||||
self.df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
|
self.df = pd.read_csv(
|
||||||
|
filepath_or_buffer=tmp_csv_file_path,
|
||||||
|
encoding="latin-1",
|
||||||
|
)
|
||||||
|
|
||||||
# Rename and reformat block group ID
|
# Rename and reformat block group ID
|
||||||
self.df.rename(columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True)
|
self.df.rename(
|
||||||
|
columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True
|
||||||
|
)
|
||||||
|
|
||||||
# The CHAS data has census tract ids such as `14000US01001020100`
|
# The CHAS data has census tract ids such as `14000US01001020100`
|
||||||
# Whereas the rest of our data uses, for the same tract, `01001020100`.
|
# Whereas the rest of our data uses, for the same tract, `01001020100`.
|
||||||
|
@ -160,7 +165,9 @@ class HudHousingETL(ExtractTransformLoad):
|
||||||
# TODO: add small sample size checks
|
# TODO: add small sample size checks
|
||||||
self.df[self.HOUSING_BURDEN_FIELD_NAME] = self.df[
|
self.df[self.HOUSING_BURDEN_FIELD_NAME] = self.df[
|
||||||
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME
|
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME
|
||||||
].astype(float) / self.df[self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME].astype(
|
].astype(float) / self.df[
|
||||||
|
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME
|
||||||
|
].astype(
|
||||||
float
|
float
|
||||||
)
|
)
|
||||||
|
|
|
@ -6,16 +6,16 @@ authors = ["Your Name <you@example.com>"]
|
||||||
|
|
||||||
[tool.poetry.dependencies]
|
[tool.poetry.dependencies]
|
||||||
python = "^3.7.1"
|
python = "^3.7.1"
|
||||||
|
CensusData = "^1.13"
|
||||||
|
click = "^8.0.1"
|
||||||
|
dynaconf = "^3.1.4"
|
||||||
ipython = "^7.24.1"
|
ipython = "^7.24.1"
|
||||||
jupyter = "^1.0.0"
|
jupyter = "^1.0.0"
|
||||||
jupyter-contrib-nbextensions = "^0.5.1"
|
jupyter-contrib-nbextensions = "^0.5.1"
|
||||||
numpy = "^1.21.0"
|
numpy = "^1.21.0"
|
||||||
pandas = "^1.2.5"
|
pandas = "^1.2.5"
|
||||||
requests = "^2.25.1"
|
requests = "^2.25.1"
|
||||||
click = "^8.0.1"
|
|
||||||
dynaconf = "^3.1.4"
|
|
||||||
types-requests = "^2.25.0"
|
types-requests = "^2.25.0"
|
||||||
CensusData = "^1.13"
|
|
||||||
|
|
||||||
[tool.poetry.dev-dependencies]
|
[tool.poetry.dev-dependencies]
|
||||||
mypy = "^0.910"
|
mypy = "^0.910"
|
|
@ -1,13 +1,13 @@
|
||||||
version: "3.4"
|
version: "3.4"
|
||||||
services:
|
services:
|
||||||
score:
|
score:
|
||||||
image: j40_score
|
image: j40_data_pipeline
|
||||||
container_name: j40_score_1
|
container_name: j40_data_pipeline_1
|
||||||
build: score
|
build: data/data-pipeline
|
||||||
ports:
|
ports:
|
||||||
- 8888:8888
|
- 8888:8888
|
||||||
volumes:
|
volumes:
|
||||||
- ./score:/score
|
- ./data/data-pipeline:/data-pipeline
|
||||||
stdin_open: true
|
stdin_open: true
|
||||||
tty: true
|
tty: true
|
||||||
environment:
|
environment:
|
||||||
|
|
Loading…
Add table
Reference in a new issue