mirror of
https://github.com/DOI-DO/j40-cejst-2.git
synced 2025-02-22 09:41:26 -08:00
Data folder restructuring in preparation for 361 (#376)
* initial checkin * gitignore and docker-compose update * readme update and error on hud * encoding issue * one more small README change * data roadmap re-strcuture * pyproject sort * small update to score output folders * checkpoint * couple of last fixes
This commit is contained in:
parent
3032a8305d
commit
543d147e61
66 changed files with 130 additions and 108 deletions
14
.gitignore
vendored
14
.gitignore
vendored
|
@ -130,15 +130,15 @@ dmypy.json
|
|||
cython_debug/
|
||||
|
||||
# Ignore dynaconf secret files
|
||||
score/.secrets.*
|
||||
*/data-pipeline/.secrets.*
|
||||
|
||||
# ignore data
|
||||
score/data
|
||||
score/data/census
|
||||
score/data/tiles
|
||||
score/data/tmp
|
||||
score/data/dataset
|
||||
score/data/score
|
||||
*/data-pipeline/data
|
||||
*/data-pipeline/data/census
|
||||
*/data-pipeline/data/tiles
|
||||
*/data-pipeline/data/tmp
|
||||
*/data-pipeline/data/dataset
|
||||
*/data-pipeline/data/score
|
||||
|
||||
# node
|
||||
node_modules
|
||||
|
|
|
@ -25,7 +25,7 @@ RUN add-apt-repository ppa:ubuntugis/ppa
|
|||
RUN apt-get -y install gdal-bin
|
||||
|
||||
# Prepare python packages
|
||||
WORKDIR /score
|
||||
WORKDIR /data-pipeline
|
||||
RUN pip3 install --upgrade pip setuptools wheel
|
||||
COPY . .
|
||||
|
|
@ -93,30 +93,30 @@ We use Docker to install the necessary libraries in a container that can be run
|
|||
|
||||
To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build`.
|
||||
|
||||
After that, to run commands type the following:
|
||||
Once completed, run `docker-compose up` and then open a new tab or terminal window, and then run any command for the application using this format:
|
||||
`docker exec j40_data_pipeline_1 python3 application.py [command]`
|
||||
|
||||
- Get help: `docker run --rm -it j40_score /bin/sh -c "python3 application.py --help"`
|
||||
- Clean up the census data directories: `docker run --rm -it j40_score /bin/sh -c "python3 application.py census-cleanup"`
|
||||
- Clean up the data directories: `docker run --rm -it j40_score /bin/sh -c "python3 application.py data-cleanup"`
|
||||
- Generate census data: `docker run --rm -it j40_score /bin/sh -c "python3 application.py census-data-download"`
|
||||
- Run all ETL processes: `docker run --rm -it j40_score /bin/sh -c "python3 application.py etl-run"`
|
||||
- Generate Score: `docker run --rm -it j40_score /bin/sh -c "python3 application.py score-run"`
|
||||
Here's a list of commands:
|
||||
|
||||
## Log visualization
|
||||
|
||||
If you want to visualize logs while running a command, the following temporary workaround can be used:
|
||||
|
||||
- Run `docker-compose up` on the root of the repo
|
||||
- Open a new tab on your terminal
|
||||
- Then run any command for the application using this format: `docker exec j40_score_1 python3 application.py [command]`
|
||||
- Get help: `docker exec j40_data_pipeline_1 python3 application.py --help"`
|
||||
- Clean up the census data directories: `docker exec j40_data_pipeline_1 python3 application.py census-cleanup"`
|
||||
- Clean up the data directories: `docker exec j40_data_pipeline_1 python3 application.py data-cleanup"`
|
||||
- Generate census data: `docker exec j40_data_pipeline_1 python3 application.py census-data-download"`
|
||||
- Run all ETL processes: `docker exec j40_data_pipeline_1 python3 application.py etl-run"`
|
||||
- Generate Score: `docker exec j40_data_pipeline_1 python3 application.py score-run"`
|
||||
|
||||
## Local development
|
||||
|
||||
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe)
|
||||
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe). Please refer to the repos for specific instructions for your OS.
|
||||
|
||||
Note: If you are using Windows, please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install Geopandas locally. If you want to install TippeCanoe, [follow these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
|
||||
### Windows Users
|
||||
- If you want to download Census data or run tile generation, please install TippeCanoe [following these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
|
||||
- If you want to generate tiles, you need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally.
|
||||
|
||||
### Setting up Poetry
|
||||
|
||||
- Start a terminal
|
||||
- Change to this directory (`/data/data-pipeline`)
|
||||
- Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
|
||||
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
|
||||
- Install Poetry requirements with `poetry install`
|
||||
|
@ -125,7 +125,7 @@ Note: If you are using Windows, please follow [these instructions](https://stack
|
|||
|
||||
- Make sure you have Docker running in your machine
|
||||
- Start a terminal
|
||||
- Change to this directory (i.e. `cd score`)
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
|
||||
- Then run `poetry run python application.py census-data-download`
|
||||
Note: Census files are not kept in the repository and the download directories are ignored by Git
|
||||
|
@ -137,18 +137,18 @@ Note: If you are using Windows, please follow [these instructions](https://stack
|
|||
### Serve the map locally
|
||||
|
||||
- Start a terminal
|
||||
- Change to this directory (i.e. `cd score`)
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl`
|
||||
|
||||
### Running Jupyter notebooks
|
||||
|
||||
- Start a terminal
|
||||
- Change to this directory (i.e. `cd score`)
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
|
||||
|
||||
### Activating variable-enabled Markdown for Jupyter notebooks
|
||||
|
||||
- Change to this directory (i.e. `cd score`)
|
||||
- Change to this directory (i.e. `cd data/data-pipeline`)
|
||||
- Activate a Poetry Shell (see above)
|
||||
- Run `jupyter contrib nbextension install --user`
|
||||
- Run `jupyter nbextension enable python-markdown/main`
|
|
@ -20,8 +20,8 @@ class PostScoreETL(ExtractTransformLoad):
|
|||
self.STATE_CSV = (
|
||||
self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
|
||||
)
|
||||
self.SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
|
||||
self.COUNTY_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa-county.csv"
|
||||
self.FULL_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
|
||||
self.TILR_SCORE_CSV = self.SCORE_CSV_PATH / "tile" / "usa.csv"
|
||||
|
||||
self.TILES_SCORE_COLUMNS = [
|
||||
"GEOID10",
|
||||
|
@ -59,7 +59,9 @@ class PostScoreETL(ExtractTransformLoad):
|
|||
self.states_df = pd.read_csv(
|
||||
self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
|
||||
)
|
||||
self.score_df = pd.read_csv(self.SCORE_CSV, dtype={"GEOID10": "string"})
|
||||
self.score_df = pd.read_csv(
|
||||
self.FULL_SCORE_CSV, dtype={"GEOID10": "string"}
|
||||
)
|
||||
|
||||
def transform(self) -> None:
|
||||
logger.info(f"Transforming data sources for Score + County CSV")
|
||||
|
@ -98,11 +100,9 @@ class PostScoreETL(ExtractTransformLoad):
|
|||
del self.score_county_state_merged["GEOID_OTHER"]
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving Score + County CSV")
|
||||
logger.info(f"Saving Full Score CSV with County Information")
|
||||
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
|
||||
# self.score_county_state_merged.to_csv(
|
||||
# self.COUNTY_SCORE_CSV, index=False
|
||||
# )
|
||||
self.score_county_state_merged.to_csv(self.FULL_SCORE_CSV, index=False)
|
||||
|
||||
logger.info(f"Saving Tile Score CSV")
|
||||
# TODO: check which are the columns we'll use
|
|
@ -3,7 +3,14 @@ import csv
|
|||
import os
|
||||
from config import settings
|
||||
|
||||
from utils import remove_files_from_dir, remove_all_dirs_from_dir, unzip_file_from_url
|
||||
from utils import (
|
||||
remove_files_from_dir,
|
||||
remove_all_dirs_from_dir,
|
||||
unzip_file_from_url,
|
||||
get_module_logger,
|
||||
)
|
||||
|
||||
logger = get_module_logger(__name__)
|
||||
|
||||
|
||||
def reset_data_directories(data_path: Path) -> None:
|
||||
|
@ -27,6 +34,7 @@ def get_state_fips_codes(data_path: Path) -> list:
|
|||
|
||||
# check if file exists
|
||||
if not os.path.isfile(fips_csv_path):
|
||||
logger.info(f"Downloading fips from S3 repository")
|
||||
unzip_file_from_url(
|
||||
settings.AWS_JUSTICE40_DATA_URL + "/Census/fips_states_2010.zip",
|
||||
data_path / "tmp",
|
|
@ -9,7 +9,7 @@ logger = get_module_logger(__name__)
|
|||
class EJScreenETL(ExtractTransformLoad):
|
||||
def __init__(self):
|
||||
self.EJSCREEN_FTP_URL = "https://gaftp.epa.gov/EJSCREEN/2019/EJSCREEN_2019_StatePctile.csv.zip"
|
||||
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctile.csv"
|
||||
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctiles.csv"
|
||||
self.CSV_PATH = self.DATA_PATH / "dataset" / "ejscreen_2019"
|
||||
self.df: pd.DataFrame
|
||||
|
|
@ -25,12 +25,19 @@ class HousingTransportationETL(ExtractTransformLoad):
|
|||
logger.info(
|
||||
f"Downloading housing data for state/territory with FIPS code {fips}"
|
||||
)
|
||||
|
||||
# Puerto Rico has no data, so skip
|
||||
if fips == "72":
|
||||
continue
|
||||
|
||||
unzip_file_from_url(
|
||||
f"{self.HOUSING_FTP_URL}{fips}", self.TMP_PATH, zip_file_dir
|
||||
)
|
||||
|
||||
# New file name:
|
||||
tmp_csv_file_path = zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
|
||||
tmp_csv_file_path = (
|
||||
zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
|
||||
)
|
||||
tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
|
||||
|
||||
dfs.append(tmp_df)
|
||||
|
@ -44,9 +51,9 @@ class HousingTransportationETL(ExtractTransformLoad):
|
|||
|
||||
# Rename and reformat block group ID
|
||||
self.df.rename(columns={"blkgrp": self.GEOID_FIELD_NAME}, inplace=True)
|
||||
self.df[self.GEOID_FIELD_NAME] = self.df[self.GEOID_FIELD_NAME].str.replace(
|
||||
'"', ""
|
||||
)
|
||||
self.df[self.GEOID_FIELD_NAME] = self.df[
|
||||
self.GEOID_FIELD_NAME
|
||||
].str.replace('"', "")
|
||||
|
||||
def load(self) -> None:
|
||||
logger.info(f"Saving Housing and Transportation Data")
|
|
@ -11,16 +11,16 @@ class HudHousingETL(ExtractTransformLoad):
|
|||
def __init__(self):
|
||||
self.OUTPUT_PATH = self.DATA_PATH / "dataset" / "hud_housing"
|
||||
self.GEOID_TRACT_FIELD_NAME = "GEOID10_TRACT"
|
||||
self.HOUSING_FTP_URL = (
|
||||
"https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
|
||||
)
|
||||
self.HOUSING_FTP_URL = "https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
|
||||
self.HOUSING_ZIP_FILE_DIR = self.TMP_PATH / "hud_housing"
|
||||
|
||||
# We measure households earning less than 80% of HUD Area Median Family Income by county
|
||||
# and paying greater than 30% of their income to housing costs.
|
||||
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
|
||||
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME = "HOUSING_BURDEN_NUMERATOR"
|
||||
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = "HOUSING_BURDEN_DENOMINATOR"
|
||||
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = (
|
||||
"HOUSING_BURDEN_DENOMINATOR"
|
||||
)
|
||||
|
||||
# Note: some variable definitions.
|
||||
# HUD-adjusted median family income (HAMFI).
|
||||
|
@ -47,10 +47,15 @@ class HudHousingETL(ExtractTransformLoad):
|
|||
/ "140"
|
||||
/ "Table8.csv"
|
||||
)
|
||||
self.df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
|
||||
self.df = pd.read_csv(
|
||||
filepath_or_buffer=tmp_csv_file_path,
|
||||
encoding="latin-1",
|
||||
)
|
||||
|
||||
# Rename and reformat block group ID
|
||||
self.df.rename(columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True)
|
||||
self.df.rename(
|
||||
columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True
|
||||
)
|
||||
|
||||
# The CHAS data has census tract ids such as `14000US01001020100`
|
||||
# Whereas the rest of our data uses, for the same tract, `01001020100`.
|
||||
|
@ -160,7 +165,9 @@ class HudHousingETL(ExtractTransformLoad):
|
|||
# TODO: add small sample size checks
|
||||
self.df[self.HOUSING_BURDEN_FIELD_NAME] = self.df[
|
||||
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME
|
||||
].astype(float) / self.df[self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME].astype(
|
||||
].astype(float) / self.df[
|
||||
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME
|
||||
].astype(
|
||||
float
|
||||
)
|
||||
|
|
@ -6,16 +6,16 @@ authors = ["Your Name <you@example.com>"]
|
|||
|
||||
[tool.poetry.dependencies]
|
||||
python = "^3.7.1"
|
||||
CensusData = "^1.13"
|
||||
click = "^8.0.1"
|
||||
dynaconf = "^3.1.4"
|
||||
ipython = "^7.24.1"
|
||||
jupyter = "^1.0.0"
|
||||
jupyter-contrib-nbextensions = "^0.5.1"
|
||||
numpy = "^1.21.0"
|
||||
pandas = "^1.2.5"
|
||||
requests = "^2.25.1"
|
||||
click = "^8.0.1"
|
||||
dynaconf = "^3.1.4"
|
||||
types-requests = "^2.25.0"
|
||||
CensusData = "^1.13"
|
||||
|
||||
[tool.poetry.dev-dependencies]
|
||||
mypy = "^0.910"
|
|
@ -1,13 +1,13 @@
|
|||
version: "3.4"
|
||||
services:
|
||||
score:
|
||||
image: j40_score
|
||||
container_name: j40_score_1
|
||||
build: score
|
||||
image: j40_data_pipeline
|
||||
container_name: j40_data_pipeline_1
|
||||
build: data/data-pipeline
|
||||
ports:
|
||||
- 8888:8888
|
||||
volumes:
|
||||
- ./score:/score
|
||||
- ./data/data-pipeline:/data-pipeline
|
||||
stdin_open: true
|
||||
tty: true
|
||||
environment:
|
||||
|
|
Loading…
Add table
Reference in a new issue