Data folder restructuring in preparation for 361 (#376)

* initial checkin

* gitignore and docker-compose update

* readme update and error on hud

* encoding issue

* one more small README change

* data roadmap re-strcuture

* pyproject sort

* small update to score output folders

* checkpoint

* couple of last fixes
This commit is contained in:
Jorge Escobar 2021-07-20 14:55:39 -04:00 committed by GitHub
parent 3032a8305d
commit 543d147e61
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
66 changed files with 130 additions and 108 deletions

14
.gitignore vendored
View file

@ -130,15 +130,15 @@ dmypy.json
cython_debug/
# Ignore dynaconf secret files
score/.secrets.*
*/data-pipeline/.secrets.*
# ignore data
score/data
score/data/census
score/data/tiles
score/data/tmp
score/data/dataset
score/data/score
*/data-pipeline/data
*/data-pipeline/data/census
*/data-pipeline/data/tiles
*/data-pipeline/data/tmp
*/data-pipeline/data/dataset
*/data-pipeline/data/score
# node
node_modules

View file

@ -25,7 +25,7 @@ RUN add-apt-repository ppa:ubuntugis/ppa
RUN apt-get -y install gdal-bin
# Prepare python packages
WORKDIR /score
WORKDIR /data-pipeline
RUN pip3 install --upgrade pip setuptools wheel
COPY . .

View file

@ -93,30 +93,30 @@ We use Docker to install the necessary libraries in a container that can be run
To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build`.
After that, to run commands type the following:
Once completed, run `docker-compose up` and then open a new tab or terminal window, and then run any command for the application using this format:
`docker exec j40_data_pipeline_1 python3 application.py [command]`
- Get help: `docker run --rm -it j40_score /bin/sh -c "python3 application.py --help"`
- Clean up the census data directories: `docker run --rm -it j40_score /bin/sh -c "python3 application.py census-cleanup"`
- Clean up the data directories: `docker run --rm -it j40_score /bin/sh -c "python3 application.py data-cleanup"`
- Generate census data: `docker run --rm -it j40_score /bin/sh -c "python3 application.py census-data-download"`
- Run all ETL processes: `docker run --rm -it j40_score /bin/sh -c "python3 application.py etl-run"`
- Generate Score: `docker run --rm -it j40_score /bin/sh -c "python3 application.py score-run"`
Here's a list of commands:
## Log visualization
If you want to visualize logs while running a command, the following temporary workaround can be used:
- Run `docker-compose up` on the root of the repo
- Open a new tab on your terminal
- Then run any command for the application using this format: `docker exec j40_score_1 python3 application.py [command]`
- Get help: `docker exec j40_data_pipeline_1 python3 application.py --help"`
- Clean up the census data directories: `docker exec j40_data_pipeline_1 python3 application.py census-cleanup"`
- Clean up the data directories: `docker exec j40_data_pipeline_1 python3 application.py data-cleanup"`
- Generate census data: `docker exec j40_data_pipeline_1 python3 application.py census-data-download"`
- Run all ETL processes: `docker exec j40_data_pipeline_1 python3 application.py etl-run"`
- Generate Score: `docker exec j40_data_pipeline_1 python3 application.py score-run"`
## Local development
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe)
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe). Please refer to the repos for specific instructions for your OS.
Note: If you are using Windows, please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install Geopandas locally. If you want to install TippeCanoe, [follow these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
### Windows Users
- If you want to download Census data or run tile generation, please install TippeCanoe [following these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows).
- If you want to generate tiles, you need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally.
### Setting up Poetry
- Start a terminal
- Change to this directory (`/data/data-pipeline`)
- Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
- Install Poetry requirements with `poetry install`
@ -125,7 +125,7 @@ Note: If you are using Windows, please follow [these instructions](https://stack
- Make sure you have Docker running in your machine
- Start a terminal
- Change to this directory (i.e. `cd score`)
- Change to this directory (i.e. `cd data/data-pipeline`)
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
- Then run `poetry run python application.py census-data-download`
Note: Census files are not kept in the repository and the download directories are ignored by Git
@ -137,18 +137,18 @@ Note: If you are using Windows, please follow [these instructions](https://stack
### Serve the map locally
- Start a terminal
- Change to this directory (i.e. `cd score`)
- Change to this directory (i.e. `cd data/data-pipeline`)
- Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl`
### Running Jupyter notebooks
- Start a terminal
- Change to this directory (i.e. `cd score`)
- Change to this directory (i.e. `cd data/data-pipeline`)
- Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
### Activating variable-enabled Markdown for Jupyter notebooks
- Change to this directory (i.e. `cd score`)
- Change to this directory (i.e. `cd data/data-pipeline`)
- Activate a Poetry Shell (see above)
- Run `jupyter contrib nbextension install --user`
- Run `jupyter nbextension enable python-markdown/main`

View file

@ -20,8 +20,8 @@ class PostScoreETL(ExtractTransformLoad):
self.STATE_CSV = (
self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
)
self.SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
self.COUNTY_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa-county.csv"
self.FULL_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
self.TILR_SCORE_CSV = self.SCORE_CSV_PATH / "tile" / "usa.csv"
self.TILES_SCORE_COLUMNS = [
"GEOID10",
@ -59,7 +59,9 @@ class PostScoreETL(ExtractTransformLoad):
self.states_df = pd.read_csv(
self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
)
self.score_df = pd.read_csv(self.SCORE_CSV, dtype={"GEOID10": "string"})
self.score_df = pd.read_csv(
self.FULL_SCORE_CSV, dtype={"GEOID10": "string"}
)
def transform(self) -> None:
logger.info(f"Transforming data sources for Score + County CSV")
@ -98,11 +100,9 @@ class PostScoreETL(ExtractTransformLoad):
del self.score_county_state_merged["GEOID_OTHER"]
def load(self) -> None:
logger.info(f"Saving Score + County CSV")
logger.info(f"Saving Full Score CSV with County Information")
self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
# self.score_county_state_merged.to_csv(
# self.COUNTY_SCORE_CSV, index=False
# )
self.score_county_state_merged.to_csv(self.FULL_SCORE_CSV, index=False)
logger.info(f"Saving Tile Score CSV")
# TODO: check which are the columns we'll use

View file

@ -3,7 +3,14 @@ import csv
import os
from config import settings
from utils import remove_files_from_dir, remove_all_dirs_from_dir, unzip_file_from_url
from utils import (
remove_files_from_dir,
remove_all_dirs_from_dir,
unzip_file_from_url,
get_module_logger,
)
logger = get_module_logger(__name__)
def reset_data_directories(data_path: Path) -> None:
@ -27,6 +34,7 @@ def get_state_fips_codes(data_path: Path) -> list:
# check if file exists
if not os.path.isfile(fips_csv_path):
logger.info(f"Downloading fips from S3 repository")
unzip_file_from_url(
settings.AWS_JUSTICE40_DATA_URL + "/Census/fips_states_2010.zip",
data_path / "tmp",

View file

@ -9,7 +9,7 @@ logger = get_module_logger(__name__)
class EJScreenETL(ExtractTransformLoad):
def __init__(self):
self.EJSCREEN_FTP_URL = "https://gaftp.epa.gov/EJSCREEN/2019/EJSCREEN_2019_StatePctile.csv.zip"
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctile.csv"
self.EJSCREEN_CSV = self.TMP_PATH / "EJSCREEN_2019_StatePctiles.csv"
self.CSV_PATH = self.DATA_PATH / "dataset" / "ejscreen_2019"
self.df: pd.DataFrame

View file

@ -25,12 +25,19 @@ class HousingTransportationETL(ExtractTransformLoad):
logger.info(
f"Downloading housing data for state/territory with FIPS code {fips}"
)
# Puerto Rico has no data, so skip
if fips == "72":
continue
unzip_file_from_url(
f"{self.HOUSING_FTP_URL}{fips}", self.TMP_PATH, zip_file_dir
)
# New file name:
tmp_csv_file_path = zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
tmp_csv_file_path = (
zip_file_dir / f"htaindex_data_blkgrps_{fips}.csv"
)
tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
dfs.append(tmp_df)
@ -44,9 +51,9 @@ class HousingTransportationETL(ExtractTransformLoad):
# Rename and reformat block group ID
self.df.rename(columns={"blkgrp": self.GEOID_FIELD_NAME}, inplace=True)
self.df[self.GEOID_FIELD_NAME] = self.df[self.GEOID_FIELD_NAME].str.replace(
'"', ""
)
self.df[self.GEOID_FIELD_NAME] = self.df[
self.GEOID_FIELD_NAME
].str.replace('"', "")
def load(self) -> None:
logger.info(f"Saving Housing and Transportation Data")

View file

@ -11,16 +11,16 @@ class HudHousingETL(ExtractTransformLoad):
def __init__(self):
self.OUTPUT_PATH = self.DATA_PATH / "dataset" / "hud_housing"
self.GEOID_TRACT_FIELD_NAME = "GEOID10_TRACT"
self.HOUSING_FTP_URL = (
"https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
)
self.HOUSING_FTP_URL = "https://www.huduser.gov/portal/datasets/cp/2012thru2016-140-csv.zip"
self.HOUSING_ZIP_FILE_DIR = self.TMP_PATH / "hud_housing"
# We measure households earning less than 80% of HUD Area Median Family Income by county
# and paying greater than 30% of their income to housing costs.
self.HOUSING_BURDEN_FIELD_NAME = "Housing burden (percent)"
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME = "HOUSING_BURDEN_NUMERATOR"
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = "HOUSING_BURDEN_DENOMINATOR"
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME = (
"HOUSING_BURDEN_DENOMINATOR"
)
# Note: some variable definitions.
# HUD-adjusted median family income (HAMFI).
@ -47,10 +47,15 @@ class HudHousingETL(ExtractTransformLoad):
/ "140"
/ "Table8.csv"
)
self.df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)
self.df = pd.read_csv(
filepath_or_buffer=tmp_csv_file_path,
encoding="latin-1",
)
# Rename and reformat block group ID
self.df.rename(columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True)
self.df.rename(
columns={"geoid": self.GEOID_TRACT_FIELD_NAME}, inplace=True
)
# The CHAS data has census tract ids such as `14000US01001020100`
# Whereas the rest of our data uses, for the same tract, `01001020100`.
@ -160,7 +165,9 @@ class HudHousingETL(ExtractTransformLoad):
# TODO: add small sample size checks
self.df[self.HOUSING_BURDEN_FIELD_NAME] = self.df[
self.HOUSING_BURDEN_NUMERATOR_FIELD_NAME
].astype(float) / self.df[self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME].astype(
].astype(float) / self.df[
self.HOUSING_BURDEN_DENOMINATOR_FIELD_NAME
].astype(
float
)

View file

@ -6,16 +6,16 @@ authors = ["Your Name <you@example.com>"]
[tool.poetry.dependencies]
python = "^3.7.1"
CensusData = "^1.13"
click = "^8.0.1"
dynaconf = "^3.1.4"
ipython = "^7.24.1"
jupyter = "^1.0.0"
jupyter-contrib-nbextensions = "^0.5.1"
numpy = "^1.21.0"
pandas = "^1.2.5"
requests = "^2.25.1"
click = "^8.0.1"
dynaconf = "^3.1.4"
types-requests = "^2.25.0"
CensusData = "^1.13"
[tool.poetry.dev-dependencies]
mypy = "^0.910"

View file

@ -1,13 +1,13 @@
version: "3.4"
services:
score:
image: j40_score
container_name: j40_score_1
build: score
image: j40_data_pipeline
container_name: j40_data_pipeline_1
build: data/data-pipeline
ports:
- 8888:8888
volumes:
- ./score:/score
- ./data/data-pipeline:/data-pipeline
stdin_open: true
tty: true
environment: