Combine + Tilefy (#806)

* init

* score-post

* added score csv s3 download; remore poetry cmds from readme

* working census tile fetch

* PR review

* Github Actions Work
This commit is contained in:
Jorge Escobar 2021-11-01 18:05:05 -04:00 committed by GitHub
commit 1b17af84c8
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
13 changed files with 560 additions and 371 deletions

View file

@ -104,18 +104,18 @@ TODO add mermaid diagram
1. Call the `etl-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
- With Docker: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application etl-run`
- With Poetry: `poetry run etl`
- With Poetry: `poetry run python3 data_pipeline/application.py etl-run`
2. This command will execute the corresponding ETL script for each data source in `data_pipeline/etl/sources/`. For example, `data_pipeline/etl/sources/ejscreen/etl.py` is the ETL script for EJSCREEN data.
3. Each ETL script will extract the data from its original source, then format the data into `.csv` files that get stored in the relevant folder in `data_pipeline/data/dataset/`. For example, HUD Housing data is stored in `data_pipeline/data/dataset/hud_housing/usa.csv`
_**NOTE:** You have the option to pass the name of a specific data source to the `etl-run` command using the `-d` flag, which will limit the execution of the ETL process to that specific data source._
_For example: `poetry run etl -d ejscreen` would only run the ETL process for EJSCREEN data._
_For example: `poetry run python3 data_pipeline/application.py etl-run -d ejscreen` would only run the ETL process for EJSCREEN data._
#### Step 3: Calculate the Justice40 score experiments
1. Call the `score-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
- With Docker: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application score-run`
- With Poetry: `poetry run score`
- With Poetry: `poetry run python3 data_pipeline/application.py score-run`
1. The `score-run` command will execute the `etl/score/etl.py` script which loads the data from each of the source files added to the `data/dataset/` directory by the ETL scripts in Step 1.
1. These data sets are merged into a single dataframe using their Census Block Group GEOID as a common key, and the data in each of the columns is standardized in two ways:
- Their [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) is calculated, which tells us what percentage of other Census Block Groups have a lower value for that particular column.
@ -203,32 +203,58 @@ To install the above-named executables:
### Windows Users
If you want to run tile generation, please install TippeCanoe [following these instrcutions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows). You also need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally.
If you want to run tile generation, please install TippeCanoe [following these instructions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows). You also need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally. It's definitely easier if you have access to WSL (Windows Subsystem Linux), and install these packages using commands similar to our [Dockerfile](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/Dockerfile).
### Setting up Poetry
- Start a terminal
- Change to this directory (`/data/data-pipeline/`)
- Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
- Make sure you have at least Python 3.7 installed: `python -V` or `python3 -V`
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
- Install Poetry requirements with `poetry install`
### Downloading Census Block Groups GeoJSON and Generating CBG CSVs
### The Application entrypoint
- Make sure you have Docker running in your machine
After installing the poetry dependencies, you can see a list of commands with the following steps:
- Start a terminal
- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline/`)
- If you want to clear out all data and tiles from all directories, you can run: `poetry run cleanup_data`.
- Then run `poetry run download_census`
Note: Census files are not kept in the repository and the download directories are ignored by Git
- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
- Then run `poetry run python3 data_pipeline/application.py --help`
### Downloading Census Block Groups GeoJSON and Generating CBG CSVs (not normally required)
- Start a terminal
- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python3 data_pipeline/application.py data-cleanup`.
- Then run `poetry run python3 data_pipeline/application.py census-data-download`
Note: Census files are hosted in the Justice40 S3 and you can skip this step by passing the `-s aws` or `--data-source aws` flag in the scripts below
### Run all ETL, score and map generation processes
- Start a terminal
- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
- Then run `poetry run python3 data_pipeline/application.py data-full-run -s aws`
- Note: The `-s` flag is optional if you have generated/downloaded the census data
### Run both ETL and score generation processes
- Start a terminal
- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
- Then run `poetry run python3 data_pipeline/application.py score-full-run`
### Run all ETL processes
- Start a terminal
- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
- Then run `poetry run python3 data_pipeline/application.py etl-run`
### Generating Map Tiles
- Make sure you have Docker running in your machine
- Start a terminal
- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
- Then run `poetry run generate_tiles`
- Then run `poetry run python3 data_pipeline/application.py generate-map-tiles -s aws`
- If you have S3 keys, you can sync to the dev repo by doing `aws s3 sync ./data_pipeline/data/score/tiles/ s3://justice40-data/data-pipeline/data/score/tiles --acl public-read --delete`
- Note: The `-s` flag is optional if you have generated/downloaded the score data
### Serve the map locally
@ -248,8 +274,7 @@ If you want to run tile generation, please install TippeCanoe [following these i
- Activate a Poetry Shell (see above)
- Run `jupyter contrib nbextension install --user`
- Run `jupyter nbextension enable python-markdown/main`
- Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near
top right of Notebook screen.)
- Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near top right of Notebook screen.)
For more information, see [nbextensions docs](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html) and
see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/python-markdown).