j40-cejst-2/score/README.md
Billy Daly 3032a8305d
Issue 198 score docs (#358)
* Adds matplotlib to the pyproject.toml for score/ to avoid import error

* Adds document structure for workflow steps and source data dictionaries

* Reverts 793381e because Matplotlib is no longer a dependency

* Adds the workflow steps for the first 2 stages of the workflow

* Updates the README with some TODOs

* Makes changes requested in PR review

* Fixes one more reference to Justice40 score
2021-07-20 14:29:14 -04:00

9.6 KiB

Justice 40 Score application

Table of Contents

About this application

This application is used to compare experimental versions of the Justice40 score to established environmental justice indices, such as EJSCREEN, CalEnviroScreen, and so on.

NOTE: These scores do not represent final versions of the Justice40 scores and are merely used for comparative purposes. As a result, the specific input columns and formulas used to calculate them are likely to change over time.

Score comparison workflow

The descriptions below provide a more detailed outline of what happens at each step of ETL and score calculation workflow.

Workflow Diagram

TODO add mermaid diagram

Step 0: Set up your environment

  1. After cloning the project locally, change to this directory: cd score
  2. Choose whether you'd like to run this application using Docker or if you'd like to install the dependencies locally so you can contribute to the project.

Step 1: Run the ETL script for each data source

  1. Call the etl-run command using the application manager application.py NOTE: This may take several minutes to execute.
    • With Docker: docker run --rm -it j40_score /bin/sh -c "python3 application.py etl-run"
    • With Poetry: poetry run python application.py etl-run
  2. The etl-run command will execute the corresponding ETL script for each data source in etl/sources/. For example, etl/sources/ejscreen/etl.py is the ETL script for EJSCREEN data.
  3. Each ETL script will extract the data from its original source, then format the data into .csv files that get stored in the relevant folder in data/dataset/. For example, HUD Housing data is stored in data/dataset/hud_housing/usa.csv

NOTE: You have the option to pass the name of a specific data source to the etl-run command, which will limit the execution of the ETL process to that specific data source. For example: poetry run python application.py etl-run ejscreen would only run the ETL process for EJSCREEN data.

Step 2: Calculate the Justice40 score experiments

  1. Call the score-run command using the application manager application.py NOTE: This may take several minutes to execute.
    • With Docker: docker run --rm -it j40_score /bin/sh -c "python3 application.py score-run"
    • With Poetry: poetry run python application.py score-run
  2. The score-run command will execute the etl/score/etl.py script which loads the data from each of the source files added to the data/dataset/ directory by the ETL scripts in Step 1.
  3. These data sets are merged into a single dataframe using their Census Block Group GEOID as a common key, and the data in each of the columns is standardized in two ways:
    • Their percentile rank is calculated, which tells us what percentage of other Census Block Groups have a lower value for that particular column.
    • They are normalized using min-max normalization, which adjusts the scale of the data so that the Census Block Group with the highest value for that column is set to 1, the Census Block Group with the lowest value is set to 0, and all of the other values are adjusted to fit within that range based on how close they were to the highest or lowest value.
  4. The standardized columns are then used to calculate each of the Justice40 score experiments described in greater detail below, and the results are exported to a .csv file in data/score/csv

Step 3: Compare the Justice40 score experiments to other indices

  1. TODO: Describe the steps for this

Data Sources

Running using Docker

We use Docker to install the necessary libraries in a container that can be run in any operating system.

To build the docker container the first time, make sure you're in the root directory of the repository and run docker-compose build.

After that, to run commands type the following:

  • Get help: docker run --rm -it j40_score /bin/sh -c "python3 application.py --help"
  • Clean up the census data directories: docker run --rm -it j40_score /bin/sh -c "python3 application.py census-cleanup"
  • Clean up the data directories: docker run --rm -it j40_score /bin/sh -c "python3 application.py data-cleanup"
  • Generate census data: docker run --rm -it j40_score /bin/sh -c "python3 application.py census-data-download"
  • Run all ETL processes: docker run --rm -it j40_score /bin/sh -c "python3 application.py etl-run"
  • Generate Score: docker run --rm -it j40_score /bin/sh -c "python3 application.py score-run"

Log visualization

If you want to visualize logs while running a command, the following temporary workaround can be used:

  • Run docker-compose up on the root of the repo
  • Open a new tab on your terminal
  • Then run any command for the application using this format: docker exec j40_score_1 python3 application.py [command]

Local development

You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the GDAL library installed locally. Also to generate tiles for a local map, you will need Mapbox tippeanoe

Note: If you are using Windows, please follow these instructions to install Geopandas locally. If you want to install TippeCanoe, follow these instrcutions.

  • Start a terminal
  • Make sure you have Python 3.9 installed: python -V or python3 -V
  • We use Poetry for managing dependencies and building the application. Please follow the instructions on their site to download.
  • Install Poetry requirements with poetry install

Downloading Census Block Groups GeoJSON and Generating CBG CSVs

  • Make sure you have Docker running in your machine
  • Start a terminal
  • Change to this directory (i.e. cd score)
  • If you want to clear out all data and tiles from all directories, you can run: poetry run python application.py data-cleanup.
  • Then run poetry run python application.py census-data-download Note: Census files are not kept in the repository and the download directories are ignored by Git

Generating mbtiles

  • TBD

Serve the map locally

  • Start a terminal
  • Change to this directory (i.e. cd score)
  • Run: docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl

Running Jupyter notebooks

  • Start a terminal
  • Change to this directory (i.e. cd score)
  • Run poetry run jupyter notebook. Your browser should open with a Jupyter Notebook tab

Activating variable-enabled Markdown for Jupyter notebooks

  • Change to this directory (i.e. cd score)
  • Activate a Poetry Shell (see above)
  • Run jupyter contrib nbextension install --user
  • Run jupyter nbextension enable python-markdown/main
  • Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near top right of Notebook screen.)

For more information, see nbextensions docs and see python-markdown docs.

Miscellaneous

  • To export packages from Poetry to requirements.txt run poetry export --without-hashes > requirements.txt