| .. | ||
| .vscode | ||
| data_pipeline | ||
| .flake8 | ||
| Dockerfile | ||
| poetry.lock | ||
| pyproject.toml | ||
| README.md | ||
| requirements.txt | ||
| settings.toml | ||
| tox.ini | ||
Justice 40 Score application
Table of Contents
About this application
This application is used to compare experimental versions of the Justice40 score to established environmental justice indices, such as EJSCREEN, CalEnviroScreen, and so on.
NOTE: These scores do not represent final versions of the Justice40 scores and are merely used for comparative purposes. As a result, the specific input columns and formulas used to calculate them are likely to change over time.
Using the data
One of our primary development principles is that the entire data pipeline should be open and replicable end-to-end. As part of this, in addition to all code being open, we also strive to make data visible and available for use at every stage of our pipeline. You can follow the instructions below in this README to spin up the data pipeline yourself in your own environment; you can also access the data we've already processed on our S3 bucket.
In the sub-sections below, we outline what each stage of the data provenance looks like and where you can find the data output by that stage. If you'd like to actually perform each step in your own environment, skip down to Score generation and comparison workflow.
1. Source data
If you would like to find and use the raw source data, you can find the source URLs in the etl.py files located within each directory in data/data-pipeline/etl/sources.
2. Extract-Transform-Load (ETL) the data
The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in data/data-pipeline/etl/sources, and the output of this process is a number of CSVs available at the following locations:
- EJScreen: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv
- Census ACS 2019: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv
- Housing and Transportation Index: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv
- HUD Housing: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv
Each CSV may have a different column name for the census tract or census block group identifier. You can find what the name is in the ETL code. Please note that when you view these files you should make sure that your text editor or spreadsheet software does not remove the initial 0 from this identifier field (many IDs begin with 0).
3. Combined dataset
The CSV with the combined data from all of these sources will be available soon!
4. Tileset
Once we have all the data from the previous stages, we convert it to tiles to make it usable on a map. We only need a subset of the data to display in our client UI, so we do not include all data from the combined CSV in the tileset.
Link to the tile server coming soon!
Score generation and comparison workflow
The descriptions below provide a more detailed outline of what happens at each step of ETL and score calculation workflow.
Workflow Diagram
TODO add mermaid diagram
Step 0: Set up your environment
- Choose whether you'd like to run this application using Docker or if you'd like to install the dependencies locally so you can contribute to the project.
- With Docker: Follow these installation instructions and skip down to the Running with Docker section for more information
- For Local Development: Skip down to the Local Development section for more detailed installation instructions
 
Step 1: Run the script to download census data or download from the Justice40 S3 URL
- Call the census_data_downloadcommand using the application managerapplication.pyNOTE: This may take several minutes to execute.- With Docker: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application census-data-download
- With Poetry: poetry run download_census(Install GDAL as described below)
 
- With Docker: 
- If you have a high speed internet connection and don't want to generate the census data or install GDALlocally, you can download a zip version of the Census file here. Then unzip and move the contents inside thedata/data-pipeline/data_pipeline/data/census/folder/
Step 2: Run the ETL script for each data source
- Call the etl-runcommand using the application managerapplication.pyNOTE: This may take several minutes to execute.- With Docker: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application etl-run
- With Poetry: poetry run etl
 
- With Docker: 
- This command will execute the corresponding ETL script for each data source in data_pipeline/etl/sources/. For example,data_pipeline/etl/sources/ejscreen/etl.pyis the ETL script for EJSCREEN data.
- Each ETL script will extract the data from its original source, then format the data into .csvfiles that get stored in the relevant folder indata_pipeline/data/dataset/. For example, HUD Housing data is stored indata_pipeline/data/dataset/hud_housing/usa.csv
NOTE: You have the option to pass the name of a specific data source to the etl-run command using the -d flag, which will limit the execution of the ETL process to that specific data source.
For example: poetry run etl -- -d ejscreen would only run the ETL process for EJSCREEN data.
Step 3: Calculate the Justice40 score experiments
- Call the score-runcommand using the application managerapplication.pyNOTE: This may take several minutes to execute.- With Docker: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application score-run
- With Poetry: poetry run score
 
- With Docker: 
- The score-runcommand will execute theetl/score/etl.pyscript which loads the data from each of the source files added to thedata/dataset/directory by the ETL scripts in Step 1.
- These data sets are merged into a single dataframe using their Census Block Group GEOID as a common key, and the data in each of the columns is standardized in two ways:
- Their percentile rank is calculated, which tells us what percentage of other Census Block Groups have a lower value for that particular column.
- They are normalized using min-max normalization, which adjusts the scale of the data so that the Census Block Group with the highest value for that column is set to 1, the Census Block Group with the lowest value is set to 0, and all of the other values are adjusted to fit within that range based on how close they were to the highest or lowest value.
 
- The standardized columns are then used to calculate each of the Justice40 score experiments described in greater detail below, and the results are exported to a .csvfile indata/score/csv
Step 4: Compare the Justice40 score experiments to other indices
We are building a comparison tool to enable easy (or at least straightforward) comparison of the Justice40 score with other existing indices. The goal of having this is so that as we experiment and iterate with a scoring methodology, we can understand how our score overlaps with or differs from other indices that communities, nonprofits, and governmentss use to inform decision making.
Right now, our comparison tool exists simply as a python notebook in data/data-pipeline/data_pipeline/ipython/scoring_comparison.ipynb.
To run this comparison tool:
- Make sure you've gone through the above steps to run the data ETL and score generation.
- From the package directory (data/data-pipeline/data_pipeline/), navigate to theipythondirectory:cd ipython.
- Ensure you have pandocinstalled on your computer. If you're on a Mac, runbrew install pandoc; for other OSes, see pandoc's installation guide.
- Start the notebooks: jupyter notebook
- In your browser, navigate to one of the URLs returned by the above command.
- Select scoring_comparison.ipynbfrom the options in your browser.
- Run through the steps in the notebook. You can step through them one at a time by clicking the "Run" button for each cell, or open the "Cell" menu and click "Run all" to run them all at once.
- Reports and spreadsheets generated by the comparison tool will be available in data/data-pipeline/data_pipeline/data/comparison_outputs.
NOTE: This may take several minutes or over an hour to fully execute and generate the reports.
Data Sources
- EJSCREEN: TODO Add description of data source
- Census: TODO Add description of data source
- American Communities Survey: TODO Add description of data source
- Housing and Transportation: TODO Add description of data source
- HUD Housing: TODO Add description of data source
- HUD Recap: TODO Add description of data source
- CalEnviroScreen: TODO Add description of data source
Running using Docker
We use Docker to install the necessary libraries in a container that can be run in any operating system.
Important: To be able to run the data Docker containers, you need to increase the memory resoure of your container to at leat 8096 MB.
To build the docker container the first time, make sure you're in the root directory of the repository and run docker-compose build --no-cache.
Once completed, run docker-compose up and then open a new tab or terminal window, and then run any command for the application using this format:
docker exec j40_data_pipeline_1 python3 -m data_pipeline.application [command]
Here's a list of commands:
- Get help: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application --help
- Generate census data: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application census-data-download
- Run all ETL and Generate score: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application score-full-run
- Clean up the data directories: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application data-cleanup
- Run all ETL processes: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application etl-run
- Generate Score: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application score-run
- Generate Score with Geojson and high and low versions: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application geo-score
- Generate Map Tiles: docker exec j40_data_pipeline_1 python3 -m data_pipeline.application generate-map-tiles
Local development
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the GDAL library installed locally. Also to generate tiles for a local map, you will need Mapbox tippecanoe. Please refer to the repos for specific instructions for your OS.
VSCode
If you are using VSCode, you can make use of the .vscode folder checked in under data/data-pipeline/.vscode. To do this, open this directory with code data/data-pipeline.
Here's whats included:
- 
launch.json- launch commands that allow for debugging the various commands inapplication.py. Note that because we are using the otherwise excellent Click CLI, and Click in turn usesconsole_scriptsto parse and execute command line options, it is necessary to run the equivalent ofpython -m data_pipeline.application [command]withinlaunch.jsonto be able to set and hit breakpoints (this is what is currently implemented. Otherwise, you may find that the script times out after 5 seconds. More about this here.
- 
settings.json- these ensure that you're using the default linter (pylint), formatter (flake8), and test library (pytest) that the team is using.
- 
tasks.json- these enable you to useTerminal->Run Taskto run our preferred formatters and linters within your project.
Users are instructed to only add settings to this file that should be shared across the team, and not to add settings here that only apply to local development environments (particularly full absolute paths which can differ between setups). If you are looking to add something to this file, check in with the rest of the team to ensure the proposed settings should be shared.
MacOS
To install the above-named executables:
- gdal: brew install gdal
- Tippecanoe: brew install tippecanoe
Windows Users
If you want to run tile generation, please install TippeCanoe following these instrcutions. You also need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow these instructions to install the Geopandas dependency locally.
Setting up Poetry
- Start a terminal
- Change to this directory (/data/data-pipeline/)
- Make sure you have Python 3.9 installed: python -Vorpython3 -V
- We use Poetry for managing dependencies and building the application. Please follow the instructions on their site to download.
- Install Poetry requirements with poetry install
Downloading Census Block Groups GeoJSON and Generating CBG CSVs
- Make sure you have Docker running in your machine
- Start a terminal
- Change to the package directory (i.e. cd data/data-pipeline/data_pipeline/)
- If you want to clear out all data and tiles from all directories, you can run: poetry run cleanup_data.
- Then run poetry run download_censusNote: Census files are not kept in the repository and the download directories are ignored by Git
Generating Map Tiles
- Make sure you have Docker running in your machine
- Start a terminal
- Change to the package directory (i.e. cd data/data-pipeline/data_pipeline)
- Then run poetry run generate_tiles
- If you have S3 keys, you can sync to the dev repo by doing aws s3 sync ./data_pipeline/data/score/tiles/ s3://justice40-data/data-pipeline/data/score/tiles --acl public-read --delete
Serve the map locally
- Start a terminal
- Change to the package directory (i.e. cd data/data-pipeline/data_pipeline)
- For USA high zoom: docker run --rm -it -v ${PWD}/data/score/tiles/high:/data -p 8080:80 maptiler/tileserver-gl
Running Jupyter notebooks
- Start a terminal
- Change to the package directory (i.e. cd data/data-pipeline/data_pipeline)
- Run poetry run jupyter notebook. Your browser should open with a Jupyter Notebook tab
Activating variable-enabled Markdown for Jupyter notebooks
- Change to the package directory (i.e. cd data/data-pipeline/data_pipeline)
- Activate a Poetry Shell (see above)
- Run jupyter contrib nbextension install --user
- Run jupyter nbextension enable python-markdown/main
- Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near top right of Notebook screen.)
For more information, see nbextensions docs and see python-markdown docs.
Miscellaneous
- To export packages from Poetry to requirements.txtrunpoetry export --without-hashes > requirements.txt