updating readme

2025-02-23 01:54:18 -08:00 · 2021-08-04 12:30:22 -04:00 · 2021-08-04 12:30:22 -04:00 · aeb71137e4
commit aeb71137e4
parent 6ec4dafafd
2 changed files with 31 additions and 33 deletions
--- a/data/data-pipeline/README.md
+++ b/data/data-pipeline/README.md
@ -7,9 +7,10 @@
 - [Justice 40 Score application](#justice-40-score-application)
  - [About this application](#about-this-application)
-    - [Score comparison workflow](#score-comparison-workflow)
+    - [Score generation and comparison workflow](#score-generation-and-comparison-workflow)
      - [Workflow Diagram](#workflow-diagram)
      - [Step 0: Set up your environment](#step-0-set-up-your-environment)
      - [(Optional) Step 0: Run the script to download census data](#optional-step-0-run-the-script-to-download-census-data)
      - [Step 1: Run the ETL script for each data source](#step-1-run-the-etl-script-for-each-data-source)
      - [Step 2: Calculate the Justice40 score experiments](#step-2-calculate-the-justice40-score-experiments)
      - [Step 3: Compare the Justice40 score experiments to other indices](#step-3-compare-the-justice40-score-experiments-to-other-indices)
@ -19,7 +20,7 @@
    - [Windows Users](#windows-users)
    - [Setting up Poetry](#setting-up-poetry)
    - [Downloading Census Block Groups GeoJSON and Generating CBG CSVs](#downloading-census-block-groups-geojson-and-generating-cbg-csvs)
-    - [Generating mbtiles](#generating-mbtiles)
+    - [Generating Map Tiles](#generating-map-tiles)
    - [Serve the map locally](#serve-the-map-locally)
    - [Running Jupyter notebooks](#running-jupyter-notebooks)
    - [Activating variable-enabled Markdown for Jupyter notebooks](#activating-variable-enabled-markdown-for-jupyter-notebooks)
@ -35,12 +36,7 @@ This application is used to compare experimental versions of the Justice40 score
 _**NOTE:** These scores **do not** represent final versions of the Justice40 scores and are merely used for comparative purposes. As a result, the specific input columns and formulas used to calculate them are likely to change over time._
 <<<<<<< Updated upstream
 ### Score generation and comparison workflow
 =======
 ### Score comparison workflow
 >>>>>>> Stashed changes
 The descriptions below provide a more detailed outline of what happens at each step of ETL and score calculation workflow.
@ -50,31 +46,31 @@ TODO add mermaid diagram
 #### Step 0: Set up your environment
 <<<<<<< Updated upstream
 1. After cloning the project locally, change to this directory: `cd data/data-pipeline`
 =======
 1. After cloning the project locally, change to this directory: `cd data`
 >>>>>>> Stashed changes
 1. Choose whether you'd like to run this application using Docker or if you'd like to install the dependencies locally so you can contribute to the project.
   - **With Docker:** Follow these [installation instructions](https://docs.docker.com/get-docker/) and skip down to the [Running with Docker section](#running-with-docker) for more information
   - **For Local Development:** Skip down to the [Local Development section](#local-development) for more detailed installation instructions
 #### (Optional) Step 0: Run the script to download census data
 1. See instructions below for downloading census data, which is a prerequisite for running score code
 #### Step 1: Run the ETL script for each data source
 1. Call the `etl-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
   - With Docker: `docker run --rm -it j40_data_pipeline /bin/sh -c "python3 application.py etl-run"`
-   - With Poetry: `poetry run python application.py etl-run`
+   - With Poetry: `poetry run etl`
-1. The `etl-run` command will execute the corresponding ETL script for each data source in `etl/sources/`. For example, `etl/sources/ejscreen/etl.py` is the ETL script for EJSCREEN data.
+2. This command will execute the corresponding ETL script for each data source in `data_pipeline/etl/sources/`. For example, `data_pipeline/etl/sources/ejscreen/etl.py` is the ETL script for EJSCREEN data.
-1. Each ETL script will extract the data from its original source, then format the data into `.csv` files that get stored in the relevant folder in `data/dataset/`. For example, HUD Housing data is stored in `data/dataset/hud_housing/usa.csv`
+3. Each ETL script will extract the data from its original source, then format the data into `.csv` files that get stored in the relevant folder in `data_pipeline/data/dataset/`. For example, HUD Housing data is stored in `data_pipeline/data/dataset/hud_housing/usa.csv`
 _**NOTE:** You have the option to pass the name of a specific data source to the `etl-run` command using the `-d` flag, which will limit the execution of the ETL process to that specific data source._
-_For example: `poetry run python application.py etl-run -d ejscreen` would only run the ETL process for EJSCREEN data._
+_For example: `poetry run etl -- -d ejscreen` would only run the ETL process for EJSCREEN data._
 #### Step 2: Calculate the Justice40 score experiments
 1. Call the `score-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
   - With Docker: `docker run --rm -it j40_data_pipeline /bin/sh -c "python3 application.py score-run"`
-   - With Poetry: `poetry run python application.py score-run`
+   - With Poetry: `poetry run score`
 1. The `score-run` command will execute the `etl/score/etl.py` script which loads the data from each of the source files added to the `data/dataset/` directory by the ETL scripts in Step 1.
 1. These data sets are merged into a single dataframe using their Census Block Group GEOID as a common key, and the data in each of the columns is standardized in two ways:
   - Their [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) is calculated, which tells us what percentage of other Census Block Groups have a lower value for that particular column.
@ -85,15 +81,16 @@ _For example: `poetry run python application.py etl-run -d ejscreen` would only
 We are building a comparison tool to enable easy (or at least straightforward) comparison of the Justice40 score with other existing indices. The goal of having this is so that as we experiment and iterate with a scoring methodology, we can understand how our score overlaps with or differs from other indices that communities, nonprofits, and governmentss use to inform decision making.
-Right now, our comparison tool exists simply as a python notebook in `data/data-pipeline/ipython/scoring_comparison.ipynb`.
+Right now, our comparison tool exists simply as a python notebook in `data/data-pipeline/data_pipeline/ipython/scoring_comparison.ipynb`.
 To run this comparison tool:
 1. Make sure you've gone through the above steps to run the data ETL and score generation.
-1. From this directory (`data/data-pipeline`), navigate to the `ipython` directory: `cd ipython`.
+1. From the package directory (`data/data-pipeline/data_pipeline/`), navigate to the `ipython` directory: `cd ipython`.
 1. Ensure you have `pandoc` installed on your computer. If you're on a Mac, run `brew install pandoc`; for other OSes, see pandoc's [installation guide](https://pandoc.org/installing.html).
 1. Install the extra dependencies:
-```
+
 ```python
  pip install pypandoc
  pip install requests
  pip install us
@ -101,13 +98,14 @@ To run this comparison tool:
  pip install dynaconf
  pip instal xlsxwriter
 ```
 1. Start the notebooks: `jupyter notebook`
 1. In your browser, navigate to one of the URLs returned by the above command.
 1. Select `scoring_comparison.ipynb` from the options in your browser.
 1. Run through the steps in the notebook. You can step through them one at a time by clicking the "Run" button for each cell, or open the "Cell" menu and click "Run all" to run them all at once.
-1. Reports and spreadsheets generated by the comparison tool will be available in `data/data-pipeline/data/comparison_outputs`.
+1. Reports and spreadsheets generated by the comparison tool will be available in `data/data-pipeline/data_pipeline/data/comparison_outputs`.
-*NOTE:* This may take several minutes or over an hour to fully execute and generate the reports.
+_NOTE:_ This may take several minutes or over an hour to fully execute and generate the reports.
 ### Data Sources
@ -151,7 +149,7 @@ You can run the Python code locally without Docker to develop, using Poetry. How
 ### Setting up Poetry
 - Start a terminal
- Change to this directory (`/data/data-pipeline`)
+- Change to this directory (`/data/data-pipeline/`)
 - Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
 - We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
 - Install Poetry requirements with `poetry install`
@ -160,33 +158,33 @@ You can run the Python code locally without Docker to develop, using Poetry. How
 - Make sure you have Docker running in your machine
 - Start a terminal
- Change to this directory (i.e. `cd data/data-pipeline`)
+- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline/`)
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
+- If you want to clear out all data and tiles from all directories, you can run: `poetry run cleanup_data`.
- Then run `poetry run python application.py census-data-download`
+- Then run `poetry run download_census`
  Note: Census files are not kept in the repository and the download directories are ignored by Git
 ### Generating Map Tiles
 - Make sure you have Docker running in your machine
 - Start a terminal
- Change to this directory (i.e. `cd data/data-pipeline`)
+- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
- Then run `poetry run python application.py generate-map-tiles`
+- Then run `poetry run generate_tiles`
 ### Serve the map locally
 - Start a terminal
- Change to this directory (i.e. `cd data/data-pipeline`)
+- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
 - For USA high zoom: `docker run --rm -it -v ${PWD}/data/score/tiles/high:/data -p 8080:80 maptiler/tileserver-gl`
 ### Running Jupyter notebooks
 - Start a terminal
- Change to this directory (i.e. `cd data/data-pipeline`)
+- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
 - Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
 ### Activating variable-enabled Markdown for Jupyter notebooks
- Change to this directory (i.e. `cd data/data-pipeline`)
+- Change to the package directory (i.e. `cd data/data-pipeline/data_pipeline`)
 - Activate a Poetry Shell (see above)
 - Run `jupyter contrib nbextension install --user`
 - Run `jupyter nbextension enable python-markdown/main`
--- a/data/data-pipeline/pyproject.toml
+++ b/data/data-pipeline/pyproject.toml
@ -2,9 +2,6 @@
 authors = ["Your Name <you@example.com>"]
 description = "ETL and Generation of Justice 40 Score"
 name = "data-pipeline"
 packages = [
  {include = "etl"}, # required for poetry packaging to install in tox
 ]
 version = "0.1.0"
 [tool.poetry.dependencies]
@ -103,5 +100,8 @@ authorized_licenses = [
 ]
 [tool.poetry.scripts]
 cleanup_data = 'data_pipeline.application:data_cleanup'
 download_census = 'data_pipeline.application:census_data_download'
 etl = 'data_pipeline.application:etl_run'
 generate_tiles = 'data_pipeline.application:generate_map_tiles'
 score = 'data_pipeline.application:score_run'