Modularization + Poetry + Docker (#213)

* reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup
2025-09-30 23:13:18 -07:00 · 2021-06-28 16:16:14 -04:00 · 2021-06-28 16:16:14 -04:00 · 67c73dde2a
commit 67c73dde2a
parent 6f4087d247
29 changed files with 2383 additions and 433 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,5 +1,6 @@
 *.env
 .idea
 .vscode
 # Byte-compiled / optimized / DLL files
 __pycache__/
@ -128,7 +129,11 @@ dmypy.json
 # Cython debug symbols
 cython_debug/
-# temporary census data
+# Ignore dynaconf secret files
 score/.secrets.*
 # ignore data
 score/data
 score/data/census
 score/data/tiles
 score/data/tmp
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -0,0 +1,15 @@
 version: "3.4"
 services:
  score:
    image: j40_score
    container_name: j40_score_1
    build: score
    ports:
      - 8888:8888
    volumes:
      - ./score:/score
    stdin_open: true
    tty: true
    environment:
      ENV_FOR_DYNACONF: development
      PYTHONUNBUFFERED: 1
--- a/score/.vscode/settings.json
+++ b/score/.vscode/settings.json
@ -1,4 +0,0 @@
 {
    "python.pythonPath": "venv\\Scripts\\python.exe",
    "python.dataScience.sendSelectionToInteractiveWindow": false
 }
--- a/score/Dockerfile
+++ b/score/Dockerfile
@ -0,0 +1,33 @@
 FROM ubuntu:20.04
 # Install packages
 RUN apt-get update && apt-get install -y \
    build-essential \
    make \
    gcc \
    git \
    unzip \
    wget \
    python3-dev \
    python3-pip
 # tippeanoe
 ENV TZ=America/Los_Angeles
 RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
 RUN apt-get install -y software-properties-common libsqlite3-dev zlib1g-dev
 RUN apt-add-repository -y ppa:git-core/ppa
 RUN mkdir -p /tmp/tippecanoe-src && git clone https://github.com/mapbox/tippecanoe.git /tmp/tippecanoe-src
 WORKDIR /tmp/tippecanoe-src
 RUN /bin/sh -c make && make install
 ## gdal
 RUN add-apt-repository ppa:ubuntugis/ppa
 RUN apt-get -y install gdal-bin
 # Prepare python packages
 WORKDIR /score
 RUN pip3 install --upgrade pip setuptools wheel
 COPY . .
 COPY requirements.txt .
 RUN pip3 install -r requirements.txt
--- a/score/README.md
+++ b/score/README.md
@ -1,26 +1,63 @@
-# Justice 40 Score generator
+# Justice 40 Score application
-## Setup
+## Running using Docker
 We use Docker to install the necessary libraries in a container that can be run in any operating system.
 To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build`
 Then run commands, opening a new terminal window or tab:
 - Get help: `docker run --rm -it j40_score /bin/sh -c "python3 application.py" python3 application.py --help"`
 - Clean up the data directories: `docker run --rm -it j40_score /bin/sh -c "python3 application.py data-cleanup"`
 - Generate census data: `docker run --rm -it j40_score /bin/sh -c "python3 application.py census-data-download"`
 ## Log visualization
 If you want to visualize logs, the following temporary workaround can be used:
 - Run `docker-compose up` on the root of the repo
 - Open a new tab on your terminal
 - Then run any command for the application using this format: `docker exec j40_score_1 python3 application.py [command]`
 ## Local development
 You can run the Python code locally to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. Also to generate tiles for a local map, you will need [Mapbox tippeanoe](https://github.com/mapbox/tippecanoe)
 - Start a terminal
 - Make sure you have Python 3.9 installed: `python -V` or `python3 -V`
- Create a `virtualenv` in this folder: `python -m venv venv`
+- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
- Activate the virtualenv
+- Install Poetry requirements with `poetry install`
  - Windows: `./venv/Scripts/activate`
  - Mac/Linux: `source venv/bin/activate`
 - Install packages: `pip install -r requirements.txt`
  - If you are a Windows user, you might need to install Build Tools for Visual Studio. [Instructions here](https://stackoverflow.com/a/54136652)
-## Running the Jupyter notebook
+### Downloading Census Block Groups GeoJSON and Generating CBG CSVs
 - Make sure you have Docker running in your machine
 - Start a terminal
 - Change to this directory (i.e. `cd score`)
 - If you want to clear out all data and tiles from all directories, you can run: `poetry run python application.py data-cleanup`.
 - Then run `poetry run python application.py census-data-download`
  Note: Census files are not kept in the repository and the download directories are ignored by Git
 ### Generating mbtiles
 - TBD
 ### Serve the map locally
 - Start a terminal
 - Change to this directory (i.e. `cd score`)
- Activate your virtualenv (see above)
+- Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 maptiler/tileserver-gl`
 - Type `jupyter notebook`. Your browser should open with a Jupyter Notebook tab
-## Activating variable-enabled Markdown for Jupyter notebooks
+### Running Jupyter notebooks
 - Start a terminal
 - Change to this directory (i.e. `cd score`)
 - Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
 ### Activating variable-enabled Markdown for Jupyter notebooks
 - Change to this directory (i.e. `cd score`)
 - Activate a Poetry Shell (see above)
 - Run `jupyter contrib nbextension install --user`
 - Run `jupyter nbextension enable python-markdown/main`
 - Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near
@ -29,21 +66,6 @@
 For more information, see [nbextensions docs](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html) and
 see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/python-markdown).
-## Downloading Census Block Groups GeoJSON and Generating CBG CSVs
+## Miscellaneous
- Make sure you have Docker running in your machine
+- To export packages from Poetry to `requirements.txt` run `poetry export --without-hashes > requirements.txt`
 - Start a terminal
 - Change to this directory (i.e. `cd score`)
 - Activate your virtualenv (see above)
 - Run `python scripts/download_cbg.py`
  Note: Census files are not kept in the repository and the download directories are ignored by Git
 ## Generating mbtiles
 - Change to this directory (i.e. `cd score`)
 - Activate your virtualenv (see above)
 - Run the following script: `python .\scripts\generate_mbtiles.py`
 ## Serve the map locally
 - Run: `docker run --rm -it -v ${PWD}/data/tiles:/data -p 8080:80 klokantech/tileserver-gl`
--- a/score/application.py
+++ b/score/application.py
@ -0,0 +1,60 @@
 from config import settings
 import click
 from pathlib import Path
 import sys
 from etl.sources.census.etl_utils import reset_data_directories as census_reset
 from utils import remove_files_from_dir, remove_all_from_dir, get_module_logger
 from etl.sources.census.etl import download_census_csvs
 settings.APP_ROOT = Path.cwd()
 logger = get_module_logger(__name__)
@click.group()
 def cli():
    pass
@cli.command(
    help="Clean up all data folders",
 )
 def data_cleanup():
    data_path = settings.APP_ROOT / "data"
    # census directories
    logger.info(f"Initializing all census data")
    census_reset(data_path)
    # dataset directory
    logger.info(f"Initializing all dataset directoriees")
    remove_all_from_dir(data_path / "dataset")
    # score directory
    logger.info(f"Initializing all score data")
    remove_files_from_dir(data_path / "score" / "csv", ".csv")
    remove_files_from_dir(data_path / "score" / "geojson", ".json")
    # cleanup tmp dir
    logger.info(f"Initializing all temp directoriees")
    remove_all_from_dir(data_path / "tmp")
    logger.info("Cleaned up all data files")
@cli.command(
    help="Census data download",
 )
 def census_data_download():
    logger.info("Downloading census data")
    data_path = settings.APP_ROOT / "data"
    download_census_csvs(data_path)
    logger.info("Completed downloading census data")
    exit()
 if __name__ == "__main__":
    cli()
--- a/score/config.py
+++ b/score/config.py
@ -0,0 +1,11 @@
 from dynaconf import Dynaconf
 settings = Dynaconf(
    envvar_prefix="DYNACONF",
    settings_files=["settings.toml", ".secrets.toml"],
    environments=True,
 )
 # To set an environment use:
 # Linux/OSX: export ENV_FOR_DYNACONF=staging
 # Windows: set ENV_FOR_DYNACONF=staging
--- a/score/data/dataset/init.py
+++ b/score/data/dataset/init.py
--- a/score/data/fips_states_2010.csv
+++ b/score/data/fips_states_2010.csv
@ -1,52 +0,0 @@
 fips,state_name
 01 ,Alabama 
 02 ,Alaska 
 04 ,Arizona 
 05 ,Arkansas 
 06 ,California
 08 ,Colorado 
 09 ,Connecticut 
 10 ,Delaware 
 11 ,District of Columbia 
 12 ,Florida 
 13 ,Georgia 
 15 ,Hawaii 
 16 ,Idaho 
 17 ,Illinois 
 18 ,Indiana 
 19 ,Iowa
 20 ,Kansas 
 21 ,Kentucky 
 22 ,Louisiana 
 23 ,Maine 
 24 ,Maryland 
 25 ,Massachusetts 
 26 ,Michigan 
 27 ,Minnesota 
 28 ,Mississippi 
 29 ,Missouri 
 30 ,Montana 
 31 ,Nebraska 
 32 ,Nevada 
 33 ,New Hampshire 
 34 ,New Jersey
 35 ,New Mexico
 36 ,New York 
 37 ,North Carolina 
 38 ,North Dakota 
 39 ,Ohio
 40 ,Oklahoma 
 41 ,Oregon 
 42 ,Pennsylvania 
 44 ,Rhode Island 
 45 ,South Carolina 
 46 ,South Dakota 
 47 ,Tennessee 
 48 ,Texas 
 49 ,Utah
 50 ,Vermont 
 51 ,Virginia 
 53 ,Washington
 54 ,West Virginia 
 55 ,Wisconsin 
 56 ,Wyoming 
--- a/score/etl/sources/init.py
+++ b/score/etl/sources/init.py
--- a/score/etl/sources/census/init.py
+++ b/score/etl/sources/census/init.py
--- a/score/etl/sources/census/etl.py
+++ b/score/etl/sources/census/etl.py
@ -0,0 +1,93 @@
 import csv
 import os
 import json
 from pathlib import Path
 from .etl_utils import get_state_fips_codes
 from utils import unzip_file_from_url, get_module_logger
 logger = get_module_logger(__name__)
 def download_census_csvs(data_path: Path) -> None:
    # the fips_states_2010.csv is generated from data here
    # https://www.census.gov/geographies/reference-files/time-series/geo/tallies.html
    state_fips_codes = get_state_fips_codes(data_path)
    for fips in state_fips_codes:
        # check if file exists
        shp_file_path = data_path / "census" / "shp" / fips / f"tl_2010_{fips}_bg10.shp"
        if not os.path.isfile(shp_file_path):
            logger.info(f"Downloading {fips}")
            # 2020 tiger data is here: https://www2.census.gov/geo/tiger/TIGER2020/BG/
            # But using 2010 for now
            cbg_state_url = f"https://www2.census.gov/geo/tiger/TIGER2010/BG/2010/tl_2010_{fips}_bg10.zip"
            unzip_file_from_url(
                cbg_state_url, data_path / "tmp", data_path / "census" / "shp" / fips
            )
        geojson_dir_path = data_path / "census" / "geojson"
        cmd = (
            "ogr2ogr -f GeoJSON data/census/geojson/"
            + fips
            + ".json data/census/shp/"
            + fips
            + "/tl_2010_"
            + fips
            + "_bg10.shp"
        )
        os.system(cmd)
        # generate CBG CSV table for pandas
        ## load in memory
        cbg_national = []  # in-memory global list
        cbg_per_state: dict = {}  # in-memory dict per state
        for file in os.listdir(geojson_dir_path):
            if file.endswith(".json"):
                logger.info(f"Ingesting geoid10 for file {file}")
                with open(geojson_dir_path / file) as f:
                    geojson = json.load(f)
                    for feature in geojson["features"]:
                        geoid10 = feature["properties"]["GEOID10"]
                        cbg_national.append(str(geoid10))
                        geoid10_state_id = geoid10[:2]
                        if not cbg_per_state.get(geoid10_state_id):
                            cbg_per_state[geoid10_state_id] = []
                        cbg_per_state[geoid10_state_id].append(geoid10)
        csv_dir_path = data_path / "census" / "csv"
        ## write to individual state csv
        for state_id in cbg_per_state:
            geoid10_list = cbg_per_state[state_id]
            with open(
                csv_dir_path / f"{state_id}.csv", mode="w", newline=""
            ) as cbg_csv_file:
                cbg_csv_file_writer = csv.writer(
                    cbg_csv_file,
                    delimiter=",",
                    quotechar='"',
                    quoting=csv.QUOTE_MINIMAL,
                )
                for geoid10 in geoid10_list:
                    cbg_csv_file_writer.writerow(
                        [
                            geoid10,
                        ]
                    )
        ## write US csv
        with open(csv_dir_path / "us.csv", mode="w", newline="") as cbg_csv_file:
            cbg_csv_file_writer = csv.writer(
                cbg_csv_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL
            )
            for geoid10 in cbg_national:
                cbg_csv_file_writer.writerow(
                    [
                        geoid10,
                    ]
                )
        logger.info("Census block groups downloading complete")
--- a/score/etl/sources/census/etl_utils.py
+++ b/score/etl/sources/census/etl_utils.py
@ -0,0 +1,47 @@
 from pathlib import Path
 import csv
 import os
 from config import settings
 from utils import remove_files_from_dir, remove_all_dirs_from_dir, unzip_file_from_url
 def reset_data_directories(data_path: Path) -> None:
    census_data_path = data_path / "census"
    # csv
    csv_path = census_data_path / "csv"
    remove_files_from_dir(csv_path, ".csv")
    # geojson
    geojson_path = census_data_path / "geojson"
    remove_files_from_dir(geojson_path, ".json")
    # shp
    shp_path = census_data_path / "shp"
    remove_all_dirs_from_dir(shp_path)
 def get_state_fips_codes(data_path: Path) -> list:
    fips_csv_path = data_path / "census" / "csv" / "fips_states_2010.csv"
    # check if file exists
    if not os.path.isfile(fips_csv_path):
        unzip_file_from_url(
            settings.AWS_JUSTICE40_DATA_URL + "/Census/fips_states_2010.zip",
            data_path / "tmp",
            data_path / "census" / "csv",
        )
    fips_state_list = []
    with open(fips_csv_path) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=",")
        line_count = 0
        for row in csv_reader:
            if line_count == 0:
                line_count += 1
            else:
                fips = row[0].strip()
                fips_state_list.append(fips)
    return fips_state_list
--- a/score/etl/sources/ejscreen/init.py
+++ b/score/etl/sources/ejscreen/init.py
--- a/score/etl/datasets/ejscreen_2020.py
+++ b/score/etl/datasets/ejscreen_2020.py
--- a/score/ipython/census_etl.ipynb
+++ b/score/ipython/census_etl.ipynb
@ -12,11 +12,17 @@
    "import csv\n",
    "from pathlib import Path\n",
    "import os\n",
    "import sys\n",
    "\n",
    "module_path = os.path.abspath(os.path.join('..'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)\n",
    "\n",
    "from etl.sources.census.etl_utils import get_state_fips_codes\n",
    "\n",
    "ACS_YEAR = 2019\n",
    "\n",
    "DATA_PATH = Path.cwd().parent / \"data\"\n",
    "FIPS_CSV_PATH = DATA_PATH / \"fips_states_2010.csv\"\n",
    "OUTPUT_PATH = DATA_PATH / \"dataset\" / f\"census_acs_{ACS_YEAR}\"\n",
    "\n",
    "GEOID_FIELD_NAME = \"GEOID10\"\n",
@ -57,27 +63,19 @@
    "\n",
    "\n",
    "dfs = []\n",
-    "with open(FIPS_CSV_PATH) as csv_file:\n",
+    "for fips in get_state_fips_codes(DATA_PATH):\n",
-    "    csv_reader = csv.reader(csv_file, delimiter=\",\")\n",
+    "    print(f\"Downloading data for state/territory with FIPS code {fips}\")\n",
    "    line_count = 0\n",
    "\n",
-    "    for row in csv_reader:\n",
+    "    dfs.append(\n",
-    "        if line_count == 0:\n",
+    "        censusdata.download(\n",
-    "            line_count += 1\n",
+    "            src=\"acs5\",\n",
-    "        else:\n",
+    "            year=ACS_YEAR,\n",
-    "            fips = row[0].strip()\n",
+    "            geo=censusdata.censusgeo(\n",
-    "            print(f\"Downloading data for state/territory with FIPS code {fips}\")\n",
+    "                [(\"state\", fips), (\"county\", \"*\"), (\"block group\", \"*\")]\n",
-    "\n",
+    "            ),\n",
-    "            dfs.append(\n",
+    "            var=[\"B23025_005E\", \"B23025_003E\"],\n",
-    "                censusdata.download(\n",
+    "        )\n",
-    "                    src=\"acs5\",\n",
+    "    )\n",
    "                    year=ACS_YEAR,\n",
    "                    geo=censusdata.censusgeo(\n",
    "                        [(\"state\", fips), (\"county\", \"*\"), (\"block group\", \"*\")]\n",
    "                    ),\n",
    "                    var=[\"B23025_005E\", \"B23025_003E\"],\n",
    "                )\n",
    "            )\n",
    "\n",
    "df = pd.concat(dfs)\n",
    "\n",
--- a/score/ipython/ejscreen_etl.ipynb
+++ b/score/ipython/ejscreen_etl.ipynb
@ -8,33 +8,25 @@
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import requests\n",
    "import zipfile\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import csv\n",
    "import sys\n",
    "import os\n",
    "\n",
-    "data_path = Path.cwd().parent / \"data\"\n",
+    "module_path = os.path.abspath(os.path.join('..'))\n",
-    "fips_csv_path = data_path / \"fips_states_2010.csv\"\n",
+    "if module_path not in sys.path:\n",
-    "csv_path = data_path / \"dataset\" / \"ejscreen_2020\""
+    "    sys.path.append(module_path)\n",
-   ]
+    "\n",
-  },
+    "from etl.sources.census.etl_utils import get_state_fips_codes\n",
-  {
+    "from utils import unzip_file_from_url, remove_all_from_dir\n",
-   "cell_type": "code",
+    "\n",
-   "execution_count": null,
+    "DATA_PATH = Path.cwd().parent / \"data\"\n",
-   "id": "67a58c24",
+    "TMP_PATH = DATA_PATH / \"tmp\"\n",
-   "metadata": {},
+    "EJSCREEN_FTP_URL = \"https://gaftp.epa.gov/EJSCREEN/2020/EJSCREEN_2020_StatePctile.csv.zip\"\n",
-   "outputs": [],
+    "EJSCREEN_CSV = TMP_PATH / \"EJSCREEN_2020_StatePctile.csv\"\n",
-   "source": [
+    "CSV_PATH = DATA_PATH / \"dataset\" / \"ejscreen_2020\"\n",
-    "download = requests.get(\n",
+    "print(DATA_PATH)"
    "    \"https://gaftp.epa.gov/EJSCREEN/2020/EJSCREEN_2020_StatePctile.csv.zip\",\n",
    "    verify=False,\n",
    ")\n",
    "file_contents = download.content\n",
    "zip_file_path = data_path / \"tmp\"\n",
    "zip_file = open(zip_file_path / \"downloaded.zip\", \"wb\")\n",
    "zip_file.write(file_contents)\n",
    "zip_file.close()"
   ]
  },
  {
@ -44,9 +36,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "with zipfile.ZipFile(zip_file_path / \"downloaded.zip\", \"r\") as zip_ref:\n",
+    "# download file from ejscreen ftp\n",
-    "    zip_ref.extractall(zip_file_path)\n",
+    "unzip_file_from_url(EJSCREEN_FTP_URL, TMP_PATH, TMP_PATH)"
    "ejscreen_csv = data_path / \"tmp\" / \"EJSCREEN_2020_StatePctile.csv\""
   ]
  },
  {
@ -58,7 +49,7 @@
   },
   "outputs": [],
   "source": [
-    "df = pd.read_csv(ejscreen_csv, dtype={\"ID\": \"string\"}, low_memory=False)"
+    "df = pd.read_csv(EJSCREEN_CSV, dtype={\"ID\": \"string\"}, low_memory=False)"
   ]
  },
  {
@ -69,8 +60,8 @@
   "outputs": [],
   "source": [
    "# write nationwide csv\n",
-    "csv_path.mkdir(parents=True, exist_ok=True)\n",
+    "CSV_PATH.mkdir(parents=True, exist_ok=True)\n",
-    "df.to_csv(csv_path / f\"usa.csv\", index=False)"
+    "df.to_csv(CSV_PATH / f\"usa.csv\", index=False)"
   ]
  },
  {
@ -81,19 +72,11 @@
   "outputs": [],
   "source": [
    "# write per state csvs\n",
-    "with open(fips_csv_path) as csv_file:\n",
+    "for fips in get_state_fips_codes(DATA_PATH):\n",
-    "    csv_reader = csv.reader(csv_file, delimiter=\",\")\n",
+    "    print(f\"Generating data{fips} csv\")\n",
-    "    line_count = 0\n",
+    "    df1 = df[df.ID.str[:2] == fips]\n",
-    "\n",
+    "    # we need to name the file data01.csv for ogr2ogr csv merge to work\n",
-    "    for row in csv_reader:\n",
+    "    df1.to_csv(CSV_PATH / f\"data{fips}.csv\", index=False)"
    "        if line_count == 0:\n",
    "            line_count += 1\n",
    "        else:\n",
    "            fips = row[0].strip()\n",
    "            print(f\"Generating data{fips} csv\")\n",
    "            df1 = df[df.ID.str[:2] == fips]\n",
    "            # we need to name the file data01.csv for ogr2ogr csv merge to work\n",
    "            df1.to_csv(csv_path / f\"data{fips}.csv\", index=False)"
   ]
  },
  {
@ -102,6 +85,17 @@
   "id": "81b977f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# cleanup\n",
    "remove_all_from_dir(TMP_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d4f74d7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
--- a/score/ipython/housing_and_transportation_etl.ipynb
+++ b/score/ipython/housing_and_transportation_etl.ipynb
@ -10,15 +10,22 @@
    "import pandas as pd\n",
    "import censusdata\n",
    "import csv\n",
    "import requests\n",
    "import zipfile\n",
    "\n",
    "from pathlib import Path\n",
    "import os\n",
    "import sys\n",
    "\n",
    "module_path = os.path.abspath(os.path.join('..'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)\n",
    "\n",
    "from etl.sources.census.etl_utils import get_state_fips_codes\n",
    "from utils import unzip_file_from_url, remove_all_from_dir\n",
    "\n",
    "ACS_YEAR = 2019\n",
    "\n",
    "DATA_PATH = Path.cwd().parent / \"data\"\n",
-    "FIPS_CSV_PATH = DATA_PATH / \"fips_states_2010.csv\"\n",
+    "TMP_PATH = DATA_PATH / \"tmp\"\n",
    "HOUSING_FTP_URL = \"https://htaindex.cnt.org/download/download.php?focus=blkgrp&geoid=\"\n",
    "OUTPUT_PATH = DATA_PATH / \"dataset\" / \"housing_and_transportation_index\"\n",
    "\n",
    "GEOID_FIELD_NAME = \"GEOID10\""
@ -31,44 +38,18 @@
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://htaindex.cnt.org/download/download.php?focus=blkgrp&geoid=01\n",
    "\n",
    "# Download each state / territory individually\n",
    "dfs = []\n",
-    "with open(FIPS_CSV_PATH) as csv_file:\n",
+    "zip_file_dir = TMP_PATH / \"housing_and_transportation_index\"\n",
-    "    csv_reader = csv.reader(csv_file, delimiter=\",\")\n",
+    "for fips in get_state_fips_codes(DATA_PATH):\n",
-    "    line_count = 0\n",
+    "    print(f\"Downloading housing data for state/territory with FIPS code {fips}\")\n",
    "    unzip_file_from_url(f\"{HOUSING_FTP_URL}{fips}\", TMP_PATH, zip_file_dir)\n",
    "    \n",
    "    # New file name:\n",
    "    tmp_csv_file_path = zip_file_dir / f\"htaindex_data_blkgrps_{fips}.csv\"\n",
    "    tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)\n",
    "\n",
-    "    for row in csv_reader:\n",
+    "    dfs.append(tmp_df)\n",
    "        if line_count == 0:\n",
    "            line_count += 1\n",
    "        else:\n",
    "            fips = row[0].strip()\n",
    "\n",
    "            print(f\"Downloading data for state/territory with FIPS code {fips}\")\n",
    "\n",
    "            download = requests.get(\n",
    "                f\"https://htaindex.cnt.org/download/download.php?focus=blkgrp&geoid={fips}\",\n",
    "                verify=False,\n",
    "            )\n",
    "            file_contents = download.content\n",
    "            zip_file_dir = DATA_PATH / \"tmp\" / \"housing_and_transportation_index\"\n",
    "\n",
    "            # Make the directory if it doesn't exist\n",
    "            zip_file_dir.mkdir(parents=True, exist_ok=True)\n",
    "            zip_file_path = zip_file_dir / f\"{fips}-downloaded.zip\"\n",
    "            zip_file = open(zip_file_path, \"wb\")\n",
    "            zip_file.write(file_contents)\n",
    "            zip_file.close()\n",
    "\n",
    "            with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n",
    "                zip_ref.extractall(zip_file_dir)\n",
    "\n",
    "            # New file name:\n",
    "            tmp_csv_file_path = zip_file_dir / f\"htaindex_data_blkgrps_{fips}.csv\"\n",
    "            tmp_df = pd.read_csv(filepath_or_buffer=tmp_csv_file_path)\n",
    "\n",
    "            dfs.append(tmp_df)\n",
    "\n",
    "df = pd.concat(dfs)\n",
    "\n",
@ -105,6 +86,17 @@
   "id": "ef5bb862",
   "metadata": {},
   "outputs": [],
   "source": [
    "# cleanup\n",
    "remove_all_from_dir(TMP_PATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9269e497",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
--- a/score/ipython/score_calc.ipynb
+++ b/score/ipython/score_calc.ipynb
@ -17,6 +17,14 @@
    "from pathlib import Path\n",
    "import pandas as pd\n",
    "import csv\n",
    "import os\n",
    "import sys\n",
    "\n",
    "module_path = os.path.abspath(os.path.join('..'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)\n",
    "\n",
    "from etl.sources.census.etl_utils import get_state_fips_codes\n",
    "\n",
    "# Define some global parameters\n",
    "GEOID_FIELD_NAME = \"GEOID10\"\n",
@ -37,9 +45,8 @@
    "\n",
    "PERCENTILE_FIELD_SUFFIX = \" (percentile)\"\n",
    "\n",
-    "data_path = Path.cwd().parent / \"data\"\n",
+    "DATA_PATH = Path.cwd().parent / \"data\"\n",
-    "fips_csv_path = data_path / \"fips_states_2010.csv\"\n",
+    "SCORE_CSV_PATH = DATA_PATH / \"score\" / \"csv\"\n",
    "score_csv_path = data_path / \"score\" / \"csv\"\n",
    "\n",
    "# Tell pandas to display all columns\n",
    "pd.set_option(\"display.max_columns\", None)"
@ -55,7 +62,7 @@
   "outputs": [],
   "source": [
    "# EJSCreen csv Load\n",
-    "ejscreen_csv = data_path / \"dataset\" / \"ejscreen_2020\" / \"usa.csv\"\n",
+    "ejscreen_csv = DATA_PATH / \"dataset\" / \"ejscreen_2020\" / \"usa.csv\"\n",
    "ejscreen_df = pd.read_csv(ejscreen_csv, dtype={\"ID\": \"string\"}, low_memory=False)\n",
    "ejscreen_df.rename(columns={\"ID\": GEOID_FIELD_NAME}, inplace=True)\n",
    "ejscreen_df.head()"
@ -69,7 +76,7 @@
   "outputs": [],
   "source": [
    "# Load census data\n",
-    "census_csv = data_path / \"dataset\" / \"census_acs_2019\" / \"usa.csv\"\n",
+    "census_csv = DATA_PATH / \"dataset\" / \"census_acs_2019\" / \"usa.csv\"\n",
    "census_df = pd.read_csv(\n",
    "    census_csv, dtype={GEOID_FIELD_NAME: \"string\"}, low_memory=False\n",
    ")\n",
@ -85,7 +92,7 @@
   "source": [
    "# Load housing and transportation data\n",
    "housing_and_transportation_index_csv = (\n",
-    "    data_path / \"dataset\" / \"housing_and_transportation_index\" / \"usa.csv\"\n",
+    "    DATA_PATH / \"dataset\" / \"housing_and_transportation_index\" / \"usa.csv\"\n",
    ")\n",
    "housing_and_transportation_df = pd.read_csv(\n",
    "    housing_and_transportation_index_csv,\n",
@ -352,7 +359,7 @@
   "outputs": [],
   "source": [
    "# write nationwide csv\n",
-    "df.to_csv(score_csv_path / f\"usa.csv\", index=False)"
+    "df.to_csv(SCORE_CSV_PATH / f\"usa.csv\", index=False)"
   ]
  },
  {
@ -363,19 +370,11 @@
   "outputs": [],
   "source": [
    "# write per state csvs\n",
-    "with open(fips_csv_path) as csv_file:\n",
+    "for states_fips in get_state_fips_codes(DATA_PATH):\n",
-    "    csv_reader = csv.reader(csv_file, delimiter=\",\")\n",
+    "    print(f\"Generating data{states_fips} csv\")\n",
-    "    line_count = 0\n",
+    "    df1 = df[df[\"GEOID10\"].str[:2] == states_fips]\n",
-    "\n",
+    "    # we need to name the file data01.csv for ogr2ogr csv merge to work\n",
-    "    for row in csv_reader:\n",
+    "    df1.to_csv(SCORE_CSV_PATH / f\"data{states_fips}.csv\", index=False)"
    "        if line_count == 0:\n",
    "            line_count += 1\n",
    "        else:\n",
    "            states_fips = row[0].strip()\n",
    "            print(f\"Generating data{states_fips} csv\")\n",
    "            df1 = df[df[\"GEOID10\"].str[:2] == states_fips]\n",
    "            # we need to name the file data01.csv for ogr2ogr csv merge to work\n",
    "            df1.to_csv(score_csv_path / f\"data{states_fips}.csv\", index=False)"
   ]
  },
  {
--- a/score/poetry.lock
+++ b/score/poetry.lock
--- a/score/pyproject.toml
+++ b/score/pyproject.toml
@ -0,0 +1,26 @@
 [tool.poetry]
 name = "score"
 version = "0.1.0"
 description = "ETL and Generation of Justice 40 Score"
 authors = ["Your Name <you@example.com>"]
 [tool.poetry.dependencies]
 python = "^3.9"
 ipython = "^7.24.1"
 jupyter = "^1.0.0"
 jupyter-contrib-nbextensions = "^0.5.1"
 numpy = "^1.21.0"
 pandas = "^1.2.5"
 requests = "^2.25.1"
 click = "^8.0.1"
 dynaconf = "^3.1.4"
 types-requests = "^2.25.0"
 CensusData = "^1.13"
 [tool.poetry.dev-dependencies]
 mypy = "^0.910"
 black = {version = "^21.6b0", allow-prereleases = true}
 [build-system]
 requires = ["poetry-core>=1.0.0"]
 build-backend = "poetry.core.masonry.api"
--- a/score/requirements.txt
+++ b/score/requirements.txt
--- a/score/scripts/download_cbg.py
+++ b/score/scripts/download_cbg.py
@ -1,114 +0,0 @@
 import csv
 import requests
 import zipfile
 import os
 import json
 from pathlib import Path
 from utils import get_state_fips_codes
 data_path = Path.cwd() / "data"
 with requests.Session() as s:
    # the fips_states_2010.csv is generated from data here
    # https://www.census.gov/geographies/reference-files/time-series/geo/tallies.html
    state_fips_codes = get_state_fips_codes()
    for fips in state_fips_codes:
        # check if file exists
        shp_file_path = data_path.joinpath(
            "census", "shp", fips, f"tl_2010_{fips}_bg10.shp"
        )
        if not os.path.isfile(shp_file_path):
            print(f"downloading {row[1]}")
            # 2020 tiger data is here: https://www2.census.gov/geo/tiger/TIGER2020/BG/
            # But using 2010 for now
            cbg_state_url = f"https://www2.census.gov/geo/tiger/TIGER2010/BG/2010/tl_2010_{fips}_bg10.zip"
            download = s.get(cbg_state_url)
            file_contents = download.content
            zip_file_path = data_path / "census" / "downloaded.zip"
            zip_file = open(zip_file_path, "wb")
            zip_file.write(file_contents)
            zip_file.close()
            print(f"extracting {row[1]}")
            with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
                shp_dir_path = data_path / "census" / "shp" / fips
                zip_ref.extractall(shp_dir_path)
        geojson_dir_path = data_path.joinpath(
            "census",
            "geojson",
        )
        if not os.path.isfile(geojson_dir_path.joinpath(fips + ".json")):
            # ogr2ogr
            print(f"encoding GeoJSON for {row[1]}")
            # PWD is different for Windows
            if os.name == "nt":
                pwd = "%cd%"
            else:
                pwd = "${PWD}"
            cmd = (
                'docker run --rm -it -v "'
                + pwd
                + '"/:/home osgeo/gdal:alpine-ultrasmall-latest ogr2ogr -f GeoJSON /home/data/census/geojson/'
                + fips
                + ".json /home/data/census/shp/"
                + fips
                + "/tl_2010_"
                + fips
                + "_bg10.shp"
            )
            print(cmd)
            os.system(cmd)
    # generate CBG CSV table for pandas
    ## load in memory
    cbg_national_list = []  # in-memory global list
    cbg_per_state_list = {}  # in-memory dict per state
    for file in os.listdir(geojson_dir_path):
        if file.endswith(".json"):
            print(f"ingesting geoid10 for file {file}")
            with open(geojson_dir_path.joinpath(file)) as f:
                geojson = json.load(f)
                for feature in geojson["features"]:
                    geoid10 = feature["properties"]["GEOID10"]
                    cbg_national_list.append(str(geoid10))
                    geoid10_state_id = geoid10[:2]
                    if not cbg_per_state_list.get(geoid10_state_id):
                        cbg_per_state_list[geoid10_state_id] = []
                    cbg_per_state_list[geoid10_state_id].append(geoid10)
    csv_dir_path = data_path / "census" / "csv"
    ## write to individual state csv
    for state_id in cbg_per_state_list:
        geoid10_list = cbg_per_state_list[state_id]
        with open(
            csv_dir_path.joinpath(f"{state_id}.csv"), mode="w", newline=""
        ) as cbg_csv_file:
            cbg_csv_file_writer = csv.writer(
                cbg_csv_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL
            )
            for geoid10 in geoid10_list:
                cbg_csv_file_writer.writerow(
                    [
                        geoid10,
                    ]
                )
    ## write US csv
    with open(csv_dir_path.joinpath("us.csv"), mode="w", newline="") as cbg_csv_file:
        cbg_csv_file_writer = csv.writer(
            cbg_csv_file, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL
        )
        for geoid10 in cbg_national_list:
            cbg_csv_file_writer.writerow(
                [
                    geoid10,
                ]
            )
    print("Census block groups downloading complete")
--- a/score/scripts/generate_mbtiles.py
+++ b/score/scripts/generate_mbtiles.py
@ -1,91 +0,0 @@
 import os
 from pathlib import Path
 import shutil
 from utils import get_state_fips_codes
 data_path = Path.cwd() / "data"
 # remove existing mbtiles file
 mb_tiles_path = data_path / "tiles" / "block2010.mbtiles"
 if os.path.exists(mb_tiles_path):
    os.remove(mb_tiles_path)
 # remove existing mvt directory
 mvt_tiles_path = data_path / "tiles" / "mvt"
 if os.path.exists(mvt_tiles_path):
    shutil.rmtree(mvt_tiles_path)
 # Merge scores into json
 # TODO: for this first pass, just merging ACS EJScren indicators
 #       Per https://github.com/usds/justice40-tool/issues/102
 if os.name == "nt":
    pwd = "%cd%"
 else:
    pwd = "${PWD}"
 # remove existing score json files
 score_geojson_dir = data_path / "score" / "geojson"
 files_in_directory = os.listdir(score_geojson_dir)
 filtered_files = [file for file in files_in_directory if file.endswith(".json")]
 for file in filtered_files:
    path_to_file = os.path.join(score_geojson_dir, file)
    os.remove(path_to_file)
 # join the state shape sqllite with the score csv
 state_fips_codes = get_state_fips_codes()
 for fips in state_fips_codes:
    cmd = (
        'docker run --rm -v "'
        + pwd
        + '"/:/home '
        + "osgeo/gdal:alpine-small-latest ogr2ogr -f GeoJSON "
        + f"-sql \"SELECT * FROM tl_2010_{fips}_bg10 LEFT JOIN '/home/data/score/csv/data{fips}.csv'.data{fips} ON tl_2010_{fips}_bg10.GEOID10 = data{fips}.ID\" "
        + f"/home/data/score/geojson/{fips}.json /home/data/census/shp/{fips}/tl_2010_{fips}_bg10.dbf"
    )
    print(cmd)
    os.system(cmd)
 # get a list of all json files to plug in the docker commands below
 # (workaround since *.json doesn't seem to work)
 geojson_list = ""
 geojson_path = data_path / "score" / "geojson"
 for file in os.listdir(geojson_path):
    if file.endswith(".json"):
        geojson_list += f"/home/data/score/geojson/{file} "
 if geojson_list == "":
    print("No GeoJson files found. Please run scripts/download_cbg.py first")
 # generate mbtiles file
 # PWD is different for Windows
 if os.name == "nt":
    pwd = "%cd%"
 else:
    pwd = "${PWD}"
 cmd = (
    'docker run --rm -it -v "'
    + pwd
    + '"/:/home klokantech/tippecanoe tippecanoe --drop-densest-as-needed -zg -o /home/data/tiles/block2010.mbtiles --extend-zooms-if-still-dropping -l cbg2010 -s_srs EPSG:4269 -t_srs EPSG:4326 '
    + geojson_list
 )
 print(cmd)
 os.system(cmd)
 # if AWS creds are present, generate uncompressed toles
 # docker run --rm -it -v ${PWD}:/data tippecanoe tippecanoe --no-tile-compression -zg -e /data/tiles_custom -l blocks /data/tabblock2010_01_pophu_joined.json
 # PWD is different for Windows
 if os.name == "nt":
    pwd = "%cd%"
 else:
    pwd = "${PWD}"
 cmd = (
    'docker run --rm -it -v "'
    + pwd
    + '"/:/home klokantech/tippecanoe tippecanoe --drop-densest-as-needed --no-tile-compression  -zg -e /home/data/tiles/mvt '
    + geojson_list
 )
 print(cmd)
 os.system(cmd)
--- a/score/scripts/utils.py
+++ b/score/scripts/utils.py
@ -1,20 +0,0 @@
 # common usage functions
 import csv
 from pathlib import Path
 def get_state_fips_codes():
    data_path = Path.cwd() / "data"
    fips_csv_path = data_path / "fips_states_2010.csv"
    fips_state_list = []
    with open(fips_csv_path) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=",")
        line_count = 0
        for row in csv_reader:
            if line_count == 0:
                line_count += 1
            else:
                fips = row[0].strip()
                fips_state_list.append(fips)
    return fips_state_list
--- a/score/settings.toml
+++ b/score/settings.toml
@ -0,0 +1,8 @@
 [default]
 AWS_JUSTICE40_DATA_URL = "https://justice40-data.s3.amazonaws.com"
 [development]
 [staging]
 [production]
--- a/score/tile/init.py
+++ b/score/tile/init.py
--- a/score/tile/generate.py
+++ b/score/tile/generate.py
@ -0,0 +1,86 @@
 import os
 from pathlib import Path
 import shutil
 from etl.sources.census.etl_utils import get_state_fips_codes
 def generate_tiles(data_path: Path) -> None:
    # remove existing mbtiles file
    mb_tiles_path = data_path / "tiles" / "block2010.mbtiles"
    if os.path.exists(mb_tiles_path):
        os.remove(mb_tiles_path)
    # remove existing mvt directory
    mvt_tiles_path = data_path / "tiles" / "mvt"
    if os.path.exists(mvt_tiles_path):
        shutil.rmtree(mvt_tiles_path)
    # Merge scores into json
    if os.name == "nt":
        pwd = "%cd%"
    else:
        pwd = "${PWD}"
    # remove existing score json files
    score_geojson_dir = data_path / "score" / "geojson"
    files_in_directory = os.listdir(score_geojson_dir)
    filtered_files = [file for file in files_in_directory if file.endswith(".json")]
    for file in filtered_files:
        path_to_file = os.path.join(score_geojson_dir, file)
        os.remove(path_to_file)
    # join the state shape sqllite with the score csv
    state_fips_codes = get_state_fips_codes()
    for fips in state_fips_codes:
        cmd = (
            'docker run --rm -v "'
            + pwd
            + '"/:/home '
            + "osgeo/gdal:alpine-small-latest ogr2ogr -f GeoJSON "
            + f"-sql \"SELECT * FROM tl_2010_{fips}_bg10 LEFT JOIN '/home/data/score/csv/data{fips}.csv'.data{fips} ON tl_2010_{fips}_bg10.GEOID10 = data{fips}.ID\" "
            + f"/home/data/score/geojson/{fips}.json /home/data/census/shp/{fips}/tl_2010_{fips}_bg10.dbf"
        )
        os.system(cmd)
    # get a list of all json files to plug in the docker commands below
    # (workaround since *.json doesn't seem to work)
    geojson_list = ""
    geojson_path = data_path / "score" / "geojson"
    for file in os.listdir(geojson_path):
        if file.endswith(".json"):
            geojson_list += f"/home/data/score/geojson/{file} "
    if geojson_list == "":
        logging.error(
            "No GeoJson files found. Please run scripts/download_cbg.py first"
        )
    # generate mbtiles file
    # PWD is different for Windows
    if os.name == "nt":
        pwd = "%cd%"
    else:
        pwd = "${PWD}"
    cmd = (
        'docker run --rm -it -v "'
        + pwd
        + '"/:/home klokantech/tippecanoe tippecanoe --drop-densest-as-needed -zg -o /home/data/tiles/block2010.mbtiles --extend-zooms-if-still-dropping -l cbg2010 -s_srs EPSG:4269 -t_srs EPSG:4326 '
        + geojson_list
    )
    os.system(cmd)
    # PWD is different for Windows
    if os.name == "nt":
        pwd = "%cd%"
    else:
        pwd = "${PWD}"
    cmd = (
        'docker run --rm -it -v "'
        + pwd
        + '"/:/home klokantech/tippecanoe tippecanoe --drop-densest-as-needed --no-tile-compression  -zg -e /home/data/tiles/mvt '
        + geojson_list
    )
    os.system(cmd)
--- a/score/utils.py
+++ b/score/utils.py
@ -0,0 +1,76 @@
 from pathlib import Path
 import os
 import logging
 import shutil
 import requests
 import zipfile
 def get_module_logger(module_name):
    """
    To use this, do logger = get_module_logger(__name__)
    """
    logger = logging.getLogger(module_name)
    handler = logging.StreamHandler()
    formatter = logging.Formatter(
        "%(asctime)s [%(name)-12s] %(levelname)-8s %(message)s"
    )
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.DEBUG)
    return logger
 logger = get_module_logger(__name__)
 def remove_files_from_dir(files_path: Path, extension: str = None) -> None:
    for file in os.listdir(files_path):
        if extension:
            if not file.endswith(extension):
                continue
        else:
            # don't rempove __init__ files as they conserve dir structure
            if file == "__init__.py":
                continue
        os.remove(files_path / file)
        logger.info(f"Removing {file}")
 def remove_all_from_dir(files_path: Path) -> None:
    for file in os.listdir(files_path):
        # don't rempove __init__ files as they conserve dir structure
        if file == "__init__.py":
            continue
        if os.path.isfile(files_path / file):
            os.remove(files_path / file)
        else:
            shutil.rmtree(files_path / file)
        logger.info(f"Removing {file}")
 def remove_all_dirs_from_dir(dir_path: Path) -> None:
    for filename in os.listdir(dir_path):
        file_path = os.path.join(dir_path, filename)
        if os.path.isdir(file_path):
            shutil.rmtree(file_path)
            logging.info(f"Removing directory {file_path}")
 def unzip_file_from_url(
    file_url: str, download_path: Path, zip_file_directory: Path, verify: bool = False
 ) -> None:
    logger.info(f"Downloading {file_url}")
    download = requests.get(file_url, verify=verify)
    file_contents = download.content
    zip_file_path = download_path / "downloaded.zip"
    zip_file = open(zip_file_path, "wb")
    zip_file.write(file_contents)
    zip_file.close()
    logger.info(f"Extracting {zip_file_path}")
    with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
        zip_ref.extractall(zip_file_directory)
    # cleanup temporary file
    os.remove(zip_file_path)