j40-cejst-2/data/data-pipeline/data_pipeline/etl/sources/census/etl_utils.py

import csv
import os
import sys
from pathlib import Path

import pandas as pd
from data_pipeline.config import settings
from data_pipeline.utils import (
    get_module_logger,
    remove_all_dirs_from_dir,
    remove_files_from_dir,
    unzip_file_from_url,
    zip_directory,
)

logger = get_module_logger(__name__)


def reset_data_directories(
    data_path: Path,
) -> None:
    """Empties all census folders"""
    census_data_path = data_path / "census"

    # csv
    csv_path = census_data_path / "csv"
    remove_files_from_dir(
        csv_path, ".csv", exception_list=["fips_states_2010.csv"]
    )

    # geojson
    geojson_path = census_data_path / "geojson"
    remove_files_from_dir(geojson_path, ".json")

    # shp
    shp_path = census_data_path / "shp"
    remove_all_dirs_from_dir(shp_path)


def get_state_fips_codes(data_path: Path) -> list:
    """Returns a list with state data"""
    fips_csv_path = data_path / "census" / "csv" / "fips_states_2010.csv"

    # check if file exists
    if not os.path.isfile(fips_csv_path):
        logger.info("Downloading fips from S3 repository")
        unzip_file_from_url(
            settings.AWS_JUSTICE40_DATASOURCES_URL + "/fips_states_2010.zip",
            data_path / "tmp",
            data_path / "census" / "csv",
        )

    fips_state_list = []
    with open(fips_csv_path, encoding="utf-8") as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=",")
        line_count = 0

        for row in csv_reader:
            if line_count == 0:
                line_count += 1
            else:
                fips = row[0].strip()
                fips_state_list.append(fips)
    return fips_state_list


def get_state_information(data_path: Path) -> pd.DataFrame:
    """Load the full state file as a dataframe.

    Useful because of the state regional information.
    """
    fips_csv_path = data_path / "census" / "csv" / "fips_states_2010.csv"

    df = pd.read_csv(fips_csv_path)

    # Left pad the FIPS codes with 0s
    df["fips"] = df["fips"].astype(str).apply(lambda x: x.zfill(2))

    return df


def check_census_data_source(
    census_data_path: Path, census_data_source: str
) -> None:
    """Checks if census data is present, and exits gracefully if it doesn't exist. It will download it from S3
       if census_data_source is set to "aws"

    Args:
        census_data_path (str): Path for Census data
        census_data_source (str): Source for the census data
                                  Options:
                                  - local: fetch census data from the local data directory
                                  - aws: fetch census from AWS S3 J40 data repository

    Returns:
        None

    """
    CENSUS_DATA_S3_URL = settings.AWS_JUSTICE40_DATASOURCES_URL + "/census.zip"
    DATA_PATH = settings.APP_ROOT / "data"

    # download from s3 if census_data_source is aws
    if census_data_source == "aws":
        logger.info("Fetching Census data from AWS S3")
        unzip_file_from_url(
            CENSUS_DATA_S3_URL,
            DATA_PATH / "tmp",
            DATA_PATH,
        )
    else:
        # check if census data is found locally
        if not os.path.isfile(census_data_path / "geojson" / "us.json"):
            logger.info(
                "No local census data found. Please use '-s aws` to fetch from AWS"
            )
            sys.exit()


def zip_census_data():
    logger.info("Compressing census files to data/tmp folder")

    CENSUS_DATA_PATH = settings.APP_ROOT / "data" / "census"
    TMP_PATH = settings.APP_ROOT / "data" / "tmp"

    # zip folder
    zip_directory(CENSUS_DATA_PATH, TMP_PATH)
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00			`import csv`
Data directory should adopt standard Poetry-suggested python package structure (#457) * Fixes #456 - Our data directory should adopt standard python package structure * a few missed references * updating readme * updating requirements * Running Black * Fixes for flake8 * updating pylint 2021-08-05 15:35:54 -04:00			`import os`
Data sources from S3 (#769) * Started 535 * Data sources from S3 * lint * renove breakpoints * PR comments * lint * census data completed * lint * renaming data source 2021-10-13 16:00:33 -04:00			`import sys`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`from pathlib import Path`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00
Data directory should adopt standard Poetry-suggested python package structure (#457) * Fixes #456 - Our data directory should adopt standard python package structure * a few missed references * updating readme * updating requirements * Running Black * Fixes for flake8 * updating pylint 2021-08-05 15:35:54 -04:00			`import pandas as pd`
			`from data_pipeline.config import settings`
			`from data_pipeline.utils import (`
			`get_module_logger,`
Data folder restructuring in preparation for 361 (#376) * initial checkin * gitignore and docker-compose update * readme update and error on hud * encoding issue * one more small README change * data roadmap re-strcuture * pyproject sort * small update to score output folders * checkpoint * couple of last fixes 2021-07-20 14:55:39 -04:00			`remove_all_dirs_from_dir,`
Data directory should adopt standard Poetry-suggested python package structure (#457) * Fixes #456 - Our data directory should adopt standard python package structure * a few missed references * updating readme * updating requirements * Running Black * Fixes for flake8 * updating pylint 2021-08-05 15:35:54 -04:00			`remove_files_from_dir,`
Data folder restructuring in preparation for 361 (#376) * initial checkin * gitignore and docker-compose update * readme update and error on hud * encoding issue * one more small README change * data roadmap re-strcuture * pyproject sort * small update to score output folders * checkpoint * couple of last fixes 2021-07-20 14:55:39 -04:00			`unzip_file_from_url,`
Data sources from S3 (#769) * Started 535 * Data sources from S3 * lint * renove breakpoints * PR comments * lint * census data completed * lint * renaming data source 2021-10-13 16:00:33 -04:00			`zip_directory,`
Data folder restructuring in preparation for 361 (#376) * initial checkin * gitignore and docker-compose update * readme update and error on hud * encoding issue * one more small README change * data roadmap re-strcuture * pyproject sort * small update to score output folders * checkpoint * couple of last fixes 2021-07-20 14:55:39 -04:00			`)`

			`logger = get_module_logger(__name__)`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00

Add territory boundary data (#885) * Add territory boundary data * housing and transp * lint * lint * lint 2021-11-16 10:05:09 -05:00			`def reset_data_directories(`
			`data_path: Path,`
			`) -> None:`
Data sources from S3 (#769) * Started 535 * Data sources from S3 * lint * renove breakpoints * PR comments * lint * census data completed * lint * renaming data source 2021-10-13 16:00:33 -04:00			`"""Empties all census folders"""`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00			`census_data_path = data_path / "census"`

			`# csv`
			`csv_path = census_data_path / "csv"`
Add territory boundary data (#885) * Add territory boundary data * housing and transp * lint * lint * lint 2021-11-16 10:05:09 -05:00			`remove_files_from_dir(`
			`csv_path, ".csv", exception_list=["fips_states_2010.csv"]`
			`)`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00
			`# geojson`
			`geojson_path = census_data_path / "geojson"`
			`remove_files_from_dir(geojson_path, ".json")`

			`# shp`
			`shp_path = census_data_path / "shp"`
			`remove_all_dirs_from_dir(shp_path)`


			`def get_state_fips_codes(data_path: Path) -> list:`
Data sources from S3 (#769) * Started 535 * Data sources from S3 * lint * renove breakpoints * PR comments * lint * census data completed * lint * renaming data source 2021-10-13 16:00:33 -04:00			`"""Returns a list with state data"""`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00			`fips_csv_path = data_path / "census" / "csv" / "fips_states_2010.csv"`

			`# check if file exists`
			`if not os.path.isfile(fips_csv_path):`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`logger.info("Downloading fips from S3 repository")`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00			`unzip_file_from_url(`
Hotfix for fips zip download location + added full-score-run command (#465) * Hotfix for S3 locations of data sources * updated README * lint failures Co-authored-by: Nat Hillard <Nathaniel.K.Hillard@omb.eop.gov> 2021-08-05 12:55:21 -04:00			`settings.AWS_JUSTICE40_DATASOURCES_URL + "/fips_states_2010.zip",`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00			`data_path / "tmp",`
			`data_path / "census" / "csv",`
			`)`

			`fips_state_list = []`
dependabot bump pillow (#681) * dependabot bump pillow * updated poetry * adding encoding to file open 2021-09-14 17:28:59 -04:00			`with open(fips_csv_path, encoding="utf-8") as csv_file:`
Modularization + Poetry + Docker (#213) * reorg * added configuration management; initial click cmds * reset dirs completed * major modularization effort * prepping mbtiles * first round of PR review updates * round 2 of feedback review * checkpoint * habemus dockerfile 🎉 * updated dock-er-compose with long running container * census generation works * logging working * updated README * updated README * last small update to README * added instructions for log visualization * census etl update for reusable fips module * ejscreem etl updated * further modularization * score modularization * tmp cleanup 2021-06-28 16:16:14 -04:00			`csv_reader = csv.reader(csv_file, delimiter=",")`
			`line_count = 0`

			`for row in csv_reader:`
			`if line_count == 0:`
			`line_count += 1`
			`else:`
			`fips = row[0].strip()`
			`fips_state_list.append(fips)`
			`return fips_state_list`
Analysis by region (#385) * Adding regional comparisons * Small ETL fixes 2021-07-26 08:02:25 -07:00

			`def get_state_information(data_path: Path) -> pd.DataFrame:`
			`"""Load the full state file as a dataframe.`

			`Useful because of the state regional information.`
			`"""`
			`fips_csv_path = data_path / "census" / "csv" / "fips_states_2010.csv"`

			`df = pd.read_csv(fips_csv_path)`

			`# Left pad the FIPS codes with 0s`
			`df["fips"] = df["fips"].astype(str).apply(lambda x: x.zfill(2))`

			`return df`
Data sources from S3 (#769) * Started 535 * Data sources from S3 * lint * renove breakpoints * PR comments * lint * census data completed * lint * renaming data source 2021-10-13 16:00:33 -04:00

			`def check_census_data_source(`
			`census_data_path: Path, census_data_source: str`
			`) -> None:`
			`"""Checks if census data is present, and exits gracefully if it doesn't exist. It will download it from S3`
			`if census_data_source is set to "aws"`

			`Args:`
			`census_data_path (str): Path for Census data`
			`census_data_source (str): Source for the census data`
			`Options:`
			`- local: fetch census data from the local data directory`
			`- aws: fetch census from AWS S3 J40 data repository`

			`Returns:`
			`None`

			`"""`
			`CENSUS_DATA_S3_URL = settings.AWS_JUSTICE40_DATASOURCES_URL + "/census.zip"`
			`DATA_PATH = settings.APP_ROOT / "data"`

			`# download from s3 if census_data_source is aws`
			`if census_data_source == "aws":`
			`logger.info("Fetching Census data from AWS S3")`
			`unzip_file_from_url(`
			`CENSUS_DATA_S3_URL,`
			`DATA_PATH / "tmp",`
			`DATA_PATH,`
			`)`
			`else:`
			`# check if census data is found locally`
			`if not os.path.isfile(census_data_path / "geojson" / "us.json"):`
			`logger.info(`
Display score L on map (#849) * updates to first docker run * tile constants * frontend changes * updating pickles instructions * pickles 2021-11-05 16:26:14 -04:00			"No local census data found. Please use '-s aws` to fetch from AWS"
Data sources from S3 (#769) * Started 535 * Data sources from S3 * lint * renove breakpoints * PR comments * lint * census data completed * lint * renaming data source 2021-10-13 16:00:33 -04:00			`)`
			`sys.exit()`


			`def zip_census_data():`
Combine + Tilefy (#806) * init * score-post * added score csv s3 download; remore poetry cmds from readme * working census tile fetch * PR review * Github Actions Work 2021-11-01 18:05:05 -04:00			`logger.info("Compressing census files to data/tmp folder")`
Data sources from S3 (#769) * Started 535 * Data sources from S3 * lint * renove breakpoints * PR comments * lint * census data completed * lint * renaming data source 2021-10-13 16:00:33 -04:00
			`CENSUS_DATA_PATH = settings.APP_ROOT / "data" / "census"`
			`TMP_PATH = settings.APP_ROOT / "data" / "tmp"`

			`# zip folder`
			`zip_directory(CENSUS_DATA_PATH, TMP_PATH)`