j40-cejst-2/data/data-pipeline/data_pipeline/etl/sources/calenviroscreen/etl.py

import pandas as pd

from data_pipeline.etl.base import ExtractTransformLoad
from data_pipeline.utils import get_module_logger
from data_pipeline.config import settings

logger = get_module_logger(__name__)


class CalEnviroScreenETL(ExtractTransformLoad):
    def __init__(self):
        self.CALENVIROSCREEN_FTP_URL = (
            settings.AWS_JUSTICE40_DATASOURCES_URL
            + "/CalEnviroScreen_4.0_2021.zip"
        )
        self.CALENVIROSCREEN_CSV = (
            self.TMP_PATH / "CalEnviroScreen_4.0_2021.csv"
        )
        self.CSV_PATH = self.DATA_PATH / "dataset" / "calenviroscreen4"

        # Definining some variable names
        self.CALENVIROSCREEN_SCORE_FIELD_NAME = "calenviroscreen_score"
        self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME = (
            "calenviroscreen_percentile"
        )
        self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME = (
            "calenviroscreen_priority_community"
        )

        # Choosing constants.
        # None of these numbers are final, but just for the purposes of comparison.
        self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD = 75

        self.df: pd.DataFrame

    def extract(self) -> None:
        logger.info("Downloading CalEnviroScreen Data")
        super().extract(
            self.CALENVIROSCREEN_FTP_URL,
            self.TMP_PATH,
        )

    def transform(self) -> None:
        logger.info("Transforming CalEnviroScreen Data")

        # Data from https://calenviroscreen-oehha.hub.arcgis.com/#Data, specifically:
        # https://oehha.ca.gov/media/downloads/calenviroscreen/document/calenviroscreen40resultsdatadictionaryd12021.zip
        # Load comparison index (CalEnviroScreen 4)
        self.df = pd.read_csv(
            self.CALENVIROSCREEN_CSV, dtype={"Census Tract": "string"}
        )

        self.df.rename(
            columns={
                "Census Tract": self.GEOID_TRACT_FIELD_NAME,
                "DRAFT CES 4.0 Score": self.CALENVIROSCREEN_SCORE_FIELD_NAME,
                "DRAFT CES 4.0 Percentile": self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME,
            },
            inplace=True,
        )

        # Add a leading "0" to the Census Tract to match our format in other data frames.
        self.df[self.GEOID_TRACT_FIELD_NAME] = (
            "0" + self.df[self.GEOID_TRACT_FIELD_NAME]
        )

        # Calculate the top K% of prioritized communities
        self.df[self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME] = (
            self.df[self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME]
            >= self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD
        )

    def load(self) -> None:
        logger.info("Saving CalEnviroScreen CSV")
        # write nationwide csv
        self.CSV_PATH.mkdir(parents=True, exist_ok=True)
        self.df.to_csv(self.CSV_PATH / "data06.csv", index=False)
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`import pandas as pd`

Data directory should adopt standard Poetry-suggested python package structure (#457) * Fixes #456 - Our data directory should adopt standard python package structure * a few missed references * updating readme * updating requirements * Running Black * Fixes for flake8 * updating pylint 2021-08-05 15:35:54 -04:00			`from data_pipeline.etl.base import ExtractTransformLoad`
			`from data_pipeline.utils import get_module_logger`
			`from data_pipeline.config import settings`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`logger = get_module_logger(__name__)`


			`class CalEnviroScreenETL(ExtractTransformLoad):`
			`def __init__(self):`
Hotfix for fips zip download location + added full-score-run command (#465) * Hotfix for S3 locations of data sources * updated README * lint failures Co-authored-by: Nat Hillard <Nathaniel.K.Hillard@omb.eop.gov> 2021-08-05 12:55:21 -04:00			`self.CALENVIROSCREEN_FTP_URL = (`
Score F, testing methodology (#510) * fixing dependency issue * fixing more dependencies * including fraction of state AMI * wip * nitpick whitespace * etl working now * wip on scoring * fix rename error * reducing metrics * fixing score f * fixing readme * adding dependency * passing tests; * linting/black * removing unnecessary sample * fixing error * adding verify flag on etl/base Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> 2021-08-24 15:40:54 -05:00			`settings.AWS_JUSTICE40_DATASOURCES_URL`
			`+ "/CalEnviroScreen_4.0_2021.zip"`
			`)`
			`self.CALENVIROSCREEN_CSV = (`
			`self.TMP_PATH / "CalEnviroScreen_4.0_2021.csv"`
Hotfix for fips zip download location + added full-score-run command (#465) * Hotfix for S3 locations of data sources * updated README * lint failures Co-authored-by: Nat Hillard <Nathaniel.K.Hillard@omb.eop.gov> 2021-08-05 12:55:21 -04:00			`)`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`self.CSV_PATH = self.DATA_PATH / "dataset" / "calenviroscreen4"`

			`# Definining some variable names`
			`self.CALENVIROSCREEN_SCORE_FIELD_NAME = "calenviroscreen_score"`
Score F, testing methodology (#510) * fixing dependency issue * fixing more dependencies * including fraction of state AMI * wip * nitpick whitespace * etl working now * wip on scoring * fix rename error * reducing metrics * fixing score f * fixing readme * adding dependency * passing tests; * linting/black * removing unnecessary sample * fixing error * adding verify flag on etl/base Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> 2021-08-24 15:40:54 -05:00			`self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME = (`
			`"calenviroscreen_percentile"`
			`)`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME = (`
			`"calenviroscreen_priority_community"`
			`)`

			`# Choosing constants.`
			`# None of these numbers are final, but just for the purposes of comparison.`
			`self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD = 75`

			`self.df: pd.DataFrame`

			`def extract(self) -> None:`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`logger.info("Downloading CalEnviroScreen Data")`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`super().extract(`
			`self.CALENVIROSCREEN_FTP_URL,`
			`self.TMP_PATH,`
			`)`

			`def transform(self) -> None:`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`logger.info("Transforming CalEnviroScreen Data")`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`# Data from https://calenviroscreen-oehha.hub.arcgis.com/#Data, specifically:`
			`# https://oehha.ca.gov/media/downloads/calenviroscreen/document/calenviroscreen40resultsdatadictionaryd12021.zip`
			`# Load comparison index (CalEnviroScreen 4)`
			`self.df = pd.read_csv(`
			`self.CALENVIROSCREEN_CSV, dtype={"Census Tract": "string"}`
			`)`

			`self.df.rename(`
			`columns={`
			`"Census Tract": self.GEOID_TRACT_FIELD_NAME,`
			`"DRAFT CES 4.0 Score": self.CALENVIROSCREEN_SCORE_FIELD_NAME,`
			`"DRAFT CES 4.0 Percentile": self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME,`
			`},`
			`inplace=True,`
			`)`

			`# Add a leading "0" to the Census Tract to match our format in other data frames.`
			`self.df[self.GEOID_TRACT_FIELD_NAME] = (`
			`"0" + self.df[self.GEOID_TRACT_FIELD_NAME]`
			`)`

			`# Calculate the top K% of prioritized communities`
			`self.df[self.CALENVIROSCREEN_PRIORITY_COMMUNITY_FIELD_NAME] = (`
			`self.df[self.CALENVIROSCREEN_PERCENTILE_FIELD_NAME]`
			`>= self.CALENVIROSCREEN_PRIORITY_COMMUNITY_THRESHOLD`
			`)`

			`def load(self) -> None:`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`logger.info("Saving CalEnviroScreen CSV")`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`# write nationwide csv`
			`self.CSV_PATH.mkdir(parents=True, exist_ok=True)`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`self.df.to_csv(self.CSV_PATH / "data06.csv", index=False)`