j40-cejst-2/data/data-pipeline/data_pipeline/etl/sources/hud_recap/etl.py

import pandas as pd
import requests

from data_pipeline.etl.base import ExtractTransformLoad
from data_pipeline.utils import get_module_logger
from data_pipeline.config import settings

logger = get_module_logger(__name__)


class HudRecapETL(ExtractTransformLoad):
    def __init__(self):
        # pylint: disable=line-too-long
        self.HUD_RECAP_CSV_URL = "https://opendata.arcgis.com/api/v3/datasets/56de4edea8264fe5a344da9811ef5d6e_0/downloads/data?format=csv&spatialRefId=4326"  # noqa: E501
        self.HUD_RECAP_CSV = (
            self.get_tmp_path()
            / "Racially_or_Ethnically_Concentrated_Areas_of_Poverty__R_ECAPs_.csv"
        )
        self.CSV_PATH = self.DATA_PATH / "dataset" / "hud_recap"

        # Definining some variable names
        self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME = (
            "hud_recap_priority_community"
        )

        self.df: pd.DataFrame

    def extract(self) -> None:
        logger.info("Downloading HUD Recap Data")
        download = requests.get(
            self.HUD_RECAP_CSV_URL,
            verify=None,
            timeout=settings.REQUESTS_DEFAULT_TIMOUT,
        )
        file_contents = download.content
        csv_file = open(self.HUD_RECAP_CSV, "wb")
        csv_file.write(file_contents)
        csv_file.close()

    def transform(self) -> None:
        logger.info("Transforming HUD Recap Data")

        # Load comparison index (CalEnviroScreen 4)
        self.df = pd.read_csv(self.HUD_RECAP_CSV, dtype={"GEOID": "string"})

        self.df.rename(
            columns={
                "GEOID": self.GEOID_TRACT_FIELD_NAME,
                # Interestingly, there's no data dictionary for the RECAP data that I could find.
                # However, this site (http://www.schousing.com/library/Tax%20Credit/2020/QAP%20Instructions%20(2).pdf)
                # suggests:
                # "If RCAP_Current for the tract in which the site is located is 1, the tract is an R/ECAP. If RCAP_Current is 0, it is not."
                "RCAP_Current": self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME,
            },
            inplace=True,
        )

        # Convert to boolean
        self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME] = self.df[
            self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME
        ].astype("bool")

        self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME].value_counts()

        self.df.sort_values(by=self.GEOID_TRACT_FIELD_NAME, inplace=True)

    def load(self) -> None:
        logger.info("Saving HUD Recap CSV")
        # write nationwide csv
        self.CSV_PATH.mkdir(parents=True, exist_ok=True)
        self.df.to_csv(self.CSV_PATH / "usa.csv", index=False)
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`import pandas as pd`
			`import requests`

Data directory should adopt standard Poetry-suggested python package structure (#457) * Fixes #456 - Our data directory should adopt standard python package structure * a few missed references * updating readme * updating requirements * Running Black * Fixes for flake8 * updating pylint 2021-08-05 15:35:54 -04:00			`from data_pipeline.etl.base import ExtractTransformLoad`
			`from data_pipeline.utils import get_module_logger`
Score tests (#1847) * update Python version on README; tuple typing fix * Alaska tribal points fix (#1821) * Bump mistune from 0.8.4 to 2.0.3 in /data/data-pipeline (#1777) Bumps [mistune](https://github.com/lepture/mistune) from 0.8.4 to 2.0.3. - [Release notes](https://github.com/lepture/mistune/releases) - [Changelog](https://github.com/lepture/mistune/blob/master/docs/changes.rst) - [Commits](https://github.com/lepture/mistune/compare/v0.8.4...v2.0.3) --- updated-dependencies: - dependency-name: mistune dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * poetry update * initial pass of score tests * add threshold tests * added ses threshold (not donut, not island) * testing suite -- stopping for the day * added test for lead proxy indicator * Refactor score tests to make them less verbose and more direct (#1865) * Cleanup tests slightly before refactor (#1846) * Refactor score calculations tests * Feedback from review * Refactor output tests like calculatoin tests (#1846) (#1870) * Reorganize files (#1846) * Switch from lru_cache to fixture scorpes (#1846) * Add tests for all factors (#1846) * Mark smoketests and run as part of be deply (#1846) * Update renamed var (#1846) * Switch from named tuple to dataclass (#1846) This is annoying, but pylint in python3.8 was crashing parsing the named tuple. We weren't using any namedtuple-specific features, so I made the type a dataclass just to get pylint to behave. * Add default timout to requests (#1846) * Fix type (#1846) * Fix merge mistake on poetry.lock (#1846) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Bowen <83967628+mattbowen-usds@users.noreply.github.com> Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov> 2022-08-26 15:23:20 -04:00			`from data_pipeline.config import settings`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`logger = get_module_logger(__name__)`


			`class HudRecapETL(ExtractTransformLoad):`
			`def __init__(self):`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`# pylint: disable=line-too-long`
			`self.HUD_RECAP_CSV_URL = "https://opendata.arcgis.com/api/v3/datasets/56de4edea8264fe5a344da9811ef5d6e_0/downloads/data?format=csv&spatialRefId=4326" # noqa: E501`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`self.HUD_RECAP_CSV = (`
Run ETL processes in parallel (#1253) * WIP on parallelizing * switching to get_tmp_path for nri * switching to get_tmp_path everywhere necessary * fixing linter errors * moving heavy ETLs to front of line * add hold * moving cdc places up * removing unnecessary print * moving h&t up * adding parallel to geo post * better census labels * switching to concurrent futures * fixing output 2022-02-11 14:04:53 -05:00			`self.get_tmp_path()`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`/ "Racially_or_Ethnically_Concentrated_Areas_of_Poverty__R_ECAPs_.csv"`
			`)`
			`self.CSV_PATH = self.DATA_PATH / "dataset" / "hud_recap"`

			`# Definining some variable names`
Score F, testing methodology (#510) * fixing dependency issue * fixing more dependencies * including fraction of state AMI * wip * nitpick whitespace * etl working now * wip on scoring * fix rename error * reducing metrics * fixing score f * fixing readme * adding dependency * passing tests; * linting/black * removing unnecessary sample * fixing error * adding verify flag on etl/base Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> 2021-08-24 15:40:54 -05:00			`self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME = (`
			`"hud_recap_priority_community"`
			`)`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`self.df: pd.DataFrame`

			`def extract(self) -> None:`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`logger.info("Downloading HUD Recap Data")`
Score tests (#1847) * update Python version on README; tuple typing fix * Alaska tribal points fix (#1821) * Bump mistune from 0.8.4 to 2.0.3 in /data/data-pipeline (#1777) Bumps [mistune](https://github.com/lepture/mistune) from 0.8.4 to 2.0.3. - [Release notes](https://github.com/lepture/mistune/releases) - [Changelog](https://github.com/lepture/mistune/blob/master/docs/changes.rst) - [Commits](https://github.com/lepture/mistune/compare/v0.8.4...v2.0.3) --- updated-dependencies: - dependency-name: mistune dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * poetry update * initial pass of score tests * add threshold tests * added ses threshold (not donut, not island) * testing suite -- stopping for the day * added test for lead proxy indicator * Refactor score tests to make them less verbose and more direct (#1865) * Cleanup tests slightly before refactor (#1846) * Refactor score calculations tests * Feedback from review * Refactor output tests like calculatoin tests (#1846) (#1870) * Reorganize files (#1846) * Switch from lru_cache to fixture scorpes (#1846) * Add tests for all factors (#1846) * Mark smoketests and run as part of be deply (#1846) * Update renamed var (#1846) * Switch from named tuple to dataclass (#1846) This is annoying, but pylint in python3.8 was crashing parsing the named tuple. We weren't using any namedtuple-specific features, so I made the type a dataclass just to get pylint to behave. * Add default timout to requests (#1846) * Fix type (#1846) * Fix merge mistake on poetry.lock (#1846) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Bowen <83967628+mattbowen-usds@users.noreply.github.com> Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov> 2022-08-26 15:23:20 -04:00			`download = requests.get(`
			`self.HUD_RECAP_CSV_URL,`
			`verify=None,`
			`timeout=settings.REQUESTS_DEFAULT_TIMOUT,`
			`)`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`file_contents = download.content`
			`csv_file = open(self.HUD_RECAP_CSV, "wb")`
			`csv_file.write(file_contents)`
			`csv_file.close()`

			`def transform(self) -> None:`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`logger.info("Transforming HUD Recap Data")`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`# Load comparison index (CalEnviroScreen 4)`
Analysis by region (#385) * Adding regional comparisons * Small ETL fixes 2021-07-26 08:02:25 -07:00			`self.df = pd.read_csv(self.HUD_RECAP_CSV, dtype={"GEOID": "string"})`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`self.df.rename(`
			`columns={`
			`"GEOID": self.GEOID_TRACT_FIELD_NAME,`
			`# Interestingly, there's no data dictionary for the RECAP data that I could find.`
			`# However, this site (http://www.schousing.com/library/Tax%20Credit/2020/QAP%20Instructions%20(2).pdf)`
			`# suggests:`
			`# "If RCAP_Current for the tract in which the site is located is 1, the tract is an R/ECAP. If RCAP_Current is 0, it is not."`
			`"RCAP_Current": self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME,`
			`},`
			`inplace=True,`
			`)`

			`# Convert to boolean`
			`self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME] = self.df[`
			`self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME`
			`].astype("bool")`

			`self.df[self.HUD_RECAP_PRIORITY_COMMUNITY_FIELD_NAME].value_counts()`

			`self.df.sort_values(by=self.GEOID_TRACT_FIELD_NAME, inplace=True)`

			`def load(self) -> None:`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`logger.info("Saving HUD Recap CSV")`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`# write nationwide csv`
			`self.CSV_PATH.mkdir(parents=True, exist_ok=True)`
Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`self.df.to_csv(self.CSV_PATH / "usa.csv", index=False)`