j40-cejst-2/data/data-pipeline/data_pipeline/etl/base.py

from pathlib import Path
from typing import Optional

from data_pipeline.config import settings
from data_pipeline.utils import unzip_file_from_url, remove_all_from_dir


class ExtractTransformLoad:
    """
    A class used to instantiate an ETL object to retrieve and process data from
    datasets.

    Attributes:
        DATA_PATH (pathlib.Path): Local path where all data will be stored
        TMP_PATH (pathlib.Path): Local path where temporary data will be stored
        GEOID_FIELD_NAME (str): The common column name for a Census Block Group identifier
        GEOID_TRACT_FIELD_NAME (str): The common column name for a Census Tract identifier
    """

    DATA_PATH: Path = settings.APP_ROOT / "data"
    TMP_PATH: Path = DATA_PATH / "tmp"
    FILES_PATH: Path = settings.APP_ROOT / "files"
    GEOID_FIELD_NAME: str = "GEOID10"
    GEOID_TRACT_FIELD_NAME: str = "GEOID10_TRACT"
    # TODO: investigate. Census says there are only 217,740 CBGs in the US. This might be from CBGs at different time periods.
    EXPECTED_MAX_CENSUS_BLOCK_GROUPS: int = 220405
    EXPECTED_MAX_CENSUS_TRACTS: int = 73076

    def get_yaml_config(self) -> None:
        """Reads the YAML configuration file for the dataset and stores
        the properies in the instance (upcoming feature)"""

        pass

    def check_ttl(self) -> None:
        """Checks if the ETL process can be run based on a the TLL value on the
        YAML config (upcoming feature)"""

        pass

    def extract(
        self,
        source_url: str = None,
        extract_path: Path = None,
        verify: Optional[bool] = True,
    ) -> None:
        """Extract the data from
        a remote source. By default it provides code to get the file from a source url,
        unzips it and stores it on an extract_path."""

        # this can be accessed via super().extract()
        if source_url and extract_path:
            unzip_file_from_url(
                source_url, self.TMP_PATH, extract_path, verify=verify
            )

    def transform(self) -> None:
        """Transform the data extracted into a format that can be consumed by the
        score generator"""

        raise NotImplementedError

    def load(self) -> None:
        """Saves the transformed data in the specified local data folder or remote AWS S3
        bucket"""

        raise NotImplementedError

    def cleanup(self) -> None:
        """Clears out any files stored in the TMP folder"""

        remove_all_from_dir(self.TMP_PATH)
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`from pathlib import Path`
Score F, testing methodology (#510) * fixing dependency issue * fixing more dependencies * including fraction of state AMI * wip * nitpick whitespace * etl working now * wip on scoring * fix rename error * reducing metrics * fixing score f * fixing readme * adding dependency * passing tests; * linting/black * removing unnecessary sample * fixing error * adding verify flag on etl/base Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> 2021-08-24 15:40:54 -05:00			`from typing import Optional`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
Data directory should adopt standard Poetry-suggested python package structure (#457) * Fixes #456 - Our data directory should adopt standard python package structure * a few missed references * updating readme * updating requirements * Running Black * Fixes for flake8 * updating pylint 2021-08-05 15:35:54 -04:00			`from data_pipeline.config import settings`
			`from data_pipeline.utils import unzip_file_from_url, remove_all_from_dir`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00

Issue 308 python linting (#443) * Adds flake8, pylint, liccheck, flake8 to dependencies for data-pipeline * Sets up and runs black autoformatting * Adds flake8 to tox linting * Fixes flake8 error F541 f string missing placeholders * Fixes flake8 E501 line too long * Fixes flake8 F401 imported but not used * Adds pylint to tox and disables the following pylint errors: - C0114: module docstrings - R0201: method could have been a function - R0903: too few public methods - C0103: name case styling - W0511: fix me - W1203: f-string interpolation in logging * Adds utils.py to tox.ini linting, runs black on utils.py * Fixes import related pylint errors: C0411 and C0412 * Fixes or ignores remaining pylint errors (for discussion later) * Adds safety and liccheck to tox.ini 2021-08-02 12:16:38 -04:00			`class ExtractTransformLoad:`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`"""`
			`A class used to instantiate an ETL object to retrieve and process data from`
			`datasets.`

			`Attributes:`
			`DATA_PATH (pathlib.Path): Local path where all data will be stored`
			`TMP_PATH (pathlib.Path): Local path where temporary data will be stored`
			`GEOID_FIELD_NAME (str): The common column name for a Census Block Group identifier`
			`GEOID_TRACT_FIELD_NAME (str): The common column name for a Census Tract identifier`
			`"""`

			`DATA_PATH: Path = settings.APP_ROOT / "data"`
			`TMP_PATH: Path = DATA_PATH / "tmp"`
Add pytest to tox run in CI/CD (#713) * Add pytest to tox run in CI/CD * Try fixing tox dependencies for pytest * update poetry to get ci/cd passing * Run poetry export with --dev flag to include dev dependencies such as pytest * WIP updating test fixtures to include PDF * Remove dev dependencies from reqs and add pytest to envlist to make build faster * passing score_post tests * Add pytest tox (#729) * Fix failing pytest * Fixes failing tox tests and updates requirements.txt to include dev deps * pickle protocol 4 Co-authored-by: Shelby Switzer <shelby.switzer@cms.hhs.gov> Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> Co-authored-by: Billy Daly <williamdaly422@gmail.com> Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> 2021-09-22 13:47:37 -04:00			`FILES_PATH: Path = settings.APP_ROOT / "files"`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`GEOID_FIELD_NAME: str = "GEOID10"`
			`GEOID_TRACT_FIELD_NAME: str = "GEOID10_TRACT"`
Adding persistent poverty tracts (#738) * persistent poverty working * fixing left-padding * running black and adding persistent poverty to comp tool * fixing bug * running black and fixing linter * fixing linter * fixing linter error 2021-09-22 16:57:08 -05:00			`# TODO: investigate. Census says there are only 217,740 CBGs in the US. This might be from CBGs at different time periods.`
Ticket 492: Integrate Area Median Income and Poverty measures into ETL (#660) * Loading AMI and poverty data 2021-09-13 15:36:35 -05:00			`EXPECTED_MAX_CENSUS_BLOCK_GROUPS: int = 220405`
Adding persistent poverty tracts (#738) * persistent poverty working * fixing left-padding * running black and adding persistent poverty to comp tool * fixing bug * running black and fixing linter * fixing linter * fixing linter error 2021-09-22 16:57:08 -05:00			`EXPECTED_MAX_CENSUS_TRACTS: int = 73076`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`def get_yaml_config(self) -> None:`
			`"""Reads the YAML configuration file for the dataset and stores`
			`the properies in the instance (upcoming feature)"""`

			`pass`

			`def check_ttl(self) -> None:`
			`"""Checks if the ETL process can be run based on a the TLL value on the`
			`YAML config (upcoming feature)"""`

			`pass`

Score F, testing methodology (#510) * fixing dependency issue * fixing more dependencies * including fraction of state AMI * wip * nitpick whitespace * etl working now * wip on scoring * fix rename error * reducing metrics * fixing score f * fixing readme * adding dependency * passing tests; * linting/black * removing unnecessary sample * fixing error * adding verify flag on etl/base Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> 2021-08-24 15:40:54 -05:00			`def extract(`
			`self,`
			`source_url: str = None,`
			`extract_path: Path = None,`
			`verify: Optional[bool] = True,`
			`) -> None:`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`"""Extract the data from`
			`a remote source. By default it provides code to get the file from a source url,`
			`unzips it and stores it on an extract_path."""`

			`# this can be accessed via super().extract()`
			`if source_url and extract_path:`
Score F, testing methodology (#510) * fixing dependency issue * fixing more dependencies * including fraction of state AMI * wip * nitpick whitespace * etl working now * wip on scoring * fix rename error * reducing metrics * fixing score f * fixing readme * adding dependency * passing tests; * linting/black * removing unnecessary sample * fixing error * adding verify flag on etl/base Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> 2021-08-24 15:40:54 -05:00			`unzip_file_from_url(`
			`source_url, self.TMP_PATH, extract_path, verify=verify`
			`)`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00
			`def transform(self) -> None:`
			`"""Transform the data extracted into a format that can be consumed by the`
			`score generator"""`

			`raise NotImplementedError`

			`def load(self) -> None:`
			`"""Saves the transformed data in the specified local data folder or remote AWS S3`
			`bucket"""`

			`raise NotImplementedError`

			`def cleanup(self) -> None:`
			`"""Clears out any files stored in the TMP folder"""`

			`remove_all_from_dir(self.TMP_PATH)`