j40-cejst-2/score/etl/runner.py

import importlib

from etl.score.etl_score import ScoreETL
from etl.score.etl_score_post import PostScoreETL


def etl_runner(dataset_to_run: str = None) -> None:
    """Runs all etl processes or a specific one

    Args:
        dataset_to_run (str): Run a specific ETL process. If missing, runs all processes (optional)

    Returns:
        None
    """

    # this list comes from YAMLs
    dataset_list = [
        {
            "name": "census_acs",
            "module_dir": "census_acs",
            "class_name": "CensusACSETL",
        },
        {
            "name": "ejscreen",
            "module_dir": "ejscreen",
            "class_name": "EJScreenETL",
        },
        {
            "name": "housing_and_transportation",
            "module_dir": "housing_and_transportation",
            "class_name": "HousingTransportationETL",
        },
        {
            "name": "hud_housing",
            "module_dir": "hud_housing",
            "class_name": "HudHousingETL",
        },
        {
            "name": "calenviroscreen",
            "module_dir": "calenviroscreen",
            "class_name": "CalEnviroScreenETL",
        },
        {
            "name": "hud_recap",
            "module_dir": "hud_recap",
            "class_name": "HudRecapETL",
        },
    ]

    if dataset_to_run:
        dataset_element = next(
            (item for item in dataset_list if item["name"] == dataset_to_run),
            None,
        )
        if not dataset_list:
            raise ValueError("Invalid dataset name")
        else:
            # reset the list to just the dataset
            dataset_list = [dataset_element]

    # Run the ETLs for the dataset_list
    for dataset in dataset_list:
        etl_module = importlib.import_module(
            f"etl.sources.{dataset['module_dir']}.etl"
        )
        etl_class = getattr(etl_module, dataset["class_name"])
        etl_instance = etl_class()

        # run extract
        etl_instance.extract()

        # run transform
        etl_instance.transform()

        # run load
        etl_instance.load()

        # cleanup
        etl_instance.cleanup()

    # update the front end JSON/CSV of list of data sources
    pass


def score_generate() -> None:
    """Generates the score and saves it on the local data directory

    Args:
        None

    Returns:
        None
    """

    # Score Gen
    score_gen = ScoreETL()
    score_gen.extract()
    score_gen.transform()
    score_gen.load()

    # Post Score Processing
    score_post = PostScoreETL()
    score_post.extract()
    score_post.transform()
    score_post.load()
    score_post.cleanup()


def _find_dataset_index(dataset_list, key, value):
    for i, element in enumerate(dataset_list):
        if element[key] == value:
            return i
    return -1
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`import importlib`

County Names for Score #188 (#347) * starting PR * completed feature * checkpoint * adding new fips and updating counties to 2010 * updated sources to 2010 - 2019 * more cleanup * creating tiles score csv 2021-07-15 13:34:08 -04:00			`from etl.score.etl_score import ScoreETL`
			`from etl.score.etl_score_post import PostScoreETL`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00

			`def etl_runner(dataset_to_run: str = None) -> None:`
			`"""Runs all etl processes or a specific one`

			`Args:`
			`dataset_to_run (str): Run a specific ETL process. If missing, runs all processes (optional)`

			`Returns:`
			`None`
			`"""`

			`# this list comes from YAMLs`
			`dataset_list = [`
			`{`
			`"name": "census_acs",`
			`"module_dir": "census_acs",`
			`"class_name": "CensusACSETL",`
			`},`
County Names for Score #188 (#347) * starting PR * completed feature * checkpoint * adding new fips and updating counties to 2010 * updated sources to 2010 - 2019 * more cleanup * creating tiles score csv 2021-07-15 13:34:08 -04:00			`{`
			`"name": "ejscreen",`
			`"module_dir": "ejscreen",`
			`"class_name": "EJScreenETL",`
			`},`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`{`
			`"name": "housing_and_transportation",`
			`"module_dir": "housing_and_transportation",`
			`"class_name": "HousingTransportationETL",`
			`},`
			`{`
			`"name": "hud_housing",`
			`"module_dir": "hud_housing",`
			`"class_name": "HudHousingETL",`
			`},`
			`{`
			`"name": "calenviroscreen",`
			`"module_dir": "calenviroscreen",`
			`"class_name": "CalEnviroScreenETL",`
			`},`
County Names for Score #188 (#347) * starting PR * completed feature * checkpoint * adding new fips and updating counties to 2010 * updated sources to 2010 - 2019 * more cleanup * creating tiles score csv 2021-07-15 13:34:08 -04:00			`{`
			`"name": "hud_recap",`
			`"module_dir": "hud_recap",`
			`"class_name": "HudRecapETL",`
			`},`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`]`

			`if dataset_to_run:`
			`dataset_element = next(`
County Names for Score #188 (#347) * starting PR * completed feature * checkpoint * adding new fips and updating counties to 2010 * updated sources to 2010 - 2019 * more cleanup * creating tiles score csv 2021-07-15 13:34:08 -04:00			`(item for item in dataset_list if item["name"] == dataset_to_run),`
			`None,`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`)`
			`if not dataset_list:`
			`raise ValueError("Invalid dataset name")`
			`else:`
			`# reset the list to just the dataset`
			`dataset_list = [dataset_element]`

			`# Run the ETLs for the dataset_list`
			`for dataset in dataset_list:`
County Names for Score #188 (#347) * starting PR * completed feature * checkpoint * adding new fips and updating counties to 2010 * updated sources to 2010 - 2019 * more cleanup * creating tiles score csv 2021-07-15 13:34:08 -04:00			`etl_module = importlib.import_module(`
			`f"etl.sources.{dataset['module_dir']}.etl"`
			`)`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00			`etl_class = getattr(etl_module, dataset["class_name"])`
			`etl_instance = etl_class()`

			`# run extract`
			`etl_instance.extract()`

			`# run transform`
			`etl_instance.transform()`

			`# run load`
			`etl_instance.load()`

			`# cleanup`
			`etl_instance.cleanup()`

			`# update the front end JSON/CSV of list of data sources`
			`pass`


			`def score_generate() -> None:`
			`"""Generates the score and saves it on the local data directory`

			`Args:`
			`None`

			`Returns:`
			`None`
			`"""`

County Names for Score #188 (#347) * starting PR * completed feature * checkpoint * adding new fips and updating counties to 2010 * updated sources to 2010 - 2019 * more cleanup * creating tiles score csv 2021-07-15 13:34:08 -04:00			`# Score Gen`
			`score_gen = ScoreETL()`
			`score_gen.extract()`
			`score_gen.transform()`
			`score_gen.load()`

			`# Post Score Processing`
			`score_post = PostScoreETL()`
			`score_post.extract()`
			`score_post.transform()`
			`score_post.load()`
			`score_post.cleanup()`
ETL Classes for Data Sets (#260) * first commit * checkpoint * checkpoint * first extract module 🎉 * completed census acs etl class * completed ejscreen etl * completed etl * score generation ready * improving census load and separation * score generation working 🎉 * completed etls * new score generation * PR reviews * run specific etl; starting docstrings * docstrings work * more docstrings * completed docstrings * adding pyenv version * more reasonable poetry req for python * PR comments 2021-07-12 15:50:44 -04:00

			`def _find_dataset_index(dataset_list, key, value):`
			`for i, element in enumerate(dataset_list):`
			`if element[key] == value:`
			`return i`
			`return -1`