Refactor DOE Energy Burden and COI to use YAML (#1796)

* added tribalId for Supplemental dataset (#1804)

* Setting zoom levels for tribal map (#1810)

* NRI dataset and initial score YAML configuration (#1534)

* update be staging gha

* NRI dataset and initial score YAML configuration

* checkpoint

* adding data checks for release branch

* passing tests

* adding INPUT_EXTRACTED_FILE_NAME to base class

* lint

* columns to keep and tests

* update be staging gha

* checkpoint

* update be staging gha

* NRI dataset and initial score YAML configuration

* checkpoint

* adding data checks for release branch

* passing tests

* adding INPUT_EXTRACTED_FILE_NAME to base class

* lint

* columns to keep and tests

* checkpoint

* PR Review

* renoving source url

* tests

* stop execution of ETL if there's a YAML schema issue

* update be staging gha

* adding source url as class var again

* clean up

* force cache bust

* gha cache bust

* dynamically set score vars from YAML

* docsctrings

* removing last updated year - optional reverse percentile

* passing tests

* sort order

* column ordening

* PR review

* class level vars

* Updating DatasetsConfig

* fix pylint errors

* moving metadata hint back to code

Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov>

* Correct copy typo (#1809)

* Add basic test suite for COI (#1518)

* Update COI to use new yaml (#1518)

* Add tests for DOE energy budren (1518

* Add dataset config for energy budren (1518)

* Refactor ETL to use datasets.yml (#1518)

* Add fake GEOIDs to COI tests (#1518)

* Refactor _setup_etl_instance_and_run_extract to base (#1518)

For the three classes we've done so far, a generic
_setup_etl_instance_and_run_extract will work fine, for the moment we
can reuse the same setup method until we decide future classes need more
flexibility --- but they can also always subclass so...

* Add output-path tests (#1518)

* Update YAML to match constant (#1518)

* Don't blindly set float format (#1518)

* Add defaults for extract (#1518)

* Run YAML load on all subclasses (#1518)

* Update description fields (#1518)

* Update YAML per final format (#1518)

* Update fixture tract IDs (#1518)

* Update base class refactor (#1518)

Now that NRI is final I needed to make a small number of updates to my
refactored code.

* Remove old comment (#1518)

* Fix type signature and return (#1518)

* Update per code review (#1518)

Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com>
Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov>
Co-authored-by: Vim <86254807+vim-usds@users.noreply.github.com>
This commit is contained in:
Matt Bowen 2022-08-10 16:02:59 -04:00 committed by Emma Nechamkin
commit 97e17546cc
28 changed files with 455 additions and 189 deletions

View file

@ -2,63 +2,48 @@ from pathlib import Path
import pandas as pd
from data_pipeline.config import settings
from data_pipeline.etl.base import ExtractTransformLoad
from data_pipeline.utils import get_module_logger, unzip_file_from_url
from data_pipeline.etl.base import ExtractTransformLoad, ValidGeoLevel
from data_pipeline.utils import get_module_logger
logger = get_module_logger(__name__)
class DOEEnergyBurden(ExtractTransformLoad):
def __init__(self):
self.DOE_FILE_URL = (
settings.AWS_JUSTICE40_DATASOURCES_URL
+ "/DOE_LEAD_AMI_TRACT_2018_ALL.csv.zip"
)
NAME = "doe_energy_burden"
SOURCE_URL: str = (
settings.AWS_JUSTICE40_DATASOURCES_URL
+ "/DOE_LEAD_AMI_TRACT_2018_ALL.csv.zip"
)
GEO_LEVEL = ValidGeoLevel.CENSUS_TRACT
REVISED_ENERGY_BURDEN_FIELD_NAME: str
def __init__(self):
self.OUTPUT_PATH: Path = (
self.DATA_PATH / "dataset" / "doe_energy_burden"
)
self.TRACT_INPUT_COLUMN_NAME = "FIP"
self.INPUT_ENERGY_BURDEN_FIELD_NAME = "BURDEN"
self.REVISED_ENERGY_BURDEN_FIELD_NAME = "Energy burden"
# Constants for output
self.COLUMNS_TO_KEEP = [
self.GEOID_TRACT_FIELD_NAME,
self.REVISED_ENERGY_BURDEN_FIELD_NAME,
]
self.raw_df: pd.DataFrame
self.output_df: pd.DataFrame
def extract(self) -> None:
logger.info("Starting data download.")
unzip_file_from_url(
file_url=self.DOE_FILE_URL,
download_path=self.get_tmp_path(),
unzipped_file_path=self.get_tmp_path() / "doe_energy_burden",
)
self.raw_df = pd.read_csv(
def transform(self) -> None:
logger.info("Starting DOE Energy Burden transforms.")
raw_df: pd.DataFrame = pd.read_csv(
filepath_or_buffer=self.get_tmp_path()
/ "doe_energy_burden"
/ "DOE_LEAD_AMI_TRACT_2018_ALL.csv",
# The following need to remain as strings for all of their digits, not get converted to numbers.
dtype={
self.TRACT_INPUT_COLUMN_NAME: "string",
self.INPUT_GEOID_TRACT_FIELD_NAME: "string",
},
low_memory=False,
)
def transform(self) -> None:
logger.info("Starting transforms.")
output_df = self.raw_df.rename(
logger.info("Renaming columns and ensuring output format is correct")
output_df = raw_df.rename(
columns={
self.INPUT_ENERGY_BURDEN_FIELD_NAME: self.REVISED_ENERGY_BURDEN_FIELD_NAME,
self.TRACT_INPUT_COLUMN_NAME: self.GEOID_TRACT_FIELD_NAME,
self.INPUT_GEOID_TRACT_FIELD_NAME: self.GEOID_TRACT_FIELD_NAME,
}
)
@ -75,7 +60,4 @@ class DOEEnergyBurden(ExtractTransformLoad):
def load(self) -> None:
logger.info("Saving DOE Energy Burden CSV")
self.OUTPUT_PATH.mkdir(parents=True, exist_ok=True)
self.output_df[self.COLUMNS_TO_KEEP].to_csv(
path_or_buf=self.OUTPUT_PATH / "usa.csv", index=False
)
super().load()