j40-cejst-2/data/data-pipeline/data_pipeline/etl/sources/eamlis/etl.py

from pathlib import Path
import geopandas as gpd
import pandas as pd
from data_pipeline.config import settings

from data_pipeline.etl.base import ExtractTransformLoad, ValidGeoLevel
from data_pipeline.etl.sources.geo_utils import add_tracts_for_geometries
from data_pipeline.utils import get_module_logger

logger = get_module_logger(__name__)


class AbandonedMineETL(ExtractTransformLoad):
    """Data from Office Of Surface Mining Reclamation and Enforcement's
    eAMLIS. These are the locations of abandoned mines.
    """

    # Metadata for the baseclass
    NAME = "eamlis"
    GEO_LEVEL = ValidGeoLevel.CENSUS_TRACT
    AML_BOOLEAN: str
    LOAD_YAML_CONFIG: bool = True

    PUERTO_RICO_EXPECTED_IN_DATA = False
    EXPECTED_MISSING_STATES = [
        "10",
        "11",
        "12",
        "15",
        "23",
        "27",
        "31",
        "33",
        "34",
        "36",
        "45",
        "50",
        "55",
    ]

    # Define these for easy code completion
    def __init__(self):
        self.SOURCE_URL = (
            settings.AWS_JUSTICE40_DATASOURCES_URL
            + "/eAMLIS export of all data.tsv.zip"
        )

        self.TRACT_INPUT_COLUMN_NAME = self.INPUT_GEOID_TRACT_FIELD_NAME

        self.OUTPUT_PATH: Path = (
            self.DATA_PATH / "dataset" / "abandoned_mine_land_inventory_system"
        )

        self.COLUMNS_TO_KEEP = [
            self.GEOID_TRACT_FIELD_NAME,
            self.AML_BOOLEAN,
        ]

        self.output_df: pd.DataFrame

    def transform(self) -> None:
        logger.info("Starting eAMLIS transforms.")
        df = pd.read_csv(
            self.get_tmp_path() / "eAMLIS export of all data.tsv",
            sep="\t",
            low_memory=False,
        )
        gdf = gpd.GeoDataFrame(
            df,
            geometry=gpd.points_from_xy(
                x=df["Longitude"],
                y=df["Latitude"],
            ),
            crs="epsg:4326",
        )
        gdf = gdf.drop_duplicates(subset=["geometry"], keep="last")
        gdf_tracts = add_tracts_for_geometries(gdf)
        gdf_tracts = gdf_tracts.drop_duplicates(self.GEOID_TRACT_FIELD_NAME)
        gdf_tracts[self.AML_BOOLEAN] = True
        self.output_df = gdf_tracts[self.COLUMNS_TO_KEEP]
Add abandoned mine lands data (#1824) * Add notebook to generate test data (#1780) * Add Abandoned Mine Land data (#1780) Using a similar structure but simpler apporach compared to FUDs, add an indicator for whether a tract has an abandonded mine. * Adding some detail to dataset readmes Just a thought! * Apply feedback from revieiw (#1780) * Fixup bad string that broke test (#1780) * Update a string that I should have renamed (#1780) * Reduce number of threads to reduce memory pressure (#1780) * Try not running geo data (#1780) * Run the high-memory sets separately (#1780) * Actually deduplicate (#1780) * Add flag for memory intensive ETLs (#1780) * Document new flag for datasets (#1780) * Add flag for new datasets fro rebase (#1780) Co-authored-by: Emma Nechamkin <97977170+emma-nechamkin@users.noreply.github.com> 2022-08-17 11:33:59 -04:00			`from pathlib import Path`
			`import geopandas as gpd`
			`import pandas as pd`
			`from data_pipeline.config import settings`

			`from data_pipeline.etl.base import ExtractTransformLoad, ValidGeoLevel`
			`from data_pipeline.etl.sources.geo_utils import add_tracts_for_geometries`
			`from data_pipeline.utils import get_module_logger`

			`logger = get_module_logger(__name__)`


			`class AbandonedMineETL(ExtractTransformLoad):`
			`"""Data from Office Of Surface Mining Reclamation and Enforcement's`
			`eAMLIS. These are the locations of abandoned mines.`
			`"""`

			`# Metadata for the baseclass`
			`NAME = "eamlis"`
			`GEO_LEVEL = ValidGeoLevel.CENSUS_TRACT`
			`AML_BOOLEAN: str`
Add tests for all non-census sources (#1899) * Refactor CDC life-expectancy (1554) * Update to new tract list (#1554) * Adjust for tests (#1848) * Add tests for cdc_places (#1848) * Add EJScreen tests (#1848) * Add tests for HUD housing (#1848) * Add tests for GeoCorr (#1848) * Add persistent poverty tests (#1848) * Update for sources without zips, for new validation (#1848) * Update tests for new multi-CSV but (#1848) Lucas updated the CDC life expectancy data to handle a bug where two states are missing from the US Overall download. Since virtually none of our other ETL classes download multiple CSVs directly like this, it required a pretty invasive new mocking strategy. * Add basic tests for nature deprived (#1848) * Add wildfire tests (#1848) * Add flood risk tests (#1848) * Add DOT travel tests (#1848) * Add historic redlining tests (#1848) * Add tests for ME and WI (#1848) * Update now that validation exists (#1848) * Adjust for validation (#1848) * Add health insurance back to cdc places (#1848) Ooops * Update tests with new field (#1848) * Test for blank tract removal (#1848) * Add tracts for clipping behavior * Test clipping and zfill behavior (#1848) * Fix bad test assumption (#1848) * Simplify class, add test for tract padding (#1848) * Fix percentage inversion, update tests (#1848) Looking through the transformations, I noticed that we were subtracting a percentage that is usually between 0-100 from 1 instead of 100, and so were endind up with some surprising results. Confirmed with lucasmbrown-usds * Add note about first street data (#1848) 2022-09-19 15:17:00 -04:00			`LOAD_YAML_CONFIG: bool = True`
Add abandoned mine lands data (#1824) * Add notebook to generate test data (#1780) * Add Abandoned Mine Land data (#1780) Using a similar structure but simpler apporach compared to FUDs, add an indicator for whether a tract has an abandonded mine. * Adding some detail to dataset readmes Just a thought! * Apply feedback from revieiw (#1780) * Fixup bad string that broke test (#1780) * Update a string that I should have renamed (#1780) * Reduce number of threads to reduce memory pressure (#1780) * Try not running geo data (#1780) * Run the high-memory sets separately (#1780) * Actually deduplicate (#1780) * Add flag for memory intensive ETLs (#1780) * Document new flag for datasets (#1780) * Add flag for new datasets fro rebase (#1780) Co-authored-by: Emma Nechamkin <97977170+emma-nechamkin@users.noreply.github.com> 2022-08-17 11:33:59 -04:00
Issue 1831: missing life expectancy data from Maine and Wisconsin (#1887) * Fixing missing states and adding tests for states to all classes 2022-09-09 20:35:01 -04:00			`PUERTO_RICO_EXPECTED_IN_DATA = False`
			`EXPECTED_MISSING_STATES = [`
			`"10",`
			`"11",`
			`"12",`
			`"15",`
			`"23",`
			`"27",`
			`"31",`
			`"33",`
			`"34",`
			`"36",`
			`"45",`
			`"50",`
			`"55",`
			`]`

Add abandoned mine lands data (#1824) * Add notebook to generate test data (#1780) * Add Abandoned Mine Land data (#1780) Using a similar structure but simpler apporach compared to FUDs, add an indicator for whether a tract has an abandonded mine. * Adding some detail to dataset readmes Just a thought! * Apply feedback from revieiw (#1780) * Fixup bad string that broke test (#1780) * Update a string that I should have renamed (#1780) * Reduce number of threads to reduce memory pressure (#1780) * Try not running geo data (#1780) * Run the high-memory sets separately (#1780) * Actually deduplicate (#1780) * Add flag for memory intensive ETLs (#1780) * Document new flag for datasets (#1780) * Add flag for new datasets fro rebase (#1780) Co-authored-by: Emma Nechamkin <97977170+emma-nechamkin@users.noreply.github.com> 2022-08-17 11:33:59 -04:00			`# Define these for easy code completion`
			`def __init__(self):`
			`self.SOURCE_URL = (`
			`settings.AWS_JUSTICE40_DATASOURCES_URL`
			`+ "/eAMLIS export of all data.tsv.zip"`
			`)`

			`self.TRACT_INPUT_COLUMN_NAME = self.INPUT_GEOID_TRACT_FIELD_NAME`

			`self.OUTPUT_PATH: Path = (`
			`self.DATA_PATH / "dataset" / "abandoned_mine_land_inventory_system"`
			`)`

			`self.COLUMNS_TO_KEEP = [`
			`self.GEOID_TRACT_FIELD_NAME,`
			`self.AML_BOOLEAN,`
			`]`

			`self.output_df: pd.DataFrame`

			`def transform(self) -> None:`
			`logger.info("Starting eAMLIS transforms.")`
			`df = pd.read_csv(`
			`self.get_tmp_path() / "eAMLIS export of all data.tsv",`
			`sep="\t",`
			`low_memory=False,`
			`)`
			`gdf = gpd.GeoDataFrame(`
			`df,`
			`geometry=gpd.points_from_xy(`
			`x=df["Longitude"],`
			`y=df["Latitude"],`
			`),`
			`crs="epsg:4326",`
			`)`
			`gdf = gdf.drop_duplicates(subset=["geometry"], keep="last")`
			`gdf_tracts = add_tracts_for_geometries(gdf)`
			`gdf_tracts = gdf_tracts.drop_duplicates(self.GEOID_TRACT_FIELD_NAME)`
			`gdf_tracts[self.AML_BOOLEAN] = True`
			`self.output_df = gdf_tracts[self.COLUMNS_TO_KEEP]`