Add FUDS ETL (#1817)

* Add spatial join method (#1871)

Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.

* Add FUDS, also jupyter lab (#1871)

* Add YAML configs for FUDS (#1871)

* Allow input geoid to be optional (#1871)

* Add FUDS ETL, tests, test-datae noteobook (#1871)

This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.

* Floats -> Ints, as I intended (#1871)

* Floats -> Ints, as I intended (#1871)

* Formatting fixes (#1871)

* Add test false positive GEOIDs (#1871)

* Add gdal binaries (#1871)

* Refactor pandas code to be more idiomatic (#1871)

Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!

* Update configs per Emma suggestions (#1871)

* Type fixed! (#1871)

* Remove spurious import from vscode (#1871)

* Snapshot update after changing col name (#1871)

* Move up GDAL (#1871)

* Adjust geojson strategy (#1871)

* Try running census separately first (#1871)

* Fix import order (#1871)

* Cleanup cache strategy (#1871)

* Download census data from S3 instead of re-calculating (#1871)

* Clarify pandas code per Emma (#1871)
This commit is contained in:
Matt Bowen 2022-08-16 13:28:39 -04:00 committed by GitHub
commit d5fbb802e8
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
22 changed files with 2534 additions and 416 deletions

View file

@ -0,0 +1,98 @@
from pathlib import Path
import geopandas as gpd
import pandas as pd
import numpy as np
from data_pipeline.etl.base import ExtractTransformLoad, ValidGeoLevel
from data_pipeline.utils import get_module_logger, download_file_from_url
from data_pipeline.etl.sources.geo_utils import add_tracts_for_geometries
logger = get_module_logger(__name__)
class USArmyFUDS(ExtractTransformLoad):
"""The Formerly Used Defense Sites (FUDS)"""
NAME: str = "us_army_fuds"
ELIGIBLE_FUDS_COUNT_FIELD_NAME: str
INELIGIBLE_FUDS_COUNT_FIELD_NAME: str
ELIGIBLE_FUDS_BINARY_FIELD_NAME: str
GEO_LEVEL: ValidGeoLevel = ValidGeoLevel.CENSUS_TRACT
def __init__(self):
self.FILE_URL: str = (
"https://opendata.arcgis.com/api/v3/datasets/"
"3f8354667d5b4b1b8ad7a6e00c3cf3b1_1/downloads/"
"data?format=geojson&spatialRefId=4326&where=1%3D1"
)
self.OUTPUT_PATH: Path = self.DATA_PATH / "dataset" / "us_army_fuds"
# Constants for output
self.COLUMNS_TO_KEEP = [
self.GEOID_TRACT_FIELD_NAME,
self.ELIGIBLE_FUDS_COUNT_FIELD_NAME,
self.INELIGIBLE_FUDS_COUNT_FIELD_NAME,
self.ELIGIBLE_FUDS_BINARY_FIELD_NAME,
]
self.DOWNLOAD_FILE_NAME = self.get_tmp_path() / "fuds.geojson"
self.raw_df: gpd.GeoDataFrame
self.output_df: pd.DataFrame
def extract(self) -> None:
logger.info("Starting FUDS data download.")
download_file_from_url(
file_url=self.FILE_URL,
download_file_name=self.DOWNLOAD_FILE_NAME,
verify=True,
)
def transform(self) -> None:
logger.info("Starting FUDS transform.")
# before we try to do any transformation, get the tract data
# so it's loaded and the census ETL is out of scope
logger.info("Loading FUDs data as GeoDataFrame for transform")
raw_df = gpd.read_file(
filename=self.DOWNLOAD_FILE_NAME,
low_memory=False,
)
# Note that the length of raw_df will not be exactly the same
# because same bases lack coordinated or have coordinates in
# Mexico or in the ocean. See the following dataframe:
# raw_df[~raw_df.OBJECTID.isin(df_with_tracts.OBJECTID)][
# ['OBJECTID', 'CLOSESTCITY', 'COUNTY', 'ELIGIBILITY',
# 'STATE', 'LATITUDE', "LONGITUDE"]]
logger.debug("Adding tracts to FUDS data")
df_with_tracts = add_tracts_for_geometries(raw_df)
self.output_df = pd.DataFrame()
# this will create a boolean series which you can do actually sans np.where
df_with_tracts["tmp_fuds"] = (
df_with_tracts.ELIGIBILITY == "Eligible"
) & (df_with_tracts.HASPROJECTS == "Yes")
self.output_df[
self.ELIGIBLE_FUDS_COUNT_FIELD_NAME
] = df_with_tracts.groupby(self.GEOID_TRACT_FIELD_NAME)[
"tmp_fuds"
].sum()
self.output_df[self.INELIGIBLE_FUDS_COUNT_FIELD_NAME] = (
df_with_tracts[~df_with_tracts.tmp_fuds]
.groupby(self.GEOID_TRACT_FIELD_NAME)
.size()
)
self.output_df = (
self.output_df.fillna(0).astype("int64").sort_index().reset_index()
)
self.output_df[self.ELIGIBLE_FUDS_BINARY_FIELD_NAME] = np.where(
self.output_df[self.ELIGIBLE_FUDS_COUNT_FIELD_NAME] > 0.0,
True,
False,
)