Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
"""Utililities for turning geographies into tracts, using census data"""
|
2022-09-30 13:43:31 -04:00
|
|
|
from functools import lru_cache
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
from pathlib import Path
|
|
|
|
from typing import Optional
|
2022-09-30 13:43:31 -04:00
|
|
|
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
import geopandas as gpd
|
2022-09-20 14:53:12 -04:00
|
|
|
from data_pipeline.etl.sources.tribal.etl import TribalETL
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
from data_pipeline.utils import get_module_logger
|
2022-09-30 13:43:31 -04:00
|
|
|
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
from .census.etl import CensusETL
|
|
|
|
|
|
|
|
logger = get_module_logger(__name__)
|
|
|
|
|
|
|
|
|
|
|
|
@lru_cache()
|
|
|
|
def get_tract_geojson(
|
|
|
|
_tract_data_path: Optional[Path] = None,
|
|
|
|
) -> gpd.GeoDataFrame:
|
|
|
|
logger.info("Loading tract geometry data from census ETL")
|
|
|
|
GEOJSON_PATH = _tract_data_path
|
|
|
|
if GEOJSON_PATH is None:
|
|
|
|
GEOJSON_PATH = CensusETL.NATIONAL_TRACT_JSON_PATH
|
2022-09-20 14:53:12 -04:00
|
|
|
if not GEOJSON_PATH.exists():
|
|
|
|
logger.debug("Census data has not been computed, running")
|
|
|
|
census_etl = CensusETL()
|
|
|
|
census_etl.extract()
|
|
|
|
census_etl.transform()
|
|
|
|
census_etl.load()
|
|
|
|
tract_data = gpd.read_file(
|
|
|
|
GEOJSON_PATH,
|
|
|
|
include_fields=["GEOID10"],
|
|
|
|
)
|
|
|
|
tract_data = tract_data.rename(
|
|
|
|
columns={"GEOID10": "GEOID10_TRACT"}, errors="raise"
|
|
|
|
)
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
return tract_data
|
|
|
|
|
|
|
|
|
2022-09-20 14:53:12 -04:00
|
|
|
@lru_cache()
|
|
|
|
def get_tribal_geojson(
|
|
|
|
_tribal_data_path: Optional[Path] = None,
|
|
|
|
) -> gpd.GeoDataFrame:
|
|
|
|
logger.info("Loading Tribal geometry data from Tribal ETL")
|
|
|
|
GEOJSON_PATH = _tribal_data_path
|
|
|
|
if GEOJSON_PATH is None:
|
|
|
|
GEOJSON_PATH = TribalETL().NATIONAL_TRIBAL_GEOJSON_PATH
|
|
|
|
if not GEOJSON_PATH.exists():
|
|
|
|
logger.debug("Tribal data has not been computed, running")
|
|
|
|
tribal_etl = TribalETL()
|
|
|
|
tribal_etl.extract()
|
|
|
|
tribal_etl.transform()
|
|
|
|
tribal_etl.load()
|
|
|
|
tribal_data = gpd.read_file(
|
|
|
|
GEOJSON_PATH,
|
|
|
|
)
|
|
|
|
return tribal_data
|
|
|
|
|
|
|
|
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
def add_tracts_for_geometries(
|
2022-09-20 14:53:12 -04:00
|
|
|
df: gpd.GeoDataFrame, tract_data: Optional[gpd.GeoDataFrame] = None
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
) -> gpd.GeoDataFrame:
|
|
|
|
"""Adds tract-geoids to dataframe df that contains spatial geometries
|
|
|
|
|
|
|
|
Depends on CensusETL for the geodata to do its conversion
|
|
|
|
|
|
|
|
Args:
|
|
|
|
df (GeoDataFrame): a geopandas GeoDataFrame with a point geometry column
|
2022-09-20 14:53:12 -04:00
|
|
|
tract_data (GeoDataFrame): optional override to directly pass a
|
|
|
|
geodataframe of the tract boundaries. Also helps simplify testing.
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
|
|
|
|
Returns:
|
|
|
|
GeoDataFrame: the above dataframe, with an additional GEOID10_TRACT column that
|
|
|
|
maps the points in DF to census tracts and a geometry column for later
|
|
|
|
spatial analysis
|
|
|
|
"""
|
|
|
|
logger.debug("Appending tract data to dataframe")
|
2022-09-20 14:53:12 -04:00
|
|
|
|
|
|
|
if tract_data is None:
|
|
|
|
tract_data = get_tract_geojson()
|
|
|
|
else:
|
|
|
|
logger.debug("Using existing tract data.")
|
|
|
|
|
Add FUDS ETL (#1817)
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
|
|
|
assert (
|
|
|
|
tract_data.crs == df.crs
|
|
|
|
), f"Dataframe must be projected to {tract_data.crs}"
|
|
|
|
df = gpd.sjoin(
|
|
|
|
df,
|
|
|
|
tract_data[["GEOID10_TRACT", "geometry"]],
|
|
|
|
how="inner",
|
|
|
|
op="intersects",
|
|
|
|
)
|
|
|
|
return df
|