Add Michigan EJ Screen into data-pipeline's ETL and provide automated scoring and statistics outputs (#1091)

* draft wip * initial commit * clear output from notebook * revert to 65ceb7900f * draft wip * initial commit * clear output from notebook * revert to 65ceb7900f * make michigan prefix for readable * standardize Michigan names and move all constants from class into field names module * standardize Michigan names and move all constants from class into field names module * include only pertinent columns for scoring comparison tool * michigan EJSCREEN standardization * final PR feedback * added exposition and summary of Michigan EJSCREEN * added exposition and summary of Michigan EJSCREEN * fix typo Co-authored-by: Saran Ahluwalia <ahlusar.ahluwalia@gmail.com>
2025-02-22 09:41:26 -08:00 · 2021-12-31 15:38:52 -05:00 · 2021-12-31 15:38:52 -05:00 · a4137fdc98
commit a4137fdc98
parent 24f8eb93c4
6 changed files with 142 additions and 3 deletions
--- a/data/data-pipeline/data_pipeline/etl/constants.py
+++ b/data/data-pipeline/data_pipeline/etl/constants.py
@ -99,6 +99,11 @@ DATASET_LIST = [
        "module_dir": "tree_equity_score",
        "class_name": "TreeEquityScoreETL",
    },
+    {
+        "name": "michigan_ejscreen",
+        "module_dir": "michigan_ejscreen",
+        "class_name": "MichiganEnviroScreenETL",
+    },
 ]
 CENSUS_INFO = {
    "name": "census",
--- a/data/data-pipeline/data_pipeline/etl/sources/michigan_ejscreen/README.md
+++ b/data/data-pipeline/data_pipeline/etl/sources/michigan_ejscreen/README.md
@ -0,0 +1,32 @@
+# Michigan EJSCREEN
+
+The Michigan EJSCREEN description and publication can be found [here](https://deepblue.lib.umich.edu/bitstream/handle/2027.42/149105/AssessingtheStateofEnvironmentalJusticeinMichigan_344.pdf).
+
+
+#### Some notes about the input source data column fields:
+
+There are two pertinent columns used - `EJ_Score_Cal_Min` and `Pct_CalMin` that are referenced in the source codebase. To our knowledge, these columns reflect the adoption and the comparative quantitative analysis from two different approaches. The "Cal" prefix reflects CalEPA's CalEnviroScreen that omits racial and ethnic data. The "Min" abbreviation reflects Minnesota Pollution Control Agency’s (MPCA) approach to including this data. Please see pages 37 - 39 in the above reference for further details. Briefly, the authors adopted a combination of both the CalEnviroScreen's methodology and the MCPA's methodology. The scores and percentile rankings in the input data source sheet are the same as those reflected in the cited report, included in Appendix I and in the latest version of the mapping [tool](https://www.arcgis.com/apps/webappviewer/index.html?id=dc4f0647dda34959963488d3f519fd24).
+
+#### Additional information on the adoption of the methodology from CalEnviroScreen and MCPA
+
+Both CalEPA's CalEnviroScreen and the Minnesota Pollution Control Agency’s (MPCA) methodology are adopted and used for both comparative purposes and for the identification of areas of concern. The latter, in particular, is used to identify tribal areas. According to the authors, to make permitting decisions, MPCA assesses whether the community, measured at the census tract level, fits at least one of the following criteria:
+
+* Percent of the non-white population is at least 50%
+* "More than 40% of the households have a household income of less than 185% of the federal
+poverty level (FPL)”
+* If the facility is within the boundaries of a “tribal community” (MPCA 2015).
+
+Furthermore, the authors state that the MCPA methodology included data on tribal community boundaries, as defined by the US Census Bureau, and data on poverty, race, and ethnicity. However, the authors also note that the MCPA's methodology does not rank any census tracts.
+
+In addition, although the CalEPA does not analyze data on race and ethnicity in CalEnviroScreen, the researchers incorporated race and ethnicity data in their assessment of environmental justice in Michigan. To justify the incorporation of race and ethnic data, the team compared the tract rankings with and without the data.
+
+A Spearman's rank-order correlation was calculated for the 2,741 census tracts within Michigan with the two variables being environmental justice scores using the CalEPA methodology 1) without racial and ethnic data and 2) with racial and ethnic data. These scores were then ranked and the Spearman rank-order correlation was calculated. These statistics are not included in the output of this ETL process. Please see Chapter 5 and Chapter 6 for further details.
+
+Finally, please see pages 104 -106 for details on the justification and details for the applicability of the upper quartile as a means to identify communities in Michigan with the potential for environmental justice concerns. It should also be noted that, according to the authors, that CalEPA also designates the top 25% scoring tracts as “disadvantaged communities".
+
+Sources:
+
+* Minnesota Pollution Control Agency. (2015, December 15). Environmental Justice Framework Report.
+Retrieved from https://www.pca.state.mn.us/sites/default/files/p-gen5-05.pdf.
+
+* Faust, J., L. August, K. Bangia, V. Galaviz, J. Leichty, S. Prasad… and L. Zeise. (2017, January). Update to the California Communities Environmental Health Screening Tool CalEnviroScreen 3.0. Retrieved from OEHHA website: https://oehha.ca.gov/media/downloads/calenviroscreen/report/ces3report.pdf
--- a/data/data-pipeline/data_pipeline/etl/sources/michigan_ejscreen/init.py
+++ b/data/data-pipeline/data_pipeline/etl/sources/michigan_ejscreen/init.py
--- a/data/data-pipeline/data_pipeline/etl/sources/michigan_ejscreen/etl.py
+++ b/data/data-pipeline/data_pipeline/etl/sources/michigan_ejscreen/etl.py
@ -0,0 +1,69 @@
+import pandas as pd
+
+from data_pipeline.etl.base import ExtractTransformLoad
+from data_pipeline.utils import get_module_logger
+from data_pipeline.score import field_names
+from data_pipeline.config import settings
+
+logger = get_module_logger(__name__)
+
+
+class MichiganEnviroScreenETL(ExtractTransformLoad):
+    """Michigan EJ Screen class that ingests dataset represented
+    here: https://www.arcgis.com/apps/webappviewer/index.html?id=dc4f0647dda34959963488d3f519fd24
+    This class ingests the data presented in "Assessing the State of Environmental
+    Justice in Michigan." Please see the README in this module for further details.
+    """
+
+    def __init__(self):
+        self.MICHIGAN_EJSCREEN_S3_URL = (
+            settings.AWS_JUSTICE40_DATASOURCES_URL
+            + "/michigan_ejscore_12212021.csv"
+        )
+
+        self.CSV_PATH = self.DATA_PATH / "dataset" / "michigan_ejscreen"
+        self.MICHIGAN_EJSCREEN_PRIORITY_COMMUNITY_THRESHOLD: float = 0.75
+
+        self.COLUMNS_TO_KEEP = [
+            self.GEOID_TRACT_FIELD_NAME,
+            field_names.MICHIGAN_EJSCREEN_SCORE_FIELD,
+            field_names.MICHIGAN_EJSCREEN_PERCENTILE_FIELD,
+            field_names.MICHIGAN_EJSCREEN_PRIORITY_COMMUNITY_FIELD,
+        ]
+
+        self.df: pd.DataFrame
+
+    def extract(self) -> None:
+        logger.info("Downloading Michigan EJSCREEN Data")
+        self.df = pd.read_csv(
+            filepath_or_buffer=self.MICHIGAN_EJSCREEN_S3_URL,
+            dtype={"GEO_ID": "string"},
+            low_memory=False,
+        )
+
+    def transform(self) -> None:
+        logger.info("Transforming Michigan EJSCREEN Data")
+
+        self.df.rename(
+            columns={
+                "GEO_ID": self.GEOID_TRACT_FIELD_NAME,
+                "EJ_Score_Cal_Min": field_names.MICHIGAN_EJSCREEN_SCORE_FIELD,
+                "Pct_CalMin": field_names.MICHIGAN_EJSCREEN_PERCENTILE_FIELD,
+            },
+            inplace=True,
+        )
+        # Calculate the top quartile of prioritized communities
+        # Please see pg. 104 - 109 from source:
+        # pg. https://deepblue.lib.umich.edu/bitstream/handle/2027.42/149105/AssessingtheStateofEnvironmentalJusticeinMichigan_344.pdf
+        self.df[field_names.MICHIGAN_EJSCREEN_PRIORITY_COMMUNITY_FIELD] = (
+            self.df[field_names.MICHIGAN_EJSCREEN_PERCENTILE_FIELD]
+            >= self.MICHIGAN_EJSCREEN_PRIORITY_COMMUNITY_THRESHOLD
+        )
+
+    def load(self) -> None:
+        logger.info("Saving Michigan Environmental Screening Tool to CSV")
+        # write nationwide csv
+        self.CSV_PATH.mkdir(parents=True, exist_ok=True)
+        self.df[self.COLUMNS_TO_KEEP].to_csv(
+            self.CSV_PATH / "michigan_ejscreen.csv", index=False
+        )
--- a/data/data-pipeline/data_pipeline/ipython/scoring_comparison.ipynb
+++ b/data/data-pipeline/data_pipeline/ipython/scoring_comparison.ipynb
@ -295,6 +295,25 @@
    "energy_definition_alternative_draft_df"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe4a2939",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load Michigan EJSCREEN\n",
+    "michigan_ejscreen_data_path = (\n",
+    "    DATA_DIR / \"dataset\" / \"michigan_ejscreen\" / \"michigan_ejscreen.csv\"\n",
+    ")\n",
+    "michigan_ejscreen_df = pd.read_csv(\n",
+    "    michigan_ejscreen_data_path,\n",
+    "    dtype={ExtractTransformLoad.GEOID_TRACT_FIELD_NAME: \"string\"},\n",
+    ")\n",
+    "\n",
+    "michigan_ejscreen_df.head()"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -311,6 +330,7 @@
    "    persistent_poverty_df,\n",
    "    mapping_inequality_df,\n",
    "    energy_definition_alternative_draft_df,\n",
+    "    michigan_ejscreen_df\n",
    "]\n",
    "\n",
    "merged_df = functools.reduce(\n",
@ -456,6 +476,14 @@
    "            priority_communities_field=field_names.ENERGY_RELATED_COMMUNITIES_DEFINITION_ALTERNATIVE,\n",
    "            other_census_tract_fields_to_keep=[],\n",
    "        ),\n",
+    "        Index(\n",
+    "            method_name=\"Michigan EJSCREEN\",\n",
+    "            priority_communities_field=field_names.MICHIGAN_EJSCREEN_PRIORITY_COMMUNITY_FIELD,\n",
+    "            other_census_tract_fields_to_keep=[\n",
+    "                field_names.MICHIGAN_EJSCREEN_SCORE_FIELD,\n",
+    "                field_names.MICHIGAN_EJSCREEN_PERCENTILE_FIELD,\n",
+    "            ],\n",
+    "        ),        \n",
    "    ]\n",
    "    # Insert indices for each of the HOLC factors.\n",
    "    # Note: since these involve no renaming, we write them using list comprehension.\n",
@ -1298,7 +1326,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
@ -1312,7 +1340,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.6"
+   "version": "3.6.2"
  }
 },
 "nbformat": 4,
--- a/data/data-pipeline/data_pipeline/score/field_names.py
+++ b/data/data-pipeline/data_pipeline/score/field_names.py
@ -206,13 +206,18 @@ EJSCREEN_AREAS_OF_CONCERN_STATE_90TH_PERCENTILE_COMMUNITIES_FIELD = (
 EJSCREEN_AREAS_OF_CONCERN_STATE_95TH_PERCENTILE_COMMUNITIES_FIELD = (
    "EJSCREEN Areas of Concern, State, 95th percentile (communities)"
 )
-
 # Mapping inequality data.
 HOLC_GRADE_D_TRACT_PERCENT_FIELD: str = "Percent of tract that is HOLC Grade D"
 HOLC_GRADE_D_TRACT_20_PERCENT_FIELD: str = "Tract is >20% HOLC Grade D"
 HOLC_GRADE_D_TRACT_50_PERCENT_FIELD: str = "Tract is >50% HOLC Grade D"
 HOLC_GRADE_D_TRACT_75_PERCENT_FIELD: str = "Tract is >75% HOLC Grade D"

+# Michigan Environmental Screening Tool ETL Constants
+MICHIGAN_EJSCREEN_SCORE_FIELD: str = "Michigan EJSCREEN Score Field"
+MICHIGAN_EJSCREEN_PERCENTILE_FIELD: str = "Michigan EJSCREEN Percentile Field"
+MICHIGAN_EJSCREEN_PRIORITY_COMMUNITY_FIELD: str = (
+    "Michigan EJSCREEN Priority Community"
+)

 # Child Opportunity Index data
 # Summer days with maximum temperature above 90F.