Updating comparison tool to be easier for pairwise comparisons (#1400)

Creating pairwise comparison tool to compare two lists of prioritized tracts to each other.
2025-02-22 09:41:26 -08:00 · 2022-03-30 14:02:06 -04:00 · 2022-03-30 14:02:06 -04:00 · cb963cff5f
commit cb963cff5f
parent 78d8b5ec3b
8 changed files with 1129 additions and 1 deletions
--- a/data/data-pipeline/data_pipeline/comparison_tool/README.md
+++ b/data/data-pipeline/data_pipeline/comparison_tool/README.md
@ -0,0 +1,70 @@
+# Score comparisons
+
+## Comparison tool
+
+TODO once the comparison tool has been refactored. 
+
+## Single comparator score comparisons
+
+The goal of this directory is to create interactive 1-to-1 dac list:cejst comparisons. That means that, when this tool is run, you will have comparisons of two true/false classifications. 
+
+This uses `papermill` to parameterize a jupyter notebook, and is meant to be a *lightweight* entry into this analysis. The tool as a whole creates a bunch of comparisons against CEJST data -- but after it runs, you'll have the notebook to re-run and add to if you are so inclined. 
+
+To run:
+` $ python src/run_tract_comparison.py --template_notebook=TEMPLATE.ipynb --parameter_yaml=PARAMETERS.yaml`
+
+For example, if I am running this from the `comparison_tool` directory within the `justice40-tool` repo, I would run something like:
+
+` $ poetry run python3 src/run_tract_comparison.py --template_notebook=src/tract_comparison__template.ipynb --parameter_yaml=src/comparison_configs.yml`
+
+__What is the template notebook?__
+
+This gets filled in by the parameters in the yaml file and then executed. Even after execution, it is run-able and interactive. You do not need to change anything in this (with the caveat -- depending on how you run `jupyter lab`, you might need to add `import sys` and then `sys.path.append("../../../../)` to run the notebook live). 
+
+__What is the output?__
+
+When you run this, you'll get back three files:
+1. The filled-in parameter notebook that you can run live, with the date appended. This means if you run the script twice in one day, the notebook will get overriden, but if you run the script on two consecutive days, you will get two separate notebooks saved. 
+2. A graph that shows the relative average of the specified `ADDITIONAL_DEMO_COLUMNS` and `DEMOGRAPHIC_COLUMNS` segmented by CEJST and the comparator you include. This gets overridden with every run. 
+3. An excel file with many tabs that has summary statistics from the comparison of the two classifications (the cejst and the comparator).
+
+In more detail, the excel file contains the following tabs:
+- `Summary`: out of all tracts (even if you keep missing), how many tracts are classified TRUE/FALSE by the comparator and CEJST, by population and number.
+- `Tract level stats`: overall, for all tracts classified as TRUE for CEJST and the comparator, how do the demographics of those tracts compare? Here, we think of "demographics" loosely -- whatever columns you include in the parameter yaml will show up. For example, if my additional demographics column in the yaml included `percent of households in linguistic isolation`, I'd see the average percent of households in linguistic isolation for the comparator-identified tracts (where the comparator is TRUE) and for CEJST-identified tracts. 
+- `Population level stats`: same demographic variables, looking at population within tract. Since not all tracts have the same number of people, this will be slightly different. This also includes segments of the population -- where you can investigate the disjoint set of tracts identified by a single method (e.g., you could specifically look at tracts identified by CEJST but not by the comparator.)
+- `Segmented tract level stats`: segmented version of the tract-level stats. 
+- (Optional -- requires not disjoint set of tracts) `Comparator and CEJST overlap`: shows the overlap from the vantage point of the comparator ("what share of the tracts that the comparator identifies are also identified in CEJST?"). Also lists the states the comparator has information for. 
+
+__What parameters go in the yaml file?__
+
+- ADDITIONAL_DEMO_COLUMNS: list, demographic columns from the score file that you want to run analyses on. All columns here will appear in the excel file and the graph. 
+- COMPARATOR_COLUMN: the name of the column that has a boolean (*must be TRUE / FALSE*) for whether or not the tract is prioritized. You provide this! 
+- DEMOGRAPHIC_COLUMNS: list, demographic columns from another file that you'd like to include in the analysis. 
+- DEMOGRAPHIC_FILE: the file that has the census demographic information. This name suggests, in theory, that you've run our pipeline and are using the ACS output -- but any file with `GEOID10_TRACT` as the field with census tract IDs will work. 
+- OUTPUT_DATA_PATH: where you want the output to be. Convention: output + folder named of data source. Note that the folder name of the data source gets read as the "data name" for some of the outputs. 
+- SCORE_COLUMN: CEJST score boolean name column name. 
+- SCORE_FILE: CEJST full score file. This requires that you've run our pipeline, but in theory,  the downloaded file should also work, provided the columns are named appropriately. 
+- TOTAL_POPULATION_COLUMN: column name for total population. We use `Total Population` currently in our pipeline. 
+- OTHER_COMPARATOR_COLUMNS: list, other columns from the comparator file you might want to read in for analysis. This is an optional argument. You will keep these columns to perform analysis once you have the notebook -- this will not be included in the excel print out. 
+- KEEP_MISSING_VALUES_FOR_SEGMENTATION: whether or not to fill NaNs. True keeps missing.
+
+__Cleaning data__
+
+Comparator data should live in a flat csv, just like the CEJST data. Right now, each comparator has a folder in `comparison_tool/data` that contains a notebook to clean the data (this is because the data is often quirky and so live inspection is easier), the `raw` data, and the `clean` data. We can also point the `yaml` to an `ETL` output, for files in which there are multiple important columns, if you want to use one of the data sources the CEJST team has already included in the pipeline (which are already compatible with the tool). 
+
+When you make your own output for comparison, make sure to follow the steps below. 
+
+When you clean the data, it's important that you:
+1. Ensure the tract level id is named the same as the field name in score M (specified in `field_names`). Right now, this is `GEOID10_TRACT`.
+2. Ensure the identification column is a `bool`.
+
+You will provide the path to the comparator data in the parameter yaml file. 
+
+__How to use the shell script__
+
+We have also included a shell script, `run_all_comparisons.sh`. This script includes all 
+of the commands that we have run to generate pairwise comparisons. 
+
+To run: `$ bash run_all_comparisons.sh`
+
+To add to it: create a new line and include the command line for each notebook run. 
--- a/data/data-pipeline/data_pipeline/comparison_tool/run_all_comparisons.sh
+++ b/data/data-pipeline/data_pipeline/comparison_tool/run_all_comparisons.sh
@ -0,0 +1,3 @@
+#! /bin/bash
+
+poetry run python3 src/run_tract_comparison.py --template_notebook=src/tract_comparison__template.ipynb --parameter_yaml=src/donut_hole_dacs.yaml
--- a/data/data-pipeline/data_pipeline/comparison_tool/src/donut_hole_dacs.yaml
+++ b/data/data-pipeline/data_pipeline/comparison_tool/src/donut_hole_dacs.yaml
@ -0,0 +1,24 @@
+ADDITIONAL_DEMO_COLUMNS:
+- Urban Heuristic Flag
+- Percent of individuals below 200% Federal Poverty Line
+- Percent individuals age 25 or over with less than high school degree
+- Unemployment (percent)
+- Percent of households in linguistic isolation
+COMPARATOR_COLUMN: donut_hole_dac_additional_constraints
+COMPARATOR_FILE: data/donut_hole_dac/donut_hole_data.csv
+DEMOGRAPHIC_COLUMNS:
+- Percent Black or African American alone
+- Percent American Indian and Alaska Native alone
+- Percent Asian alone
+- Percent Native Hawaiian and Other Pacific alone
+- Percent Two or more races
+- Percent Non-Hispanic White
+- Percent Hispanic or Latino
+DEMOGRAPHIC_FILE: ../../data_pipeline/data/dataset/census_acs_2019/usa.csv
+OUTPUT_DATA_PATH: output/donut_hole_dac
+SCORE_FILE: ../../data_pipeline/data/score/csv/full/usa.csv
+OTHER_COMPARATOR_COLUMNS: 
+- donut_hole_dac
+- P200_PFS
+- HSEF
+KEEP_MISSING_VALUES_FOR_SEGMENTATION: false
--- a/data/data-pipeline/data_pipeline/comparison_tool/src/run_tract_comparison.py
+++ b/data/data-pipeline/data_pipeline/comparison_tool/src/run_tract_comparison.py
@ -0,0 +1,82 @@
+"""
+This script is a 'starter' for parameterized work pertaining to generating comparisons.
+Here, we only compare  DAC lists that operate at the tract level. We can change/expand this as we move forward.
+
+Why papermill? Papermill is an easy way to parameterize notebooks. While doing comparison work,
+I often had to do a lot of one-off analysis in a notebook that would then get discarded. With parametrized notebooks,
+we can run each type of analysis and then store it for posterity. The first template is just for agencies, but you could
+imagine generating interactive outputs for many types of analysis, like demographic work.
+
+To see more: https://buildmedia.readthedocs.org/media/pdf/papermill/latest/papermill.pdf
+
+To run:
+` $ python src/run_tract_comparison.py --template_notebook=TEMPLATE.ipynb --parameter_yaml=PARAMETERS.yaml`
+"""
+
+import os
+import datetime
+import argparse
+import yaml
+import papermill as pm
+
+
+def _read_param_file(param_file: str) -> dict:
+    """Reads params and enforces a few constraints:
+
+    1. There's a defined output path
+    2. Change all relative paths to absolute paths
+    """
+    with open(param_file, "r", encoding="utf8") as param_stream:
+        param_dict = yaml.safe_load(param_stream)
+    assert (
+        "OUTPUT_DATA_PATH" in param_dict
+    ), "Error: you need to specify an output data path"
+    # convert any paths to absolute paths
+    updated_param_dict = {}
+    for variable, value in param_dict.items():
+        if ("PATH" in variable) or ("FILE" in variable):
+            updated_param_dict[variable] = os.path.abspath(value)
+        else:
+            updated_param_dict[variable] = value
+    # configure output name
+    updated_param_dict["OUTPUT_NAME"] = updated_param_dict[
+        "OUTPUT_DATA_PATH"
+    ].split("/")[-1]
+
+    return updated_param_dict
+
+
+def _configure_output(updated_param_dict: dict, input_template: str) -> str:
+    """Configures output directory and creates output path"""
+    if not os.path.exists(updated_param_dict["OUTPUT_DATA_PATH"]):
+        os.mkdir(updated_param_dict["OUTPUT_DATA_PATH"])
+
+    output_notebook_path = os.path.join(
+        updated_param_dict["OUTPUT_DATA_PATH"],
+        updated_param_dict["OUTPUT_NAME"]
+        + f"__{input_template.replace('template', datetime.datetime.now().strftime('%Y-%m-%d'))}",
+    )
+    return output_notebook_path
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--template_notebook",
+        help="Please specify which notebook to run",
+        required=True,
+    )
+    parser.add_argument(
+        "--parameter_yaml",
+        help="Please specify which parameter file to use",
+        required=True,
+    )
+    args = parser.parse_args()
+    updated_param_dict = _read_param_file(args.parameter_yaml)
+    notebook_name = args.template_notebook.split("/")[-1]
+    output_notebook_path = _configure_output(updated_param_dict, notebook_name)
+    pm.execute_notebook(
+        args.template_notebook,
+        output_notebook_path,
+        parameters=updated_param_dict,
+    )
--- a/data/data-pipeline/data_pipeline/comparison_tool/src/tract_comparison__template.ipynb
+++ b/data/data-pipeline/data_pipeline/comparison_tool/src/tract_comparison__template.ipynb
@ -0,0 +1,430 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "81788485-12d4-41f4-b5db-aa4874a32501",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import os\n",
+    "import datetime\n",
+    "import sys\n",
+    "import seaborn as sns\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from data_pipeline.score import field_names\n",
+    "from data_pipeline.comparison_tool.src import utils   \n",
+    "\n",
+    "pd.options.display.float_format = \"{:,.3f}\".format\n",
+    "%load_ext lab_black"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2820ead7-981d-40e0-8a07-80f9f1c48b9c",
+   "metadata": {},
+   "source": [
+    "# Comparator definition comparison\n",
+    "\n",
+    "This notebook answers a few questions:\n",
+    "1. How many tracts are flagged and what's the size of overlap by comparator?\n",
+    "2. What are the demographics of each set of tracts by \"category\" of score (CEJST but not comparator, comparator but not CEJST, CEJST and comparator)?\n",
+    "3. What are the overall demographics of ALL comparator vs ALL CEJST?\n",
+    "\n",
+    "It produces a single Excel file of the stats listed, but is interactive even after run-time. This notebook focuses on 1:1 comparison. It can be pointed in the YAML to either a simple output (tract and boolean for highlight) or to the output from an ETL."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2093584e-312b-44f7-87d7-6099d9fe000f",
+   "metadata": {
+    "tags": [
+     "parameters"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "## These are parameters and get overridden by the \"injected parameters\" cell below\n",
+    "ADDITIONAL_DEMO_COLUMNS = []\n",
+    "COMPARATOR_COLUMN = None\n",
+    "COMPARATOR_FILE = None\n",
+    "DEMOGRAPHIC_COLUMNS = []\n",
+    "DEMOGRAPHIC_FILE = None\n",
+    "OUTPUT_DATA_PATH = None\n",
+    "SCORE_FILE = None\n",
+    "OTHER_COMPARATOR_COLUMNS = None\n",
+    "OUTPUT_NAME = None\n",
+    "KEEP_MISSING_VALUES_FOR_SEGMENTATION = True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd38aaaa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## These are constants for all runs\n",
+    "GEOID_COLUMN = field_names.GEOID_TRACT_FIELD\n",
+    "SCORE_COLUMN = field_names.SCORE_M_COMMUNITIES\n",
+    "TOTAL_POPULATION_COLUMN = field_names.TOTAL_POP_FIELD"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a3940be",
+   "metadata": {},
+   "source": [
+    "__Date and time of last run__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5f19a1e3-fe05-425d-9160-4d4e193fadf7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "datetime.datetime.now()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "269b0f06-edee-4fce-8755-a8e7fe92e340",
+   "metadata": {},
+   "source": [
+    "__Congfigure output (autocreated)__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "acace944-18fa-4b05-a581-8147e8e09299",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "OUTPUT_EXCEL = os.path.join(\n",
+    "    OUTPUT_DATA_PATH,\n",
+    "    f\"{OUTPUT_NAME}__{datetime.datetime.now().strftime('%Y-%m-%d')}.xlsx\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "806b8a7a",
+   "metadata": {},
+   "source": [
+    "__Validate new data__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d089469",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "utils.validate_new_data(\n",
+    "    file_path=COMPARATOR_FILE, score_col=COMPARATOR_COLUMN\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f5517217-f107-4fee-b4ca-be61f6b2b7c3",
+   "metadata": {},
+   "source": [
+    "__Read in data__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cdefecf2-100b-4ad2-b0bd-101c0bad5a92",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "comparator_cols = [COMPARATOR_COLUMN] + OTHER_COMPARATOR_COLUMNS if OTHER_COMPARATOR_COLUMNS else [COMPARATOR_COLUMN]\n",
+    "\n",
+    "#papermill_description=Loading_data\n",
+    "joined_df = pd.concat(\n",
+    "    [\n",
+    "        utils.read_file(\n",
+    "            file_path=SCORE_FILE,\n",
+    "            columns=[TOTAL_POPULATION_COLUMN, SCORE_COLUMN] + ADDITIONAL_DEMO_COLUMNS,\n",
+    "            geoid=GEOID_COLUMN,\n",
+    "        ),\n",
+    "        utils.read_file(\n",
+    "            file_path=COMPARATOR_FILE,\n",
+    "            columns=comparator_cols,\n",
+    "            geoid=GEOID_COLUMN\n",
+    "        ),\n",
+    "        utils.read_file(\n",
+    "            file_path=DEMOGRAPHIC_FILE,\n",
+    "            columns=DEMOGRAPHIC_COLUMNS,\n",
+    "            geoid=GEOID_COLUMN,\n",
+    "        ),\n",
+    "    ],\n",
+    "    axis=1,\n",
+    ").reset_index()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "da71a62c-9c2c-46bd-815f-37cdb9ca18a1",
+   "metadata": {},
+   "source": [
+    "## High-level summary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bfaf9912-1ffe-49e7-a38a-2f815b07826d",
+   "metadata": {},
+   "source": [
+    "What *shares* of tracts and population highlighted by the comparator are covered by CEJST?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "93d2990c-edc0-404b-aeea-b62e3ca7b308",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#papermill_description=Summary_stats\n",
+    "population_df = utils.produce_summary_stats(\n",
+    "    joined_df=joined_df,\n",
+    "    comparator_column=COMPARATOR_COLUMN,\n",
+    "    score_column=SCORE_COLUMN,\n",
+    "    population_column=TOTAL_POPULATION_COLUMN,\n",
+    "    geoid_column=GEOID_COLUMN\n",
+    ")\n",
+    "population_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c945e85-874d-4f1f-921e-ff2f0de42767",
+   "metadata": {},
+   "source": [
+    "## Tract-level stats\n",
+    "\n",
+    "First, this walks through overall stats for disadvantaged communities under the comparator definition and under the CEJST's definition. Next, this walks through stats by group (e.g., CEJST and not comparator). This is at the tract level, so the average across tracts, where tracts are not population-weighted. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "618d2bac-8800-4fd7-a75d-f4abaae30923",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#papermill_description=Tract_stats\n",
+    "tract_level_by_identification_df = pd.concat(\n",
+    "    [\n",
+    "        utils.get_demo_series(\n",
+    "            grouping_column=COMPARATOR_COLUMN,\n",
+    "            joined_df=joined_df,\n",
+    "            demo_columns=ADDITIONAL_DEMO_COLUMNS + DEMOGRAPHIC_COLUMNS\n",
+    "        ),\n",
+    "        utils.get_demo_series(\n",
+    "            grouping_column=SCORE_COLUMN,\n",
+    "            joined_df=joined_df,\n",
+    "            demo_columns=ADDITIONAL_DEMO_COLUMNS + DEMOGRAPHIC_COLUMNS\n",
+    "        ),\n",
+    "    ],\n",
+    "    axis=1,\n",
+    ")\n",
+    "\n",
+    "tract_level_by_identification_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0f37bfa5-5e9e-46d1-97af-8d0e7d53c80b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(11, 11))\n",
+    "sns.barplot(\n",
+    "    y=\"Variable\",\n",
+    "    x=\"Avg in tracts\",\n",
+    "    hue=\"Definition\",\n",
+    "    data=tract_level_by_identification_df.sort_values(by=COMPARATOR_COLUMN, ascending=False)\n",
+    "    .stack()\n",
+    "    .reset_index()\n",
+    "    .rename(\n",
+    "        columns={\"level_0\": \"Variable\", \"level_1\": \"Definition\", 0: \"Avg in tracts\"}\n",
+    "    ),\n",
+    "    palette=\"Blues\",\n",
+    ")\n",
+    "plt.xlim(0, 1)\n",
+    "plt.title(\"Tract level averages by identification strategy\")\n",
+    "plt.savefig(os.path.join(OUTPUT_DATA_PATH, \"tract_lvl_avg.jpg\"), bbox_inches='tight')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "713f2a04-1b2a-472c-b036-454cd5c9a495",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#papermill_description=Tract_stats_grouped\n",
+    "tract_level_by_grouping_df = utils.get_tract_level_grouping(\n",
+    "    joined_df=joined_df,\n",
+    "    score_column=SCORE_COLUMN,\n",
+    "    comparator_column=COMPARATOR_COLUMN,\n",
+    "    demo_columns=ADDITIONAL_DEMO_COLUMNS + DEMOGRAPHIC_COLUMNS,\n",
+    "    keep_missing_values=KEEP_MISSING_VALUES_FOR_SEGMENTATION\n",
+    ")\n",
+    "\n",
+    "tract_level_by_grouping_formatted_df = utils.format_multi_index_for_excel(\n",
+    "    df=tract_level_by_grouping_df\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0d166f53-4242-46f1-aedf-51dac517de94",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tract_level_by_grouping_formatted_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "372ac341-b03b-404a-9462-5087b88579b7",
+   "metadata": {},
+   "source": [
+    "## Population-weighted stats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d172e3c8-60ab-44f0-8034-dda69f1146da",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#papermill_description=Population_stats\n",
+    "population_weighted_stats_df = pd.concat(\n",
+    "    [\n",
+    "        utils.construct_weighted_statistics(\n",
+    "            input_df=joined_df,\n",
+    "            weighting_column=COMPARATOR_COLUMN,\n",
+    "            demographic_columns=DEMOGRAPHIC_COLUMNS + ADDITIONAL_DEMO_COLUMNS,\n",
+    "            population_column=TOTAL_POPULATION_COLUMN,\n",
+    "        ),\n",
+    "        utils.construct_weighted_statistics(\n",
+    "            input_df=joined_df,\n",
+    "            weighting_column=SCORE_COLUMN,\n",
+    "            demographic_columns=DEMOGRAPHIC_COLUMNS + ADDITIONAL_DEMO_COLUMNS,\n",
+    "            population_column=TOTAL_POPULATION_COLUMN,\n",
+    "        ),\n",
+    "    ],\n",
+    "    axis=1,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "33c96e68",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "population_weighted_stats_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "751d2d21",
+   "metadata": {},
+   "source": [
+    "## Final information about overlap"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9eef9447",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "comparator_and_cejst_proportion_series, states = utils.get_final_summary_info(\n",
+    "    population=population_df,\n",
+    "    comparator_file=COMPARATOR_FILE,\n",
+    "    geoid_col=GEOID_COLUMN\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c45e7b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "states_text = \"States included in comparator: \" + states\n",
+    "states_text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a9fb4dfa-876d-4f25-9552-9831f51ec05c",
+   "metadata": {},
+   "source": [
+    "## Print to excel"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67e2c71e-68af-44b4-bab0-33bc54afe7a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#papermill_description=Writing_excel\n",
+    "utils.write_single_comparison_excel(\n",
+    "    output_excel=OUTPUT_EXCEL,\n",
+    "    population_df=population_df,\n",
+    "    tract_level_by_identification_df=tract_level_by_identification_df,\n",
+    "    population_weighted_stats_df=population_weighted_stats_df,\n",
+    "    tract_level_by_grouping_formatted_df=tract_level_by_grouping_formatted_df,\n",
+    "    comparator_and_cejst_proportion_series=comparator_and_cejst_proportion_series,\n",
+    "    states_text=states_text\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/data/data-pipeline/data_pipeline/comparison_tool/src/utils.py
+++ b/data/data-pipeline/data_pipeline/comparison_tool/src/utils.py
@ -0,0 +1,391 @@
+import pathlib
+import pandas as pd
+import xlsxwriter
+
+from data_pipeline.score import field_names
+from data_pipeline.etl.sources.census.etl_utils import get_state_information
+
+# Some excel parameters
+DEFAULT_COLUMN_WIDTH = 18
+# the 31 is a limit from excel on how long the tab name can be
+MSFT_TAB_NAME_LIMIT = 31
+
+# FIPS information
+DATA_PATH = pathlib.Path(__file__).parents[2] / "data"
+FIPS_MAP = (
+    get_state_information(data_path=DATA_PATH)
+    .set_index("fips")["state_abbreviation"]
+    .to_dict()
+)
+
+
+def validate_new_data(
+    file_path: str, score_col: str, geoid: str = field_names.GEOID_TRACT_FIELD
+):
+    """Ensures user provided data meets the constraints.
+
+    Constraints are:
+    (1) Boolean series for score column
+    (2) GEOID column is named the same thing as in the rest of our code
+
+    Note this only reads the first 10 rows of the file for speed
+    """
+    checking_df = pd.read_csv(
+        file_path, usecols=[score_col, geoid], dtype={geoid: str}, nrows=10
+    )
+
+    assert (
+        geoid in checking_df.columns
+    ), f"Error: change your geoid column in the data to {field_names.GEOID_TRACT_FIELD}"
+    assert (
+        checking_df[score_col].nunique() <= 3
+    ), f"Error: there are too many values possible in {score_col}"
+    assert (True in checking_df[score_col].unique()) & (
+        False in checking_df[score_col].unique()
+    ), f"Error: {score_col} should be a boolean"
+
+
+def read_file(
+    file_path: str, columns: list, geoid: str = field_names.GEOID_TRACT_FIELD
+) -> pd.DataFrame:
+    """Reads standardized csvs in
+
+    Parameters:
+        file_path: the file to read
+        columns: the columns to include
+        geoid: the geoid column name (if we change this in usa.csv, will need to change this slightly)
+
+    Returns:
+        dataframe that has been read in with geographic index
+    """
+    assert (
+        geoid == field_names.GEOID_TRACT_FIELD
+    ), f"Field name specified for geoid is incorrect. Use {field_names.GEOID_TRACT_FIELD}"
+    return pd.read_csv(
+        file_path, usecols=columns + [geoid], dtype={geoid: str}
+    ).set_index(geoid)
+
+
+def produce_summary_stats(
+    joined_df: pd.DataFrame,
+    comparator_column: str,
+    score_column: str,
+    population_column: str,
+    geoid_column: str = field_names.GEOID_TRACT_FIELD,
+) -> pd.DataFrame:
+    """Produces high-level overview dataframe
+
+    Parameters:
+        joined_df: the big df
+        comparator_column: the column name for the comparator identification bool
+        score_column: the column name for the CEJST score bool
+        population_column: the column that includes population count per tract
+        geoid_column: the geoid10_tract column
+
+    Returns:
+        population: the high-level overview df
+    """
+    # Because this reports high-level statistics across all census tracts, it
+    # makes sense for the statistics to force all census tracts to be included.
+    temp_joined_df = joined_df.fillna({comparator_column: "missing"})
+
+    population_df = temp_joined_df.groupby(
+        [comparator_column, score_column]
+    ).agg({population_column: ["sum"], geoid_column: ["count"]})
+
+    population_df["share_of_tracts"] = (
+        population_df[geoid_column] / population_df[geoid_column].sum()
+    )
+
+    population_df["share_of_population_in_tracts"] = (
+        population_df[population_column]
+        / population_df[population_column].sum()
+    )
+
+    population_df.columns = [
+        "Population",
+        "Count of tracts",
+        "Share of tracts",
+        "Share of population",
+    ]
+    return population_df
+
+
+def get_demo_series(
+    grouping_column: str,
+    joined_df: pd.DataFrame,
+    demo_columns: list,
+) -> pd.DataFrame:
+    """Helper function to produce demographic information"""
+    # To preserve np.nan, we drop all nans
+    full_df = joined_df.dropna(subset=[grouping_column])
+    return (
+        full_df[full_df[grouping_column]][demo_columns]
+        .mean()
+        .T.rename(grouping_column)
+    )
+
+
+def get_tract_level_grouping(
+    joined_df: pd.DataFrame,
+    score_column: str,
+    comparator_column: str,
+    demo_columns: list,
+    keep_missing_values: bool = True,
+) -> pd.DataFrame:
+    """Function to produce segmented statistics (tract level)
+
+    Here, we are thinking about the following segments:
+    1. CEJST and comparator
+    2. Not CEJST and comparator
+    3. Not CEJST and not comparator
+    4. CEJST and not comparator
+
+    If "keep_missing_values" flag:
+    5. Missing from CEJST and comparator (this should never be true - it would be a tract we had not seen in CEJST!)
+    6. Missing from comparator and not highlighted by CEJST
+    7. Missing from comparator and highlighted by CEJST
+
+    This will make sure that comparisons are "apples to apples".
+    """
+    group_list = [score_column, comparator_column]
+    use_df = joined_df.copy()
+
+    if keep_missing_values:
+        use_df = use_df.fillna({score_column: "nan", comparator_column: "nan"})
+    grouping_df = use_df.groupby(group_list)[demo_columns].mean().reset_index()
+
+    # this will work whether or not there are "nans" present
+    grouping_df[score_column] = grouping_df[score_column].map(
+        {
+            True: "CEJST",
+            False: "Not CEJST",
+            "nan": "No CEJST classification",
+        }
+    )
+    grouping_df[comparator_column] = grouping_df[comparator_column].map(
+        {
+            True: "Comparator",
+            False: "Not Comparator",
+            "nan": "No Comparator classification",
+        }
+    )
+    return grouping_df.set_index([score_column, comparator_column]).T
+
+
+def format_multi_index_for_excel(
+    df: pd.DataFrame, rename_str: str = "Variable"
+) -> pd.DataFrame:
+    """Helper function for multiindex printing"""
+    df = df.reset_index()
+    df.columns = [rename_str] + [
+        ", ".join(col_tuple).strip()
+        for col_tuple in df.columns[1:].to_flat_index()
+    ]
+    return df
+
+
+def get_final_summary_info(
+    population: pd.DataFrame,
+    comparator_file: str,
+    geoid_col: str,
+) -> tuple[pd.DataFrame, str]:
+    """Creates summary table.
+
+    This creates a series that tells us what share (%) of census tracts identified
+    by the comparator are also in CEJST and what states the comparator covers.
+    """
+    try:
+        comparator_and_cejst_proportion_series = (
+            population.loc[(True, True)] / population.loc[(True,)].sum()
+        )
+    except KeyError:
+        # for when we are looking at a disjoint set, like donut holes
+        comparator_and_cejst_proportion_series = pd.DataFrame()
+
+    # we pull all fips codes from the comparator column -- this is a very quick
+    # read
+    states_represented = (
+        pd.read_csv(
+            comparator_file, usecols=[geoid_col], dtype={geoid_col: str}
+        )[geoid_col]
+        .str[:2]
+        .unique()
+    )
+    # We join all states into a single string here so they can be printed in a single
+    # cell in the excel file.
+    states = ", ".join(
+        [
+            FIPS_MAP[state]
+            if (state in FIPS_MAP)
+            else f"Comparator code missing: (fips {state})"
+            for state in states_represented
+        ]
+    )
+    return comparator_and_cejst_proportion_series, states
+
+
+def construct_weighted_statistics(
+    input_df: pd.DataFrame,
+    weighting_column: str,
+    demographic_columns: list,
+    population_column: str,
+) -> pd.DataFrame:
+    """Function to produce population weighted stats
+
+    Parameters:
+        input_df: this gets copied and is the big frame
+        weighting_column: the column to group by for the comparator weights (e.g., grouped by this column, the sum of the weights is 1)
+        demographic_columns: the columns to get weighted stats for
+        population_column: the population column
+
+    Returns:
+        population-weighted comparator statistics
+    """
+    comparator_weighted_joined_df = input_df.copy()
+    comparator_weighted_joined_df[
+        "tmp_weight"
+    ] = comparator_weighted_joined_df.groupby(weighting_column)[
+        population_column
+    ].transform(
+        lambda x: x / x.sum()
+    )
+    comparator_weighted_joined_df[
+        demographic_columns
+    ] = comparator_weighted_joined_df[demographic_columns].transform(
+        lambda x: x * comparator_weighted_joined_df["tmp_weight"]
+    )
+    return (
+        comparator_weighted_joined_df.groupby(weighting_column)[
+            demographic_columns
+        ]
+        .sum()
+        .T
+    ).rename(columns={True: weighting_column, False: "not " + weighting_column})
+
+
+def write_excel_tab(
+    writer: pd.ExcelWriter,
+    worksheet_name: str,
+    df: pd.DataFrame,
+    text_format: xlsxwriter.format.Format,
+    use_index: bool = True,
+):
+    """Helper function to set tab width"""
+    df.to_excel(writer, sheet_name=worksheet_name, index=use_index)
+    worksheet = writer.sheets[worksheet_name]
+    for i, column_name in enumerate(df.columns):
+        # We set variable names to be extra wide, all other columns can take
+        # cues from their headers
+        if not column_name == "Variable":
+            worksheet.set_column(i, i + 1, len(column_name) + 2, text_format)
+        else:
+            worksheet.set_column(i, i + 1, DEFAULT_COLUMN_WIDTH, text_format)
+
+
+def write_excel_tab_about_comparator_scope(
+    writer: pd.ExcelWriter,
+    worksheet_name: str,
+    comparator_and_cejst_proportion_series: pd.Series,
+    text_format: xlsxwriter.format.Format,
+    merge_format: xlsxwriter.format.Format,
+    states_text: str,
+):
+    """Writes single tab for the excel file about high level comparator stats"""
+    comparator_and_cejst_proportion_series.to_excel(
+        writer, sheet_name=worksheet_name
+    )
+    worksheet = writer.sheets[worksheet_name[:MSFT_TAB_NAME_LIMIT]]
+    worksheet.set_column(0, 1, DEFAULT_COLUMN_WIDTH, text_format)
+
+    # merge the cells for states text
+    row_merge = len(comparator_and_cejst_proportion_series) + 2
+    # changes the row height based on how long the states text is
+    worksheet.set_row(row_merge, len(states_text) // 2)
+    worksheet.merge_range(
+        first_row=row_merge,
+        last_row=row_merge,
+        first_col=0,
+        last_col=1,
+        data=states_text,
+        cell_format=merge_format,
+    )
+
+
+def write_single_comparison_excel(
+    output_excel: str,
+    population_df: pd.DataFrame,
+    tract_level_by_identification_df: pd.DataFrame,
+    population_weighted_stats_df: pd.DataFrame,
+    tract_level_by_grouping_formatted_df: pd.DataFrame,
+    comparator_and_cejst_proportion_series: pd.Series,
+    states_text: str,
+):
+    """Writes the comparison excel file.
+
+    Writing excel from python is always a huge pain. Making the functions truly generalizable is not worth
+    the pay off and (in my experience) is extremely hard to maintain.
+    """
+    with pd.ExcelWriter(output_excel) as writer:
+        workbook = writer.book
+        text_format = workbook.add_format(
+            {
+                "bold": False,
+                "text_wrap": True,
+                "valign": "middle",
+                "num_format": "#,##0.00",
+            }
+        )
+
+        merge_format = workbook.add_format(
+            {
+                "border": 1,
+                "align": "center",
+                "valign": "vcenter",
+                "text_wrap": True,
+            }
+        )
+        write_excel_tab(
+            writer=writer,
+            worksheet_name="Summary",
+            df=population_df.reset_index(),
+            text_format=text_format,
+            use_index=False,
+        )
+        write_excel_tab(
+            writer=writer,
+            worksheet_name="Tract level stats",
+            df=tract_level_by_identification_df.reset_index().rename(
+                columns={"index": "Description of variable"}
+            ),
+            text_format=text_format,
+            use_index=False,
+        )
+
+        write_excel_tab(
+            writer=writer,
+            worksheet_name="Population level stats",
+            df=population_weighted_stats_df.reset_index().rename(
+                columns={"index": "Description of variable"}
+            ),
+            text_format=text_format,
+            use_index=False,
+        )
+        write_excel_tab(
+            writer=writer,
+            worksheet_name="Segmented tract level stats",
+            df=tract_level_by_grouping_formatted_df,
+            text_format=text_format,
+            use_index=False,
+        )
+        if not comparator_and_cejst_proportion_series.empty:
+            write_excel_tab_about_comparator_scope(
+                writer=writer,
+                worksheet_name="Comparator and CEJST overlap",
+                comparator_and_cejst_proportion_series=comparator_and_cejst_proportion_series.rename(
+                    "Comparator and CEJST overlap"
+                ),
+                text_format=text_format,
+                states_text=states_text,
+                merge_format=merge_format,
+            )
--- a/data/data-pipeline/poetry.lock
+++ b/data/data-pipeline/poetry.lock
@ -1,3 +1,14 @@
+[[package]]
+name = "ansiwrap"
+version = "0.8.4"
+description = "textwrap, but savvy to ANSI colors and styles"
+category = "dev"
+optional = false
+python-versions = "*"
+
+[package.dependencies]
+textwrap3 = ">=0.9.2"
+
 [[package]]
 name = "appnope"
 version = "0.1.2"
@ -1094,6 +1105,36 @@ category = "main"
 optional = false
 python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"

+[[package]]
+name = "papermill"
+version = "2.3.4"
+description = "Parametrize and run Jupyter and nteract Notebooks"
+category = "dev"
+optional = false
+python-versions = ">=3.6"
+
+[package.dependencies]
+ansiwrap = "*"
+click = "*"
+entrypoints = "*"
+nbclient = ">=0.2.0"
+nbformat = ">=5.1.2"
+pyyaml = "*"
+requests = "*"
+tenacity = "*"
+tqdm = ">=4.32.2"
+
+[package.extras]
+all = ["boto3", "azure-datalake-store (>=0.0.30)", "azure-storage-blob (>=12.1.0)", "requests (>=2.21.0)", "gcsfs (>=0.2.0)", "pyarrow", "black (>=19.3b0)"]
+azure = ["azure-datalake-store (>=0.0.30)", "azure-storage-blob (>=12.1.0)", "requests (>=2.21.0)"]
+black = ["black (>=19.3b0)"]
+dev = ["boto3", "botocore", "codecov", "coverage", "google-compute-engine", "ipython (>=5.0)", "ipywidgets", "notebook", "mock", "moto", "pytest (>=4.1)", "pytest-cov (>=2.6.1)", "pytest-mock (>=1.10)", "pytest-env (>=0.6.2)", "requests (>=2.21.0)", "check-manifest", "attrs (>=17.4.0)", "pre-commit", "flake8", "tox", "bumpversion", "recommonmark", "pip (>=18.1)", "wheel (>=0.31.0)", "setuptools (>=38.6.0)", "twine (>=1.11.0)", "azure-datalake-store (>=0.0.30)", "azure-storage-blob (>=12.1.0)", "gcsfs (>=0.2.0)", "pyarrow", "black (>=19.3b0)"]
+gcs = ["gcsfs (>=0.2.0)"]
+github = ["PyGithub (>=1.55)"]
+hdfs = ["pyarrow"]
+s3 = ["boto3"]
+test = ["boto3", "botocore", "codecov", "coverage", "google-compute-engine", "ipython (>=5.0)", "ipywidgets", "notebook", "mock", "moto", "pytest (>=4.1)", "pytest-cov (>=2.6.1)", "pytest-mock (>=1.10)", "pytest-env (>=0.6.2)", "requests (>=2.21.0)", "check-manifest", "attrs (>=17.4.0)", "pre-commit", "flake8", "tox", "bumpversion", "recommonmark", "pip (>=18.1)", "wheel (>=0.31.0)", "setuptools (>=38.6.0)", "twine (>=1.11.0)", "azure-datalake-store (>=0.0.30)", "azure-storage-blob (>=12.1.0)", "gcsfs (>=0.2.0)", "pyarrow", "black (>=19.3b0)"]
+
 [[package]]
 name = "parso"
 version = "0.8.3"
@ -1470,6 +1511,31 @@ dparse = ">=0.5.1"
 packaging = "*"
 requests = "*"

+[[package]]
+name = "scipy"
+version = "1.6.1"
+description = "SciPy: Scientific Library for Python"
+category = "dev"
+optional = false
+python-versions = ">=3.7"
+
+[package.dependencies]
+numpy = ">=1.16.5"
+
+[[package]]
+name = "seaborn"
+version = "0.11.2"
+description = "seaborn: statistical data visualization"
+category = "dev"
+optional = false
+python-versions = ">=3.6"
+
+[package.dependencies]
+matplotlib = ">=2.2"
+numpy = ">=1.15"
+pandas = ">=0.23"
+scipy = ">=1.0"
+
 [[package]]
 name = "semantic-version"
 version = "2.9.0"
@ -1540,6 +1606,17 @@ category = "main"
 optional = false
 python-versions = ">=3.6"

+[[package]]
+name = "tenacity"
+version = "8.0.1"
+description = "Retry code until it succeeds"
+category = "dev"
+optional = false
+python-versions = ">=3.6"
+
+[package.extras]
+doc = ["reno", "sphinx", "tornado (>=4.5)"]
+
 [[package]]
 name = "terminado"
 version = "0.13.3"
@ -1567,6 +1644,14 @@ python-versions = ">= 3.5"
 [package.extras]
 test = ["pytest"]

+[[package]]
+name = "textwrap3"
+version = "0.9.2"
+description = "textwrap from Python 3.6 backport (plus a few tweaks)"
+category = "dev"
+optional = false
+python-versions = "*"
+
 [[package]]
 name = "toml"
 version = "0.10.2"
@ -1783,9 +1868,13 @@ testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-
 [metadata]
 lock-version = "1.1"
 python-versions = "^3.8"
-content-hash = "48853003143c9035134427a9cb0e359401b1d49ffe55953a711a79d264994125"
+content-hash = "a317df20c1eeacc023aca30a4c25af3da8fa45f63c24ad47d5ab518d08bd9311"

 [metadata.files]
+ansiwrap = [
+    {file = "ansiwrap-0.8.4-py2.py3-none-any.whl", hash = "sha256:7b053567c88e1ad9eed030d3ac41b722125e4c1271c8a99ade797faff1f49fb1"},
+    {file = "ansiwrap-0.8.4.zip", hash = "sha256:ca0c740734cde59bf919f8ff2c386f74f9a369818cdc60efe94893d01ea8d9b7"},
+]
 appnope = [
    {file = "appnope-0.1.2-py2.py3-none-any.whl", hash = "sha256:93aa393e9d6c54c5cd570ccadd8edad61ea0c4b9ea7a01409020c9aa019eb442"},
    {file = "appnope-0.1.2.tar.gz", hash = "sha256:dd83cd4b5b460958838f6eb3000c660b1f9caf2a5b1de4264e941512f603258a"},
@ -2464,6 +2553,10 @@ pandocfilters = [
    {file = "pandocfilters-1.5.0-py2.py3-none-any.whl", hash = "sha256:33aae3f25fd1a026079f5d27bdd52496f0e0803b3469282162bafdcbdf6ef14f"},
    {file = "pandocfilters-1.5.0.tar.gz", hash = "sha256:0b679503337d233b4339a817bfc8c50064e2eff681314376a47cb582305a7a38"},
 ]
+papermill = [
+    {file = "papermill-2.3.4-py3-none-any.whl", hash = "sha256:81eb9aa3dbace9772cd6287f5af8deef64c6659d9ace0b2761db05068233bf77"},
+    {file = "papermill-2.3.4.tar.gz", hash = "sha256:be12d2728989c0ae17b42fcb05b623500004e94b34f56bd153355ccebb84a59a"},
+]
 parso = [
    {file = "parso-0.8.3-py2.py3-none-any.whl", hash = "sha256:c001d4636cd3aecdaf33cbb40aebb59b094be2a74c556778ef5576c175e19e75"},
    {file = "parso-0.8.3.tar.gz", hash = "sha256:8c07be290bb59f03588915921e29e8a50002acaf2cdc5fa0e0114f91709fafa0"},
@ -2788,6 +2881,31 @@ safety = [
    {file = "safety-1.10.3-py2.py3-none-any.whl", hash = "sha256:5f802ad5df5614f9622d8d71fedec2757099705c2356f862847c58c6dfe13e84"},
    {file = "safety-1.10.3.tar.gz", hash = "sha256:30e394d02a20ac49b7f65292d19d38fa927a8f9582cdfd3ad1adbbc66c641ad5"},
 ]
+scipy = [
+    {file = "scipy-1.6.1-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:a15a1f3fc0abff33e792d6049161b7795909b40b97c6cc2934ed54384017ab76"},
+    {file = "scipy-1.6.1-cp37-cp37m-manylinux1_i686.whl", hash = "sha256:e79570979ccdc3d165456dd62041d9556fb9733b86b4b6d818af7a0afc15f092"},
+    {file = "scipy-1.6.1-cp37-cp37m-manylinux1_x86_64.whl", hash = "sha256:a423533c55fec61456dedee7b6ee7dce0bb6bfa395424ea374d25afa262be261"},
+    {file = "scipy-1.6.1-cp37-cp37m-manylinux2014_aarch64.whl", hash = "sha256:33d6b7df40d197bdd3049d64e8e680227151673465e5d85723b3b8f6b15a6ced"},
+    {file = "scipy-1.6.1-cp37-cp37m-win32.whl", hash = "sha256:6725e3fbb47da428794f243864f2297462e9ee448297c93ed1dcbc44335feb78"},
+    {file = "scipy-1.6.1-cp37-cp37m-win_amd64.whl", hash = "sha256:5fa9c6530b1661f1370bcd332a1e62ca7881785cc0f80c0d559b636567fab63c"},
+    {file = "scipy-1.6.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:bd50daf727f7c195e26f27467c85ce653d41df4358a25b32434a50d8870fc519"},
+    {file = "scipy-1.6.1-cp38-cp38-manylinux1_i686.whl", hash = "sha256:f46dd15335e8a320b0fb4685f58b7471702234cba8bb3442b69a3e1dc329c345"},
+    {file = "scipy-1.6.1-cp38-cp38-manylinux1_x86_64.whl", hash = "sha256:0e5b0ccf63155d90da576edd2768b66fb276446c371b73841e3503be1d63fb5d"},
+    {file = "scipy-1.6.1-cp38-cp38-manylinux2014_aarch64.whl", hash = "sha256:2481efbb3740977e3c831edfd0bd9867be26387cacf24eb5e366a6a374d3d00d"},
+    {file = "scipy-1.6.1-cp38-cp38-win32.whl", hash = "sha256:68cb4c424112cd4be886b4d979c5497fba190714085f46b8ae67a5e4416c32b4"},
+    {file = "scipy-1.6.1-cp38-cp38-win_amd64.whl", hash = "sha256:5f331eeed0297232d2e6eea51b54e8278ed8bb10b099f69c44e2558c090d06bf"},
+    {file = "scipy-1.6.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:0c8a51d33556bf70367452d4d601d1742c0e806cd0194785914daf19775f0e67"},
+    {file = "scipy-1.6.1-cp39-cp39-manylinux1_i686.whl", hash = "sha256:83bf7c16245c15bc58ee76c5418e46ea1811edcc2e2b03041b804e46084ab627"},
+    {file = "scipy-1.6.1-cp39-cp39-manylinux1_x86_64.whl", hash = "sha256:794e768cc5f779736593046c9714e0f3a5940bc6dcc1dba885ad64cbfb28e9f0"},
+    {file = "scipy-1.6.1-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:5da5471aed911fe7e52b86bf9ea32fb55ae93e2f0fac66c32e58897cfb02fa07"},
+    {file = "scipy-1.6.1-cp39-cp39-win32.whl", hash = "sha256:8e403a337749ed40af60e537cc4d4c03febddcc56cd26e774c9b1b600a70d3e4"},
+    {file = "scipy-1.6.1-cp39-cp39-win_amd64.whl", hash = "sha256:a5193a098ae9f29af283dcf0041f762601faf2e595c0db1da929875b7570353f"},
+    {file = "scipy-1.6.1.tar.gz", hash = "sha256:c4fceb864890b6168e79b0e714c585dbe2fd4222768ee90bc1aa0f8218691b11"},
+]
+seaborn = [
+    {file = "seaborn-0.11.2-py3-none-any.whl", hash = "sha256:85a6baa9b55f81a0623abddc4a26b334653ff4c6b18c418361de19dbba0ef283"},
+    {file = "seaborn-0.11.2.tar.gz", hash = "sha256:cf45e9286d40826864be0e3c066f98536982baf701a7caa386511792d61ff4f6"},
+]
 semantic-version = [
    {file = "semantic_version-2.9.0-py2.py3-none-any.whl", hash = "sha256:db2504ab37902dd2c9876ece53567aa43a5b2a417fbe188097b2048fff46da3d"},
    {file = "semantic_version-2.9.0.tar.gz", hash = "sha256:abf54873553e5e07a6fd4d5f653b781f5ae41297a493666b59dcf214006a12b2"},
@ -2836,6 +2954,10 @@ soupsieve = [
    {file = "soupsieve-2.3.1-py3-none-any.whl", hash = "sha256:1a3cca2617c6b38c0343ed661b1fa5de5637f257d4fe22bd9f1338010a1efefb"},
    {file = "soupsieve-2.3.1.tar.gz", hash = "sha256:b8d49b1cd4f037c7082a9683dfa1801aa2597fb11c3a1155b7a5b94829b4f1f9"},
 ]
+tenacity = [
+    {file = "tenacity-8.0.1-py3-none-any.whl", hash = "sha256:f78f4ea81b0fabc06728c11dc2a8c01277bfc5181b321a4770471902e3eb844a"},
+    {file = "tenacity-8.0.1.tar.gz", hash = "sha256:43242a20e3e73291a28bcbcacfd6e000b02d3857a9a9fff56b297a27afdc932f"},
+]
 terminado = [
    {file = "terminado-0.13.3-py3-none-any.whl", hash = "sha256:874d4ea3183536c1782d13c7c91342ef0cf4e5ee1d53633029cbc972c8760bd8"},
    {file = "terminado-0.13.3.tar.gz", hash = "sha256:94d1cfab63525993f7d5c9b469a50a18d0cdf39435b59785715539dd41e36c0d"},
@ -2844,6 +2966,10 @@ testpath = [
    {file = "testpath-0.6.0-py3-none-any.whl", hash = "sha256:8ada9f80a2ac6fb0391aa7cdb1a7d11cfa8429f693eda83f74dde570fe6fa639"},
    {file = "testpath-0.6.0.tar.gz", hash = "sha256:2f1b97e6442c02681ebe01bd84f531028a7caea1af3825000f52345c30285e0f"},
 ]
+textwrap3 = [
+    {file = "textwrap3-0.9.2-py2.py3-none-any.whl", hash = "sha256:bf5f4c40faf2a9ff00a9e0791fed5da7415481054cef45bb4a3cfb1f69044ae0"},
+    {file = "textwrap3-0.9.2.zip", hash = "sha256:5008eeebdb236f6303dcd68f18b856d355f6197511d952ba74bc75e40e0c3414"},
+]
 toml = [
    {file = "toml-0.10.2-py2.py3-none-any.whl", hash = "sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b"},
    {file = "toml-0.10.2.tar.gz", hash = "sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"},
--- a/data/data-pipeline/pyproject.toml
+++ b/data/data-pipeline/pyproject.toml
@ -53,6 +53,8 @@ tox-poetry = "^0.4.1"
 pandas-vet = "^0.2.2"
 pytest-snapshot = "^0.8.1"
 nb-black = "^1.0.7"
+seaborn = "^0.11.2"
+papermill = "^2.3.4"

 [build-system]
 build-backend = "poetry.core.masonry.api"