Data Unit Tests (#509)

* Fixes #341 - As a J40 developer, I want to write Unit Tests for the ETL files, so that tests are run on each commit * Location bug * Adding Load tests * Fixing XLSX filename * Adding downloadable zip test * updating pickle * Fixing pylint warnings * Updte readme to correct some typos and reorganize test content structure * Removing unused schemas file, adding details to readme around pickles, per PR feedback * Update test to pass with Score D added to score file; update path in readme * fix requirements.txt after merge * fix poetry.lock after merge Co-authored-by: Shelby Switzer <shelby.switzer@cms.hhs.gov>
2025-02-22 01:31:25 -08:00 · 2021-09-10 14:17:34 -04:00 · 2021-09-10 14:17:34 -04:00 · 536a35d6a0
commit 536a35d6a0
parent 88c8209bb0
17 changed files with 676 additions and 242 deletions
--- a/.gitignore
+++ b/.gitignore
@ -149,3 +149,4 @@ node_modules
 # pyenv
 .python-version
 .DS_Store
+temp_dir
--- a/data/data-pipeline/README.md
+++ b/data/data-pipeline/README.md
@ -8,6 +8,10 @@
 - [Justice 40 Score application](#justice-40-score-application)
  - [About this application](#about-this-application)
    - [Using the data](#using-the-data)
+      - [1. Source data](#1-source-data)
+      - [2. Extract-Transform-Load (ETL) the data](#2-extract-transform-load-etl-the-data)
+      - [3. Combined dataset](#3-combined-dataset)
+      - [4. Tileset](#4-tileset)
    - [Score generation and comparison workflow](#score-generation-and-comparison-workflow)
      - [Workflow Diagram](#workflow-diagram)
      - [Step 0: Set up your environment](#step-0-set-up-your-environment)
@ -28,6 +32,15 @@
    - [Running Jupyter notebooks](#running-jupyter-notebooks)
    - [Activating variable-enabled Markdown for Jupyter notebooks](#activating-variable-enabled-markdown-for-jupyter-notebooks)
  - [Miscellaneous](#miscellaneous)
+  - [Testing](#testing)
+    - [Background](#background)
+    - [Configuration / Fixtures](#configuration--fixtures)
+      - [Updating Pickles](#updating-pickles)
+      - [Future Enchancements](#future-enchancements)
+    - [ETL Unit Tests](#etl-unit-tests)
+      - [Extract Tests](#extract-tests)
+      - [Transform Tests](#transform-tests)
+      - [Load Tests](#load-tests)

 <!-- /TOC -->

@ -46,15 +59,17 @@ One of our primary development principles is that the entire data pipeline shoul
 In the sub-sections below, we outline what each stage of the data provenance looks like and where you can find the data output by that stage. If you'd like to actually perform each step in your own environment, skip down to [Score generation and comparison workflow](#score-generation-and-comparison-workflow).

 #### 1. Source data
+
 If you would like to find and use the raw source data, you can find the source URLs in the `etl.py` files located within each directory in `data/data-pipeline/etl/sources`.

 #### 2. Extract-Transform-Load (ETL) the data
-The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in `data/data-pipeline/etl/sources`, and the output of this process is a number of CSVs available at the following locations: 

-* EJScreen: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv
-* Census ACS 2019: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv
-* Housing and Transportation Index: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv
-* HUD Housing: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv
+The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in `data/data-pipeline/etl/sources`, and the output of this process is a number of CSVs available at the following locations:
+
+- EJScreen: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv>
+- Census ACS 2019: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv>
+- Housing and Transportation Index: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv>
+- HUD Housing: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv>

 Each CSV may have a different column name for the census tract or census block group identifier. You can find what the name is in the ETL code. Please note that when you view these files you should make sure that your text editor or spreadsheet software does not remove the initial `0` from this identifier field (many IDs begin with `0`).

@ -242,3 +257,78 @@ see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nb
 ## Miscellaneous

 - To export packages from Poetry to `requirements.txt` run `poetry export --without-hashes > requirements.txt`
+
+## Testing
+
+### Background
+
+For this project, we make use of [pytest](https://docs.pytest.org/en/latest/) for testing purposes. To run tests, simply run `poetry run pytest` in this directory (i.e. `justice40-tool/data/data-pipeline`).
+
+### Configuration / Fixtures
+
+Test data is configured via [fixtures](https://docs.pytest.org/en/latest/explanation/fixtures.html).
+
+These fixtures utilize [pickle files](https://docs.python.org/3/library/pickle.html) to store dataframes to disk. This is ultimately because if you assert equality on two dataframes, even if column values have the same "visible" value, if their types are mismatching they will be counted as not being equal.
+
+In a bit more detail:
+
+1. Pandas dataframes are typed, and by default, types are inferred when you create one from scratch. If you create a dataframe using the `DataFrame` [constructors](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame), there is no guarantee that types will be correct, without explicit `dtype` annotations. Explicit `dtype` annotations are possible, but, and this leads us to point #2:
+
+2. Our transformations/dataframes in the source code under test itself doesn't always require specific types, and it is often sufficient in the code itself to just rely on the `object` type. I attempted adding explicit typing based on the "logical" type of given columns, but in practice it resulted in non-matching dataframes that _actually_ had the same value -- in particular it was very common to have one dataframe column of type `string` and another of type `object` that carried the same values. So, that is to say, even if we did create a "correctly" typed dataframe (according to our logical assumptions about what types should be), they were still counted as mismatched against the dataframes that are actually used in our program. To fix this "the right way", it is necessary to explicitly annotate types at the point of the `read_csv` call, which definitely has other potential unintended side effects and would need to be done carefully.
+
+3. For larger dataframes (some of these have 150+ values), it was initially deemed too difficult/time consuming to manually annotate all types, and further, to modify those type annotations based on what is expected in the souce code under test.
+
+#### Updating Pickles
+
+If you update the input our output to various methods, it is necessary to create new pickles so that data is validated correctly. To do this:
+
+1. Drop a breakpoint just before the dataframe will otherwise be written to / read from disk. If you're using VSCode, use one of the named run targets within `data-pipeline` such as `Score Full Run` , and put a breakpoint in the margin just before the actionable step. More on using breakpoints in VSCode [here](https://code.visualstudio.com/docs/editor/debugging#_breakpoints). If you are not using VSCode, you can put the line `breakpoint()` in your code and it will stop where you have placed the line in whatever calling context you are using.
+1. In your editor/terminal, run `df.to_pickle("data_pipeline/etl/score/tests/snapshots/YOUR_OUT_PATH_HERE.pkl")` to write the pickle to the appropriate location on disk.
+1. Be sure to do this for all inputs/outputs that have changed as a result of your modification. It is often necessary to do this several times for cascading operations.
+1. To inspect your pickle, open a python interpreter, then run `pickle.load( open( "data_pipeline/etl/score/tests/snapshots/YOUR_OUT_PATH_HERE.pkl", "rb" ) )` to get file contents.
+
+#### Future Enchancements
+
+Pickles have several downsides that we should consider alternatives for:
+
+1. They are opaque - it is necessary to open a python interpreter (as written above) to confirm its contents
+2. They are a bit harder for newcomers to python to grok.
+3. They potentially encode flawed typing assumptions (see above) which are paved over for future test runs.
+
+In the future, we could adopt any of the below strategies to work around this:
+
+1. We could use [pytest-snapshot](https://pypi.org/project/pytest-snapshot/) to automatically store the output of each test as data changes. This would make it so that you could avoid having to generate a pickle for each method - instead, you would only need to call `generate` once , and only when the dataframe had changed.
+
+Additionally, you could use a pandas type schema annotation such as [pandera](https://pandera.readthedocs.io/en/stable/schema_models.html?highlight=inputschema#basic-usage) to annotate input/output schemas for given functions, and your unit tests could use these to validate explicitly. This could be of very high value for annotating expectations.
+
+Alternatively, or in conjunction, you could move toward using a more strictly-typed container format for read/writes such as SQL/SQLite, and use something like [SQLModel](https://github.com/tiangolo/sqlmodel) to handle more explicit type guarantees.
+
+### ETL Unit Tests
+
+ETL unit tests are typically organized into three buckets:
+
+- Extract Tests
+- Transform Tests, and
+- Load Tests
+
+These are tested using different strategies, explained below.
+
+#### Extract Tests
+
+Extract tests rely on the limited data transformations that occur as data is loaded from source files.
+
+In tests, we use fake, limited CSVs read via `StringIO` , taken from the first several rows of the files of interest, and ensure data types are correct.
+
+Down the line, we could use a tool like [Pandera](https://pandera.readthedocs.io/) to enforce schemas, both for the tests and the classes themselves.
+
+#### Transform Tests
+
+Transform tests are the heart of ETL unit tests, and compare ideal dataframes with their actual counterparts.
+
+See above [Fixtures](#configuration--fixtures) section for information about where data is coming from.
+
+#### Load Tests
+
+These make use of [tmp_path_factory](https://docs.pytest.org/en/latest/how-to/tmp_path.html) to create a file-system located under `temp_dir`, and validate whether the correct files are written to the correct locations.
+
+Additional future modifications could include the use of Pandera and/or other schema validation tools, and or a more explicit test that the data written to file can be read back in and yield the same dataframe.
--- a/data/data-pipeline/data_pipeline/etl/score/constants.py
+++ b/data/data-pipeline/data_pipeline/etl/score/constants.py
@ -0,0 +1,108 @@
+from pathlib import Path
+
+import pandas as pd
+from data_pipeline.config import settings
+
+# Base Paths
+DATA_PATH = Path(settings.APP_ROOT) / "data"
+TMP_PATH = DATA_PATH / "tmp"
+
+# Remote Paths
+CENSUS_COUNTIES_ZIP_URL = "https://www2.census.gov/geo/docs/maps-data/data/gazetteer/Gaz_counties_national.zip"
+
+# Local Paths
+CENSUS_COUNTIES_FILE_NAME = TMP_PATH / "Gaz_counties_national.txt"
+
+# Census paths
+DATA_CENSUS_DIR = DATA_PATH / "census"
+DATA_CENSUS_CSV_DIR = DATA_CENSUS_DIR / "csv"
+DATA_CENSUS_CSV_FILE_PATH = DATA_CENSUS_CSV_DIR / "us.csv"
+DATA_CENSUS_CSV_STATE_FILE_PATH = DATA_CENSUS_CSV_DIR / "fips_states_2010.csv"
+
+
+# Score paths
+DATA_SCORE_DIR = DATA_PATH / "score"
+
+## Score CSV Paths
+DATA_SCORE_CSV_DIR = DATA_SCORE_DIR / "csv"
+DATA_SCORE_CSV_FULL_DIR = DATA_SCORE_CSV_DIR / "full"
+DATA_SCORE_CSV_FULL_FILE_PATH = DATA_SCORE_CSV_FULL_DIR / "usa.csv"
+FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH = (
+    DATA_SCORE_CSV_FULL_DIR / "usa_counties.csv"
+)
+
+## Score Tile paths
+DATA_SCORE_TILES_DIR = DATA_SCORE_DIR / "tiles"
+DATA_SCORE_TILES_FILE_PATH = DATA_SCORE_TILES_DIR / "usa.csv"
+
+# Downloadable paths
+SCORE_DOWNLOADABLE_DIR = DATA_SCORE_DIR / "downloadable"
+SCORE_DOWNLOADABLE_CSV_FILE_PATH = SCORE_DOWNLOADABLE_DIR / "usa.csv"
+SCORE_DOWNLOADABLE_EXCEL_FILE_PATH = SCORE_DOWNLOADABLE_DIR / "usa.xlsx"
+SCORE_DOWNLOADABLE_ZIP_FILE_PATH = SCORE_DOWNLOADABLE_DIR / "Screening Tool Data.zip"
+
+# Column subsets
+CENSUS_COUNTIES_COLUMNS = ["USPS", "GEOID", "NAME"]
+TILES_SCORE_COLUMNS = [
+    "GEOID10",
+    "State Name",
+    "County Name",
+    "Total population",
+    "Score D (percentile)",
+    "Score D (top 25th percentile)",
+    "Score E (percentile)",
+    "Score E (top 25th percentile)",
+    "Poverty (Less than 200% of federal poverty line) (percentile)",
+    "Percent individuals age 25 or over with less than high school degree (percentile)",
+    "Linguistic isolation (percent) (percentile)",
+    "Unemployed civilians (percent) (percentile)",
+    "Housing burden (percent) (percentile)",
+]
+
+# columns to round floats to 2 decimals
+TILES_SCORE_FLOAT_COLUMNS = [
+    "Score D (percentile)",
+    "Score D (top 25th percentile)",
+    "Score E (percentile)",
+    "Score E (top 25th percentile)",
+    "Poverty (Less than 200% of federal poverty line)",
+    "Percent individuals age 25 or over with less than high school degree",
+    "Linguistic isolation (percent)",
+    "Unemployed civilians (percent)",
+    "Housing burden (percent)",
+]
+TILES_ROUND_NUM_DECIMALS = 2
+
+DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_BASIC = [
+    "Percent individuals age 25 or over with less than high school degree",
+    "Linguistic isolation (percent)",
+    "Poverty (Less than 200% of federal poverty line)",
+    "Unemployed civilians (percent)",
+    "Housing burden (percent)",
+    "Respiratory hazard index",
+    "Diesel particulate matter",
+    "Particulate matter (PM2.5)",
+    "Traffic proximity and volume",
+    "Proximity to RMP sites",
+    "Wastewater discharge",
+    "Percent pre-1960s housing (lead paint indicator)",
+    "Total population",
+]
+
+# For every indicator above, we want to include percentile and min-max normalized variants also
+DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_FULL = list(
+    pd.core.common.flatten(
+        [
+            [p, f"{p} (percentile)", f"{p} (min-max normalized)"]
+            for p in DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_BASIC
+        ]
+    )
+)
+
+# Finally we augment with the GEOID10, county, and state
+DOWNLOADABLE_SCORE_COLUMNS = [
+    "GEOID10",
+    "County Name",
+    "State Name",
+    *DOWNLOADABLE_SCORE_INDICATOR_COLUMNS_FULL,
+]
--- a/data/data-pipeline/data_pipeline/etl/score/etl_score_post.py
+++ b/data/data-pipeline/data_pipeline/etl/score/etl_score_post.py
@ -6,6 +6,8 @@ import pandas as pd
 from data_pipeline.etl.base import ExtractTransformLoad
 from data_pipeline.utils import get_module_logger, get_zip_info

+from . import constants
+
 ## zlib is not available on all systems
 try:
    import zlib  # noqa # pylint: disable=unused-import
@ -25,108 +27,19 @@ class PostScoreETL(ExtractTransformLoad):
    """

    def __init__(self):
-        self.CENSUS_COUNTIES_ZIP_URL = "https://www2.census.gov/geo/docs/maps-data/data/gazetteer/Gaz_counties_national.zip"
-        self.CENSUS_COUNTIES_TXT = self.TMP_PATH / "Gaz_counties_national.txt"
-        self.CENSUS_COUNTIES_COLS = ["USPS", "GEOID", "NAME"]
-        self.CENSUS_USA_CSV = self.DATA_PATH / "census" / "csv" / "us.csv"
-        self.SCORE_CSV_PATH = self.DATA_PATH / "score" / "csv"
-        self.DOWNLOADABLE_INFO_PATH = self.DATA_PATH / "score" / "downloadable"
+        self.input_counties_df: pd.DataFrame
+        self.input_states_df: pd.DataFrame
+        self.input_score_df: pd.DataFrame
+        self.input_national_cbg_df: pd.DataFrame

-        self.STATE_CSV = (
-            self.DATA_PATH / "census" / "csv" / "fips_states_2010.csv"
-        )
-
-        self.FULL_SCORE_CSV = self.SCORE_CSV_PATH / "full" / "usa.csv"
-        self.FULL_SCORE_CSV_PLUS_COUNTIES = (
-            self.SCORE_CSV_PATH / "full" / "usa_counties.csv"
-        )
-
-        self.TILES_SCORE_COLUMNS = [
-            "GEOID10",
-            "State Name",
-            "County Name",
-            "Total population",
-            "Score D (percentile)",
-            "Score D (top 25th percentile)",
-            "Score E (percentile)",
-            "Score E (top 25th percentile)",
-            "Poverty (Less than 200% of federal poverty line) (percentile)",
-            "Percent individuals age 25 or over with less than high school degree (percentile)",
-            "Linguistic isolation (percent) (percentile)",
-            "Unemployed civilians (percent) (percentile)",
-            "Housing burden (percent) (percentile)",
-        ]
-        self.TILES_SCORE_CSV_PATH = self.SCORE_CSV_PATH / "tiles"
-        self.TILES_SCORE_CSV = self.TILES_SCORE_CSV_PATH / "usa.csv"
-
-        # columns to round floats to 2 decimals
-        self.TILES_SCORE_FLOAT_COLUMNS = [
-            "Score D (percentile)",
-            "Score D (top 25th percentile)",
-            "Score E (percentile)",
-            "Score E (top 25th percentile)",
-            "Poverty (Less than 200% of federal poverty line) (percentile)",
-            "Percent individuals age 25 or over with less than high school degree (percentile)",
-            "Linguistic isolation (percent) (percentile)",
-            "Unemployed civilians (percent) (percentile)",
-            "Housing burden (percent) (percentile)",
-        ]
-        self.TILES_ROUND_NUM_DECIMALS = 2
-
-        self.DOWNLOADABLE_SCORE_INDICATORS_BASIC = [
-            "Percent individuals age 25 or over with less than high school degree",
-            "Linguistic isolation (percent)",
-            "Poverty (Less than 200% of federal poverty line)",
-            "Unemployed civilians (percent)",
-            "Housing burden (percent)",
-            "Respiratory hazard index",
-            "Diesel particulate matter",
-            "Particulate matter (PM2.5)",
-            "Traffic proximity and volume",
-            "Proximity to RMP sites",
-            "Wastewater discharge",
-            "Percent pre-1960s housing (lead paint indicator)",
-            "Total population",
-        ]
-
-        # For every indicator above, we want to include percentile and min-max normalized variants also
-        self.DOWNLOADABLE_SCORE_INDICATORS_FULL = list(
-            pd.core.common.flatten(
-                [
-                    [p, f"{p} (percentile)", f"{p} (min-max normalized)"]
-                    for p in self.DOWNLOADABLE_SCORE_INDICATORS_BASIC
-                ]
-            )
-        )
-
-        # Finally we augment with the GEOID10, county, and state
-        self.DOWNLOADABLE_SCORE_COLUMNS = [
-            "GEOID10",
-            "County Name",
-            "State Name",
-            *self.DOWNLOADABLE_SCORE_INDICATORS_FULL,
-        ]
-        self.DOWNLOADABLE_SCORE_CSV = self.DOWNLOADABLE_INFO_PATH / "usa.csv"
-        self.DOWNLOADABLE_SCORE_EXCEL = self.DOWNLOADABLE_INFO_PATH / "usa.xlsx"
-        self.DOWNLOADABLE_SCORE_ZIP = (
-            self.DOWNLOADABLE_INFO_PATH / "Screening Tool Data.zip"
-        )
-
-        self.counties_df: pd.DataFrame
-        self.states_df: pd.DataFrame
-        self.score_df: pd.DataFrame
-        self.score_county_state_merged: pd.DataFrame
-        self.score_for_tiles: pd.DataFrame
-
-    def extract(self) -> None:
-        super().extract(
-            self.CENSUS_COUNTIES_ZIP_URL,
-            self.TMP_PATH,
-        )
+        self.output_score_county_state_merged_df: pd.DataFrame
+        self.output_score_tiles_df: pd.DataFrame
+        self.output_downloadable_df: pd.DataFrame

+    def _extract_counties(self, county_path: Path) -> pd.DataFrame:
        logger.info("Reading Counties CSV")
-        self.counties_df = pd.read_csv(
-            self.CENSUS_COUNTIES_TXT,
+        return pd.read_csv(
+            county_path,
            sep="\t",
            dtype={
                "GEOID": "string",
@ -136,134 +49,213 @@ class PostScoreETL(ExtractTransformLoad):
            encoding="latin-1",
        )

+    def _extract_states(self, state_path: Path) -> pd.DataFrame:
        logger.info("Reading States CSV")
-        self.states_df = pd.read_csv(
-            self.STATE_CSV, dtype={"fips": "string", "state_code": "string"}
+        return pd.read_csv(
+            state_path, dtype={"fips": "string", "state_abbreviation": "string"}
        )
-        self.score_df = pd.read_csv(
-            self.FULL_SCORE_CSV,
+
+    def _extract_score(self, score_path: Path) -> pd.DataFrame:
+        logger.info("Reading Score CSV")
+        return pd.read_csv(
+            score_path,
            dtype={"GEOID10": "string", "Total population": "int64"},
        )

-    def transform(self) -> None:
-        logger.info("Transforming data sources for Score + County CSV")
-
-        # rename some of the columns to prepare for merge
-        self.counties_df = self.counties_df[["USPS", "GEOID", "NAME"]]
-        self.counties_df.rename(
-            columns={"USPS": "State Abbreviation", "NAME": "County Name"},
-            inplace=True,
-        )
-
-        # remove unnecessary columns
-        self.states_df.rename(
-            columns={
-                "fips": "State Code",
-                "state_name": "State Name",
-                "state_abbreviation": "State Abbreviation",
-            },
-            inplace=True,
-        )
-        self.states_df.drop(["region", "division"], axis=1, inplace=True)
-
-        # add the tract level column
-        self.score_df["GEOID"] = self.score_df.GEOID10.str[:5]
-
-        # merge state with counties
-        county_state_merged = self.counties_df.merge(
-            self.states_df, on="State Abbreviation", how="left"
-        )
-
-        # merge state + county with score
-        self.score_county_state_merged = self.score_df.merge(
-            county_state_merged, on="GEOID", how="left"
-        )
-
-        # check if there are census cbgs without score
-        logger.info("Removing CBG rows without score")
-
-        ## load cbgs
-        cbg_usa_df = pd.read_csv(
-            self.CENSUS_USA_CSV,
+    def _extract_national_cbg(self, national_cbg_path: Path) -> pd.DataFrame:
+        logger.info("Reading national CBG")
+        return pd.read_csv(
+            national_cbg_path,
            names=["GEOID10"],
            dtype={"GEOID10": "string"},
            low_memory=False,
            header=None,
        )

+    def extract(self) -> None:
+        logger.info("Starting Extraction")
+        super().extract(
+            constants.CENSUS_COUNTIES_ZIP_URL,
+            constants.TMP_PATH,
+        )
+        self.input_counties_df = self._extract_counties(
+            constants.CENSUS_COUNTIES_FILE_NAME
+        )
+        self.input_states_df = self._extract_states(
+            constants.DATA_CENSUS_CSV_STATE_FILE_PATH
+        )
+        self.input_score_df = self._extract_score(
+            constants.DATA_SCORE_CSV_FULL_FILE_PATH
+        )
+        self.input_national_cbg_df = self._extract_national_cbg(
+            constants.DATA_CENSUS_CSV_FILE_PATH
+        )
+
+    def _transform_counties(self, initial_counties_df: pd.DataFrame) -> pd.DataFrame:
+        """
+        Necessary modifications to the counties dataframe
+        """
+        # Rename some of the columns to prepare for merge
+        new_df = initial_counties_df[constants.CENSUS_COUNTIES_COLUMNS]
+        new_df.rename(
+            columns={"USPS": "State Abbreviation", "NAME": "County Name"}, inplace=True
+        )
+        return new_df
+
+    def _transform_states(self, initial_states_df: pd.DataFrame) -> pd.DataFrame:
+        """
+        Necessary modifications to the states dataframe
+        """
+        # remove unnecessary columns
+        new_df = initial_states_df.rename(
+            columns={
+                "fips": "State Code",
+                "state_name": "State Name",
+                "state_abbreviation": "State Abbreviation",
+            }
+        )
+        new_df.drop(["region", "division"], axis=1, inplace=True)
+        return new_df
+
+    def _transform_score(self, initial_score_df: pd.DataFrame) -> pd.DataFrame:
+        """
+        Necessary modifications to the score dataframe
+        """
+        # Add the tract level column
+        new_df = initial_score_df.copy()
+        new_df["GEOID"] = initial_score_df.GEOID10.str[:5]
+        return new_df
+
+    def _create_score_data(
+        self,
+        national_cbg_df: pd.DataFrame,
+        counties_df: pd.DataFrame,
+        states_df: pd.DataFrame,
+        score_df: pd.DataFrame,
+    ) -> pd.DataFrame:
+
+        # merge state with counties
+        logger.info("Merging state with county info")
+        county_state_merged = counties_df.merge(
+            states_df, on="State Abbreviation", how="left"
+        )
+
+        # merge state + county with score
+        score_county_state_merged = score_df.merge(
+            county_state_merged, on="GEOID", how="left"
+        )
+
+        # check if there are census cbgs without score
+        logger.info("Removing CBG rows without score")
+
        # merge census cbgs with score
-        merged_df = cbg_usa_df.merge(
-            self.score_county_state_merged,
-            on="GEOID10",
-            how="left",
+        merged_df = national_cbg_df.merge(
+            score_county_state_merged, on="GEOID10", how="left"
        )

        # recast population to integer
-        merged_df["Total population"] = (
+        score_county_state_merged["Total population"] = (
            merged_df["Total population"].fillna(0.0).astype(int)
        )

        # list the null score cbgs
        null_cbg_df = merged_df[merged_df["Score E (percentile)"].isnull()]

-        # subsctract data sets
+        # subtract data sets
        # this follows the XOR pattern outlined here:
        # https://stackoverflow.com/a/37313953
-        removed_df = pd.concat(
+        de_duplicated_df = pd.concat(
            [merged_df, null_cbg_df, null_cbg_df]
        ).drop_duplicates(keep=False)

        # set the score to the new df
-        self.score_county_state_merged = removed_df
-
-    def _save_full_csv(self):
-        logger.info("Saving Full Score CSV with County Information")
-        self.SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
-        self.score_county_state_merged.to_csv(
-            self.FULL_SCORE_CSV_PLUS_COUNTIES, index=False
-        )
-
-    def _save_tile_csv(self):
-        logger.info("Saving Tile Score CSV")
-        score_tiles = self.score_county_state_merged[self.TILES_SCORE_COLUMNS]
+        return de_duplicated_df

+    def _create_tile_data(
+        self, score_county_state_merged_df: pd.DataFrame
+    ) -> pd.DataFrame:
+        score_tiles = score_county_state_merged_df[constants.TILES_SCORE_COLUMNS]
        decimals = pd.Series(
-            [self.TILES_ROUND_NUM_DECIMALS]
-            * len(self.TILES_SCORE_FLOAT_COLUMNS),
-            index=self.TILES_SCORE_FLOAT_COLUMNS,
+            [constants.TILES_ROUND_NUM_DECIMALS]
+            * len(constants.TILES_SCORE_FLOAT_COLUMNS),
+            index=constants.TILES_SCORE_FLOAT_COLUMNS,
        )
-        score_tiles = score_tiles.round(decimals)
+        return score_tiles.round(decimals)

-        self.TILES_SCORE_CSV_PATH.mkdir(parents=True, exist_ok=True)
-        score_tiles.to_csv(self.TILES_SCORE_CSV, index=False)
+    def _create_downloadable_data(
+        self, score_county_state_merged_df: pd.DataFrame
+    ) -> pd.DataFrame:
+        return score_county_state_merged_df[constants.DOWNLOADABLE_SCORE_COLUMNS]

-    def _save_downloadable_zip(self):
+    def transform(self) -> None:
+        logger.info("Transforming data sources for Score + County CSV")
+
+        transformed_counties = self._transform_counties(self.input_counties_df)
+        transformed_states = self._transform_states(self.input_states_df)
+        transformed_score = self._transform_score(self.input_score_df)
+
+        output_score_county_state_merged_df = self._create_score_data(
+            self.input_national_cbg_df,
+            transformed_counties,
+            transformed_states,
+            transformed_score,
+        )
+        self.output_score_tiles_df = self._create_tile_data(
+            output_score_county_state_merged_df
+        )
+        self.output_downloadable_df = self._create_downloadable_data(
+            output_score_county_state_merged_df
+        )
+        self.output_score_county_state_merged_df = output_score_county_state_merged_df
+
+    def _load_score_csv(
+        self, score_county_state_merged: pd.DataFrame, score_csv_path: Path
+    ) -> None:
+        logger.info("Saving Full Score CSV with County Information")
+        score_csv_path.parent.mkdir(parents=True, exist_ok=True)
+        score_county_state_merged.to_csv(score_csv_path, index=False)
+
+    def _load_tile_csv(
+        self, score_tiles_df: pd.DataFrame, tile_score_path: Path
+    ) -> None:
+        logger.info("Saving Tile Score CSV")
+        # TODO: check which are the columns we'll use
+        # Related to: https://github.com/usds/justice40-tool/issues/302
+        tile_score_path.mkdir(parents=True, exist_ok=True)
+        score_tiles_df.to_csv(tile_score_path, index=False)
+
+    def _load_downloadable_zip(
+        self, downloadable_df: pd.DataFrame, downloadable_info_path: Path
+    ) -> None:
        logger.info("Saving Downloadable CSV")
-        logger.info(list(self.score_county_state_merged.columns))
-        logger.info(self.DOWNLOADABLE_SCORE_COLUMNS)
-        downloadable_tiles = self.score_county_state_merged[
-            self.DOWNLOADABLE_SCORE_COLUMNS
-        ]
-        self.DOWNLOADABLE_INFO_PATH.mkdir(parents=True, exist_ok=True)
+
+        downloadable_info_path.mkdir(parents=True, exist_ok=True)
+        csv_path = downloadable_info_path / "usa.csv"
+        excel_path = downloadable_info_path / "usa.xlsx"
+        zip_path = downloadable_info_path / "Screening Tool Data.zip"

        logger.info("Writing downloadable csv")
-        downloadable_tiles.to_csv(self.DOWNLOADABLE_SCORE_CSV, index=False)
+        downloadable_df.to_csv(csv_path, index=False)

        logger.info("Writing downloadable excel")
-        downloadable_tiles.to_excel(self.DOWNLOADABLE_SCORE_EXCEL, index=False)
+        downloadable_df.to_excel(excel_path, index=False)

        logger.info("Compressing files")
-        files_to_compress = [
-            self.DOWNLOADABLE_SCORE_CSV,
-            self.DOWNLOADABLE_SCORE_EXCEL,
-        ]
-        with zipfile.ZipFile(self.DOWNLOADABLE_SCORE_ZIP, "w") as zf:
+        files_to_compress = [csv_path, excel_path]
+        with zipfile.ZipFile(zip_path, "w") as zf:
            for f in files_to_compress:
                zf.write(f, arcname=Path(f).name, compress_type=compression)
-        zip_info = get_zip_info(self.DOWNLOADABLE_SCORE_ZIP)
+        zip_info = get_zip_info(zip_path)
        logger.info(json.dumps(zip_info, indent=4, sort_keys=True, default=str))

    def load(self) -> None:
-        self._save_full_csv()
-        self._save_tile_csv()
-        self._save_downloadable_zip()
+        self._load_score_csv(
+            self.output_score_county_state_merged_df,
+            constants.FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH,
+        )
+        self._load_tile_csv(
+            self.output_score_tiles_df, constants.DATA_SCORE_TILES_FILE_PATH
+        )
+        self._load_downloadable_zip(
+            self.output_downloadable_df, constants.SCORE_DOWNLOADABLE_DIR
+        )
--- a/data/data-pipeline/data_pipeline/etl/score/tests/init.py
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/init.py
--- a/data/data-pipeline/data_pipeline/etl/score/tests/conftest.py
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/conftest.py
@ -0,0 +1,118 @@
+import os
+from importlib import reload
+from pathlib import Path
+
+import pandas as pd
+import pytest
+from data_pipeline import config
+from data_pipeline.etl.score import etl_score_post, tests
+from data_pipeline.etl.score.etl_score_post import PostScoreETL
+
+
+def pytest_configure():
+    pytest.SNAPSHOT_DIR = Path(__file__).parent / "snapshots"
+
+
+@pytest.fixture(scope="session")
+def root(tmp_path_factory):
+    basetemp = Path.cwd() / "temp_dir"
+    os.environ["PYTEST_DEBUG_TEMPROOT"] = str(
+        basetemp
+    )  # this sets the location of the temp directory inside the project folder
+    basetemp.mkdir(parents=True, exist_ok=True)
+    root = tmp_path_factory.mktemp("root", numbered=False)
+    return root
+
+
+@pytest.fixture(autouse=True)
+def settings_override(monkeypatch, root):
+    reload(config)
+
+    monkeypatch.setattr(config.settings, "APP_ROOT", root)
+    return config.settings
+
+
+@pytest.fixture()
+def etl(monkeypatch, root):
+    reload(etl_score_post)
+
+    tmp_path = root / "tmp"
+    tmp_path.mkdir(parents=True, exist_ok=True)
+    etl = PostScoreETL()
+    monkeypatch.setattr(etl, "DATA_PATH", root)
+    monkeypatch.setattr(etl, "TMP_PATH", tmp_path)
+    return etl
+
+
+@pytest.fixture(scope="session")
+def sample_data_dir():
+    base_dir = Path(tests.__file__).resolve().parent
+    return base_dir / "sample_data"
+
+
+@pytest.fixture()
+def county_data_initial(sample_data_dir):
+    return sample_data_dir / "county_data_initial.csv"
+
+
+@pytest.fixture()
+def state_data_initial(sample_data_dir):
+    return sample_data_dir / "state_data_initial.csv"
+
+
+@pytest.fixture()
+def score_data_initial(sample_data_dir):
+    return sample_data_dir / "score_data_initial.csv"
+
+
+@pytest.fixture()
+def counties_transformed_expected():
+    return pd.DataFrame.from_dict(
+        data={
+            "State Abbreviation": pd.Series(["AL", "AL"], dtype="string"),
+            "GEOID": pd.Series(["01001", "01003"], dtype="string"),
+            "County Name": pd.Series(
+                ["AutaugaCounty", "BaldwinCounty"], dtype="object"
+            ),
+        },
+    )
+
+
+@pytest.fixture()
+def states_transformed_expected():
+    return pd.DataFrame.from_dict(
+        data={
+            "State Code": pd.Series(["01", "02", "04"], dtype="string"),
+            "State Name": pd.Series(["Alabama", "Alaska", "Arizona"], dtype="object"),
+            "State Abbreviation": pd.Series(["AL", "AK", "AZ"], dtype="string"),
+        },
+    )
+
+
+@pytest.fixture()
+def score_transformed_expected():
+    return pd.read_pickle(pytest.SNAPSHOT_DIR / "score_transformed_expected.pkl")
+
+
+@pytest.fixture()
+def national_cbg_df():
+    return pd.DataFrame.from_dict(
+        data={
+            "GEOID10": pd.Series(["010010201001", "010010201002"], dtype="string"),
+        },
+    )
+
+
+@pytest.fixture()
+def score_data_expected():
+    return pd.read_pickle(pytest.SNAPSHOT_DIR / "score_data_expected.pkl")
+
+
+@pytest.fixture()
+def tile_data_expected():
+    return pd.read_pickle(pytest.SNAPSHOT_DIR / "tile_data_expected.pkl")
+
+
+@pytest.fixture()
+def downloadable_data_expected():
+    return pd.read_pickle(pytest.SNAPSHOT_DIR / "downloadable_data_expected.pkl")
--- a/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/init.py
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/init.py
--- a/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/county_data_initial.csv
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/county_data_initial.csv
@ -0,0 +1,3 @@
+USPS	GEOID	ANSICODE	NAME	POP10	HU10	ALAND	AWATER	ALAND_SQMI	AWATER_SQMI	INTPTLAT	INTPTLONG
+AL	01001	00161526	AutaugaCounty	54571	22135	1539582278	25775735	594.436	9.952	32.536382	-86.644490
+AL	01003	00161527	BaldwinCounty	182265	104061	4117521611	1133190229	1589.784	437.527	30.659218	-87.746067
--- a/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/score_data_initial.csv
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/score_data_initial.csv
@ -0,0 +1,3 @@
+GEOID10,Housing burden (percent),Total population,Air toxics cancer risk,Respiratory hazard index,Diesel particulate matter,Particulate matter (PM2.5),Ozone,Traffic proximity and volume,Proximity to RMP sites,Proximity to TSDF sites,Proximity to NPL sites,Wastewater discharge,Percent pre-1960s housing (lead paint indicator),Individuals under 5 years old,Individuals over 64 years old,Linguistic isolation (percent),Percent of households in linguistic isolation,Poverty (Less than 200% of federal poverty line),Percent individuals age 25 or over with less than high school degree,Unemployed civilians (percent),Housing + Transportation Costs % Income for the Regional Typical Household,GEOID10 (percentile),Housing burden (percent) (percentile),Total population (percentile),Air toxics cancer risk (percentile),Respiratory hazard index (percentile),Diesel particulate matter (percentile),Particulate matter (PM2.5) (percentile),Ozone (percentile),Traffic proximity and volume (percentile),Proximity to RMP sites (percentile),Proximity to TSDF sites (percentile),Proximity to NPL sites (percentile),Wastewater discharge (percentile),Percent pre-1960s housing (lead paint indicator) (percentile),Individuals under 5 years old (percentile),Individuals over 64 years old (percentile),Linguistic isolation (percent) (percentile),Percent of households in linguistic isolation (percentile),Poverty (Less than 200% of federal poverty line) (percentile),Percent individuals age 25 or over with less than high school degree (percentile),Unemployed civilians (percent) (percentile),Housing + Transportation Costs % Income for the Regional Typical Household (percentile),Housing burden (percent) (min-max normalized),Total population (min-max normalized),Air toxics cancer risk (min-max normalized),Respiratory hazard index (min-max normalized),Diesel particulate matter (min-max normalized),Particulate matter (PM2.5) (min-max normalized),Ozone (min-max normalized),Traffic proximity and volume (min-max normalized),Proximity to RMP sites (min-max normalized),Proximity to TSDF sites (min-max normalized),Proximity to NPL sites (min-max normalized),Wastewater discharge (min-max normalized),Percent pre-1960s housing (lead paint indicator) (min-max normalized),Individuals under 5 years old (min-max normalized),Individuals over 64 years old (min-max normalized),Linguistic isolation (percent) (min-max normalized),Percent of households in linguistic isolation (min-max normalized),Poverty (Less than 200% of federal poverty line) (min-max normalized),Percent individuals age 25 or over with less than high school degree (min-max normalized),Unemployed civilians (percent) (min-max normalized),Housing + Transportation Costs % Income for the Regional Typical Household (min-max normalized),Score A,Score B,Socioeconomic Factors,Sensitive populations,Environmental effects,Exposures,Pollution Burden,Population Characteristics,Score C,Score D,Score E,Score A (percentile),Score A (top 25th percentile),Score A (top 30th percentile),Score A (top 35th percentile),Score A (top 40th percentile),Score B (percentile),Score B (top 25th percentile),Score B (top 30th percentile),Score B (top 35th percentile),Score B (top 40th percentile),Score C (percentile),Score C (top 25th percentile),Score C (top 30th percentile),Score C (top 35th percentile),Score C (top 40th percentile),Score D (percentile),Score D (top 25th percentile),Score D (top 30th percentile),Score D (top 35th percentile),Score D (top 40th percentile),Score E (percentile),Score E (top 25th percentile),Score E (top 30th percentile),Score E (top 35th percentile),Score E (top 40th percentile),Poverty (Less than 200% of federal poverty line) (top 25th percentile),Poverty (Less than 200% of federal poverty line) (top 30th percentile),Poverty (Less than 200% of federal poverty line) (top 35th percentile),Poverty (Less than 200% of federal poverty line) (top 40th percentile)
+010010201001,0.15,692,49.3770316066,0.788051737456,0.2786630687,9.99813169399,40.1217287582,91.0159000855,0.0852006888915,0.0655778245369,0.0709415490545,0.0,0.29,0.0491329479769,0.0953757225434,0.0,0.04,0.293352601156,0.195011337868,0.028125,55.0,4.53858477849437e-06,0.15696279879978475,0.12089201345236528,0.9797143208291796,0.9829416396964773,0.34627219635208273,0.9086451463612172,0.28414902233020944,0.3410837232734089,0.13480504509083976,0.13460988594536452,0.5500810137382961,0.18238709002315753,0.5188510118774764,0.4494787435381899,0.25320991408459015,0.2596066814778244,0.7027453899325112,0.46606500161119757,0.7623733167523703,0.3628393561824028,0.5794871072813119,0.10909090909090909,0.013340530536705737,0.028853697167088285,0.18277886087526787,0.045859591901569303,0.5883290826337872,0.3121515260630353,0.0024222132770710053,0.004621252164336263,0.00015416214761450488,0.007893014211979786,0.0,0.29,0.09433526011570838,0.0953757225434,0.0,0.04,0.293352601156,0.195011337868,0.028125,0.2711864406779661,0.6142191591817839,0.3553155211005275,0.5747020343519587,0.3207651130335348,0.3041468093350269,0.640467674807096,0.5283607196497396,0.4477335736927467,0.23656483320764937,0.12511596962298183,0.4015694309647159,0.6357808408182161,False,False,False,True,0.6315486105122701,False,False,False,True,0.5104500914524833,False,False,False,False,0.44267994354000534,False,False,False,False,0.3517176274094212,False,False,False,False,False,False,False,False
+010010201002,0.15,1153,49.3770316066,0.788051737456,0.2786630687,9.99813169399,40.1217287582,2.61874365577,0.0737963352265,0.0604962870646,0.0643436665275,0.0,0.094623655914,0.0416305290546,0.150043365134,0.0,0.0,0.182133564614,0.039119804401,0.0287878787878787,57.0,9.07716955698874e-06,0.15696279879978475,0.42875102685480615,0.9797143208291796,0.9829416396964773,0.34627219635208273,0.9086451463612172,0.28414902233020944,0.09634507767787849,0.11004706512415299,0.1228504127842856,0.5178479846414291,0.18238709002315753,0.28270163797524656,0.3660890561105236,0.5188963977252613,0.2596066814778244,0.25592171848974055,0.2701365660159849,0.2207635715031339,0.3696173450745396,0.6379947997334159,0.10909090909090909,0.022227791486736582,0.028853697167088285,0.18277886087526787,0.045859591901569303,0.5883290826337872,0.3121515260630353,6.96928300032502e-05,0.004002684465613169,0.00014221633002379553,0.007158928457599425,0.0,0.094623655914,0.07993061578488315,0.150043365134,0.0,0.0,0.182133564614,0.039119804401,0.0287878787878787,0.2824858757062147,0.24545006875955938,0.05963631310728093,0.350886800163363,0.38153071177120307,0.2431668381096544,0.5996779005411742,0.4808408797306676,0.36620875596728303,0.17608814038438173,0.07182643137875756,0.2554173925742535,0.21102603786087423,False,False,False,False,0.2509565067420677,False,False,False,False,0.2850458170133389,False,False,False,False,0.16239056337452856,False,False,False,False,0.11055992520412285,False,False,False,False,False,False,False,False
--- a/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/state_data_initial.csv
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/sample_data/state_data_initial.csv
@ -0,0 +1,4 @@
+fips,state_name,state_abbreviation,region,division
+01,Alabama,AL,South,East South Central
+02,Alaska,AK,West,Pacific
+04,Arizona,AZ,West,Mountain
--- a/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/downloadable_data_expected.pkl
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/downloadable_data_expected.pkl
--- a/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/score_data_expected.pkl
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/score_data_expected.pkl
--- a/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/score_transformed_expected.pkl
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/score_transformed_expected.pkl
--- a/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/tile_data_expected.pkl
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/snapshots/tile_data_expected.pkl
--- a/data/data-pipeline/data_pipeline/etl/score/tests/test_score_post.py
+++ b/data/data-pipeline/data_pipeline/etl/score/tests/test_score_post.py
@ -0,0 +1,116 @@
+# pylint: disable=W0212
+## Above disables warning about access to underscore-prefixed methods
+
+from importlib import reload
+
+import pandas.api.types as ptypes
+import pandas.testing as pdt
+from data_pipeline.etl.score import constants
+
+# See conftest.py for all fixtures used in these tests
+
+
+# Extract Tests
+def test_extract_counties(etl, county_data_initial):
+    reload(constants)
+    extracted = etl._extract_counties(county_data_initial)
+    assert all(
+        ptypes.is_string_dtype(extracted[col])
+        for col in constants.CENSUS_COUNTIES_COLUMNS
+    )
+
+
+def test_extract_states(etl, state_data_initial):
+    extracted = etl._extract_states(state_data_initial)
+    string_cols = ["fips", "state_abbreviation"]
+    assert all(ptypes.is_string_dtype(extracted[col]) for col in string_cols)
+
+
+def test_extract_score(etl, score_data_initial):
+    extracted = etl._extract_score(score_data_initial)
+    string_cols = ["GEOID10"]
+    assert all(ptypes.is_string_dtype(extracted[col]) for col in string_cols)
+
+
+# Transform Tests
+def test_transform_counties(etl, county_data_initial, counties_transformed_expected):
+    extracted_counties = etl._extract_counties(county_data_initial)
+    counties_transformed_actual = etl._transform_counties(extracted_counties)
+    pdt.assert_frame_equal(counties_transformed_actual, counties_transformed_expected)
+
+
+def test_transform_states(etl, state_data_initial, states_transformed_expected):
+    extracted_states = etl._extract_states(state_data_initial)
+    states_transformed_actual = etl._transform_states(extracted_states)
+    pdt.assert_frame_equal(states_transformed_actual, states_transformed_expected)
+
+
+def test_transform_score(etl, score_data_initial, score_transformed_expected):
+    extracted_score = etl._extract_score(score_data_initial)
+    score_transformed_actual = etl._transform_score(extracted_score)
+    pdt.assert_frame_equal(
+        score_transformed_actual, score_transformed_expected, check_dtype=False
+    )
+
+
+# pylint: disable=too-many-arguments
+def test_create_score_data(
+    etl,
+    national_cbg_df,
+    counties_transformed_expected,
+    states_transformed_expected,
+    score_transformed_expected,
+    score_data_expected,
+):
+    score_data_actual = etl._create_score_data(
+        national_cbg_df,
+        counties_transformed_expected,
+        states_transformed_expected,
+        score_transformed_expected,
+    )
+    pdt.assert_frame_equal(
+        score_data_actual,
+        score_data_expected,
+    )
+
+
+def test_create_tile_data(etl, score_data_expected, tile_data_expected):
+    output_tiles_df_actual = etl._create_tile_data(score_data_expected)
+    pdt.assert_frame_equal(
+        output_tiles_df_actual,
+        tile_data_expected,
+    )
+
+
+def test_create_downloadable_data(etl, score_data_expected, downloadable_data_expected):
+    output_downloadable_df_actual = etl._create_downloadable_data(score_data_expected)
+    pdt.assert_frame_equal(
+        output_downloadable_df_actual,
+        downloadable_data_expected,
+    )
+
+
+def test_load_score_csv(etl, score_data_expected):
+    reload(constants)
+    etl._load_score_csv(
+        score_data_expected,
+        constants.FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH,
+    )
+    assert constants.FULL_SCORE_CSV_FULL_PLUS_COUNTIES_FILE_PATH.is_file()
+
+
+def test_load_tile_csv(etl, tile_data_expected):
+    reload(constants)
+    etl._load_score_csv(tile_data_expected, constants.DATA_SCORE_TILES_FILE_PATH)
+    assert constants.DATA_SCORE_TILES_FILE_PATH.is_file()
+
+
+def test_load_downloadable_zip(etl, downloadable_data_expected):
+    reload(constants)
+    etl._load_downloadable_zip(
+        downloadable_data_expected, constants.SCORE_DOWNLOADABLE_DIR
+    )
+    assert constants.SCORE_DOWNLOADABLE_DIR.is_dir()
+    assert constants.SCORE_DOWNLOADABLE_CSV_FILE_PATH.is_file()
+    assert constants.SCORE_DOWNLOADABLE_EXCEL_FILE_PATH.is_file()
+    assert constants.SCORE_DOWNLOADABLE_ZIP_FILE_PATH.is_file()
--- a/data/data-pipeline/poetry.lock
+++ b/data/data-pipeline/poetry.lock
@ -59,14 +59,6 @@ typed-ast = {version = ">=1.4.0,<1.5", markers = "implementation_name == \"cpyth
 typing-extensions = {version = ">=3.7.4", markers = "python_version < \"3.8\""}
 wrapt = ">=1.11,<1.13"

-[[package]]
-name = "async-generator"
-version = "1.10"
-description = "Async generators and context managers for Python 3.5+"
-category = "main"
-optional = false
-python-versions = ">=3.5"
-
 [[package]]
 name = "atomicwrites"
 version = "1.4.0"
@ -449,7 +441,7 @@ argcomplete = {version = ">=1.12.3", markers = "python_version < \"3.8.0\""}
 debugpy = ">=1.0.0,<2.0"
 importlib-metadata = {version = "<5", markers = "python_version < \"3.8.0\""}
 ipython = ">=7.23.1,<8.0"
-jupyter-client = "<7.0"
+jupyter-client = "<8.0"
 matplotlib-inline = ">=0.1.0,<0.2.0"
 tornado = ">=4.2,<7.0"
 traitlets = ">=4.1.0,<6.0"
@ -891,14 +883,13 @@ python-versions = "*"

 [[package]]
 name = "nbclient"
-version = "0.5.3"
+version = "0.5.4"
 description = "A client library for executing notebooks. Formerly nbconvert's ExecutePreprocessor."
 category = "main"
 optional = false
 python-versions = ">=3.6.1"

 [package.dependencies]
-async-generator = "*"
 jupyter-client = ">=6.1.5"
 nbformat = ">=5.0"
 nest-asyncio = "*"
@ -1026,7 +1017,7 @@ pyparsing = ">=2.0.2"

 [[package]]
 name = "pandas"
-version = "1.3.1"
+version = "1.3.2"
 description = "Powerful data structures for data analysis, time series, and statistics"
 category = "main"
 optional = false
@ -1185,7 +1176,7 @@ python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"

 [[package]]
 name = "pygments"
-version = "2.9.0"
+version = "2.10.0"
 description = "Pygments is a syntax highlighting package written in Python."
 category = "main"
 optional = false
@ -1263,6 +1254,20 @@ toml = "*"
 [package.extras]
 testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "requests", "xmlschema"]

+[[package]]
+name = "pytest-mock"
+version = "3.6.1"
+description = "Thin-wrapper around the mock package for easier use with pytest"
+category = "dev"
+optional = false
+python-versions = ">=3.6"
+
+[package.dependencies]
+pytest = ">=5.0"
+
+[package.extras]
+dev = ["pre-commit", "tox", "pytest-asyncio"]
+
 [[package]]
 name = "python-dateutil"
 version = "2.8.2"
@ -1342,11 +1347,11 @@ test = ["flaky", "pytest", "pytest-qt"]

 [[package]]
 name = "qtpy"
-version = "1.9.0"
+version = "1.10.0"
 description = "Provides an abstraction layer on top of the various Qt bindings (PyQt5, PyQt4 and PySide) and additional custom QWidgets."
 category = "main"
 optional = false
-python-versions = "*"
+python-versions = ">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*"

 [[package]]
 name = "regex"
@ -1703,10 +1708,6 @@ astroid = [
    {file = "astroid-2.6.6-py3-none-any.whl", hash = "sha256:ab7f36e8a78b8e54a62028ba6beef7561db4cdb6f2a5009ecc44a6f42b5697ef"},
    {file = "astroid-2.6.6.tar.gz", hash = "sha256:3975a0bd5373bdce166e60c851cfcbaf21ee96de80ec518c1f4cb3e94c3fb334"},
 ]
-async-generator = [
-    {file = "async_generator-1.10-py3-none-any.whl", hash = "sha256:01c7bf666359b4967d2cda0000cc2e4af16a0ae098cbffcb8472fb9e8ad6585b"},
-    {file = "async_generator-1.10.tar.gz", hash = "sha256:6ebb3d106c12920aaae42ccb6f787ef5eefdcdd166ea3d628fa8476abe712144"},
-]
 atomicwrites = [
    {file = "atomicwrites-1.4.0-py2.py3-none-any.whl", hash = "sha256:6d1784dea7c0c8d4a5172b6c620f40b6e4cbfdf96d783691f2e1302a7b88e197"},
    {file = "atomicwrites-1.4.0.tar.gz", hash = "sha256:ae70396ad1a434f9c7046fd2dd196fc04b12f9e91ffb859164193be8b6168a7a"},
@ -2254,8 +2255,8 @@ mypy-extensions = [
    {file = "mypy_extensions-0.4.3.tar.gz", hash = "sha256:2d82818f5bb3e369420cb3c4060a7970edba416647068eb4c5343488a6c604a8"},
 ]
 nbclient = [
-    {file = "nbclient-0.5.3-py3-none-any.whl", hash = "sha256:e79437364a2376892b3f46bedbf9b444e5396cfb1bc366a472c37b48e9551500"},
-    {file = "nbclient-0.5.3.tar.gz", hash = "sha256:db17271330c68c8c88d46d72349e24c147bb6f34ec82d8481a8f025c4d26589c"},
+    {file = "nbclient-0.5.4-py3-none-any.whl", hash = "sha256:95a300c6fbe73721736cf13972a46d8d666f78794b832866ed7197a504269e11"},
+    {file = "nbclient-0.5.4.tar.gz", hash = "sha256:6c8ad36a28edad4562580847f9f1636fe5316a51a323ed85a24a4ad37d4aefce"},
 ]
 nbconvert = [
    {file = "nbconvert-6.1.0-py3-none-any.whl", hash = "sha256:37cd92ff2ae6a268e62075ff8b16129e0be4939c4dfcee53dc77cc8a7e06c684"},
@ -2312,25 +2313,25 @@ packaging = [
    {file = "packaging-21.0.tar.gz", hash = "sha256:7dc96269f53a4ccec5c0670940a4281106dd0bb343f47b7471f779df49c2fbe7"},
 ]
 pandas = [
-    {file = "pandas-1.3.1-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:1ee8418d0f936ff2216513aa03e199657eceb67690995d427a4a7ecd2e68f442"},
-    {file = "pandas-1.3.1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5d9acfca191140a518779d1095036d842d5e5bc8e8ad8b5eaad1aff90fe1870d"},
-    {file = "pandas-1.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e323028ab192fcfe1e8999c012a0fa96d066453bb354c7e7a4a267b25e73d3c8"},
-    {file = "pandas-1.3.1-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9d06661c6eb741ae633ee1c57e8c432bb4203024e263fe1a077fa3fda7817fdb"},
-    {file = "pandas-1.3.1-cp37-cp37m-win32.whl", hash = "sha256:23c7452771501254d2ae23e9e9dac88417de7e6eff3ce64ee494bb94dc88c300"},
-    {file = "pandas-1.3.1-cp37-cp37m-win_amd64.whl", hash = "sha256:7150039e78a81eddd9f5a05363a11cadf90a4968aac6f086fd83e66cf1c8d1d6"},
-    {file = "pandas-1.3.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:5c09a2538f0fddf3895070579082089ff4ae52b6cb176d8ec7a4dacf7e3676c1"},
-    {file = "pandas-1.3.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:905fc3e0fcd86b0a9f1f97abee7d36894698d2592b22b859f08ea5a8fe3d3aab"},
-    {file = "pandas-1.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5ee927c70794e875a59796fab8047098aa59787b1be680717c141cd7873818ae"},
-    {file = "pandas-1.3.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0c976e023ed580e60a82ccebdca8e1cc24d8b1fbb28175eb6521025c127dab66"},
-    {file = "pandas-1.3.1-cp38-cp38-win32.whl", hash = "sha256:22f3fcc129fb482ef44e7df2a594f0bd514ac45aabe50da1a10709de1b0f9d84"},
-    {file = "pandas-1.3.1-cp38-cp38-win_amd64.whl", hash = "sha256:45656cd59ae9745a1a21271a62001df58342b59c66d50754390066db500a8362"},
-    {file = "pandas-1.3.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:114c6789d15862508900a25cb4cb51820bfdd8595ea306bab3b53cd19f990b65"},
-    {file = "pandas-1.3.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:527c43311894aff131dea99cf418cd723bfd4f0bcf3c3da460f3b57e52a64da5"},
-    {file = "pandas-1.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fdb3b33dde260b1766ea4d3c6b8fbf6799cee18d50a2a8bc534cf3550b7c819a"},
-    {file = "pandas-1.3.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c28760932283d2c9f6fa5e53d2f77a514163b9e67fd0ee0879081be612567195"},
-    {file = "pandas-1.3.1-cp39-cp39-win32.whl", hash = "sha256:be12d77f7e03c40a2466ed00ccd1a5f20a574d3c622fe1516037faa31aa448aa"},
-    {file = "pandas-1.3.1-cp39-cp39-win_amd64.whl", hash = "sha256:9e1fe6722cbe27eb5891c1977bca62d456c19935352eea64d33956db46139364"},
-    {file = "pandas-1.3.1.tar.gz", hash = "sha256:341935a594db24f3ff07d1b34d1d231786aa9adfa84b76eab10bf42907c8aed3"},
+    {file = "pandas-1.3.2-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:ba7ceb8abc6dbdb1e34612d1173d61e4941f1a1eb7e6f703b2633134ae6a6c89"},
+    {file = "pandas-1.3.2-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fcb71b1935249de80e3a808227189eee381d4d74a31760ced2df21eedc92a8e3"},
+    {file = "pandas-1.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fa54dc1d3e5d004a09ab0b1751473698011ddf03e14f1f59b84ad9a6ac630975"},
+    {file = "pandas-1.3.2-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:34ced9ce5d5b17b556486da7256961b55b471d64a8990b56e67a84ebeb259416"},
+    {file = "pandas-1.3.2-cp37-cp37m-win32.whl", hash = "sha256:a56246de744baf646d1f3e050c4653d632bc9cd2e0605f41051fea59980e880a"},
+    {file = "pandas-1.3.2-cp37-cp37m-win_amd64.whl", hash = "sha256:53b17e4debba26b7446b1e4795c19f94f0c715e288e08145e44bdd2865e819b3"},
+    {file = "pandas-1.3.2-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:f07a9745ca075ae73a5ce116f5e58f691c0dc9de0bff163527858459df5c176f"},
+    {file = "pandas-1.3.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c9e8e0ce5284ebebe110efd652c164ed6eab77f5de4c3533abc756302ee77765"},
+    {file = "pandas-1.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:59a78d7066d1c921a77e3306aa0ebf6e55396c097d5dfcc4df8defe3dcecb735"},
+    {file = "pandas-1.3.2-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:132def05e73d292c949b02e7ef873debb77acc44a8b119d215921046f0c3a91d"},
+    {file = "pandas-1.3.2-cp38-cp38-win32.whl", hash = "sha256:69e1b2f5811f46827722fd641fdaeedb26002bd1e504eacc7a8ec36bdc25393e"},
+    {file = "pandas-1.3.2-cp38-cp38-win_amd64.whl", hash = "sha256:7996d311413379136baf0f3cf2a10e331697657c87ced3f17ac7c77f77fe34a3"},
+    {file = "pandas-1.3.2-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:1738154049062156429a5cf2fd79a69c9f3fa4f231346a7ec6fd156cd1a9a621"},
+    {file = "pandas-1.3.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9cce01f6d655b4add966fcd36c32c5d1fe84628e200626b3f5e2f40db2d16a0f"},
+    {file = "pandas-1.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1099e2a0cd3a01ec62cca183fc1555833a2d43764950ef8cb5948c8abfc51014"},
+    {file = "pandas-1.3.2-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0cd5776be891331a3e6b425b5abeab9596abea18435c5982191356f9b24ae731"},
+    {file = "pandas-1.3.2-cp39-cp39-win32.whl", hash = "sha256:66a95361b81b4ba04b699ecd2416b0591f40cd1e24c60a8bfe0d19009cfa575a"},
+    {file = "pandas-1.3.2-cp39-cp39-win_amd64.whl", hash = "sha256:89f40e5d21814192802421df809f948247d39ffe171e45fe2ab4abf7bd4279d8"},
+    {file = "pandas-1.3.2.tar.gz", hash = "sha256:cbcb84d63867af3411fa063af3de64902665bb5b3d40b25b2059e40603594e87"},
 ]
 pandocfilters = [
    {file = "pandocfilters-1.4.3.tar.gz", hash = "sha256:bc63fbb50534b4b1f8ebe1860889289e8af94a23bff7445259592df25a3906eb"},
@ -2443,8 +2444,8 @@ pyflakes = [
    {file = "pyflakes-2.3.1.tar.gz", hash = "sha256:f5bc8ecabc05bb9d291eb5203d6810b49040f6ff446a756326104746cc00c1db"},
 ]
 pygments = [
-    {file = "Pygments-2.9.0-py3-none-any.whl", hash = "sha256:d66e804411278594d764fc69ec36ec13d9ae9147193a1740cd34d272ca383b8e"},
-    {file = "Pygments-2.9.0.tar.gz", hash = "sha256:a18f47b506a429f6f4b9df81bb02beab9ca21d0a5fee38ed15aef65f0545519f"},
+    {file = "Pygments-2.10.0-py3-none-any.whl", hash = "sha256:b8e67fe6af78f492b3c4b3e2970c0624cbf08beb1e493b2c99b9fa1b67a20380"},
+    {file = "Pygments-2.10.0.tar.gz", hash = "sha256:f398865f7eb6874156579fdf36bc840a03cab64d1cde9e93d68f46a425ec52c6"},
 ]
 pylint = [
    {file = "pylint-2.9.6-py3-none-any.whl", hash = "sha256:2e1a0eb2e8ab41d6b5dbada87f066492bb1557b12b76c47c2ee8aa8a11186594"},
@ -2506,6 +2507,10 @@ pytest = [
    {file = "pytest-6.2.4-py3-none-any.whl", hash = "sha256:91ef2131a9bd6be8f76f1f08eac5c5317221d6ad1e143ae03894b862e8976890"},
    {file = "pytest-6.2.4.tar.gz", hash = "sha256:50bcad0a0b9c5a72c8e4e7c9855a3ad496ca6a881a3641b4260605450772c54b"},
 ]
+pytest-mock = [
+    {file = "pytest-mock-3.6.1.tar.gz", hash = "sha256:40217a058c52a63f1042f0784f62009e976ba824c418cced42e88d5f40ab0e62"},
+    {file = "pytest_mock-3.6.1-py3-none-any.whl", hash = "sha256:30c2f2cc9759e76eee674b81ea28c9f0b94f8f0445a1b87762cadf774f0df7e3"},
+]
 python-dateutil = [
    {file = "python-dateutil-2.8.2.tar.gz", hash = "sha256:0123cacc1627ae19ddf3c27a5de5bd67ee4586fbdd6440d9748f8abb483d3e86"},
    {file = "python_dateutil-2.8.2-py2.py3-none-any.whl", hash = "sha256:961d03dc3453ebbc59dbdea9e4e11c5651520a876d0f4db161e8674aae935da9"},
@ -2565,13 +2570,6 @@ pyyaml = [
    {file = "PyYAML-5.4.1.tar.gz", hash = "sha256:607774cbba28732bfa802b54baa7484215f530991055bb562efbed5b2f20a45e"},
 ]
 pyzmq = [
-    {file = "pyzmq-22.2.1-cp310-cp310-macosx_10_15_universal2.whl", hash = "sha256:d60a407663b7c2af781ab7f49d94a3d379dd148bb69ea8d9dd5bc69adf18097c"},
-    {file = "pyzmq-22.2.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:631f932fb1fa4b76f31adf976f8056519bc6208a3c24c184581c3dd5be15066e"},
-    {file = "pyzmq-22.2.1-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl", hash = "sha256:0471d634c7fe48ff7d3849798da6c16afc71676dd890b5ae08eb1efe735c6fec"},
-    {file = "pyzmq-22.2.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:f520e9fee5d7a2e09b051d924f85b977c6b4e224e56c0551c3c241bbeeb0ad8d"},
-    {file = "pyzmq-22.2.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c1b6619ceb33a8907f1cb82ff8afc8a133e7a5f16df29528e919734718600426"},
-    {file = "pyzmq-22.2.1-cp310-cp310-win32.whl", hash = "sha256:31c5dfb6df5148789835128768c01bf6402eb753d06f524f12f6786caf96fb44"},
-    {file = "pyzmq-22.2.1-cp310-cp310-win_amd64.whl", hash = "sha256:4842a8263cbaba6fce401bbe4e2b125321c401a01714e42624dabc554bfc2629"},
    {file = "pyzmq-22.2.1-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:b921758f8b5098faa85f341bbdd5e36d5339de5e9032ca2b07d8c8e7bec5069b"},
    {file = "pyzmq-22.2.1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:240b83b3a8175b2f616f80092cbb019fcd5c18598f78ffc6aa0ae9034b300f14"},
    {file = "pyzmq-22.2.1-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:da7f7f3bb08bcf59a6b60b4e53dd8f08bb00c9e61045319d825a906dbb3c8fb7"},
@ -2608,8 +2606,8 @@ qtconsole = [
    {file = "qtconsole-5.1.1.tar.gz", hash = "sha256:bbc34bca14f65535afcb401bc74b752bac955e5313001ba640383f7e5857dc49"},
 ]
 qtpy = [
-    {file = "QtPy-1.9.0-py2.py3-none-any.whl", hash = "sha256:fa0b8363b363e89b2a6f49eddc162a04c0699ae95e109a6be3bb145a913190ea"},
-    {file = "QtPy-1.9.0.tar.gz", hash = "sha256:2db72c44b55d0fe1407be8fba35c838ad0d6d3bb81f23007886dc1fc0f459c8d"},
+    {file = "QtPy-1.10.0-py2.py3-none-any.whl", hash = "sha256:f683ce6cd825ba8248a798bf1dfa1a07aca387c88ae44fa5479537490aace7be"},
+    {file = "QtPy-1.10.0.tar.gz", hash = "sha256:3d20f010caa3b2c04835d6a2f66f8873b041bdaf7a76085c2a0d7890cdd65ea9"},
 ]
 regex = [
    {file = "regex-2021.8.3-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:8764a78c5464ac6bde91a8c87dd718c27c1cabb7ed2b4beaf36d3e8e390567f9"},
--- a/data/data-pipeline/pyproject.toml
+++ b/data/data-pipeline/pyproject.toml
@ -33,6 +33,7 @@ pylint = "^2.9.6"
 pytest = "^6.2.4"
 safety = "^1.10.3"
 tox = "^3.24.0"
+pytest-mock = "^3.6.1"

 [build-system]
 build-backend = "poetry.core.masonry.api"