Data Unit Tests (#509)

* Fixes #341 - As a J40 developer, I want to write Unit Tests for the ETL files, so that tests are run on each commit * Location bug * Adding Load tests * Fixing XLSX filename * Adding downloadable zip test * updating pickle * Fixing pylint warnings * Updte readme to correct some typos and reorganize test content structure * Removing unused schemas file, adding details to readme around pickles, per PR feedback * Update test to pass with Score D added to score file; update path in readme * fix requirements.txt after merge * fix poetry.lock after merge Co-authored-by: Shelby Switzer <shelby.switzer@cms.hhs.gov>
2025-09-29 14:03:17 -07:00 · 2021-09-10 14:17:34 -04:00 · 2021-09-10 14:17:34 -04:00 · 536a35d6a0
commit 536a35d6a0
parent 88c8209bb0
17 changed files with 676 additions and 242 deletions
--- a/data/data-pipeline/README.md
+++ b/data/data-pipeline/README.md
@ -8,6 +8,10 @@
 - [Justice 40 Score application](#justice-40-score-application)
  - [About this application](#about-this-application)
    - [Using the data](#using-the-data)
+      - [1. Source data](#1-source-data)
+      - [2. Extract-Transform-Load (ETL) the data](#2-extract-transform-load-etl-the-data)
+      - [3. Combined dataset](#3-combined-dataset)
+      - [4. Tileset](#4-tileset)
    - [Score generation and comparison workflow](#score-generation-and-comparison-workflow)
      - [Workflow Diagram](#workflow-diagram)
      - [Step 0: Set up your environment](#step-0-set-up-your-environment)
@ -28,6 +32,15 @@
    - [Running Jupyter notebooks](#running-jupyter-notebooks)
    - [Activating variable-enabled Markdown for Jupyter notebooks](#activating-variable-enabled-markdown-for-jupyter-notebooks)
  - [Miscellaneous](#miscellaneous)
+  - [Testing](#testing)
+    - [Background](#background)
+    - [Configuration / Fixtures](#configuration--fixtures)
+      - [Updating Pickles](#updating-pickles)
+      - [Future Enchancements](#future-enchancements)
+    - [ETL Unit Tests](#etl-unit-tests)
+      - [Extract Tests](#extract-tests)
+      - [Transform Tests](#transform-tests)
+      - [Load Tests](#load-tests)

 <!-- /TOC -->

@ -46,15 +59,17 @@ One of our primary development principles is that the entire data pipeline shoul
 In the sub-sections below, we outline what each stage of the data provenance looks like and where you can find the data output by that stage. If you'd like to actually perform each step in your own environment, skip down to [Score generation and comparison workflow](#score-generation-and-comparison-workflow).

 #### 1. Source data
+
 If you would like to find and use the raw source data, you can find the source URLs in the `etl.py` files located within each directory in `data/data-pipeline/etl/sources`.

 #### 2. Extract-Transform-Load (ETL) the data
-The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in `data/data-pipeline/etl/sources`, and the output of this process is a number of CSVs available at the following locations: 

-* EJScreen: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv
-* Census ACS 2019: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv
-* Housing and Transportation Index: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv
-* HUD Housing: https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv
+The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in `data/data-pipeline/etl/sources`, and the output of this process is a number of CSVs available at the following locations:
+
+- EJScreen: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv>
+- Census ACS 2019: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv>
+- Housing and Transportation Index: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv>
+- HUD Housing: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv>

 Each CSV may have a different column name for the census tract or census block group identifier. You can find what the name is in the ETL code. Please note that when you view these files you should make sure that your text editor or spreadsheet software does not remove the initial `0` from this identifier field (many IDs begin with `0`).

@ -242,3 +257,78 @@ see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nb
 ## Miscellaneous

 - To export packages from Poetry to `requirements.txt` run `poetry export --without-hashes > requirements.txt`
+
+## Testing
+
+### Background
+
+For this project, we make use of [pytest](https://docs.pytest.org/en/latest/) for testing purposes. To run tests, simply run `poetry run pytest` in this directory (i.e. `justice40-tool/data/data-pipeline`).
+
+### Configuration / Fixtures
+
+Test data is configured via [fixtures](https://docs.pytest.org/en/latest/explanation/fixtures.html).
+
+These fixtures utilize [pickle files](https://docs.python.org/3/library/pickle.html) to store dataframes to disk. This is ultimately because if you assert equality on two dataframes, even if column values have the same "visible" value, if their types are mismatching they will be counted as not being equal.
+
+In a bit more detail:
+
+1. Pandas dataframes are typed, and by default, types are inferred when you create one from scratch. If you create a dataframe using the `DataFrame` [constructors](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame), there is no guarantee that types will be correct, without explicit `dtype` annotations. Explicit `dtype` annotations are possible, but, and this leads us to point #2:
+
+2. Our transformations/dataframes in the source code under test itself doesn't always require specific types, and it is often sufficient in the code itself to just rely on the `object` type. I attempted adding explicit typing based on the "logical" type of given columns, but in practice it resulted in non-matching dataframes that _actually_ had the same value -- in particular it was very common to have one dataframe column of type `string` and another of type `object` that carried the same values. So, that is to say, even if we did create a "correctly" typed dataframe (according to our logical assumptions about what types should be), they were still counted as mismatched against the dataframes that are actually used in our program. To fix this "the right way", it is necessary to explicitly annotate types at the point of the `read_csv` call, which definitely has other potential unintended side effects and would need to be done carefully.
+
+3. For larger dataframes (some of these have 150+ values), it was initially deemed too difficult/time consuming to manually annotate all types, and further, to modify those type annotations based on what is expected in the souce code under test.
+
+#### Updating Pickles
+
+If you update the input our output to various methods, it is necessary to create new pickles so that data is validated correctly. To do this:
+
+1. Drop a breakpoint just before the dataframe will otherwise be written to / read from disk. If you're using VSCode, use one of the named run targets within `data-pipeline` such as `Score Full Run` , and put a breakpoint in the margin just before the actionable step. More on using breakpoints in VSCode [here](https://code.visualstudio.com/docs/editor/debugging#_breakpoints). If you are not using VSCode, you can put the line `breakpoint()` in your code and it will stop where you have placed the line in whatever calling context you are using.
+1. In your editor/terminal, run `df.to_pickle("data_pipeline/etl/score/tests/snapshots/YOUR_OUT_PATH_HERE.pkl")` to write the pickle to the appropriate location on disk.
+1. Be sure to do this for all inputs/outputs that have changed as a result of your modification. It is often necessary to do this several times for cascading operations.
+1. To inspect your pickle, open a python interpreter, then run `pickle.load( open( "data_pipeline/etl/score/tests/snapshots/YOUR_OUT_PATH_HERE.pkl", "rb" ) )` to get file contents.
+
+#### Future Enchancements
+
+Pickles have several downsides that we should consider alternatives for:
+
+1. They are opaque - it is necessary to open a python interpreter (as written above) to confirm its contents
+2. They are a bit harder for newcomers to python to grok.
+3. They potentially encode flawed typing assumptions (see above) which are paved over for future test runs.
+
+In the future, we could adopt any of the below strategies to work around this:
+
+1. We could use [pytest-snapshot](https://pypi.org/project/pytest-snapshot/) to automatically store the output of each test as data changes. This would make it so that you could avoid having to generate a pickle for each method - instead, you would only need to call `generate` once , and only when the dataframe had changed.
+
+Additionally, you could use a pandas type schema annotation such as [pandera](https://pandera.readthedocs.io/en/stable/schema_models.html?highlight=inputschema#basic-usage) to annotate input/output schemas for given functions, and your unit tests could use these to validate explicitly. This could be of very high value for annotating expectations.
+
+Alternatively, or in conjunction, you could move toward using a more strictly-typed container format for read/writes such as SQL/SQLite, and use something like [SQLModel](https://github.com/tiangolo/sqlmodel) to handle more explicit type guarantees.
+
+### ETL Unit Tests
+
+ETL unit tests are typically organized into three buckets:
+
+- Extract Tests
+- Transform Tests, and
+- Load Tests
+
+These are tested using different strategies, explained below.
+
+#### Extract Tests
+
+Extract tests rely on the limited data transformations that occur as data is loaded from source files.
+
+In tests, we use fake, limited CSVs read via `StringIO` , taken from the first several rows of the files of interest, and ensure data types are correct.
+
+Down the line, we could use a tool like [Pandera](https://pandera.readthedocs.io/) to enforce schemas, both for the tests and the classes themselves.
+
+#### Transform Tests
+
+Transform tests are the heart of ETL unit tests, and compare ideal dataframes with their actual counterparts.
+
+See above [Fixtures](#configuration--fixtures) section for information about where data is coming from.
+
+#### Load Tests
+
+These make use of [tmp_path_factory](https://docs.pytest.org/en/latest/how-to/tmp_path.html) to create a file-system located under `temp_dir`, and validate whether the correct files are written to the correct locations.
+
+Additional future modifications could include the use of Pandera and/or other schema validation tools, and or a more explicit test that the data written to file can be read back in and yield the same dataframe.