mirror of
https://github.com/DOI-DO/j40-cejst-2.git
synced 2025-02-23 10:04:18 -08:00
* Create deploy_be_staging.yml (#1575) * Imputing income using geographic neighbors (#1559) Imputes income field with a light refactor. Needs more refactor and more tests (I spotchecked). Next ticket will check and address but a lot of "narwhal" architecture is here. * Adding HOLC indicator (#1579) Added HOLC indicator (Historic Redlining Score) from NCRC work; included 3.25 cutoff and low income as part of the housing burden category. * Update backend for Puerto Rico (#1686) * Update PR threshold count to 10 We now show 10 indicators for PR. See the discussion on the github issue for more info: https://github.com/usds/justice40-tool/issues/1621 * Do not use linguistic iso for Puerto Rico Closes 1350. Co-authored-by: Shelby Switzer <shelbyswitzer@gmail.com> * updating * Do not drop Guam and USVI from ETL (#1681) * Remove code that drops Guam and USVI from ETL * Add back code for dropping rows by FIPS code We may want this functionality, so let's keep it and just make the constant currently be an empty array. Co-authored-by: Shelby Switzer <shelbyswitzer@gmail.com> * Emma nechamkin/holc patch (#1742) Removing HOLC calculation from score narwhal. * updating ejscreen data, try two (#1747) * Rescaling linguistic isolation (#1750) Rescales linguistic isolation to drop puerto rico * adds UST indicator (#1786) adds leaky underground storage tanks * Changing LHE in tiles to a boolean (#1767) also includes merging / clean up of the release * added indoor plumbing to chas * added indoor plumbing to score housing burden * added indoor plumbing to score housing burden * first run through * Refactor DOE Energy Burden and COI to use YAML (#1796) * added tribalId for Supplemental dataset (#1804) * Setting zoom levels for tribal map (#1810) * NRI dataset and initial score YAML configuration (#1534) * update be staging gha * NRI dataset and initial score YAML configuration * checkpoint * adding data checks for release branch * passing tests * adding INPUT_EXTRACTED_FILE_NAME to base class * lint * columns to keep and tests * update be staging gha * checkpoint * update be staging gha * NRI dataset and initial score YAML configuration * checkpoint * adding data checks for release branch * passing tests * adding INPUT_EXTRACTED_FILE_NAME to base class * lint * columns to keep and tests * checkpoint * PR Review * renoving source url * tests * stop execution of ETL if there's a YAML schema issue * update be staging gha * adding source url as class var again * clean up * force cache bust * gha cache bust * dynamically set score vars from YAML * docsctrings * removing last updated year - optional reverse percentile * passing tests * sort order * column ordening * PR review * class level vars * Updating DatasetsConfig * fix pylint errors * moving metadata hint back to code Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> * Correct copy typo (#1809) * Add basic test suite for COI (#1518) * Update COI to use new yaml (#1518) * Add tests for DOE energy budren (1518 * Add dataset config for energy budren (1518) * Refactor ETL to use datasets.yml (#1518) * Add fake GEOIDs to COI tests (#1518) * Refactor _setup_etl_instance_and_run_extract to base (#1518) For the three classes we've done so far, a generic _setup_etl_instance_and_run_extract will work fine, for the moment we can reuse the same setup method until we decide future classes need more flexibility --- but they can also always subclass so... * Add output-path tests (#1518) * Update YAML to match constant (#1518) * Don't blindly set float format (#1518) * Add defaults for extract (#1518) * Run YAML load on all subclasses (#1518) * Update description fields (#1518) * Update YAML per final format (#1518) * Update fixture tract IDs (#1518) * Update base class refactor (#1518) Now that NRI is final I needed to make a small number of updates to my refactored code. * Remove old comment (#1518) * Fix type signature and return (#1518) * Update per code review (#1518) Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> Co-authored-by: Vim <86254807+vim-usds@users.noreply.github.com> * Update etl_score_geo.py Yikes! Fixing merge messup! * Create deploy_be_staging.yml (#1575) * Imputing income using geographic neighbors (#1559) Imputes income field with a light refactor. Needs more refactor and more tests (I spotchecked). Next ticket will check and address but a lot of "narwhal" architecture is here. * Adding HOLC indicator (#1579) Added HOLC indicator (Historic Redlining Score) from NCRC work; included 3.25 cutoff and low income as part of the housing burden category. * Update backend for Puerto Rico (#1686) * Update PR threshold count to 10 We now show 10 indicators for PR. See the discussion on the github issue for more info: https://github.com/usds/justice40-tool/issues/1621 * Do not use linguistic iso for Puerto Rico Closes 1350. Co-authored-by: Shelby Switzer <shelbyswitzer@gmail.com> * updating * Do not drop Guam and USVI from ETL (#1681) * Remove code that drops Guam and USVI from ETL * Add back code for dropping rows by FIPS code We may want this functionality, so let's keep it and just make the constant currently be an empty array. Co-authored-by: Shelby Switzer <shelbyswitzer@gmail.com> * Emma nechamkin/holc patch (#1742) Removing HOLC calculation from score narwhal. * updating ejscreen data, try two (#1747) * Rescaling linguistic isolation (#1750) Rescales linguistic isolation to drop puerto rico * adds UST indicator (#1786) adds leaky underground storage tanks * Changing LHE in tiles to a boolean (#1767) also includes merging / clean up of the release * added indoor plumbing to chas * added indoor plumbing to score housing burden * added indoor plumbing to score housing burden * first run through * Refactor DOE Energy Burden and COI to use YAML (#1796) * added tribalId for Supplemental dataset (#1804) * Setting zoom levels for tribal map (#1810) * NRI dataset and initial score YAML configuration (#1534) * update be staging gha * NRI dataset and initial score YAML configuration * checkpoint * adding data checks for release branch * passing tests * adding INPUT_EXTRACTED_FILE_NAME to base class * lint * columns to keep and tests * update be staging gha * checkpoint * update be staging gha * NRI dataset and initial score YAML configuration * checkpoint * adding data checks for release branch * passing tests * adding INPUT_EXTRACTED_FILE_NAME to base class * lint * columns to keep and tests * checkpoint * PR Review * renoving source url * tests * stop execution of ETL if there's a YAML schema issue * update be staging gha * adding source url as class var again * clean up * force cache bust * gha cache bust * dynamically set score vars from YAML * docsctrings * removing last updated year - optional reverse percentile * passing tests * sort order * column ordening * PR review * class level vars * Updating DatasetsConfig * fix pylint errors * moving metadata hint back to code Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> * Correct copy typo (#1809) * Add basic test suite for COI (#1518) * Update COI to use new yaml (#1518) * Add tests for DOE energy budren (1518 * Add dataset config for energy budren (1518) * Refactor ETL to use datasets.yml (#1518) * Add fake GEOIDs to COI tests (#1518) * Refactor _setup_etl_instance_and_run_extract to base (#1518) For the three classes we've done so far, a generic _setup_etl_instance_and_run_extract will work fine, for the moment we can reuse the same setup method until we decide future classes need more flexibility --- but they can also always subclass so... * Add output-path tests (#1518) * Update YAML to match constant (#1518) * Don't blindly set float format (#1518) * Add defaults for extract (#1518) * Run YAML load on all subclasses (#1518) * Update description fields (#1518) * Update YAML per final format (#1518) * Update fixture tract IDs (#1518) * Update base class refactor (#1518) Now that NRI is final I needed to make a small number of updates to my refactored code. * Remove old comment (#1518) * Fix type signature and return (#1518) * Update per code review (#1518) Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> Co-authored-by: Vim <86254807+vim-usds@users.noreply.github.com> * Update etl_score_geo.py Yikes! Fixing merge messup! * updated to fix linting errors (#1818) Cleans and updates base branch * Adding back MapComparison video * Add FUDS ETL (#1817) * Add spatial join method (#1871) Since we'll need to figure out the tracts for a large number of points in future tickets, add a utility to handle grabbing the tract geometries and adding tract data to a point dataset. * Add FUDS, also jupyter lab (#1871) * Add YAML configs for FUDS (#1871) * Allow input geoid to be optional (#1871) * Add FUDS ETL, tests, test-datae noteobook (#1871) This adds the ETL class for Formerly Used Defense Sites (FUDS). This is different from most other ETLs since these FUDS are not provided by tract, but instead by geographic point, so we need to assign FUDS to tracts and then do calculations from there. * Floats -> Ints, as I intended (#1871) * Floats -> Ints, as I intended (#1871) * Formatting fixes (#1871) * Add test false positive GEOIDs (#1871) * Add gdal binaries (#1871) * Refactor pandas code to be more idiomatic (#1871) Per Emma, the more pandas-y way of doing my counts is using np.where to add the values i need, then groupby and size. It is definitely more compact, and also I think more correct! * Update configs per Emma suggestions (#1871) * Type fixed! (#1871) * Remove spurious import from vscode (#1871) * Snapshot update after changing col name (#1871) * Move up GDAL (#1871) * Adjust geojson strategy (#1871) * Try running census separately first (#1871) * Fix import order (#1871) * Cleanup cache strategy (#1871) * Download census data from S3 instead of re-calculating (#1871) * Clarify pandas code per Emma (#1871) * Disable markdown check for link * Adding DOT composite to travel score (#1820) This adds the DOT dataset to the ETL and to the score. Note that currently we take a percentile of an average of percentiles. * Adding first street foundation data (#1823) Adding FSF flood and wildfire risk datasets to the score. * first run -- adding NCLD data to the ETL, but not yet to the score * Add abandoned mine lands data (#1824) * Add notebook to generate test data (#1780) * Add Abandoned Mine Land data (#1780) Using a similar structure but simpler apporach compared to FUDs, add an indicator for whether a tract has an abandonded mine. * Adding some detail to dataset readmes Just a thought! * Apply feedback from revieiw (#1780) * Fixup bad string that broke test (#1780) * Update a string that I should have renamed (#1780) * Reduce number of threads to reduce memory pressure (#1780) * Try not running geo data (#1780) * Run the high-memory sets separately (#1780) * Actually deduplicate (#1780) * Add flag for memory intensive ETLs (#1780) * Document new flag for datasets (#1780) * Add flag for new datasets fro rebase (#1780) Co-authored-by: Emma Nechamkin <97977170+emma-nechamkin@users.noreply.github.com> * Adding NLCD data (#1826) Adding NLCD's natural space indicator end to end to the score. * Add donut hole calculation to score (#1828) Adds adjacency index to the pipeline. Requires thorough QA * Adding eamlis and fuds data to legacy pollution in score (#1832) Update to add EAMLIS and FUDS data to score * Update to use new FSF files (#1838) backend is partially done! * Quick fix to kitchen or plumbing indicator Yikes! I think I messed something up and dropped the pctile field suffix from when the KP score gets calculated. Fixing right quick. * Fast flag update (#1844) Added additional flags for the front end based on our conversation in stand up this morning. * Tiles fix (#1845) Fixes score-geo and adds flags * Update etl_score_geo.py * Issue 1827: Add demographics to tiles and download files (#1833) * Adding demographics for use in sidebar and download files * Updates backend constants to N (#1854) * updated to show T/F/null vs T/F for AML and FUDS (#1866) * fix markdown * just testing that the boolean is preserved on gha * checking drop tracts works * OOPS! Old changes persisted * adding a check to the agvalue calculation for nri * updated with error messages * updated error message * tuple type * Score tests (#1847) * update Python version on README; tuple typing fix * Alaska tribal points fix (#1821) * Bump mistune from 0.8.4 to 2.0.3 in /data/data-pipeline (#1777) Bumps [mistune](https://github.com/lepture/mistune) from 0.8.4 to 2.0.3. - [Release notes](https://github.com/lepture/mistune/releases) - [Changelog](https://github.com/lepture/mistune/blob/master/docs/changes.rst) - [Commits](https://github.com/lepture/mistune/compare/v0.8.4...v2.0.3) --- updated-dependencies: - dependency-name: mistune dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * poetry update * initial pass of score tests * add threshold tests * added ses threshold (not donut, not island) * testing suite -- stopping for the day * added test for lead proxy indicator * Refactor score tests to make them less verbose and more direct (#1865) * Cleanup tests slightly before refactor (#1846) * Refactor score calculations tests * Feedback from review * Refactor output tests like calculatoin tests (#1846) (#1870) * Reorganize files (#1846) * Switch from lru_cache to fixture scorpes (#1846) * Add tests for all factors (#1846) * Mark smoketests and run as part of be deply (#1846) * Update renamed var (#1846) * Switch from named tuple to dataclass (#1846) This is annoying, but pylint in python3.8 was crashing parsing the named tuple. We weren't using any namedtuple-specific features, so I made the type a dataclass just to get pylint to behave. * Add default timout to requests (#1846) * Fix type (#1846) * Fix merge mistake on poetry.lock (#1846) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matt Bowen <83967628+mattbowen-usds@users.noreply.github.com> Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov> * just testing that the boolean is preserved on gha (#1867) * updated with hopefully a fix; coercing aml, fuds, hrs to booleans for the raw value to preserve null character. * Adding tests to ensure proper calculations (#1871) * just testing that the boolean is preserved on gha * checking drop tracts works * adding a check to the agvalue calculation for nri * updated with error messages * tribal tiles fix (#1874) * Alaska tribal points fix (#1821) * tribal tiles fix * disabling child opportunity * lint * removing COI * removing commented out code * Pipeline tile tests (#1864) * temp update * updating with fips check * adding check on pfs * updating with pfs test * Update test_tiles_smoketests.py * Fix lint errors (#1848) * Add column names test (#1848) * Mark tests as smoketests (#1848) * Move to other score-related tests (#1848) * Recast Total threshold criteria exceeded to int (#1848) In writing tests to verify the output of the tiles csv matches the final score CSV, I noticed TC/Total threshold criteria exceeded was getting cast from an int64 to a float64 in the process of PostScoreETL. I tracked it down to the line where we merge the score dataframe with constants.DATA_CENSUS_CSV_FILE_PATH --- there where > 100 tracts in the national census CSV that don't exist in the score, so those ended up with a Total threshhold count of np.nan, which is a float, and thereby cast those columns to float. For the moment I just cast it back. * No need for low memeory (#1848) * Add additional tests of tiles.csv (#1848) * Drop pre-2010 rows before computing score (#1848) Note this is probably NOT the optimal place for this change; it might make more sense for each source to filter its own tracts down to the acceptable tract list. However, that would be a pretty invasive change, where this is central and plenty of other things are happening in score transform that could be moved to sources, so for today, here's where the change will live. * Fix typo (#1848) * Switch from filter to inner join (#1848) * Remove no-op lines from tiles (#1848) * Apply feedback from review, linter (#1848) * Check the values oeverything in the frame (#1848) * Refactor checker class (#1848) * Add test for state names (#1848) * cleanup from reviewing my own code (#1848) * Fix lint error (#1858) * Apply Emma's feedback from review (#1848) * Remove refs to national_df (#1848) * Account for new, fake nullable bools in tiles (#1848) To handle a geojson limitation, Emma converted some nullable boolean colunms to float64 in the tiles export with the values {0.0, 1.0, nan}, giving us the same expressiveness. Sadly, this broke my assumption that all columns between the score and tiles csvs would have the same dtypes, so I need to account for these new, fake bools in my test. * Use equals instead of my worse version (#1848) * Missed a spot where we called _create_score_data (#1848) * Update per safety (#1848) Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov> * Add tests to make sure each source makes it to the score correctly (#1878) * Remove unused persistent poverty from score (#1835) * Test a few datasets for overlap in the final score (#1835) * Add remaining data sources (#1853) * Apply code-review feedback (#1835) * Rearrange a little for readabililty (#1835) * Add tract test (#1835) * Add test for score values (#1835) * Check for unmatched source tracts (#1835) * Cleanup numeric code to plaintext (#1835) * Make import more obvious (#1835) * Updating traffic barriers to include low pop threshold (#1889) Changing the traffic barriers to only be included for places with recorded population * Remove no land tracts from map (#1894) remove from map * Issue 1831: missing life expectancy data from Maine and Wisconsin (#1887) * Fixing missing states and adding tests for states to all classes * Removing low pop tracts from FEMA population loss (#1898) dropping 0 population from FEMA * 1831 Follow up (#1902) This code causes no functional change to the code. It does two things: 1. Uses difference instead of - to improve code style for working with sets. 2. Removes the line EXPECTED_MISSING_STATES = ["02", "15"], which is now redundant because of the line I added (in a previous pull request) of ALASKA_AND_HAWAII_EXPECTED_IN_DATA = False. * Add tests for all non-census sources (#1899) * Refactor CDC life-expectancy (1554) * Update to new tract list (#1554) * Adjust for tests (#1848) * Add tests for cdc_places (#1848) * Add EJScreen tests (#1848) * Add tests for HUD housing (#1848) * Add tests for GeoCorr (#1848) * Add persistent poverty tests (#1848) * Update for sources without zips, for new validation (#1848) * Update tests for new multi-CSV but (#1848) Lucas updated the CDC life expectancy data to handle a bug where two states are missing from the US Overall download. Since virtually none of our other ETL classes download multiple CSVs directly like this, it required a pretty invasive new mocking strategy. * Add basic tests for nature deprived (#1848) * Add wildfire tests (#1848) * Add flood risk tests (#1848) * Add DOT travel tests (#1848) * Add historic redlining tests (#1848) * Add tests for ME and WI (#1848) * Update now that validation exists (#1848) * Adjust for validation (#1848) * Add health insurance back to cdc places (#1848) Ooops * Update tests with new field (#1848) * Test for blank tract removal (#1848) * Add tracts for clipping behavior * Test clipping and zfill behavior (#1848) * Fix bad test assumption (#1848) * Simplify class, add test for tract padding (#1848) * Fix percentage inversion, update tests (#1848) Looking through the transformations, I noticed that we were subtracting a percentage that is usually between 0-100 from 1 instead of 100, and so were endind up with some surprising results. Confirmed with lucasmbrown-usds * Add note about first street data (#1848) * Issue 1900: Tribal overlap with Census tracts (#1903) * working notebook * updating notebook * wip * fixing broken tests * adding tribal overlap files * WIP * WIP * WIP, calculated count and names * working * partial cleanup * partial cleanup * updating field names * fixing bug * removing pyogrio * removing unused imports * updating test fixtures to be more realistic * cleaning up notebook * fixing black * fixing flake8 errors * adding tox instructions * updating etl_score * suppressing warning * Use projected CRSes, ignore geom types (#1900) I looked into this a bit, and in general the geometry type mismatch changes very little about the calculation; we have a mix of multipolygons and polygons. The fastest thing to do is just not keep geom type; I did some runs with it set to both True and False, and they're the same within 9 digits of precision. Logically we just want to overlaps, regardless of how the actual geometries are encoded between the frames, so we can in this case ignore the geom types and feel OKAY. I also moved to projected CRSes, since we are actually trying to do area calculations and so like, we should. Again, the change is small in magnitude but logically more sound. * Readd CDC dataset config (#1900) * adding comments to fips code * delete unnecessary loggers Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov> * Improve score test documentation based on Lucas's feedback (#1835) (#1914) * Better document base on Lucas's feedback (#1835) * Fix typo (#1835) * Add test to verify GEOJSON matches tiles (#1835) * Remove NOOP line (#1835) * Move GEOJSON generation up for new smoketest (#1835) * Fixup code format (#1835) * Update readme for new somketest (#1835) * Cleanup source tests (#1912) * Move test to base for broader coverage (#1848) * Remove duplicate line (#1848) * FUDS needed an extra mock (#1848) * Add tribal count notebook (#1917) (#1919) * Add tribal count notebook (#1917) * test without caching * added comment Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> * Add tribal overlap to downloads (#1907) * Add tribal data to downloads (#1904) * Update test pickle with current cols (#1904) * Remove text of tribe names from GeoJSON (#1904) * Update test data (#1904) * Add tribal overlap to smoketests (#1904) * Issue 1910: Do not impute income for 0 population tracts (#1918) * should be working, has unnecessary loggers * removing loggers and cleaning up * updating ejscreen tests * adding tests and responding to PR feedback * fixing broken smoke test * delete smoketest docs * updating click * updating click * Bump just jupyterlab (#1930) * Fixing link checker (#1929) * Update deps safety says are vulnerable (#1937) (#1938) Co-authored-by: matt bowen <matt@mattbowen.net> * Add demos for island areas (#1932) * Backfill population in island areas (#1882) * Update smoketest to account for backfills (#1882) As I wrote in the commend: We backfill island areas with data from the 2010 census, so if THOSE tracts have data beyond the data source, that's to be expected and is fine to pass. If some other state or territory does though, this should fail This ends up being a nice way of documenting that behavior i guess! * Fixup lint issues (#1882) * Add in race demos to 2010 census pull (#1851) * Add backfill data to score (#1851) * Change column name (#1851) * Fill demos after the score (#1851) * Add income back, adjust test (#1882) * Apply code-review feedback (#1851) * Add test for island area backfill (#1851) * Fix bad rename (#1851) * Reorder download fields, add plumbing back (#1942) * Add back lack of plumbing fields (#1920) * Reorder fields for excel (#1921) * Reorder excel fields (#1921) * Fix formating, lint errors, pickes (#1921) * Add missing plumbing col, fix order again (#1921) * Update that pickle (#1921) * refactoring tribal (#1960) * updated with scoring comparison * updated for narhwal -- leaving commented code in for now * pydantic upgrade * produce a string for the front end to ingest (#1963) * wip * i believe this works -- let's see the pipeline * updated fixtures * Adding ADJLI_ET (#1976) * updated tile data * ensuring adjli_et in * Add back income percentile (#1977) * Add missing field to download (#1964) * Remove pydantic since it's unused (#1964) * Add percentile to CSV (#1964) * Update downloadable pickle (#1964) * Issue 105: Configure and run `black` and other pre-commit hooks (clean branch) (#1962) * Configure and run `black` and other pre-commit hooks Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov> * Removing fixed python version for black (#1985) * Fixup TA_COUNT and TA_PERC (#1991) * Change TA_PERC, change TA_COUNT (#1988, #1989) - Make TA_PERC_STR back into a nullable float following the rules requestsed in #1989 - Move TA_COUNT to be TA_COUNT_AK, also add a null TA_COUNT_C for CONUS that we can fill in later. * Fix typo comment (#1988) * Issue 1992: Do not impute income for null population tracts (#1993) * Hotfix for DOT data source DNS issue (#1999) * Make tribal overlap set score N (#2004) * Add "Is a Tribal DAC" field (#1998) * Add tribal DACs to score N final (#1998) * Add new fields to downloads (#1998) * Make a int a float (#1998) * Update field names, apply feedback (#1998) * Add assertions around codebook (#2014) * Add assertion around codebook (#1505) * Assert csv and excel have same cols (#1505) * Remove suffixes from tribal lands (#1974) (#2008) * Data source location (#2015) * data source location * toml * cdc_places * cdc_svi_index * url updates * child oppy and dot travel * up to hud_recap * completed ticket * cache bust * hud_recap * us_army_fuds * Remove vars the frontend doesn't use (#2020) (#2022) I did a pretty rough and simple analysis of the variables we put in the tiles and grepped the frontend code to see if (1) they're ever accessed and (2) if they're used, even if they're read once. I removed everything I noticed was not accessed. * Disable file size limits on tiles (#2031) * Disable file size limits on tiles * Remove print debugs I know. * Update file name pattern (#2037) (#2038) * Update file name pattern (#2037) * Remove ETL from generation (2037) I looked more carefully, and this ETL step isn't used in the score, so there's no need to run it every time. Per previous steps, I removed it from constants so the code is there it won't run by default. * Round ALL the float fields for the tiles (#2040) * Round ALL the float fields for the tiles (#2033) * Floor in a simpler way (#2033) Emma pointed out that all teh stuff we're doing in floor_series is probably unnecessary for this case, so just use the built-in floor. * Update pickle I missed (#2033) * Clean commit of just aggregate burden notebook (#1819) added a burden notebook * Update the dockerfile (#2045) * Update so the image builds (#2026) * Fix bad dict (2026) * Rename census tract field in downloads (#2068) * Change tract ID field name (2060) * Update lockfile (#2061) * Bump safety, jupyter, wheel (#2061) * DOn't depend directly on wheel (2061) * Bring narwhal reqs in line with main * Update tribal area counts (#2071) * Rename tribal area field (2062) * Add missing file (#2062) * Add checks to create version (#2047) (#2052) * Fix failing safety (#2114) * Ignore vuln that doesn't affect us 2113 https://nvd.nist.gov/vuln/detail/CVE-2022-42969 landed recently and there's no fix in py (which is maintenance mode). From my analysis, that CVE cannot hurt us (famous last words), so we'll ignore the vuln for now. * 2113 Update our gdal ppa * that didn't work (2113) * Don't add the PPA, the package exists (#2113) * Fix type (#2113) * Force an update of wheel 2113 * Also remove PPA line from create-score-versions * Drop 3.8 because of wheel 2113 * Put back 3.8, use newer actions * Try another way of upgrading wheel 2113 * Upgrade wheel in tox too 2113 * Typo fix 2113 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Emma Nechamkin <97977170+emma-nechamkin@users.noreply.github.com> Co-authored-by: Shelby Switzer <shelby.c.switzer@omb.eop.gov> Co-authored-by: Shelby Switzer <shelbyswitzer@gmail.com> Co-authored-by: Emma Nechamkin <Emma.J.Nechamkin@omb.eop.gov> Co-authored-by: Matt Bowen <83967628+mattbowen-usds@users.noreply.github.com> Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com> Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov> Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov> Co-authored-by: matt bowen <matt@mattbowen.net>
575 lines
39 KiB
Markdown
575 lines
39 KiB
Markdown
# Justice 40 Score application
|
|
|
|
<details open="open">
|
|
<summary>Table of Contents</summary>
|
|
|
|
<!-- TOC -->
|
|
|
|
- [Justice 40 Score application](#justice-40-score-application)
|
|
- [About this application](#about-this-application)
|
|
- [Using the data](#using-the-data)
|
|
- [1. Source data](#1-source-data)
|
|
- [2. Extract-Transform-Load (ETL) the data](#2-extract-transform-load-etl-the-data)
|
|
- [3. Combined dataset](#3-combined-dataset)
|
|
- [4. Tileset](#4-tileset)
|
|
- [5. Shapefiles](#5-shapefiles)
|
|
- [Score generation and comparison workflow](#score-generation-and-comparison-workflow)
|
|
- [Workflow Diagram](#workflow-diagram)
|
|
- [Step 0: Set up your environment](#step-0-set-up-your-environment)
|
|
- [Step 1: Run the script to download census data or download from the Justice40 S3 URL](#step-1-run-the-script-to-download-census-data-or-download-from-the-justice40-s3-url)
|
|
- [Step 2: Run the ETL script for each data source](#step-2-run-the-etl-script-for-each-data-source)
|
|
- [Table of commands](#table-of-commands)
|
|
- [ETL steps](#etl-steps)
|
|
- [Step 3: Calculate the Justice40 score experiments](#step-3-calculate-the-justice40-score-experiments)
|
|
- [Step 4: Compare the Justice40 score experiments to other indices](#step-4-compare-the-justice40-score-experiments-to-other-indices)
|
|
- [Data Sources](#data-sources)
|
|
- [Running using Docker](#running-using-docker)
|
|
- [Local development](#local-development)
|
|
- [VSCode](#vscode)
|
|
- [MacOS](#macos)
|
|
- [Windows Users](#windows-users)
|
|
- [Setting up Poetry](#setting-up-poetry)
|
|
- [Running tox](#running-tox)
|
|
- [The Application entrypoint](#the-application-entrypoint)
|
|
- [Downloading Census Block Groups GeoJSON and Generating CBG CSVs (not normally required)](#downloading-census-block-groups-geojson-and-generating-cbg-csvs-not-normally-required)
|
|
- [Run all ETL, score and map generation processes](#run-all-etl-score-and-map-generation-processes)
|
|
- [Run both ETL and score generation processes](#run-both-etl-and-score-generation-processes)
|
|
- [Run all ETL processes](#run-all-etl-processes)
|
|
- [Generating Map Tiles](#generating-map-tiles)
|
|
- [Serve the map locally](#serve-the-map-locally)
|
|
- [Running Jupyter notebooks](#running-jupyter-notebooks)
|
|
- [Activating variable-enabled Markdown for Jupyter notebooks](#activating-variable-enabled-markdown-for-jupyter-notebooks)
|
|
- [Testing](#testing)
|
|
- [Background](#background)
|
|
- [Score and post-processing tests](#score-and-post-processing-tests)
|
|
- [Updating Pickles](#updating-pickles)
|
|
- [Future Enhancements](#future-enhancements)
|
|
- [Fixtures used in ETL "snapshot tests"](#fixtures-used-in-etl-snapshot-tests)
|
|
- [Other ETL Unit Tests](#other-etl-unit-tests)
|
|
- [Extract Tests](#extract-tests)
|
|
- [Transform Tests](#transform-tests)
|
|
- [Load Tests](#load-tests)
|
|
- [Smoketests](#smoketests)
|
|
|
|
<!-- /TOC -->
|
|
|
|
</details>
|
|
|
|
## About this application
|
|
|
|
This application is used to compare experimental versions of the Justice40 score to established environmental justice indices, such as EJSCREEN, CalEnviroScreen, and so on.
|
|
|
|
_**NOTE:** These scores **do not** represent final versions of the Justice40 scores and are merely used for comparative purposes. As a result, the specific input columns and formulas used to calculate them are likely to change over time._
|
|
|
|
### Using the data
|
|
|
|
One of our primary development principles is that the entire data pipeline should be open and replicable end-to-end. As part of this, in addition to all code being open, we also strive to make data visible and available for use at every stage of our pipeline. You can follow the instructions below in this README to spin up the data pipeline yourself in your own environment; you can also access the data we've already processed on our S3 bucket.
|
|
|
|
In the sub-sections below, we outline what each stage of the data provenance looks like and where you can find the data output by that stage. If you'd like to actually perform each step in your own environment, skip down to [Score generation and comparison workflow](#score-generation-and-comparison-workflow).
|
|
|
|
#### 1. Source data
|
|
|
|
If you would like to find and use the raw source data, you can find the source URLs in the `etl.py` files located within each directory in `data/data-pipeline/etl/sources`.
|
|
|
|
#### 2. Extract-Transform-Load (ETL) the data
|
|
|
|
The first step of processing we perform is a simple ETL process for each of the source datasets. Code is available in `data/data-pipeline/etl/sources`, and the output of this process is a number of CSVs available at the following locations:
|
|
|
|
- EJScreen: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/ejscreen_2019/usa.csv>
|
|
- Census ACS 2019: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/census_acs_2019/usa.csv>
|
|
- Housing and Transportation Index: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/housing_and_transportation_index/usa.csv>
|
|
- HUD Housing: <https://justice40-data.s3.amazonaws.com/data-pipeline/data/dataset/hud_housing/usa.csv>
|
|
|
|
Each CSV may have a different column name for the census tract or census block group identifier. You can find what the name is in the ETL code. Please note that when you view these files you should make sure that your text editor or spreadsheet software does not remove the initial `0` from this identifier field (many IDs begin with `0`).
|
|
|
|
#### 3. Combined dataset
|
|
|
|
The CSV with the combined data from all of these sources [can be accessed here](https://justice40-data.s3.amazonaws.com/data-pipeline/data/score/csv/full/usa.csv).
|
|
|
|
#### 4. Tileset
|
|
|
|
Once we have all the data from the previous stages, we convert it to tiles to make it usable on a map. We render the map on the client side which can be seen using `docker-compose up`.
|
|
|
|
#### 5. Shapefiles
|
|
|
|
If you want to use the shapefiles in mapping applications, you can access them [here](https://justice40-data.s3.amazonaws.com/data-pipeline/data/score/shapefile/usa.zip).
|
|
|
|
|
|
### Score generation and comparison workflow
|
|
|
|
The descriptions below provide a more detailed outline of what happens at each step of ETL and score calculation workflow.
|
|
|
|
#### Workflow Diagram
|
|
|
|
TODO add mermaid diagram
|
|
|
|
#### Step 0: Set up your environment
|
|
|
|
1. Choose whether you'd like to run this application using Docker or if you'd like to install the dependencies locally so you can contribute to the project.
|
|
- **With Docker:** Follow these [installation instructions](https://docs.docker.com/get-docker/) and skip down to the [Running with Docker section](#running-with-docker) for more information
|
|
- **For Local Development:** Skip down to the [Local Development section](#local-development) for more detailed installation instructions
|
|
|
|
#### Step 1: Run the script to download census data or download from the Justice40 S3 URL
|
|
|
|
1. Call the `census_data_download` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
|
|
- With Docker: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application census-data-download`
|
|
- With Poetry: `poetry run download_census` (Install GDAL as described [below](#local-development))
|
|
2. If you have a high speed internet connection and don't want to generate the census data or install `GDAL` locally, you can download a zip version of the Census file [here](https://justice40-data.s3.amazonaws.com/data-sources/census.zip). Then unzip and move the contents inside the `data/data-pipeline/data_pipeline/data/census/` folder/
|
|
|
|
#### Step 2: Run the ETL script for each data source
|
|
|
|
##### Table of commands
|
|
|
|
| VS code command | actual command | run time | what it does | where it writes to | notes |
|
|
|---------------------------|---------------------|----------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
|
|
| ETL run | etl-run | | Downloads the data set files | data/dataset | check if there are any changes in data_pipeline/etl/sources. if there are none this can be skipped. |
|
|
| Score run | score-run | 6 mins | consume the etl outputs and combine into a score csv full. | data/score/csv/full/usa.csv | |
|
|
| Generate Score post | generate-score-post | 9 mins | 1. combines the score/csv/full with counties. 2. downloadable assets (xls, csv, zip), 3. creates the tiles/csv | data/score/csv/tiles/usa.csv, data/ score/downloadable | check destination folder to see if newly created |
|
|
| Combine score and geoJson | geo-score | 26 mins | 1. combine the data/score/csv/tiles/usa.csv with the census tiger geojson data 2. aggregates into super tracts for usa-low layer | data/score/geojson (usa high / low) | |
|
|
| Generate Map Tiles | generate-map-tiles | 35 mins | ogr-ogr pbf / mvt tiles generator that consume the geojson usa high / usa low | data/score/tiles/ high or low / {zoomLevel} | |
|
|
|
|
##### ETL steps
|
|
1. Call the `etl-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
|
|
- With Docker: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application etl-run`
|
|
- With Poetry: `poetry run python3 data_pipeline/application.py etl-run`
|
|
2. This command will execute the corresponding ETL script for each data source in `data_pipeline/etl/sources/`. For example, `data_pipeline/etl/sources/ejscreen/etl.py` is the ETL script for EJSCREEN data.
|
|
3. Each ETL script will extract the data from its original source, then format the data into `.csv` files that get stored in the relevant folder in `data_pipeline/data/dataset/`. For example, HUD Housing data is stored in `data_pipeline/data/dataset/hud_housing/usa.csv`
|
|
|
|
_**NOTE:** You have the option to pass the name of a specific data source to the `etl-run` command using the `-d` flag, which will limit the execution of the ETL process to that specific data source._
|
|
_For example: `poetry run python3 data_pipeline/application.py etl-run -d ejscreen` would only run the ETL process for EJSCREEN data._
|
|
|
|
#### Step 3: Calculate the Justice40 score experiments
|
|
|
|
1. Call the `score-run` command using the application manager `application.py` **NOTE:** This may take several minutes to execute.
|
|
- With Docker: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application score-run`
|
|
- With Poetry: `poetry run python3 data_pipeline/application.py score-run`
|
|
1. The `score-run` command will execute the `etl/score/etl.py` script which loads the data from each of the source files added to the `data/dataset/` directory by the ETL scripts in Step 1.
|
|
1. These data sets are merged into a single dataframe using their Census Block Group GEOID as a common key, and the data in each of the columns is standardized in two ways:
|
|
- Their [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) is calculated, which tells us what percentage of other Census Block Groups have a lower value for that particular column.
|
|
- They are normalized using [min-max normalization](https://en.wikipedia.org/wiki/Feature_scaling), which adjusts the scale of the data so that the Census Block Group with the highest value for that column is set to 1, the Census Block Group with the lowest value is set to 0, and all of the other values are adjusted to fit within that range based on how close they were to the highest or lowest value.
|
|
1. The standardized columns are then used to calculate each of the Justice40 score experiments described in greater detail below, and the results are exported to a `.csv` file in [`data_pipeline/data/score/csv`](data_pipeline/data/score/csv)
|
|
|
|
#### Step 4: Compare the Justice40 score experiments to other indices
|
|
|
|
We are building a comparison tool to enable easy (or at least straightforward) comparison of the Justice40 score with other existing indices. The goal of having this is so that as we experiment and iterate with a scoring methodology, we can understand how our score overlaps with or differs from other indices that communities, nonprofits, and governmentss use to inform decision making.
|
|
|
|
Right now, our comparison tool exists simply as a python notebook in `data/data-pipeline/data_pipeline/ipython/scoring_comparison.ipynb`.
|
|
|
|
To run this comparison tool:
|
|
|
|
1. Make sure you've gone through the above steps to run the data ETL and score generation.
|
|
1. From the package directory (`data/data-pipeline/data_pipeline/`), navigate to the `ipython` directory: `cd ipython`.
|
|
1. Ensure you have `pandoc` installed on your computer. If you're on a Mac, run `brew install pandoc`; for other OSes, see pandoc's [installation guide](https://pandoc.org/installing.html).
|
|
1. Start the notebooks: `jupyter notebook`
|
|
1. In your browser, navigate to one of the URLs returned by the above command.
|
|
1. Select `scoring_comparison.ipynb` from the options in your browser.
|
|
1. Run through the steps in the notebook. You can step through them one at a time by clicking the "Run" button for each cell, or open the "Cell" menu and click "Run all" to run them all at once.
|
|
1. Reports and spreadsheets generated by the comparison tool will be available in `data/data-pipeline/data_pipeline/data/comparison_outputs`.
|
|
|
|
_NOTE:_ This may take several minutes or over an hour to fully execute and generate the reports.
|
|
|
|
### Data Sources
|
|
|
|
- **[EJSCREEN](data_pipeline/etl/sources/ejscreen):** TODO Add description of data source
|
|
- **[Census](data_pipeline/etl/sources/census):** TODO Add description of data source
|
|
- **[American Communities Survey](data_pipeline/etl/sources/census_acs):** TODO Add description of data source
|
|
- **[Housing and Transportation](data_pipeline/etl/sources/housing_and_transportation):** TODO Add description of data source
|
|
- **[HUD Housing](data_pipeline/etl/sources/hud_housing):** TODO Add description of data source
|
|
- **[HUD Recap](data_pipeline/etl/sources/hud_recap):** TODO Add description of data source
|
|
- **[CalEnviroScreen](data_pipeline/etl/sources/calenviroscreen):** TODO Add description of data source
|
|
|
|
## Running using Docker
|
|
|
|
We use Docker to install the necessary libraries in a container that can be run in any operating system.
|
|
|
|
_Important_: To be able to run the data Docker containers, you need to increase the memory resource of your container to at leat 8096 MB.
|
|
|
|
To build the docker container the first time, make sure you're in the root directory of the repository and run `docker-compose build --no-cache`.
|
|
|
|
Once completed, run `docker-compose up`. Docker will spin up 3 containers: the client container, the static server container and the data container. Once all data is generated, you can see the application using a browser and navigating to `http://localhost:8000`.
|
|
|
|
If you want to run specific data tasks, you can open a terminal window, navigate to the root folder for this repository and then execute any command for the application using this format:
|
|
|
|
`docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application [command]`
|
|
|
|
Here's a list of commands:
|
|
|
|
- Get help: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application --help`
|
|
- Generate census data: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application census-data-download`
|
|
- Run all ETL and Generate score: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application score-full-run`
|
|
- Clean up the data directories: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application data-cleanup`
|
|
- Run all ETL processes: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application etl-run`
|
|
- Generate Score: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application score-run`
|
|
- Combine Score with Geojson and generate high and low zoom map tile sets: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application geo-score`
|
|
- Generate Map Tiles: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application generate-map-tiles`
|
|
|
|
## Local development
|
|
|
|
You can run the Python code locally without Docker to develop, using Poetry. However, to generate the census data you will need the [GDAL library](https://github.com/OSGeo/gdal) installed locally. For score generation, you will need [libspatialindex](https://libspatialindex.org/en/latest/). And to generate tiles for a local map, you will need [Mapbox tippecanoe](https://github.com/mapbox/tippecanoe). Please refer to the repos for specific instructions for your OS.
|
|
|
|
### VSCode
|
|
|
|
If you are using VSCode, you can make use of the `.vscode` folder checked in under `data/data-pipeline/.vscode`. To do this, open this directory with `code data/data-pipeline`.
|
|
|
|
Here's whats included:
|
|
|
|
1. `launch.json` - launch commands that allow for debugging the various commands in `application.py`. Note that because we are using the otherwise excellent [Click CLI](https://click.palletsprojects.com/en/8.0.x/), and Click in turn uses `console_scripts` to parse and execute command line options, it is necessary to run the equivalent of `python -m data_pipeline.application [command]` within `launch.json` to be able to set and hit breakpoints (this is what is currently implemented. Otherwise, you may find that the script times out after 5 seconds. More about this [here](https://stackoverflow.com/questions/64556874/how-can-i-debug-python-console-script-command-line-apps-with-the-vscode-debugger).
|
|
|
|
2. `settings.json` - these ensure that you're using the default linter (`pylint`), formatter (`flake8`), and test library (`pytest`) that the team is using.
|
|
|
|
3. `tasks.json` - these enable you to use `Terminal->Run Task` to run our preferred formatters and linters within your project.
|
|
|
|
Users are instructed to only add settings to this file that should be shared across the team, and not to add settings here that only apply to local development environments (particularly full absolute paths which can differ between setups). If you are looking to add something to this file, check in with the rest of the team to ensure the proposed settings should be shared.
|
|
|
|
### MacOS
|
|
|
|
To install the above-named executables:
|
|
|
|
- gdal: `brew install gdal`
|
|
- Tippecanoe: `brew install tippecanoe`
|
|
- spatialindex: `brew install spatialindex`
|
|
|
|
Note: For MacOS Monterey or M1 Macs, [you might need to follow these steps](https://stackoverflow.com/a/70880741) to install Scipy.
|
|
|
|
### Windows Users
|
|
|
|
If you want to run tile generation, please install TippeCanoe [following these instructions](https://github.com/GISupportICRC/ArcGIS2Mapbox#installing-tippecanoe-on-windows). You also need some pre-requisites for Geopandas as specified in the Poetry requirements. Please follow [these instructions](https://stackoverflow.com/questions/56958421/pip-install-geopandas-on-windows) to install the Geopandas dependency locally. It's definitely easier if you have access to WSL (Windows Subsystem Linux), and install these packages using commands similar to our [Dockerfile](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/Dockerfile).
|
|
|
|
### Setting up Poetry
|
|
|
|
- Start a terminal
|
|
- Change to this directory (`/data/data-pipeline/`)
|
|
- Make sure you have at least Python 3.8 installed: `python -V` or `python3 -V`
|
|
- We use [Poetry](https://python-poetry.org/) for managing dependencies and building the application. Please follow the instructions on their site to download.
|
|
- Install Poetry requirements with `poetry install`
|
|
|
|
### Running tox
|
|
|
|
Our full test and check suite is run using tox. This can be run using commands such
|
|
as `poetry run tox`.
|
|
|
|
Each run can take a while to build the whole environment. If you'd like to save time,
|
|
you can use the previously built environment by running `poetry run tox -e lint`
|
|
which will drastically speed up the linting process.
|
|
|
|
### Configuring pre-commit hooks
|
|
|
|
<!-- markdown-link-check-disable -->
|
|
To promote consistent code style and quality, we use git pre-commit hooks to
|
|
automatically lint and reformat our code before every commit we make to the codebase.
|
|
Pre-commit hooks are defined in the file [`.pre-commit-config.yaml`](../.pre-commit-config.yaml).
|
|
<!-- markdown-link-check-enable -->
|
|
|
|
1. First, install [`pre-commit`](https://pre-commit.com/) globally:
|
|
|
|
$ brew install pre-commit
|
|
|
|
2. While in the `data/data-pipeline` directory, run `pre-commit install` to install
|
|
the specific git hooks used in this repository.
|
|
|
|
Now, any time you commit code to the repository, the hooks will run on all modified files automatically. If you wish,
|
|
you can force a re-run on all files with `pre-commit run --all-files`.
|
|
|
|
#### Conflicts between backend and frontend git hooks
|
|
<!-- markdown-link-check-disable -->
|
|
In the front-end part of the codebase (the `justice40-tool/client` folder), we use
|
|
`Husky` to run pre-commit hooks for the front-end. This is different than the
|
|
`pre-commit` framework we use for the backend. The frontend `Husky` hooks are
|
|
configured at
|
|
[client/.husky](client/.husky).
|
|
|
|
It is not possible to run both our `Husky` hooks and `pre-commit` hooks on every
|
|
commit; either one or the other will run.
|
|
|
|
<!-- markdown-link-check-enable -->
|
|
|
|
`Husky` is installed every time you run `npm install`. To use the `Husky` front-end
|
|
hooks during front-end development, simply run `npm install`.
|
|
|
|
However, running `npm install` overwrites the backend hooks setup by `pre-commit`.
|
|
To restore the backend hooks after running `npm install`, do the following:
|
|
|
|
1. Run `pre-commit install` while in the `data/data-pipeline` directory.
|
|
2. The terminal should respond with an error message such as:
|
|
```
|
|
[ERROR] Cowardly refusing to install hooks with `core.hooksPath` set.
|
|
hint: `git config --unset-all core.hooksPath`
|
|
```
|
|
|
|
This error is caused by having previously run `npm install` which used `Husky` to
|
|
overwrite the hooks path.
|
|
|
|
3. Follow the hint and run `git config --unset-all core.hooksPath`.
|
|
4. Run `pre-commit install` again.
|
|
|
|
Now `pre-commit` and the backend hooks should take precedence.
|
|
|
|
### The Application entrypoint
|
|
|
|
After installing the poetry dependencies, you can see a list of commands with the following steps:
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- Then run `poetry run python3 data_pipeline/application.py --help`
|
|
|
|
### Downloading Census Block Groups GeoJSON and Generating CBG CSVs (not normally required)
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- If you want to clear out all data and tiles from all directories, you can run: `poetry run python3 data_pipeline/application.py data-cleanup`.
|
|
- Then run `poetry run python3 data_pipeline/application.py census-data-download`
|
|
Note: Census files are hosted in the Justice40 S3 and you can skip this step by passing the `-s aws` or `--data-source aws` flag in the scripts below
|
|
|
|
### Run all ETL, score and map generation processes
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- Then run `poetry run python3 data_pipeline/application.py data-full-run -s aws`
|
|
- Note: The `-s` flag is optional if you have generated/downloaded the census data
|
|
|
|
### Run both ETL and score generation processes
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- Then run `poetry run python3 data_pipeline/application.py score-full-run`
|
|
|
|
### Run all ETL processes
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- Then run `poetry run python3 data_pipeline/application.py etl-run`
|
|
|
|
### Generating Map Tiles
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- Then run `poetry run python3 data_pipeline/application.py generate-map-tiles -s aws`
|
|
- If you have S3 keys, you can sync to the dev repo by doing `aws s3 sync ./data_pipeline/data/score/tiles/ s3://justice40-data/data-pipeline/data/score/tiles --acl public-read --delete`
|
|
- Note: The `-s` flag is optional if you have generated/downloaded the score data
|
|
|
|
### Serve the map locally
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- For USA high zoom: `docker run --rm -it -v ${PWD}/data/score/tiles/high:/data -p 8080:80 maptiler/tileserver-gl`
|
|
|
|
### Running Jupyter notebooks
|
|
|
|
- Start a terminal
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- Run `poetry run jupyter notebook`. Your browser should open with a Jupyter Notebook tab
|
|
|
|
### Activating variable-enabled Markdown for Jupyter notebooks
|
|
|
|
- Change to the package directory (i.e., `cd data/data-pipeline/data_pipeline`)
|
|
- Activate a Poetry Shell (see above)
|
|
- Run `jupyter contrib nbextension install --user`
|
|
- Run `jupyter nbextension enable python-markdown/main`
|
|
- Make sure you've loaded the Jupyter notebook in a "Trusted" state. (See button near top right of Notebook screen.)
|
|
|
|
For more information, see [nbextensions docs](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html) and
|
|
see [python-markdown docs](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/python-markdown).
|
|
|
|
## Testing
|
|
|
|
### Background
|
|
|
|
<!-- markdown-link-check-disable -->
|
|
For this project, we make use of [pytest](https://docs.pytest.org/en/latest/) for testing purposes.
|
|
<!-- markdown-link-check-enable-->
|
|
|
|
To run tests, simply run `poetry run pytest` in this directory (i.e., `justice40-tool/data/data-pipeline`).
|
|
|
|
Test data is configured via [fixtures](https://docs.pytest.org/en/latest/explanation/fixtures.html).
|
|
|
|
### Score and post-processing tests
|
|
|
|
The fixtures used in the score post-processing tests are slightly different. These fixtures utilize [pickle files](https://docs.python.org/3/library/pickle.html) to store dataframes to disk. This is ultimately because if you assert equality on two dataframes, even if column values have the same "visible" value, if their types are mismatching they will be counted as not being equal.
|
|
|
|
In a bit more detail:
|
|
|
|
1. Pandas dataframes are typed, and by default, types are inferred when you create one from scratch. If you create a dataframe using the `DataFrame` [constructors](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame), there is no guarantee that types will be correct, without explicit `dtype` annotations. Explicit `dtype` annotations are possible, but, and this leads us to point #2:
|
|
|
|
2. Our transformations/dataframes in the source code under test itself doesn't always require specific types, and it is often sufficient in the code itself to just rely on the `object` type. I attempted adding explicit typing based on the "logical" type of given columns, but in practice it resulted in non-matching dataframes that _actually_ had the same value -- in particular it was very common to have one dataframe column of type `string` and another of type `object` that carried the same values. So, that is to say, even if we did create a "correctly" typed dataframe (according to our logical assumptions about what types should be), they were still counted as mismatched against the dataframes that are actually used in our program. To fix this "the right way", it is necessary to explicitly annotate types at the point of the `read_csv` call, which definitely has other potential unintended side effects and would need to be done carefully.
|
|
|
|
3. For larger dataframes (some of these have 150+ values), it was initially deemed too difficult/time consuming to manually annotate all types, and further, to modify those type annotations based on what is expected in the souce code under test.
|
|
|
|
#### Updating Pickles
|
|
|
|
If you update the score in any way, it is necessary to create new pickles so that data is validated correctly.
|
|
|
|
It starts with the `data_pipeline/etl/score/tests/sample_data/score_data_initial.csv`, which is the first two rows of the `score/full/usa.csv`.
|
|
|
|
To update this file, run a full score generation, then open a Python shell from the `data-pipeline` directory (e.g. `poetry run python3`), and then update the file with the following commands:
|
|
|
|
```
|
|
import pickle
|
|
from pathlib import Path
|
|
import pandas as pd
|
|
data_path = Path.cwd()
|
|
|
|
# score data expected
|
|
score_csv_path = data_path / "data_pipeline" / "data" / "score" / "csv" / "full" / "usa.csv"
|
|
score_initial_df = pd.read_csv(score_csv_path, dtype={"GEOID10_TRACT": "string"}, low_memory=False, nrows=2)
|
|
score_initial_df.to_csv(data_path / "data_pipeline" / "etl" / "score" / "tests" / "sample_data" /"score_data_initial.csv", index=False)
|
|
```
|
|
|
|
Now you can move on to updating individual pickles for the tests. Note that it is helpful to do them in this order:
|
|
|
|
We have four pickle files that correspond to expected files:
|
|
|
|
- `score_data_expected.pkl`: Initial score without counties
|
|
- `score_transformed_expected.pkl`: Intermediate score with `etl._extract_score` and `etl. _transform_score` applied. There's no file for this intermediate process, so we need to capture the pickle mid-process.
|
|
- `tile_data_expected.pkl`: Score with columns to be baked in tiles
|
|
- `downloadable_data_expected.pk1`: Downloadable csv
|
|
|
|
To update the pickles, let's go one by one:
|
|
|
|
For the `score_transformed_expected.pkl`, put a breakpoint on [this line]
|
|
(https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/score/tests/test_score_post.py#L62), before the `pdt.assert_frame_equal` and run:
|
|
`pytest data_pipeline/etl/score/tests/test_score_post.py::test_transform_score`
|
|
|
|
Once on the breakpoint, capture the df to a pickle as follows:
|
|
|
|
```
|
|
import pickle
|
|
from pathlib import Path
|
|
data_path = Path.cwd()
|
|
score_transformed_actual.to_pickle(data_path / "data_pipeline" / "etl" / "score" / "tests" / "snapshots" / "score_transformed_expected.pkl", protocol=4)
|
|
```
|
|
|
|
Then take out the breakpoint and re-run the test: `pytest data_pipeline/etl/score/tests/test_score_post.py::test_transform_score`
|
|
|
|
For the `score_data_expected.pkl`, put a breakpoint on [this line](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/score/tests/test_score_post.py#L78), before the `pdt.assert_frame_equal` and run:
|
|
`pytest data_pipeline/etl/score/tests/test_score_post.py::test_create_score_data`
|
|
|
|
Once on the breakpoint, capture the df to a pickle as follows:
|
|
|
|
```
|
|
import pickle
|
|
from pathlib import Path
|
|
data_path = Path.cwd()
|
|
score_data_actual.to_pickle(data_path / "data_pipeline" / "etl" / "score" / "tests" / "snapshots" / "score_data_expected.pkl", protocol=4)
|
|
```
|
|
|
|
Then take out the breakpoint and re-run the test: `pytest data_pipeline/etl/score/tests/test_score_post.py::test_create_score_data`
|
|
|
|
For the `tile_data_expected.pkl`, put a breakpoint on [this line](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/score/tests/test_score_post.py#L90), before the `pdt.assert_frame_equal` and run:
|
|
`pytest data_pipeline/etl/score/tests/test_score_post.py::test_create_tile_data`
|
|
|
|
Once on the breakpoint, capture the df to a pickle as follows:
|
|
|
|
```
|
|
import pickle
|
|
from pathlib import Path
|
|
data_path = Path.cwd()
|
|
output_tiles_df_actual.to_pickle(data_path / "data_pipeline" / "etl" / "score" / "tests" / "snapshots" / "tile_data_expected.pkl", protocol=4)
|
|
```
|
|
|
|
Then take out the breakpoint and re-run the test: `pytest data_pipeline/etl/score/tests/test_score_post.py::test_create_tile_data`
|
|
|
|
For the `downloadable_data_expected.pk1`, put a breakpoint on [this line](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/score/tests/test_score_post.py#L98), before the `pdt.assert_frame_equal` and run:
|
|
`pytest data_pipeline/etl/score/tests/test_score_post.py::test_create_downloadable_data`
|
|
|
|
Once on the breakpoint, capture the df to a pickle as follows:
|
|
|
|
```
|
|
import pickle
|
|
from pathlib import Path
|
|
data_path = Path.cwd()
|
|
output_downloadable_df_actual.to_pickle(data_path / "data_pipeline" / "etl" / "score" / "tests" / "snapshots" / "downloadable_data_expected.pkl", protocol=4)
|
|
```
|
|
|
|
Then take out the breakpoint and re-run the test: `pytest data_pipeline/etl/score/tests/test_score_post.py::test_create_downloadable_data`
|
|
|
|
#### Future Enhancements
|
|
|
|
Pickles have several downsides that we should consider alternatives for:
|
|
|
|
1. They are opaque - it is necessary to open a python interpreter (as written above) to confirm its contents
|
|
2. They are a bit harder for newcomers to python to grok.
|
|
3. They potentially encode flawed typing assumptions (see above) which are paved over for future test runs.
|
|
|
|
In the future, we could adopt any of the below strategies to work around this:
|
|
|
|
1. We could use [pytest-snapshot](https://pypi.org/project/pytest-snapshot/) to automatically store the output of each test as data changes. This would make it so that you could avoid having to generate a pickle for each method - instead, you would only need to call `generate` once , and only when the dataframe had changed.
|
|
|
|
<!-- markdown-link-check-disable -->
|
|
Additionally, you could use a pandas type schema annotation such as [pandera](https://pandera.readthedocs.io/en/stable/schema_models.html?highlight=inputschema#basic-usage) to annotate input/output schemas for given functions, and your unit tests could use these to validate explicitly. This could be of very high value for annotating expectations.
|
|
<!-- markdown-link-check-enable-->
|
|
|
|
Alternatively, or in conjunction, you could move toward using a more strictly-typed container format for read/writes such as SQL/SQLite, and use something like [SQLModel](https://github.com/tiangolo/sqlmodel) to handle more explicit type guarantees.
|
|
|
|
### Fixtures used in ETL "snapshot tests"
|
|
|
|
ETLs are tested for the results of their extract, transform, and load steps by
|
|
borrowing the concept of "snapshot testing" from the world of front-end development.
|
|
|
|
Snapshots are easy to update and demonstrate the results of a series of changes to
|
|
the code base. They are good for making sure no results have changed if you don't
|
|
expect them to change, and they are good when you expect the results to significantly
|
|
change in a way that would be tedious to update in traditional unit tests.
|
|
|
|
However, snapshot tests are also dangerous. An unthinking developer may update the
|
|
snapshot fixtures and unknowingly encode a bug into the supposed intended output of
|
|
the test.
|
|
|
|
In order to update the snapshot fixtures of an ETL class, follow the following steps:
|
|
|
|
1. If you need to manually update the fixtures, update the "furthest upstream" source
|
|
that is called by `_setup_etl_instance_and_run_extract`. For instance, this may
|
|
involve creating a new zip file that imitates the source data. (e.g., for the
|
|
National Risk Index test, update
|
|
`data_pipeline/tests/sources/national_risk_index/data/NRI_Table_CensusTracts.zip`
|
|
which is a 64kb imitation of the 405MB source NRI data.)
|
|
2. Run `pytest . -rsx --update_snapshots` to update snapshots for all files, or you
|
|
can pass a specific file name to pytest to be more precise (e.g., `pytest
|
|
data_pipeline/tests/sources/national_risk_index/test_etl.py -rsx --update_snapshots`)
|
|
3. Re-run pytest without the `update_snapshots` flag (e.g., `pytest . -rsx`) to
|
|
ensure the tests now pass.
|
|
4. Carefully check the `git diff` for the updates to all test fixtures to make sure
|
|
these are as expected. This part is very important. For instance, if you changed a
|
|
column name, you would only expect the column name to change in the output. If
|
|
you modified the calculation of some data, spot check the results to see if the
|
|
numbers in the updated fixtures are as expected.
|
|
|
|
### Other ETL Unit Tests
|
|
|
|
Outside of the ETL snapshot tests discussed above, ETL unit tests are typically
|
|
organized into three buckets:
|
|
|
|
- Extract Tests
|
|
- Transform Tests, and
|
|
- Load Tests
|
|
|
|
These are tested using different strategies explained below.
|
|
|
|
#### Extract Tests
|
|
|
|
Extract tests rely on the limited data transformations that occur as data is loaded from source files.
|
|
|
|
In tests, we use fake, limited CSVs read via `StringIO` , taken from the first several rows of the files of interest, and ensure data types are correct.
|
|
|
|
Down the line, we could use a tool like [Pandera](https://pandera.readthedocs.io/) to enforce schemas, both for the tests and the classes themselves.
|
|
|
|
#### Transform Tests
|
|
|
|
Transform tests are the heart of ETL unit tests, and compare ideal dataframes with their actual counterparts.
|
|
|
|
See above [Fixtures](#configuration--fixtures) section for information about where data is coming from.
|
|
|
|
#### Load Tests
|
|
|
|
These make use of [tmp_path_factory](https://docs.pytest.org/en/latest/how-to/tmp_path.html) to create a file-system located under `temp_dir`, and validate whether the correct files are written to the correct locations.
|
|
|
|
Additional future modifications could include the use of Pandera and/or other schema validation tools, and or a more explicit test that the data written to file can be read back in and yield the same dataframe.
|
|
|
|
### Smoketests
|
|
|
|
To ensure the score and tiles process correctly, there is a suite of "smoke tests" that can be run after the ETL and score data have been run, and outputs like the frontend GEOJSON have been created.
|
|
These tests are implemented as pytest test, but are skipped by default. To run them.
|
|
|
|
1. Generate a full score with `poetry run python3 data_pipeline/application.py score-full-run`
|
|
2. Generate the tile data with `poetry run python3 data_pipeline/application.py generate-score-post`
|
|
3. Generate the frontend GEOJSON with `poetry run python3 data_pipeline/application.py geo-score`
|
|
4. Select the smoke tests for pytest with `poetry run pytest data_pipeline/tests -k smoketest`
|