* working notebook
* updating notebook
* wip
* fixing broken tests
* adding tribal overlap files
* WIP
* WIP
* WIP, calculated count and names
* working
* partial cleanup
* partial cleanup
* updating field names
* fixing bug
* removing pyogrio
* removing unused imports
* updating test fixtures to be more realistic
* cleaning up notebook
* fixing black
* fixing flake8 errors
* adding tox instructions
* updating etl_score
* suppressing warning
* Use projected CRSes, ignore geom types (#1900)
I looked into this a bit, and in general the geometry type mismatch
changes very little about the calculation; we have a mix of
multipolygons and polygons. The fastest thing to do is just not keep
geom type; I did some runs with it set to both True and False, and
they're the same within 9 digits of precision. Logically we just want to
overlaps, regardless of how the actual geometries are encoded between
the frames, so we can in this case ignore the geom types and feel OKAY.
I also moved to projected CRSes, since we are actually trying to do area
calculations and so like, we should. Again, the change is small in
magnitude but logically more sound.
* Readd CDC dataset config (#1900)
* adding comments to fips code
* delete unnecessary loggers
Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov>
* Add notebook to generate test data (#1780)
* Add Abandoned Mine Land data (#1780)
Using a similar structure but simpler apporach compared to FUDs, add an
indicator for whether a tract has an abandonded mine.
* Adding some detail to dataset readmes
Just a thought!
* Apply feedback from revieiw (#1780)
* Fixup bad string that broke test (#1780)
* Update a string that I should have renamed (#1780)
* Reduce number of threads to reduce memory pressure (#1780)
* Try not running geo data (#1780)
* Run the high-memory sets separately (#1780)
* Actually deduplicate (#1780)
* Add flag for memory intensive ETLs (#1780)
* Document new flag for datasets (#1780)
* Add flag for new datasets fro rebase (#1780)
Co-authored-by: Emma Nechamkin <97977170+emma-nechamkin@users.noreply.github.com>
* Add spatial join method (#1871)
Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.
* Add FUDS, also jupyter lab (#1871)
* Add YAML configs for FUDS (#1871)
* Allow input geoid to be optional (#1871)
* Add FUDS ETL, tests, test-datae noteobook (#1871)
This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.
* Floats -> Ints, as I intended (#1871)
* Floats -> Ints, as I intended (#1871)
* Formatting fixes (#1871)
* Add test false positive GEOIDs (#1871)
* Add gdal binaries (#1871)
* Refactor pandas code to be more idiomatic (#1871)
Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!
* Update configs per Emma suggestions (#1871)
* Type fixed! (#1871)
* Remove spurious import from vscode (#1871)
* Snapshot update after changing col name (#1871)
* Move up GDAL (#1871)
* Adjust geojson strategy (#1871)
* Try running census separately first (#1871)
* Fix import order (#1871)
* Cleanup cache strategy (#1871)
* Download census data from S3 instead of re-calculating (#1871)
* Clarify pandas code per Emma (#1871)
Imputes income field with a light refactor. Needs more refactor and more tests (I spotchecked). Next ticket will check and address but a lot of "narwhal" architecture is here.
* starting tribal pr
* further pipeline work
* bia merge working
* alaska villages and tribal geo generate
* tribal folders
* adding data full run
* tile generation
* tribal tile deploy
* WIP on parallelizing
* switching to get_tmp_path for nri
* switching to get_tmp_path everywhere necessary
* fixing linter errors
* moving heavy ETLs to front of line
* add hold
* moving cdc places up
* removing unnecessary print
* moving h&t up
* adding parallel to geo post
* better census labels
* switching to concurrent futures
* fixing output
* draft wip
* initial commit
* clear output from notebook
* revert to 65ceb7900f
* draft wip
* initial commit
* clear output from notebook
* revert to 65ceb7900f
* make michigan prefix for readable
* standardize Michigan names and move all constants from class into field names module
* standardize Michigan names and move all constants from class into field names module
* include only pertinent columns for scoring comparison tool
* michigan EJSCREEN standardization
* final PR feedback
* added exposition and summary of Michigan EJSCREEN
* added exposition and summary of Michigan EJSCREEN
* fix typo
Co-authored-by: Saran Ahluwalia <ahlusar.ahluwalia@gmail.com>
* Adds four fields:
* Summer days above 90F
* Percent low access to healthy food
* Percent impenetrable surface areas
* Low third grade reading proficiency
* Each of these four gets added into Definition L in various factors.
* Additionally, I add college attendance fields to the ETL for Census ACS.
* This PR also introduces the notion of "reverse percentiles", relevant to ticket #970.
This ended up being a pretty large task. Here's what this PR does:
1. Pulls in Vincent's data from island areas into the score ETL. This is from the 2010 decennial census, the last census of any kind in the island areas.
2. Grabs a few new fields from 2010 island areas decennial census.
3. Calculates area median income for island areas.
4. Stops using EJSCREEN as the source of our high school education data and directly pulls that from census (this was related to this project so I went ahead and fixed it).
5. Grabs a bunch of data from the 2010 ACS in the states/Puerto Rico/DC, so that we can create percentiles comparing apples-to-apples (ish) from 2010 island areas decennial census data to 2010 ACS data. This required creating a new class because all the ACS fields are different between 2010 and 2019, so it wasn't as simple as looping over a year parameter.
6. Creates a combined population field of island areas and mainland so we can use those stats in our comparison tool, and updates the comparison tool accordingly.
* Adds dev dependencies to requirements.txt and re-runs black on codebase
* Adds test and code for national risk index etl, still in progress
* Removes test_data from .gitignore
* Adds test data to nation_risk_index tests
* Creates tests and ETL class for NRI data
* Adds tests for load() and transform() methods of NationalRiskIndexETL
* Updates README.md with info about the NRI dataset
* Adds to dos
* Moves tests and test data into a tests/ dir in national_risk_index
* Moves tmp_dir for tests into data/tmp/tests/
* Promotes fixtures to conftest and relocates national_risk_index tests:
The relocation of national_risk_index tests is necessary because tests
can only use fixtures specified in conftests within the same package
* Fixes issue with df.equals() in test_transform()
* Files reformatted by black
* Commit changes to other files after re-running black
* Fixes unused import that caused lint checks to fail
* Moves tests/ directory to app root for data_pipeline
Error this addresses:
File "/Users/lucas/Documents/usds/repos/justice40-tool/data/data-pipeline/data_pipeline/etl/runner.py", line 71, in etl_runner
f"data_pipeline.etl.sources.{dataset['module_dir']}.etl"
TypeError: 'NoneType' object is not subscriptable