Commit graph

52 commits

Author SHA1 Message Date
lucasmbrown-usds
5ff988ab29 updating pylint 2022-09-30 13:43:41 -04:00
lucasmbrown-usds
07c4c030d3 fixing merge conflicts 2022-09-30 13:43:31 -04:00
Matt Bowen
6e0ef33d81
Add tribal count notebook (#1917) (#1919)
* Add tribal count notebook (#1917)

* test without caching

* added comment

Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov>
2022-09-23 14:33:15 -04:00
Lucas Merrill Brown
aca226165c
Issue 1900: Tribal overlap with Census tracts (#1903)
* working notebook

* updating notebook

* wip

* fixing broken tests

* adding tribal overlap files

* WIP

* WIP

* WIP, calculated count and names

* working

* partial cleanup

* partial cleanup

* updating field names

* fixing bug

* removing pyogrio

* removing unused imports

* updating test fixtures to be more realistic

* cleaning up notebook

* fixing black

* fixing flake8 errors

* adding tox instructions

* updating etl_score

* suppressing warning

* Use projected CRSes, ignore geom types (#1900)

I looked into this a bit, and in general the geometry type mismatch
changes very little about the calculation; we have a mix of
multipolygons and polygons. The fastest thing to do is just not keep
geom type; I did some runs with it set to both True and False, and
they're the same within 9 digits of precision. Logically we just want to
overlaps, regardless of how the actual geometries are encoded between
the frames, so we can in this case ignore the geom types and feel OKAY.

I also moved to projected CRSes, since we are actually trying to do area
calculations and so like, we should. Again, the change is small in
magnitude but logically more sound.

* Readd CDC dataset config (#1900)

* adding comments to fips code

* delete unnecessary loggers

Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov>
2022-09-20 14:53:12 -04:00
Emma Nechamkin
b0b7ff0eec
just testing that the boolean is preserved on gha (#1867)
* updated with hopefully a fix; coercing aml, fuds, hrs to booleans for the raw value to preserve null character.
2022-08-31 12:55:03 -04:00
Emma Nechamkin
1c4d3e4142
Score tests (#1847)
* update Python version on README; tuple typing fix

* Alaska tribal points fix (#1821)

* Bump mistune from 0.8.4 to 2.0.3 in /data/data-pipeline (#1777)

Bumps [mistune](https://github.com/lepture/mistune) from 0.8.4 to 2.0.3.
- [Release notes](https://github.com/lepture/mistune/releases)
- [Changelog](https://github.com/lepture/mistune/blob/master/docs/changes.rst)
- [Commits](https://github.com/lepture/mistune/compare/v0.8.4...v2.0.3)

---
updated-dependencies:
- dependency-name: mistune
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* poetry update

* initial pass of score tests

* add threshold tests

* added ses threshold (not donut, not island)

* testing suite -- stopping for the day

* added test for lead proxy indicator

* Refactor score tests to make them less verbose and more direct (#1865)

* Cleanup tests slightly before refactor (#1846)

* Refactor score calculations tests

* Feedback from review

* Refactor output tests like calculatoin tests (#1846) (#1870)

* Reorganize files (#1846)

* Switch from lru_cache to fixture scorpes (#1846)

* Add tests for all factors (#1846)

* Mark smoketests and run as part of be deply (#1846)

* Update renamed var (#1846)

* Switch from named tuple to dataclass (#1846)

This is annoying, but pylint in python3.8 was crashing parsing the named
tuple. We weren't using any namedtuple-specific features, so I made the
type a dataclass just to get pylint to behave.

* Add default timout to requests (#1846)

* Fix type (#1846)

* Fix merge mistake on poetry.lock (#1846)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov>
Co-authored-by: Jorge Escobar <83969469+esfoobar-usds@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Bowen <83967628+mattbowen-usds@users.noreply.github.com>
Co-authored-by: matt bowen <matthew.r.bowen@omb.eop.gov>
2022-08-26 15:23:20 -04:00
Emma Nechamkin
6418335219
Updates backend constants to N (#1854) 2022-08-23 16:19:00 -04:00
Matt Bowen
6e41e0d9f0
Add donut hole calculation to score (#1828)
Adds adjacency index to the pipeline. Requires thorough QA
2022-08-18 12:04:46 -04:00
Matt Bowen
49623e4da0
Add abandoned mine lands data (#1824)
* Add notebook to generate test data (#1780)

* Add Abandoned Mine Land data (#1780)

Using a similar structure but simpler apporach compared to FUDs, add an
indicator for whether a tract has an abandonded mine.

* Adding some detail to dataset readmes

Just a thought!

* Apply feedback from revieiw (#1780)

* Fixup bad string that broke test (#1780)

* Update a string that I should have renamed (#1780)

* Reduce number of threads to reduce memory pressure (#1780)

* Try not running geo data (#1780)

* Run the high-memory sets separately (#1780)

* Actually deduplicate (#1780)

* Add flag for memory intensive ETLs (#1780)

* Document new flag for datasets (#1780)

* Add flag for new datasets fro rebase (#1780)

Co-authored-by: Emma Nechamkin <97977170+emma-nechamkin@users.noreply.github.com>
2022-08-17 11:33:59 -04:00
Matt Bowen
d5fbb802e8
Add FUDS ETL (#1817)
* Add spatial join method (#1871)

Since we'll need to figure out the tracts for a large number of points
in future tickets, add a utility to handle grabbing the tract geometries
and adding tract data to a point dataset.

* Add FUDS, also jupyter lab (#1871)

* Add YAML configs for FUDS (#1871)

* Allow input geoid to be optional (#1871)

* Add FUDS ETL, tests, test-datae noteobook (#1871)

This adds the ETL class for Formerly Used Defense Sites (FUDS). This is
different from most other ETLs since these FUDS are not provided by
tract, but instead by geographic point, so we need to assign FUDS to
tracts and then do calculations from there.

* Floats -> Ints, as I intended (#1871)

* Floats -> Ints, as I intended (#1871)

* Formatting fixes (#1871)

* Add test false positive GEOIDs (#1871)

* Add gdal binaries (#1871)

* Refactor pandas code to be more idiomatic (#1871)

Per Emma, the more pandas-y way of doing my counts is using np.where to
add the values i need, then groupby and size. It is definitely more
compact, and also I think more correct!

* Update configs per Emma suggestions (#1871)

* Type fixed! (#1871)

* Remove spurious import from vscode (#1871)

* Snapshot update after changing col name (#1871)

* Move up GDAL (#1871)

* Adjust geojson strategy (#1871)

* Try running census separately first (#1871)

* Fix import order (#1871)

* Cleanup cache strategy (#1871)

* Download census data from S3 instead of re-calculating (#1871)

* Clarify pandas code per Emma (#1871)
2022-08-16 13:28:39 -04:00
Vim
eb3004c0d5
Fix on large AK tracts that are off screen (#1740)
* Change low to high transition and global zoom

- change the low to high transition from 7 to 5. This can not go any lower as high tiles on AWS only go to zoom level 5
- reduce the zoom level globally on all census tracts

* Remove geolocation from feature flag

- geolocation is now available to all

* Add python notebook that sorts all tracts by area

- add a column of the required zoom level for the tract to be fully contained in the viewport

* Place geolocation back to behind a feature flag

* Differentiate zoom levels b/w shortcuts and tracts
2022-07-13 19:01:43 -07:00
Emma Nechamkin
1b76a68838
FEMA data check (#1270)
we wanted to implement a slightly different FEMA AG LOSS indicator. Here, we take the 90th percentile only of tracts that have agvalue, and then we also floor the denominator of the rate calculation (loss/total value) at $408k
2022-02-17 16:53:04 -05:00
Lucas Merrill Brown
a0d6e55f0a
Run ETL processes in parallel (#1253)
* WIP on parallelizing

* switching to get_tmp_path for nri

* switching to get_tmp_path everywhere necessary

* fixing linter errors

* moving heavy ETLs to front of line

* add hold

* moving cdc places up

* removing unnecessary print

* moving h&t up

* adding parallel to geo post

* better census labels

* switching to concurrent futures

* fixing output
2022-02-11 14:04:53 -05:00
Lucas Merrill Brown
43e005cc10
Issue 1075: Add refactored ETL tests to NRI (#1088)
* Adds a substantially refactored ETL test to the National Risk Index, to be used as a model for other tests
2022-02-08 19:05:32 -05:00
Emma Nechamkin
6a00b29f5d
Adding VA and CO ETL from mapping for environmental justice (#1177)
Adding the mapping for environmental justice data, which contains information about VA and CO, to the ETL pipeline.
2022-02-04 10:00:41 -05:00
Lucas Merrill Brown
18f299c5f8
Issue 1141: Definition M (#1151) 2022-01-18 14:56:55 -05:00
Saran Ahluwalia
a07bf752b0
Notebook investigating NHPD as a source for providing contemporary foreclosure data (#1012)
Co-authored-by: Saran Ahluwalia <sarahluw@cisco.com>
2022-01-18 13:08:27 -05:00
Saran Ahluwalia
87e08f5fe1
CDC SVI Index: Additions to data-pipeline and comparison tool (#1096)
* wip

* working

* working

* rename

* documentation

* add link

* add readme

* update fieldnames

* typo

* add comparison tool

* revise wording

* variable change for FIPS

* workding

* wording in readme

* cleanup wording

* update comparison tool

* final tune up

* grammar and punctuation in the documentation

* period

* cleanup comments

* added revisions

* parallelism

* PR feedback from Lucas

* remove extraneous fields from comparison tool

* style

* updates

* remove themes

* formatting

* remove referenes to percentile rank

* remove referenes to percentile rank

* typo in fieldnames

* updates based on feedback from Lucas

* fieldnames formatting

* fix broken markdown link

Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov>
2022-01-14 14:52:37 -05:00
Saran Ahluwalia
95a14adb35
Added Census Tract Aggregated Micro-data from EPA Risk-Screening Environmental Indicators (RSEI) model (#1101)
* added initial source code - todo is comparison tool

* added values

* rename fields

* check geoid

* added black

* added revisions

* added clean up to comments

* more comments

* formatting

* cleanup and address PR feedback

* fix changes

* final path changes

* style

* PR feedback

* added final PR comment

* fix flake 8

* add revisions
2022-01-14 13:50:49 -05:00
Saran Ahluwalia
a98ea35f74
Maryland EJSCREEN Addition to comparison tool (#1143)
* finalized

* cleanup notebook

* cleanup

* run black
2022-01-14 13:26:48 -05:00
Saran Ahluwalia
2604b66cf7
Fix errors and improve code quality and readability in Health Scores (#1147)
* run black on health_score.py

* to_numpy() versus values - see https://pandas.pydata.org/pandas-docs/version/0.24.0rc1/api/generated/pandas.Series.to_numpy.html
2022-01-14 13:11:47 -05:00
Saran Ahluwalia
98ff4bd9d8
Add experimental Jupyter notebook with Health Scoring Methodology Example for Health Scores (#989)
Co-authored-by: Saran Ahluwalia <sarahluw@cisco.com>
2022-01-13 14:43:27 -05:00
Lucas Merrill Brown
114e6b765a
Issue 1129: remove deprecated field other_census_tract_fields_to_keep (#1130) 2022-01-12 10:16:09 -05:00
Saran Ahluwalia
a4137fdc98
Add Michigan EJ Screen into data-pipeline's ETL and provide automated scoring and statistics outputs (#1091)
* draft wip

* initial commit

* clear output from notebook

* revert to 65ceb7900f

* draft wip

* initial commit

* clear output from notebook

* revert to 65ceb7900f

* make michigan prefix for readable

* standardize Michigan names and move all constants from class into field names module

* standardize Michigan names and move all constants from class into field names module

* include only pertinent columns for scoring comparison tool

* michigan EJSCREEN standardization

* final PR feedback

* added exposition and summary of Michigan EJSCREEN

* added exposition and summary of Michigan EJSCREEN

* fix typo

Co-authored-by: Saran Ahluwalia <ahlusar.ahluwalia@gmail.com>
2021-12-31 15:38:52 -05:00
Lucas Merrill Brown
beb0eea5cc
Alternative definition of DACs for comparison (#1068)
* Alternative energy-related definition of DACs
2021-12-27 12:05:59 -05:00
Lucas Merrill Brown
0d10534725
Issue 1044: Add low HS education fields to tiles and download (#1046) 2021-12-14 15:41:06 -05:00
Lucas Merrill Brown
7fcecaee42
Issue 970: reverse percentiles for AMI and life expectancy (#1018)
* switching to low

* fixing score-etl-post

* updating comments

* fixing comparison

* create separate field for clarity

* comment fix

* removing healthy food

* fixing bug in score post

* running black and adding comment

* Update pickles and add a helpful notes to README

Co-authored-by: Shelby Switzer <shelby.switzer@cms.hhs.gov>
2021-12-10 10:16:22 -05:00
Lucas Merrill Brown
780d1126ff
Creating notebook to compare two score files for differences (#984) 2021-12-07 16:20:41 -05:00
Lucas Merrill Brown
5706837956
Add NATA cancer risk and respiratory hazard to definition L (#1001) 2021-12-07 12:45:45 -05:00
Lucas Merrill Brown
c5dff6e5f7
Issue 242: Add HOLC Grades to data inputs (#978)
* Add mapping inequality data to data inputs

* Add mapping inequality data to comparison tool
2021-12-04 12:23:01 -05:00
Lucas Merrill Brown
1d101c93d2
Issue 844: Add island areas to Definition L (#957)
This ended up being a pretty large task. Here's what this PR does:

1. Pulls in Vincent's data from island areas into the score ETL. This is from the 2010 decennial census, the last census of any kind in the island areas.
2. Grabs a few new fields from 2010 island areas decennial census.
3. Calculates area median income for island areas.
4. Stops using EJSCREEN as the source of our high school education data and directly pulls that from census (this was related to this project so I went ahead and fixed it).
5. Grabs a bunch of data from the 2010 ACS in the states/Puerto Rico/DC, so that we can create percentiles comparing apples-to-apples (ish) from 2010 island areas decennial census data to 2010 ACS data. This required creating a new class because all the ACS fields are different between 2010 and 2019, so it wasn't as simple as looping over a year parameter.
6. Creates a combined population field of island areas and mainland so we can use those stats in our comparison tool, and updates the comparison tool accordingly.
2021-12-03 15:46:10 -05:00
Vincent La
84874ee4a5
[ISS-751] Updating comments for geocorr ETL (#913) 2021-12-03 10:10:05 -05:00
Lucas Merrill Brown
5c65eed28f
Issue 838: Update comparison tool to use tracts (#934)
* Updating comparison tool to use tracts, and rely more heavily on `field_names`
2021-11-30 18:46:29 -05:00
Saran Ahluwalia
ec8f3543e5
Remove Index related to FEMA (#917)
Co-authored-by: Saran Ahluwalia <sarahluw@cisco.com>
2021-11-24 16:50:09 -05:00
Lucas Merrill Brown
e8d64df510
Fixing missing FEMA fields (#892) 2021-11-15 11:06:44 -05:00
Lucas Merrill Brown
05ebf9b48c
Add median house value to Definition L (#882)
* Added house value to ETL

* Adding house value to score formula and comp tool
2021-11-13 10:29:23 -05:00
Lucas Merrill Brown
03e59f2abd
Definition L updates (#862)
* Changing FEMA risk measure 

* Adding "basic stats" feature to comparison tool 

* Tweaking Definition L
2021-11-05 15:43:52 -04:00
Lucas Merrill Brown
8372b47d42
Various updates to Definition L (#850)
* removing percentiles as separate field names

* adding RMP
2021-11-04 12:17:45 -04:00
Lucas Merrill Brown
1d541be447
Add EJSCREEN Areas of Concern (#843)
* Adding ej screen areas of concern

* Uses it where user has local files, but not otherwise

Co-authored-by: VincentLaUSDS <vincent.la@omb.eop.gov>
2021-11-02 15:38:42 -04:00
Shelby Switzer
7bd1a9e59e
Big ole score refactor (#815)
* WIP

* Create ScoreCalculator

This calculates all the factors for score L for now (with placeholder
formulae because this is a WIP). I think ideallly we'll want to
refactor all the score code to be extracted into this or  similar
classes.

* Add factor logic for score L

Updated factor logic to match score L factors methodology.
Still need to get the Score L field itself working.

Cleanup needed: Pull field names into constants file, extract all score
calculation into score calculator

* Update thresholds and get score L calc working

* Update header name for consistency and update comparison tool

* Initial move of score to score calculator

* WIP big refactor

* Continued WIP on score refactor

* WIP score refactor

* Get to a working score-run

* Refactor to pass df to score init

This makes it easier to pass df around within a class with multiple
methods that require df.

* Updates from Black

* Updates from linting

* Use named imports instead of wildcard; log more

* Additional refactors

* move more field names to field_names constants file
* import constants without a relative path (would break docker)
* run linting
* raise error if add_columns is not implemented in a child class

* Refactor dict to namedtuple in score c

* Update L to use all percentile field

* change high school ed field in L back

Co-authored-by: Shelby Switzer <shelby.switzer@cms.hhs.gov>
2021-11-02 14:12:53 -04:00
Shelby Switzer
7b87e0ec99
Add Score L (#812)
* Create ScoreCalculator

This calculates all the factors for score L for now (with placeholder
formulae because this is a WIP). I think ideallly we'll want to
refactor all the score code to be extracted into this or  similar
classes.

* Add factor logic for score L

Updated factor logic to match score L factors methodology.
Still need to get the Score L field itself working.

Cleanup needed: Pull field names into constants file, extract all score
calculation into score calculator

Co-authored-by: Shelby Switzer <shelby.switzer@cms.hhs.gov>
Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov>
2021-10-28 16:07:41 -04:00
Lucas Merrill Brown
b1a4d26be8
Adding persistent poverty tracts (#738)
* persistent poverty working

* fixing left-padding

* running black and adding persistent poverty to comp tool

* fixing bug

* running black and fixing linter

* fixing linter

* fixing linter error
2021-09-22 17:57:08 -04:00
Vincent La
7709836a12
Ticket 355: Adding map to Urban vs Rural Census Tracts (#696)
* Adding urban vs rural notebook

* Adding new code

* Adding settings

* Adding usa.csv

* Adding etl

* Adding etl

* Adding to etl_score

* quick changes to notebook

* Ensuring notebook can run

* Adding urban vs rural notebook

* Adding new code

* Adding settings

* Adding usa.csv

* Adding etl

* Adding etl

* Adding to etl_score

* quick changes to notebook

* Ensuring notebook can run

* adding urban to comparison tool

* renaming file

* adding urban rural to more comp tool outputs

* updating requirements and poetry

* Adding ej screen notebook

* removing ej screen notebook since it's in justice40-tool-iss-719

Co-authored-by: La <ryy0@cdc.gov>
Co-authored-by: lucasmbrown-usds <lucas.m.brown@omb.eop.gov>
2021-09-22 12:31:03 -04:00
Lucas Merrill Brown
a1a988da46
Minor updates to scoring comparison tool (#686)
* Formatting updates for output XLSX
2021-09-16 14:06:33 -05:00
Lucas Merrill Brown
e94d05882c
Issue 675 & 676: Adding life expectancy and DOE energy burden data (#683)
* Adding two new data sources.
2021-09-15 09:59:28 -05:00
Lucas Merrill Brown
52e70653f0
Prototype H (#682) 2021-09-14 16:16:41 -05:00
Lucas Merrill Brown
1083e953da
Prototype G (#672)
* wip

* cleanup

* cleanup 2

* fixing import ordering linter error

* updating backend to use score G

* adding percentile to score output

* update tippeanoe compression

Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov>
2021-09-14 10:48:11 -04:00
Lucas Merrill Brown
65ceb7900f
Score F, testing methodology (#510)
* fixing dependency issue

* fixing more dependencies

* including fraction of state AMI

* wip

* nitpick whitespace

* etl working now

* wip on scoring

* fix rename error

* reducing metrics

* fixing score f

* fixing readme

* adding dependency

* passing tests;

* linting/black

* removing unnecessary sample

* fixing error

* adding verify flag on etl/base

Co-authored-by: Jorge Escobar <jorge.e.escobar@omb.eop.gov>
2021-08-24 16:40:54 -04:00
lucasmbrown-usds
ebe6180f7c wip 2021-08-09 22:24:14 -05:00
lucasmbrown-usds
cf13036d20 clearing output 2021-08-09 21:31:07 -05:00