mirror of
https://github.com/DOI-DO/j40-cejst-2.git
synced 2025-08-09 13:54:19 -07:00
Add ability to cache ETL data sources (#2169)
* Add a rough prototype allowing a developer to pre-download data sources for all ETLs * Update code to be more production-ish * Move fetch to Extract part of ETL * Create a downloader to house all downloading operations * Remove unnecessary "name" in data source * Format source files with black * Fix issues from pylint and get the tests working with the new folder structure * Clean up files with black * Fix unzip test * Add caching notes to README * Fix tests (linting and case sensitivity bug) * Address PR comments and add API keys for census where missing * Merging comparator changes from main into this branch for the sake of the PR * Add note on using cache (-u) during pipeline
This commit is contained in:
parent
4d9c1dd11e
commit
6f39033dde
52 changed files with 1787 additions and 686 deletions
|
@ -92,7 +92,6 @@ If you want to run specific data tasks, you can open a terminal window, navigate
|
|||
- Generate Map Tiles: `docker run --rm -it -v ${PWD}/data/data-pipeline/data_pipeline/data:/data_pipeline/data j40_data_pipeline python3 -m data_pipeline.application generate-map-tiles`
|
||||
|
||||
To learn more about these commands and when they should be run, refer to [Running for Local Development](#running-for-local-development).
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
@ -136,6 +135,9 @@ Once you've downloaded the census data, run the following commands – in order
|
|||
|
||||
Many commands have options. For example, you can run a single dataset with `etl-run` by passing the command line parameter `-d name-of-dataset-to-run`. Please use the `--help` option to find out more.
|
||||
|
||||
> :bulb: **NOTE**
|
||||
> One important command line option is enabling cached data sources. Pass the command line parameter `-u` to many commands (e.g. `etl-run`) to use locally cached data sources within the ETL portion of the pipeline. This will ensure that you don't download many GB of data with each run of the data pipeline.
|
||||
|
||||
## How Scoring Works
|
||||
|
||||
Scores are generated by running the `score-run` command via Poetry or Docker. This command executes [`data_pipeline/etl/score/etl_score.py`](data_pipeline/etl/score/etl_score.py). During execution,
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue