mirror of
https://github.com/DOI-DO/j40-cejst-2.git
synced 2025-02-22 09:41:26 -08:00
195 lines
6.7 KiB
Markdown
195 lines
6.7 KiB
Markdown
|
# Rollback Plan
|
||
|
|
||
|
> Note: This guide is up to date as of [this commit](https://github.com/usds/justice40-tool/commit/0d57dd572be027a2fc8b1625958ed68c4b900653), 12/16/2021. If you want to sanity check that this guide is still relevant, go to [Rollback Details](#rollback-details).
|
||
|
|
||
|
## When To Rollback?
|
||
|
|
||
|
If for some reason the final map data that has been generated by the pipeline
|
||
|
has become incorrect or is missing, this page documents the emergency steps to
|
||
|
get the data back to a known good state.
|
||
|
|
||
|
## Rollback Theory
|
||
|
|
||
|
The theory of rollback depends on two things:
|
||
|
|
||
|
1. The s3 bucket containing our data uses [s3 bucket versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html), allowing us to revert specific files back to a previous version.
|
||
|
2. The [Combine and Tileify step](https://github.com/usds/justice40-tool/blob/main/.github/workflows/combine-tilefy.yml) consumes only two files in s3 as input, making the strategy of reverting those files to a previous version a feasible way to run this job against a previous version.
|
||
|
|
||
|
If you feel confident in this and want to do the rollback now, proceed to [Rollback Steps](#rollback-steps).
|
||
|
|
||
|
If you want to understand more deeply what's going on, and sanity check that the
|
||
|
code hasn't changed since this guide was written, go to [Rollback Details](#rollback-details).
|
||
|
|
||
|
## Rollback Steps
|
||
|
|
||
|
### 1. List Object Versions
|
||
|
|
||
|
```shell
|
||
|
aws s3api list-object-versions \
|
||
|
--bucket justice40-data \
|
||
|
--prefix data-sources/census.zip \
|
||
|
--query "Versions[*].{Key:Key,VersionId:VersionId,LastModified:LastModified}"
|
||
|
```
|
||
|
|
||
|
You should get something like:
|
||
|
|
||
|
```
|
||
|
[
|
||
|
...
|
||
|
{
|
||
|
"Key": "data-sources/census.zip",
|
||
|
"VersionId": "<version_id>",
|
||
|
"LastModified": "2021-11-29T18:57:40+00:00"
|
||
|
}
|
||
|
...
|
||
|
]
|
||
|
```
|
||
|
|
||
|
Do the same thing with the score file:
|
||
|
|
||
|
```shell
|
||
|
aws s3api list-object-versions \
|
||
|
--bucket justice40-data \
|
||
|
--prefix data-pipeline/data/score/csv/tiles/usa.csv \
|
||
|
--query "Versions[*].{Key:Key,VersionId:VersionId,LastModified:LastModified}"
|
||
|
```
|
||
|
|
||
|
### 2. Download Previous Version
|
||
|
|
||
|
Based on the output from the commands above, select the `<version_id>` of the known good version of that file, and run the following command to download it:
|
||
|
|
||
|
```shell
|
||
|
aws s3api get-object \
|
||
|
--bucket justice40-data \
|
||
|
--key data-sources/census.zip \
|
||
|
--version-id <version_id> \
|
||
|
census.zip
|
||
|
```
|
||
|
|
||
|
Do the same for the score file:
|
||
|
|
||
|
```shell
|
||
|
aws s3api get-object \
|
||
|
--bucket justice40-data \
|
||
|
--key data-pipeline/data/score/csv/tiles/usa.csv \
|
||
|
--version-id <version_id> \
|
||
|
usa.csv
|
||
|
```
|
||
|
|
||
|
> Note: This command doesn't give you feedback like `curl` does on download progress, it just sits there. To verify the download is happening, open another terminal and check the output file size.
|
||
|
|
||
|
### 3. Upload New Version
|
||
|
|
||
|
After you've verified that the local files are correct, you can overwrite the
|
||
|
live versions by running a normal s3 copy:
|
||
|
|
||
|
```shell
|
||
|
aws s3 cp census.zip s3://justice40-data/data-sources/census.zip
|
||
|
aws s3 cp usa.csv s3://justice40-data/data-pipeline/data/score/csv/tiles/usa.csv
|
||
|
```
|
||
|
|
||
|
### 4. Rerun Combine and Tileify
|
||
|
|
||
|
Run the [Combine and Tileify Github Action](https://github.com/usds/justice40-tool/actions/workflows/combine-tilefy.yml) to regenerate the map tiles from the data you just rolled back.
|
||
|
|
||
|
## Rollback Details
|
||
|
|
||
|
> Note: The links to the relevant code are included alongside the relevant snippets as a defense against this page becoming outdated. Make sure the code matches what the links point to, and verify that things still work as this guide assumes.
|
||
|
|
||
|
The relevant step that consumes the files from s3 should be in the [Combine and Tileify Job](https://github.com/usds/justice40-tool/blob/main/.github/workflows/combine-tilefy.yml#L56-L59):
|
||
|
|
||
|
```yaml
|
||
|
- name: Run Scripts
|
||
|
run: |
|
||
|
poetry run python3 data_pipeline/application.py geo-score -s aws
|
||
|
poetry run python3 data_pipeline/application.py generate-map-tiles
|
||
|
```
|
||
|
|
||
|
This runs the [`score_geo` task](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/application.py#L166):
|
||
|
|
||
|
```python
|
||
|
score_geo(data_source=data_source)
|
||
|
```
|
||
|
|
||
|
This uses the [`GeoScoreETL` object to run the etl](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/runner.py#L121-L124):
|
||
|
|
||
|
```python
|
||
|
score_geo = GeoScoreETL(data_source=data_source)
|
||
|
score_geo.extract()
|
||
|
score_geo.transform()
|
||
|
score_geo.load()
|
||
|
```
|
||
|
|
||
|
In the extract step, `GeoScoreETL` calls [`check_census_data` and `check_score_data`](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/score/etl_score_geo.py#L57-L67):
|
||
|
|
||
|
```python
|
||
|
# check census data
|
||
|
check_census_data_source(
|
||
|
census_data_path=self.DATA_PATH / "census",
|
||
|
census_data_source=self.DATA_SOURCE,
|
||
|
)
|
||
|
|
||
|
# check score data
|
||
|
check_score_data_source(
|
||
|
score_csv_data_path=self.SCORE_CSV_PATH,
|
||
|
score_data_source=self.DATA_SOURCE,
|
||
|
)
|
||
|
```
|
||
|
|
||
|
The `check_score_data_source` function downloads [one file from s3](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/score/etl_utils.py#L32-L43):
|
||
|
|
||
|
```python
|
||
|
TILE_SCORE_CSV_S3_URL = (
|
||
|
settings.AWS_JUSTICE40_DATAPIPELINE_URL
|
||
|
+ "/data/score/csv/tiles/usa.csv"
|
||
|
)
|
||
|
TILE_SCORE_CSV = score_csv_data_path / "tiles" / "usa.csv"
|
||
|
|
||
|
# download from s3 if census_data_source is aws
|
||
|
if score_data_source == "aws":
|
||
|
logger.info("Fetching Score Tile data from AWS S3")
|
||
|
download_file_from_url(
|
||
|
file_url=TILE_SCORE_CSV_S3_URL, download_file_name=TILE_SCORE_CSV
|
||
|
)
|
||
|
```
|
||
|
|
||
|
This can be found here:
|
||
|
|
||
|
```
|
||
|
% aws s3 ls s3://justice40-data/data-pipeline/data/score/csv/tiles/usa.csv
|
||
|
2021-12-13 15:23:49 27845542 usa.csv
|
||
|
% curl --head https://justice40-data.s3.amazonaws.com/data-pipeline/data/score/csv/tiles/usa.csv
|
||
|
HTTP/1.1 200 OK
|
||
|
...
|
||
|
```
|
||
|
|
||
|
The `check_census_data_source` function downloads [one file from s3](https://github.com/usds/justice40-tool/blob/main/data/data-pipeline/data_pipeline/etl/sources/census/etl_utils.py#L99-L109):
|
||
|
|
||
|
```python
|
||
|
CENSUS_DATA_S3_URL = settings.AWS_JUSTICE40_DATASOURCES_URL + "/census.zip"
|
||
|
DATA_PATH = settings.APP_ROOT / "data"
|
||
|
|
||
|
# download from s3 if census_data_source is aws
|
||
|
if census_data_source == "aws":
|
||
|
logger.info("Fetching Census data from AWS S3")
|
||
|
unzip_file_from_url(
|
||
|
CENSUS_DATA_S3_URL,
|
||
|
DATA_PATH / "tmp",
|
||
|
DATA_PATH,
|
||
|
)
|
||
|
```
|
||
|
|
||
|
This can be found here:
|
||
|
|
||
|
```
|
||
|
% aws s3 ls s3://justice40-data/data-sources/census.zip
|
||
|
2021-11-29 13:57:40 845390373 census.zip
|
||
|
% curl --head https://justice40-data.s3.amazonaws.com/data-sources/census.zip
|
||
|
HTTP/1.1 200 OK
|
||
|
...
|
||
|
```
|
||
|
|
||
|
So this is how to see that the pipeline needs those two files. You can also confirm this in the [Combine and Tileify Github Action logs](https://github.com/usds/justice40-tool/actions/workflows/combine-tilefy.yml).
|
||
|
|
||
|
If you feel confident in this and want to do the rollback now, proceed to [Rollback Steps](#rollback-steps).
|