Yamale schema validation for data set descriptions (#34)

* https://app.zenhub.com/workspaces/justice40-60993f6e05473d0010ec44e3/issues/usds/justice40-tool/40
2025-02-22 01:31:25 -08:00 · 2021-05-17 09:12:19 -07:00 · 2021-05-17 09:12:19 -07:00 · f2503e71fb
commit f2503e71fb
parent 0a2ab57e76
12 changed files with 818 additions and 57 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1 +1,129 @@
 *.env
+.idea
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
--- a/data-roadmap/README.md
+++ b/data-roadmap/README.md
@ -1,86 +1,106 @@
-# Data roadmap proposal
+# Overview

-This document describes how we setup a "data roadmap", which serves several purposes.
+This document describes our "data roadmap", which serves several purposes.
+
+# Data roadmap goals

 The goals of the data roadmap are as follows:

-* Tracking data sets being considered for inclusion in the Climate and Economic Justice Screening Tool (CEJST), either as a data set that is included in the cumulative impacts score or a reference data set that is not included in the score
+- Tracking data sets being considered for inclusion in the Climate and Economic Justice Screening Tool (CEJST), either as a data set that is included in the cumulative impacts score or a reference data set that is not included in the score

-* Prioritizing data sets, so that it's obvious to developers working on the CEJST which data sets to incorporate next into the tool
+- Prioritizing data sets, so that it's obvious to developers working on the CEJST which data sets to incorporate next into the tool

-* Gathering important details about each data set, such as its geographic resolution and the year it was last updated, so that the CEJST team can make informed decisions about what data to prioritize
+- Gathering important details about each data set, such as its geographic resolution and the year it was last updated, so that the CEJST team can make informed decisions about what data to prioritize

-* Tracking the problem areas that each data set relates to (e.g., a certain data set may relate to the problem of pesticide exposure amongst migrant farm workers)
+- Tracking the problem areas that each data set relates to (e.g., a certain data set may relate to the problem of pesticide exposure amongst migrant farm workers)

-* Enabling members of the public to submit ideas for problem areas or data sets to be considered for inclusion in the CEJST, with easy-to-use and accessible tools
+- Enabling members of the public to submit ideas for problem areas or data sets to be considered for inclusion in the CEJST, with easy-to-use and accessible tools

-* Enabling members of the public to submit revisions to the information about each problem area or data set, with easy-to-use and accessible tools
+- Enabling members of the public to submit revisions to the information about each problem area or data set, with easy-to-use and accessible tools

-* Enabling the CEJST development team to review suggestions before incorporating them officially into the data roadmap, to filter out potential noise and spam, or consider how requests may lead to changes in software features and documentation
+- Enabling the CEJST development team to review suggestions before incorporating them officially into the data roadmap, to filter out potential noise and spam, or consider how requests may lead to changes in software features and documentation

 # User stories

 These goals can map onto several user stories for the data roadmap, such as:

-* As a community member, I want to suggest a new idea for a dataset.
-* As a community member, I want to understand what happened with my suggestion for a new dataset.
-* As a community member, I want to edit the details of a dataset proposal to add more information.
-* As a WHEJAC board member, I want to vote on what data sources should be prioritized next.
-* As a product manager, I want to filter based on characteristics of the data.
-* As a developer, I want to know what to work on next.
+- As a community member, I want to suggest a new idea for a dataset.
+- As a community member, I want to understand what happened with my suggestion for a new dataset.
+- As a community member, I want to edit the details of a dataset proposal to add more information.
+- As a WHEJAC board member, I want to vote on what data sources should be prioritized next.
+- As a product manager, I want to filter based on characteristics of the data.
+- As a developer, I want to know what to work on next.

 # Data set descriptions

-There are lots of details that are important to track for each data set.
+There are lots of details that are important to track for each data set. This
+information helps us prepare to integrate a data set into the tool and prioritize
+between different options for data in the data roadmap.

-For instance, here is an incomplete list of information it would be helpful to know about any given data set:
+In order to support a process of peer review on edits and updates, these details are
+tracked in one `YAML` file per data set description in the directory
+[data_roadmap/data_set_descriptions](data_roadmap/data_set_descriptions).

-| Characteristic | Answer type | Why is this important? |
-| --- | --- | --- |
-| Name | Text |  |
-| Source | Text / URL |  |
-| Sponsor | Text | It can be helpful to name if there's a federal agency or non-governmental agency that is working to provide and maintain this data |
-| Public status | Not Released / Freely Public / Public for certain audiences / Other | It can be helpful to know whether a dataset has already gone through public release process (like ACS data) or may need a lengthy review process (like Medicaid data). |
-| Subjective rating of data quality | Low/Medium/High | Sometimes we don't have statistics on data quality, but we know it is likely to be accurate or not. |
-| Estimated margin of error | Numeric | Estimated margin of error on measurement, if known. Often more narrow geographic measures have a higher margin of error due to a smaller sample for each measurement. |
-| Known data quality issues | Description | It can be helpful to write out known problems. |
-| Geographic coverage: percent | Estimated % of American population / surface area covered by the data | We want to think about data that is comprehensive across America. |
-| Geographic coverage: Description | Description of American population / surface area covered by the data |  |
-| Relevance to environmental justice | Text | It's useful to spell out why this data is relevant. |
-| Importance to identifying EJ communities | Text | It's useful to spell out if this is an important indicator. |
-| How this data could be used outside of Justice40 | Text | Are there any other benefits to making this data available outside the tool? |
-| Ready to use? | Yes/No/Other | In our subjective assessment, do we think a current draft of the data would be ready for publication in the tool as a data source still undergoing review and assessment (e.g., as a "beta" data source). Data does not require additional collection or cleaning of the data, although it may require some minor transformations. |
-| Estimated timeline for getting to first draft | # of weeks of a full-time, individual data scientist | In our subjective assessment, how long do we think it would take one Justice40 data scientist working full-time to clean and publish this data as a first draft? |
-| Data formats | Text / Array | Developers need to know what formats the data is available in |
-| Spatial resolution | Text | Dev team needs to know if the resolution is granular enough to be useful |
-| Level of Confidence | Text | How much has it been vetted by an agency; is this the de facto data set for the topic? |
-| Last updated at | Date | When was the data last updated / refreshed? We need a way to capture recency if not able to get a date |
-| Frequency of updates | Text | How often is this data updated? Is it updated on a reliable cadence? |
-| Peer review | Text | Overview or links out to peer review done on this dataset |
-| Where/how is data available | Text | Is it available on Geoplatform.gov? Is it available from multiple sources? Do we need to SFTP into a server to get it? (Maybe this should be two separate columns) |
-| Category / taxonomy keywords | Array | Since the taxonomy for these indicators and datasets isn't an exact science and different audiences may use different ways to refer to the data set categories, let's use a list of keywords instead of a single word |
-| Accreditation | Text | Is this source accredited? |
-| Can data go in cloud? | Yes/No | Some datasets can not legally go in the cloud |
-| Documentation | Text / URL | Link to docs. Also, is the documentation good enough? Can we get the info we need? |
-| Legal weasels | Text | Legal disclaimers, assumption of risk, proprietary? |
-| Risk assessment of the data | Text or High/Medium/Low | E.g. a vendor-processed version of the dataset might not be open or good enough |
+Each data set description includes a number of fields, some of which are required.
+The schema defining these fields is written in [Yamale](https://github.com/23andMe/Yamale) 
+and lives at [data_roadmap/data_set_description_schema.yaml](data_roadmap/data_set_description_schema.yaml).

-Other ideas for consideration:
+Because `Yamale` does not provide a method for describing fields, we've created an
+additional file that includes written descriptions of the meaning of each field in
+the schema. These live in [data_roadmap/data_set_description_field_descriptions.yaml](data_roadmap/data_set_description_field_descriptions.yaml).

-* Transparency
-* Reproducibility
-* A way to capture the owner vs host of the data
-* How likely are the data owner/host is to be able to update the data at proposed frequency
+In order to provide a helpful starting point for people who are ready to contribute
+ideas for a new data set for consideration, there is an auto-generated data set
+description template that lives at [data_roadmap/data_set_description_template.yaml](data_roadmap/data_set_description_template.yaml).
+
+# Steps to add a new data set description: the "easy" way
+
+Soon we will create a Google Form that contributors can use to submit ideas for new
+data sets. The Google Form will match the schema of the data set descriptions. Please
+see [this ticket](https://app.zenhub.com/workspaces/justice40-60993f6e05473d0010ec44e3/issues/usds/justice40-tool/39)
+for tracking this work.
+
+# Steps to add a new data set description: the git-savvy way
+
+For those who are comfortable using `git` and `Markdown`, these are the steps to
+contribute a new data set description to the data roadmap:
+
+1. Research and learn about the data set you're proposing for consideration.
+
+2. Clone the repository and learn about the [contribution guidelines for this
+   project](../docs/CONTRIBUTING.md).
+
+3. In your local version of the repository, copy the template from
+   `data_roadmap/data_set_description_template.yaml` into a new file that lives in
+   `data_roadmap/data_set_descriptions` and has the name of the data set as the name of the file.
+
+4. Edit this file to ensure it has all of the appropriate details about the data set.
+
+5. If you'd like, you can run the validations in `run_validations_and_write_template`
+   to ensure your contribution is valid according to the schema. These checks will also
+   run automatically on each commit.
+
+6. Create a pull request with your new data set description and submit it for peer
+   review.
+
+Thank you for contributing!

 # Tooling proposal and milestones

-There is no single tool that supports all the goals and user stories above. Therefore we've proposed combining a number of tools in a way that can support them all.
+There is no single tool that supports all the goals and user stories described above.
+Therefore we've proposed combining a number of tools in a way that can support them all.

-We've also proposed various "milestones" that will allow us to iteratively and sequentially build the data roadmap in a way that supports the entire vision but starts with small and achievable steps. These milestones are proposed in order.
+We've also proposed various "milestones" that will allow us to iteratively and
+sequentially build the data roadmap in a way that supports the entire vision but
+starts with small and achievable steps. These milestones are proposed in order.

-## Milestone: YAML files for data sets and linter
+This work is most accurately tracked in [this epic](https://app.zenhub.com/workspaces/justice40-60993f6e05473d0010ec44e3/issues/usds/justice40-tool/38).
+We've also verbally described them below.

-To start, we'll create a folder in the appropriate repository [is this `justice40-tool`?] that can house YAML files, one per data set. Each file will describe the characteristics of the data.
+## Milestone: YAML files for data sets and linter (Done)
+
+To start, we'll create a folder in this repository that can 
+house YAML files, one per data set. Each file will describe the characteristics of the data.

 The benefit of using a YAML file for this is that it's easy to subject changes to these files to peer review through the pull request process. This allows external collaborators from the open source community to submit suggested changes, which can be reviewed by the core CEJST team.

@ -108,7 +128,7 @@ This will make it easier to filter the data to answer questions like, "which dat

 ## Milestone: Tickets created for incorporating data sets

-For each data set that is being considered for inclusion soon in the tool, the project management team will create a ticket for "Incorporating ___ data set into the database", with a link to the data set detail document. This ticket will be created in the ticket tracking system used by the open source repository, which is ZenHub. This project management system will be public.
+For each data set that is being considered for inclusion soon in the tool, the project management team will create a ticket for "Incorporating \_\_\_ data set into the database", with a link to the data set detail document. This ticket will be created in the ticket tracking system used by the open source repository, which is ZenHub. This project management system will be public.

 At the initial launch, we are not planning for members of the open source community to be able to create tickets, but we would like to consider a process for members of the open source community creating tickets that can go through review by the CEJST team.

@ -122,7 +142,7 @@ We can change the linter to validate that every data set description maps to one

 The benefit of this is that some non-data-focused members of the public or the WHEJAC advisory body may want to suggest we prioritize certain problem areas, with or without ideas for specific data sets that may best address that problem area.

-It is not clear at this time the best path forward for implementing these problem area descriptions. One option is to create a folder for descriptions of problem areas, which contains YAML files that get validated according to a schema. Another option would be simply to add these as an array in the description of data sets, or add labels to the tickets once data sets are tracked in GitHub tickets. 
+It is not clear at this time the best path forward for implementing these problem area descriptions. One option is to create a folder for descriptions of problem areas, which contains YAML files that get validated according to a schema. Another option would be simply to add these as an array in the description of data sets, or add labels to the tickets once data sets are tracked in GitHub tickets.

 ## Milestone: Add prioritzation voting for WHEJAC and members of the public

--- a/data-roadmap/data_roadmap/init.py
+++ b/data-roadmap/data_roadmap/init.py
--- a/data-roadmap/data_roadmap/data_set_description_field_descriptions.yaml
+++ b/data-roadmap/data_roadmap/data_set_description_field_descriptions.yaml
@ -0,0 +1,39 @@
+# There is no method for adding field descriptions to `yamale` schemas.
+# Therefore, we've created a dictionary here of fields and their descriptions.
+name: A short name of the data set.
+source: The URL pointing towards the data set itself or more information about the
+  data set.
+relevance_to_environmental_justice: It's useful to spell out why this data is
+  relevant to EJ issues and/or can be used to identify EJ communities.
+spatial_resolution: Dev team needs to know if the resolution is granular enough to be useful
+public_status: Whether a dataset has already gone through public release process
+  (like Census data) or may need a lengthy review process (like Medicaid data).
+sponsor: Whether there's a federal agency or non-governmental agency that is working
+  to provide and maintain this data.
+subjective_rating_of_data_quality: Sometimes we don't have statistics on data
+  quality, but we know it is likely to be accurate or not. How much has it been
+  vetted by an agency; is this the de facto data set for the topic?
+estimated_margin_of_error: Estimated margin of error on measurement, if known. Often
+  more narrow geographic measures have a higher margin of error due to a smaller sample
+  for each measurement.
+known_data_quality_issues: It can be helpful to write out known problems.
+geographic_coverage_percent: We want to think about data that is comprehensive across
+  America.
+geographic_coverage_description: A verbal description of geographic coverage.
+data_formats: Developers need to know what formats the data is available in
+last_updated_date: When was the data last updated / refreshed? (In format YYYY-MM-DD.
+  If exact date is not known, use YYYY-01-01.)
+frequency_of_updates: How often is this data updated? Is it updated on a reliable
+  cadence?
+documentation: Link to docs. Also, is the documentation good enough? Can we get the
+  info we need?
+data_can_go_in_cloud: Some datasets can not legally go in the cloud
+
+discussion: Review of other topics, such as
+  peer review (Overview or links out to peer review done on this dataset),
+  where and how data is available (e.g., Geoplatform.gov? Is it available from multiple
+  sources?),
+  risk assessment of the data (e.g. a vendor-processed version of the dataset might not
+  be open or good enough),
+  legal considerations (Legal disclaimers, assumption of risk, proprietary?),
+  accreditation (Is this source accredited?)
--- a/data-roadmap/data_roadmap/data_set_description_schema.yaml
+++ b/data-roadmap/data_roadmap/data_set_description_schema.yaml
@ -0,0 +1,24 @@
+# `yamale` schema for descriptions of data sets.
+name: str(required=True)
+source: str(required=True)
+relevance_to_environmental_justice: str(required=False)
+data_formats: enum('GeoJSON', 'Esri Shapefile (SHP, DBF, SHX)', 'GML', 'KML/KMZ',
+  'GPX', 'CSV/XLSX', 'GDB', 'MBTILES', 'LAS', required=True)
+spatial_resolution: enum('State/territory', 'County', 'Zip code', 'Census tract',
+  'Census block group', 'Exact address or lat/long', 'Other', required=True)
+public_status: enum('Not Released', 'Public', 'Public for certain audiences', 'Other',
+  required=True)
+sponsor: str(required=True)
+subjective_rating_of_data_quality: enum('Low Quality', 'Medium Quality', 'High
+  Quality', required=False)
+estimated_margin_of_error: num(required=False)
+known_data_quality_issues: str(required=False)
+geographic_coverage_percent: num(required=False)
+geographic_coverage_description: str(required=False)
+last_updated_date: day(min='2001-01-01', max='2100-01-01', required=True)
+frequency_of_updates: enum('Less than annually', 'Approximately annually',
+  'Once very 1-6 months',
+  'Daily or more frequently than daily', 'Unknown', required=True)
+documentation: str(required=False)
+data_can_go_in_cloud: bool(required=False)
+discussion: str(required=False)
--- a/data-roadmap/data_roadmap/data_set_description_template.yaml
+++ b/data-roadmap/data_roadmap/data_set_description_template.yaml
@ -0,0 +1,94 @@
+# Note: This template is automatically generated by the function
+# `write_data_set_description_template_file` from the schema
+# and field descriptions files. Do not manually edit this file.
+
+name: 
+# Description: A short name of the data set.
+# Required field: True
+# Field type: str
+
+source: 
+# Description: The URL pointing towards the data set itself or more information about the data set.
+# Required field: True
+# Field type: str
+
+relevance_to_environmental_justice: 
+# Description: It's useful to spell out why this data is relevant to EJ issues and/or can be used to identify EJ communities.
+# Required field: False
+# Field type: str
+
+data_formats: 
+# Description: Developers need to know what formats the data is available in
+# Required field: True
+# Field type: enum
+# Valid choices are one of the following: ('GeoJSON', 'Esri Shapefile (SHP, DBF, SHX)', 'GML', 'KML/KMZ', 'GPX', 'CSV/XLSX', 'GDB', 'MBTILES', 'LAS')
+
+spatial_resolution: 
+# Description: Dev team needs to know if the resolution is granular enough to be useful
+# Required field: True
+# Field type: enum
+# Valid choices are one of the following: ('State/territory', 'County', 'Zip code', 'Census tract', 'Census block group', 'Exact address or lat/long', 'Other')
+
+public_status: 
+# Description: Whether a dataset has already gone through public release process (like Census data) or may need a lengthy review process (like Medicaid data).
+# Required field: True
+# Field type: enum
+# Valid choices are one of the following: ('Not Released', 'Public', 'Public for certain audiences', 'Other')
+
+sponsor: 
+# Description: Whether there's a federal agency or non-governmental agency that is working to provide and maintain this data.
+# Required field: True
+# Field type: str
+
+subjective_rating_of_data_quality: 
+# Description: Sometimes we don't have statistics on data quality, but we know it is likely to be accurate or not. How much has it been vetted by an agency; is this the de facto data set for the topic?
+# Required field: False
+# Field type: enum
+# Valid choices are one of the following: ('Low Quality', 'Medium Quality', 'High Quality')
+
+estimated_margin_of_error: 
+# Description: Estimated margin of error on measurement, if known. Often more narrow geographic measures have a higher margin of error due to a smaller sample for each measurement.
+# Required field: False
+# Field type: num
+
+known_data_quality_issues: 
+# Description: It can be helpful to write out known problems.
+# Required field: False
+# Field type: str
+
+geographic_coverage_percent: 
+# Description: We want to think about data that is comprehensive across America.
+# Required field: False
+# Field type: num
+
+geographic_coverage_description: 
+# Description: A verbal description of geographic coverage.
+# Required field: False
+# Field type: str
+
+last_updated_date: 
+# Description: When was the data last updated / refreshed? (In format YYYY-MM-DD. If exact date is not known, use YYYY-01-01.)
+# Required field: True
+# Field type: day
+
+frequency_of_updates: 
+# Description: How often is this data updated? Is it updated on a reliable cadence?
+# Required field: True
+# Field type: enum
+# Valid choices are one of the following: ('Less than annually', 'Approximately annually', 'Once very 1-6 months', 'Daily or more frequently than daily', 'Unknown')
+
+documentation: 
+# Description: Link to docs. Also, is the documentation good enough? Can we get the info we need?
+# Required field: False
+# Field type: str
+
+data_can_go_in_cloud: 
+# Description: Some datasets can not legally go in the cloud
+# Required field: False
+# Field type: bool
+
+discussion: 
+# Description: Review of other topics, such as peer review (Overview or links out to peer review done on this dataset), where and how data is available (e.g., Geoplatform.gov? Is it available from multiple sources?), risk assessment of the data (e.g. a vendor-processed version of the dataset might not be open or good enough), legal considerations (Legal disclaimers, assumption of risk, proprietary?), accreditation (Is this source accredited?)
+# Required field: False
+# Field type: str
+
--- a/data-roadmap/data_roadmap/data_set_descriptions/PM25.yaml
+++ b/data-roadmap/data_roadmap/data_set_descriptions/PM25.yaml
@ -0,0 +1,35 @@
+name: Particulate Matter 2.5
+
+source: https://gaftp.epa.gov/EJSCREEN/
+
+relevance_to_environmental_justice: Particulate matter has a lot of adverse impacts
+  on health.
+
+data_formats: CSV/XLSX
+
+spatial_resolution: Census block group
+
+public_status: Public
+
+sponsor: EPA
+
+subjective_rating_of_data_quality: Medium Quality
+
+estimated_margin_of_error:
+
+known_data_quality_issues: Many PM 2.5 stations are known to be pretty far apart, so
+  averaging them can lead to data quality loss.
+
+geographic_coverage_percent:
+
+geographic_coverage_description:
+
+last_updated_date: 2017-01-01
+
+frequency_of_updates: Less than annually
+
+documentation: https://www.epa.gov/sites/production/files/2015-05/documents/ejscreen_technical_document_20150505.pdf#page=13
+
+data_can_go_in_cloud: True
+
+discussion:
--- a/data-roadmap/data_roadmap/data_set_descriptions/init.py
+++ b/data-roadmap/data_roadmap/data_set_descriptions/init.py
--- a/data-roadmap/data_roadmap/utils/utils_data_set_description_schema.py
+++ b/data-roadmap/data_roadmap/utils/utils_data_set_description_schema.py
@ -0,0 +1,151 @@
+import importlib_resources
+import pathlib
+import yamale
+import yaml
+
+# Set directories.
+DATA_ROADMAP_DIRECTORY = importlib_resources.files("data_roadmap")
+UTILS_DIRECTORY = DATA_ROADMAP_DIRECTORY / "utils"
+DATA_SET_DESCRIPTIONS_DIRECTORY = DATA_ROADMAP_DIRECTORY / "data_set_descriptions"
+
+# Set file paths.
+DATA_SET_DESCRIPTION_SCHEMA_FILE_PATH = (
+    DATA_ROADMAP_DIRECTORY / "data_set_description_schema.yaml"
+)
+DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH = (
+    DATA_ROADMAP_DIRECTORY / "data_set_description_field_descriptions.yaml"
+)
+DATA_SET_DESCRIPTION_TEMPLATE_FILE_PATH = (
+    DATA_ROADMAP_DIRECTORY / "data_set_description_template.yaml"
+)
+
+
+def load_data_set_description_schema(
+    file_path: pathlib.PosixPath = DATA_SET_DESCRIPTION_SCHEMA_FILE_PATH,
+) -> yamale.schema.schema.Schema:
+    """Load from file the data set description schema."""
+    schema = yamale.make_schema(path=file_path)
+
+    return schema
+
+
+def load_data_set_description_field_descriptions(
+    file_path: pathlib.PosixPath = DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH,
+) -> dict:
+    """Load from file the descriptions of fields in the data set description."""
+    # Load field descriptions.
+    with open(file_path, "r") as stream:
+        data_set_description_field_descriptions = yaml.safe_load(stream=stream)
+
+    return data_set_description_field_descriptions
+
+
+def validate_descriptions_for_schema(
+    schema: yamale.schema.schema.Schema,
+    field_descriptions: dict,
+) -> None:
+    """Validate descriptions for schema.
+
+    Checks that every field in the `yamale` schema also has a field
+    description in the `field_descriptions` dict.
+    """
+    for field_name in schema.dict.keys():
+        if field_name not in field_descriptions:
+            raise ValueError(
+                f"Field `{field_name}` does not have a "
+                f"description. Please add one to file `{DATA_SET_DESCRIPTION_FIELD_DESCRIPTIONS_FILE_PATH}`"
+            )
+
+    for field_name in field_descriptions.keys():
+        if field_name not in schema.dict.keys():
+            raise ValueError(
+                f"Field `{field_name}` has a description but is not in the " f"schema."
+            )
+
+
+def validate_all_data_set_descriptions(
+    data_set_description_schema: yamale.schema.schema.Schema,
+) -> None:
+    """Validate data set descriptions.
+
+    Validate each file in the `data_set_descriptions` directory the schema
+    against the provided schema.
+
+    """
+    data_set_description_file_paths_generator = DATA_SET_DESCRIPTIONS_DIRECTORY.glob(
+        "*.yaml"
+    )
+
+    # Validate each file
+    for file_path in data_set_description_file_paths_generator:
+        print(f"Validating {file_path}...")
+
+        # Create a yamale Data object
+        data_set_description = yamale.make_data(file_path)
+
+        # TODO: explore collecting all errors and raising them at once. - Lucas
+        yamale.validate(schema=data_set_description_schema, data=data_set_description)
+
+
+def write_data_set_description_template_file(
+    data_set_description_schema: yamale.schema.schema.Schema,
+    data_set_description_field_descriptions: dict,
+    template_file_path: str = DATA_SET_DESCRIPTION_TEMPLATE_FILE_PATH,
+) -> None:
+    """Write an example data set description with helpful comments."""
+    template_file_lines = []
+
+    # Write comments at the top of the template
+    template_file_lines.append(
+        "# Note: This template is automatically generated by the function\n"
+        "# `write_data_set_description_template_file` from the schema\n"
+        "# and field descriptions files. Do not manually edit this file.\n\n"
+    )
+
+    schema_dict = data_set_description_schema.dict
+    for field_name, field_schema in schema_dict.items():
+        template_file_lines.append(f"{field_name}: \n")
+        template_file_lines.append(
+            f"# Description: {data_set_description_field_descriptions[field_name]}\n"
+        )
+        template_file_lines.append(f"# Required field: {field_schema.is_required}\n")
+        template_file_lines.append(f"# Field type: {field_schema.get_name()}\n")
+        if type(field_schema) is yamale.validators.validators.Enum:
+            template_file_lines.append(
+                f"# Valid choices are one of the following: {field_schema.enums}\n"
+            )
+
+        # Add an empty linebreak to separate fields.
+        template_file_lines.append("\n")
+
+    with open(template_file_path, "w") as file:
+        file.writelines(template_file_lines)
+
+
+def run_validations_and_write_template() -> None:
+    """Run validations of schema and descriptions, and write a template file."""
+    # Load the schema and a separate dictionary
+    data_set_description_schema = load_data_set_description_schema()
+    data_set_description_field_descriptions = (
+        load_data_set_description_field_descriptions()
+    )
+
+    validate_descriptions_for_schema(
+        schema=data_set_description_schema,
+        field_descriptions=data_set_description_field_descriptions,
+    )
+
+    # Validate all data set descriptions in the directory against schema.
+    validate_all_data_set_descriptions(
+        data_set_description_schema=data_set_description_schema
+    )
+
+    # Write an example template for data set descriptions.
+    write_data_set_description_template_file(
+        data_set_description_schema=data_set_description_schema,
+        data_set_description_field_descriptions=data_set_description_field_descriptions,
+    )
+
+
+if __name__ == "__main__":
+    run_validations_and_write_template()
--- a/data-roadmap/data_roadmap/utils/utils_data_set_description_schema_test.py
+++ b/data-roadmap/data_roadmap/utils/utils_data_set_description_schema_test.py
@ -0,0 +1,248 @@
+import unittest
+from unittest import mock
+
+import yamale
+from data_roadmap.utils.utils_data_set_description_schema import (
+    load_data_set_description_schema,
+    load_data_set_description_field_descriptions,
+    validate_descriptions_for_schema,
+    validate_all_data_set_descriptions,
+    write_data_set_description_template_file,
+)
+
+
+class UtilsDataSetDescriptionSchema(unittest.TestCase):
+    @mock.patch("yamale.make_schema")
+    def test_load_data_set_description_schema(self, make_schema_mock):
+        load_data_set_description_schema(file_path="mock.yaml")
+
+        make_schema_mock.assert_called_once_with(path="mock.yaml")
+
+    @mock.patch("yaml.safe_load")
+    def test_load_data_set_description_field_descriptions(self, yaml_safe_load_mock):
+        # Note: this isn't a great test, we could mock the actual YAML to
+        # make it better. - Lucas
+        mock_dict = {
+            "name": "The name of the thing.",
+            "age": "The age of the thing.",
+            "height": "The height of the thing.",
+            "awesome": "The awesome of the thing.",
+            "field": "The field of the thing.",
+        }
+
+        yaml_safe_load_mock.return_value = mock_dict
+
+        field_descriptions = load_data_set_description_field_descriptions()
+
+        yaml_safe_load_mock.assert_called_once()
+
+        self.assertDictEqual(field_descriptions, mock_dict)
+
+    def test_validate_descriptions_for_schema(self):
+        # Test when all descriptions are present.
+        field_descriptions = {
+            "name": "The name of the thing.",
+            "age": "The age of the thing.",
+            "height": "The height of the thing.",
+            "awesome": "The awesome of the thing.",
+            "field": "The field of the thing.",
+        }
+
+        schema = yamale.make_schema(
+            content="""
+name: str()
+age: int(max=200)
+height: num()
+awesome: bool()
+field: enum('option 1', 'option 2')
+"""
+        )
+
+        # Should pass.
+        validate_descriptions_for_schema(
+            schema=schema, field_descriptions=field_descriptions
+        )
+
+        field_descriptions_missing_one = {
+            "name": "The name of the thing.",
+            "age": "The age of the thing.",
+            "height": "The height of the thing.",
+            "awesome": "The awesome of the thing.",
+        }
+
+        # Should fail because of the missing field description.
+        with self.assertRaises(ValueError) as context_manager:
+            validate_descriptions_for_schema(
+                schema=schema, field_descriptions=field_descriptions_missing_one
+            )
+
+        # Using `assertIn` because the file path is returned in the error
+        # message, and it varies based on environment.
+        self.assertIn(
+            "Field `field` does not have a description. Please add one to file",
+            str(context_manager.exception),
+        )
+
+        field_descriptions_extra_one = {
+            "name": "The name of the thing.",
+            "age": "The age of the thing.",
+            "height": "The height of the thing.",
+            "awesome": "The awesome of the thing.",
+            "field": "The field of the thing.",
+            "extra": "Extra description.",
+        }
+
+        # Should fail because of the extra field description.
+        with self.assertRaises(ValueError) as context_manager:
+            validate_descriptions_for_schema(
+                schema=schema, field_descriptions=field_descriptions_extra_one
+            )
+
+        # Using `assertIn` because the file path is returned in the error
+        # message, and it varies based on environment.
+        self.assertEquals(
+            "Field `extra` has a description but is not in the schema.",
+            str(context_manager.exception),
+        )
+
+    def test_validate_all_data_set_descriptions(self):
+        # Setup a few examples of `yamale` data *before* we mock the `make_data`
+        # function.
+        valid_data = yamale.make_data(
+            content="""
+        name: Bill
+        age: 26
+        height: 6.2
+        awesome: True
+        field: option 1
+        """
+        )
+
+        invalid_data_1 = yamale.make_data(
+            content="""
+                name: Bill
+                age: asdf
+                height: 6.2
+                awesome: asdf
+                field: option 1
+                """
+        )
+
+        invalid_data_2 = yamale.make_data(
+            content="""
+                age: 26
+                height: 6.2
+                awesome: True
+                field: option 1
+                """
+        )
+
+        # Mock `make_data`.
+        with mock.patch.object(
+            yamale, "make_data", return_value=None
+        ) as yamale_make_data_mock:
+            schema = yamale.make_schema(
+                content="""
+            name: str()
+            age: int(max=200)
+            height: num()
+            awesome: bool()
+            field: enum('option 1', 'option 2')
+            """
+            )
+
+            # Make the `make_data` method return valid data.
+            yamale_make_data_mock.return_value = valid_data
+
+            # Should pass.
+            validate_all_data_set_descriptions(data_set_description_schema=schema)
+
+            # Make some of the data invalid.
+            yamale_make_data_mock.return_value = invalid_data_1
+
+            # Should fail because of the invalid field values.
+            with self.assertRaises(yamale.YamaleError) as context_manager:
+                validate_all_data_set_descriptions(data_set_description_schema=schema)
+
+            self.assertEqual(
+                str(context_manager.exception),
+                """Error validating data
+	age: 'asdf' is not a int.
+	awesome: 'asdf' is not a bool.""",
+            )
+
+            # Make some of the data missing.
+            yamale_make_data_mock.return_value = invalid_data_2
+
+            # Should fail because of the missing fields.
+            with self.assertRaises(yamale.YamaleError) as context_manager:
+                validate_all_data_set_descriptions(data_set_description_schema=schema)
+
+            self.assertEqual(
+                str(context_manager.exception),
+                """Error validating data
+	name: Required field missing""",
+            )
+
+    @mock.patch("builtins.open", new_callable=mock.mock_open)
+    def test_write_data_set_description_template_file(self, builtins_writelines_mock):
+        schema = yamale.make_schema(
+            content="""
+                    name: str()
+                    age: int(max=200)
+                    height: num()
+                    awesome: bool()
+                    field: enum('option 1', 'option 2')
+                    """
+        )
+
+        data_set_description_field_descriptions = {
+            "name": "The name of the thing.",
+            "age": "The age of the thing.",
+            "height": "The height of the thing.",
+            "awesome": "The awesome of the thing.",
+            "field": "The field of the thing.",
+        }
+
+        write_data_set_description_template_file(
+            data_set_description_schema=schema,
+            data_set_description_field_descriptions=data_set_description_field_descriptions,
+            template_file_path="mock_template.yaml",
+        )
+
+        call_to_writelines = builtins_writelines_mock.mock_calls[2][1][0]
+
+        self.assertListEqual(
+            call_to_writelines,
+            [
+                "# Note: This template is automatically generated by the function\n"
+                "# `write_data_set_description_template_file` from the schema\n"
+                "# and field descriptions files. Do not manually edit this file.\n\n",
+                "name: \n",
+                "# Description: The name of the thing.\n",
+                "# Required field: True\n",
+                "# Field type: str\n",
+                "\n",
+                "age: \n",
+                "# Description: The age of the thing.\n",
+                "# Required field: True\n",
+                "# Field type: int\n",
+                "\n",
+                "height: \n",
+                "# Description: The height of the thing.\n",
+                "# Required field: True\n",
+                "# Field type: num\n",
+                "\n",
+                "awesome: \n",
+                "# Description: The awesome of the thing.\n",
+                "# Required field: True\n",
+                "# Field type: bool\n",
+                "\n",
+                "field: \n",
+                "# Description: The field of the thing.\n",
+                "# Required field: True\n",
+                "# Field type: enum\n",
+                "# Valid choices are one of the following: ('option 1', 'option 2')\n",
+                "\n",
+            ],
+        )
--- a/data-roadmap/requirements.txt
+++ b/data-roadmap/requirements.txt
@ -0,0 +1 @@
+yamale==3.0.6
--- a/data-roadmap/setup.py
+++ b/data-roadmap/setup.py
@ -0,0 +1,21 @@
+"""Setup script for `data_roadmap` package."""
+import os
+
+from setuptools import find_packages
+from setuptools import setup
+
+# TODO: replace this with `poetry`. https://github.com/usds/justice40-tool/issues/57
+_PACKAGE_DIRECTORY = os.path.abspath(os.path.dirname(__file__))
+
+with open(os.path.join(_PACKAGE_DIRECTORY, "requirements.txt")) as f:
+    requirements = f.readlines()
+
+setup(
+    name="data_roadmap",
+    description="Data roadmap package",
+    author="CEJST Development Team",
+    author_email="justice40open@usds.gov",
+    install_requires=requirements,
+    include_package_data=True,
+    packages=find_packages(),
+)