Generate Data Documentation

1

Set up environment

1!pip install -U athena-intelligence
1import json
2import os
3import pandas as pd
4from IPython.display import Markdown
5
6ATHENA_API_KEY = os.environ["ATHENA_API_KEY"]
7
8from athena import Model, Tools
9from athena.client import Athena
10
11athena = Athena(
12 api_key=ATHENA_API_KEY,
13)
2

Get datasets

Call dataset.get method to get datasets. Use optional pagination parameters to run bulk workflows with datasets.

1datasets = athena.dataset.get(page=1, page_size=5)
2datasets

Athena returns a json object with a list of datasets with the following fields: dataset id, name, database id, schema details (dialect, CREATE statement and first 3 rows), as well as pagination info.

To access raw json, use .json():

1data = json.loads(datasets.json())
2datasets_list = data['datasets']
3import pandas as pd
4pd.set_option('display.max_colwidth', None)
5df_datasets = pd.DataFrame(datasets_list)
6df_datasets
3

Document individual datasets with athena.submit_and_poll

With datasets loaded, we can proceed with the documentation workflow. We’ll start by defining a function that takes a list of datasets and send them one by one to Athena with a documentation prompt.

1documentation_responses = []
2def generate_documentation_for_dataset(dataset_name, dataset_schema_details):
3 # Placeholder for the function to submit and poll for documentation generation
4 message = athena.message.submit_and_poll(
5 content=
6 f"""
7**Task:** Generate comprehensive documentation for a dataset.
8
9**Objective:**
10Create output template documentation for a table, detailing its schema, fields, and relevant metadata. The documentation should follow the structure provided below and adhere to the specified markdown format and tone. Use metadata and other available information to produce the documentation tailored to the context. This documentation will serve as a guide for understanding the dataset's structure, purpose, and usage within the organization. It should be clear, concise, and informative, catering to both technical and non-technical stakeholders.
11
12**Instructions:**
131. Explore information available on the dataset {dataset_name}:
14- dataset metadata:
15
16{dataset_schema_details}
17
18
192. For each section of the documentation, provide clear, concise information as outlined in the output template. Use professional language and ensure the documentation is accessible to a broad audience.
203. Include a brief example value or description where requested to illustrate the type of content expected.
214. Only include factual statements. When making assumptions or inferences, clearly label them as such.
22
23**Output Template:**
24
25## Athena Generated Dataset Documentation
26
27### TABLE: \`[TABLE NAME]\`
28
29**Generated on: [CURRENT DATE]**
30
31#### Dataset Description:
32Provide a comprehensive explanation of the table's purpose, detailing what one row represents and the business process or workflow it supports.
33
34#### Field Report:
35Document each field in the table, including its name, description, data type, and an example value.
36
37| Field Name | Field Description | Field Type | Example Value |
38| ---------- | ----------------- | ---------- | ------------- |
39| [FIELD NAME] | [FIELD DESCRIPTION] | [FIELD TYPE] | [EXAMPLE VALUE] |
40| ...additional fields as necessary... |
41
42#### Sample Query and First Three Rows:
43Include a sample SQL query that returns the first three rows of data, followed by the results of the query.
44
45#### Use Cases & Guidelines:
46Describe the organization's use cases and guidelines for using this dataset, highlighting any best practices or restrictions.
47
48#### Other Notes & Considerations:
49List any additional notes or considerations relevant to the dataset's use or interpretation.
50
51**End of Template**
52
53Please ensure all information is accurate and up-to-date, reflecting the current state of the dataset as of [CURRENT DATE].
54 """,
55 model=Model.MIXTRAL_SMALL_8_X_7_B_0211,
56 tools=[],
57 )
58 print(f"Generating documentation for dataset: {dataset_name}")
59 message_json=json.loads(message.json())
60 documentation_responses.append({'dataset_name': dataset_name, 'documentation_message': message_json['content']})
61

Now we can kick off the workflow.

1# Iterate over each row in the DataFrame
2for index, row in df_datasets.iterrows():
3 dataset_name = row['name']
4 dataset_schema_details = row['schema_details']
5
6 # Generate documentation for the current dataset
7 generate_documentation_for_dataset(dataset_name, dataset_schema_details)

Convert results to markdown to read and copy generated documentation.

1def json_to_markdown_document(json_list):
2 markdown_document = ""
3 if not json_list:
4 return "No data available"
5
6 for item in json_list:
7 for key, value in item.items():
8 markdown_document += f"**{key}:** {value}\n\n"
9 markdown_document += "---\n\n" # Separator line between items
10
11 return markdown_document
12
13# Convert the list of dictionaries to Markdown
14markdown_document = json_to_markdown_document(documentation_responses)
15
16# Display the Markdown in the notebook
17display(Markdown(markdown_document))
4

Generate documentation and ERD diagrams for multiple datasets

Now that we documented all individual tables, we can ask Athena to process proccess created documentation and generate a higher-level description of the whole body of data, together with joins and other notable relationships between tables.

1def generate_high_level_documentation(markdown_document):
2 # Placeholder for the function to submit and poll for high-level documentation generation
3 message = athena.message.submit_and_poll(
4 content=
5 f"""
6**Task:** Generate high-level comprehensive documentation for a body of datasets.
7
8**Objective:**
9Create high-level output documentation for multiple related tables, detailing their schema, fields, relationships, and relevant metadata. The documentation should follow the structure provided below and adhere to the specified markdown format and tone. Use the provided markdown document and other available information to produce the documentation tailored to the context. This documentation will serve as a guide for understanding the structure, purpose, and usage of the datasets within the organization. It should be clear, concise, and informative, catering to both technical and non-technical stakeholders.
10
11**Instructions:**
121. Explore information available in the provided markdown document:
13- Provided markdown document:
14
15{markdown_document}
16
17
182. For each section of the documentation, provide clear, concise information as outlined in the output template. Use professional language and ensure the documentation is accessible to a broad audience.
193. Include diagrams such as Entity-Relationship Diagrams (ERD) and other helpful diagrams to explore relationships in the data.
204. Discuss possible analyses and how the datasets can be joined for these analyses.
215. Only include factual statements. When making assumptions or inferences, clearly label them as such.
226. Pay attention to Mermaid diagram dialect and double-check yourself.
23
24**Output Template:**
25
26## Athena Generated High-Level Dataset Documentation
27
28### Overview of Datasets
29
30Provide a brief overview of the datasets included in the markdown document, summarizing their purpose and how they relate to each other.
31
32### Entity-Relationship Diagram (ERD)
33
34Include an ERD that visually represents the relationships between the datasets.
35
36### Possible Analyses
37
38Discuss potential analyses that could be performed using these datasets, highlighting how they can be joined and what insights might be derived.
39
40### Other Helpful Diagrams
41
42Include other diagrams that may help in understanding the relationships between the datasets, such as flowcharts or sequence diagrams.
43
44### Guidelines for Use
45
46Describe the organization's guidelines for using these datasets together, including any best practices or restrictions.
47
48### Other Notes & Considerations
49
50List any additional notes or considerations relevant to the use or interpretation of these datasets as a whole.
51
52**End of Template**
53
54Please ensure all information is accurate and up-to-date, reflecting the current state of the datasets as of [CURRENT DATE].
55 """,
56 model=Model.MIXTRAL_SMALL_8_X_7_B_0211,
57 tools=[],
58 )
59 print("Generating description for provided dataset-level documentation")
60 message_json=json.loads(message.json())
61 return message_json['content']
1high_level_documentation = generate_high_level_documentation(markdown_document)
2display(Markdown(high_level_documentation))