Generate Data Documentation
Set up environment
1 !pip install -U athena-intelligence
1 import json 2 import os 3 import pandas as pd 4 from IPython.display import Markdown 5 6 ATHENA_API_KEY = os.environ["ATHENA_API_KEY"] 7 8 from athena import Model, Tools 9 from athena.client import Athena 10 11 athena = Athena( 12 api_key=ATHENA_API_KEY, 13 )
Get datasets
Call dataset.get
method to get datasets. Use optional pagination parameters to run bulk workflows with datasets.
1 datasets = athena.dataset.get(page=1, page_size=5) 2 datasets
Athena returns a json object with a list of datasets with the following fields: dataset id, name, database id, schema details (dialect, CREATE statement and first 3 rows), as well as pagination info.
To access raw json, use .json()
:
1 data = json.loads(datasets.json()) 2 datasets_list = data['datasets'] 3 import pandas as pd 4 pd.set_option('display.max_colwidth', None) 5 df_datasets = pd.DataFrame(datasets_list) 6 df_datasets
Document individual datasets with athena.submit_and_poll
With datasets loaded, we can proceed with the documentation workflow. We’ll start by defining a function that takes a list of datasets and send them one by one to Athena with a documentation prompt.
1 documentation_responses = [] 2 def generate_documentation_for_dataset(dataset_name, dataset_schema_details): 3 # Placeholder for the function to submit and poll for documentation generation 4 message = athena.message.submit_and_poll( 5 content= 6 f""" 7 **Task:** Generate comprehensive documentation for a dataset. 8 9 **Objective:** 10 Create output template documentation for a table, detailing its schema, fields, and relevant metadata. The documentation should follow the structure provided below and adhere to the specified markdown format and tone. Use metadata and other available information to produce the documentation tailored to the context. This documentation will serve as a guide for understanding the dataset's structure, purpose, and usage within the organization. It should be clear, concise, and informative, catering to both technical and non-technical stakeholders. 11 12 **Instructions:** 13 1. Explore information available on the dataset {dataset_name}: 14 - dataset metadata: 15 16 {dataset_schema_details} 17 18 19 2. For each section of the documentation, provide clear, concise information as outlined in the output template. Use professional language and ensure the documentation is accessible to a broad audience. 20 3. Include a brief example value or description where requested to illustrate the type of content expected. 21 4. Only include factual statements. When making assumptions or inferences, clearly label them as such. 22 23 **Output Template:** 24 25 ## Athena Generated Dataset Documentation 26 27 ### TABLE: \`[TABLE NAME]\` 28 29 **Generated on: [CURRENT DATE]** 30 31 #### Dataset Description: 32 Provide a comprehensive explanation of the table's purpose, detailing what one row represents and the business process or workflow it supports. 33 34 #### Field Report: 35 Document each field in the table, including its name, description, data type, and an example value. 36 37 | Field Name | Field Description | Field Type | Example Value | 38 | ---------- | ----------------- | ---------- | ------------- | 39 | [FIELD NAME] | [FIELD DESCRIPTION] | [FIELD TYPE] | [EXAMPLE VALUE] | 40 | ...additional fields as necessary... | 41 42 #### Sample Query and First Three Rows: 43 Include a sample SQL query that returns the first three rows of data, followed by the results of the query. 44 45 #### Use Cases & Guidelines: 46 Describe the organization's use cases and guidelines for using this dataset, highlighting any best practices or restrictions. 47 48 #### Other Notes & Considerations: 49 List any additional notes or considerations relevant to the dataset's use or interpretation. 50 51 **End of Template** 52 53 Please ensure all information is accurate and up-to-date, reflecting the current state of the dataset as of [CURRENT DATE]. 54 """, 55 model=Model.MIXTRAL_SMALL_8_X_7_B_0211, 56 tools=[], 57 ) 58 print(f"Generating documentation for dataset: {dataset_name}") 59 message_json=json.loads(message.json()) 60 documentation_responses.append({'dataset_name': dataset_name, 'documentation_message': message_json['content']}) 61
Now we can kick off the workflow.
1 # Iterate over each row in the DataFrame 2 for index, row in df_datasets.iterrows(): 3 dataset_name = row['name'] 4 dataset_schema_details = row['schema_details'] 5 6 # Generate documentation for the current dataset 7 generate_documentation_for_dataset(dataset_name, dataset_schema_details)
Convert results to markdown to read and copy generated documentation.
1 def json_to_markdown_document(json_list): 2 markdown_document = "" 3 if not json_list: 4 return "No data available" 5 6 for item in json_list: 7 for key, value in item.items(): 8 markdown_document += f"**{key}:** {value}\n\n" 9 markdown_document += "---\n\n" # Separator line between items 10 11 return markdown_document 12 13 # Convert the list of dictionaries to Markdown 14 markdown_document = json_to_markdown_document(documentation_responses) 15 16 # Display the Markdown in the notebook 17 display(Markdown(markdown_document))
Generate documentation and ERD diagrams for multiple datasets
Now that we documented all individual tables, we can ask Athena to process proccess created documentation and generate a higher-level description of the whole body of data, together with joins and other notable relationships between tables.
1 def generate_high_level_documentation(markdown_document): 2 # Placeholder for the function to submit and poll for high-level documentation generation 3 message = athena.message.submit_and_poll( 4 content= 5 f""" 6 **Task:** Generate high-level comprehensive documentation for a body of datasets. 7 8 **Objective:** 9 Create high-level output documentation for multiple related tables, detailing their schema, fields, relationships, and relevant metadata. The documentation should follow the structure provided below and adhere to the specified markdown format and tone. Use the provided markdown document and other available information to produce the documentation tailored to the context. This documentation will serve as a guide for understanding the structure, purpose, and usage of the datasets within the organization. It should be clear, concise, and informative, catering to both technical and non-technical stakeholders. 10 11 **Instructions:** 12 1. Explore information available in the provided markdown document: 13 - Provided markdown document: 14 15 {markdown_document} 16 17 18 2. For each section of the documentation, provide clear, concise information as outlined in the output template. Use professional language and ensure the documentation is accessible to a broad audience. 19 3. Include diagrams such as Entity-Relationship Diagrams (ERD) and other helpful diagrams to explore relationships in the data. 20 4. Discuss possible analyses and how the datasets can be joined for these analyses. 21 5. Only include factual statements. When making assumptions or inferences, clearly label them as such. 22 6. Pay attention to Mermaid diagram dialect and double-check yourself. 23 24 **Output Template:** 25 26 ## Athena Generated High-Level Dataset Documentation 27 28 ### Overview of Datasets 29 30 Provide a brief overview of the datasets included in the markdown document, summarizing their purpose and how they relate to each other. 31 32 ### Entity-Relationship Diagram (ERD) 33 34 Include an ERD that visually represents the relationships between the datasets. 35 36 ### Possible Analyses 37 38 Discuss potential analyses that could be performed using these datasets, highlighting how they can be joined and what insights might be derived. 39 40 ### Other Helpful Diagrams 41 42 Include other diagrams that may help in understanding the relationships between the datasets, such as flowcharts or sequence diagrams. 43 44 ### Guidelines for Use 45 46 Describe the organization's guidelines for using these datasets together, including any best practices or restrictions. 47 48 ### Other Notes & Considerations 49 50 List any additional notes or considerations relevant to the use or interpretation of these datasets as a whole. 51 52 **End of Template** 53 54 Please ensure all information is accurate and up-to-date, reflecting the current state of the datasets as of [CURRENT DATE]. 55 """, 56 model=Model.MIXTRAL_SMALL_8_X_7_B_0211, 57 tools=[], 58 ) 59 print("Generating description for provided dataset-level documentation") 60 message_json=json.loads(message.json()) 61 return message_json['content']
1 high_level_documentation = generate_high_level_documentation(markdown_document) 2 display(Markdown(high_level_documentation))