Skip to content

Using Ask TK to Generate SQL (Preview)

This tutorial explains how to use the Ask TK function in Yellowbrick Manager to generate SQL from natural language questions. Ask TK uses an external large language model to interpret the question, together with additional context about the database schema, to generate a SQL expression. The purpose of this feature is to provide a SQL starting point that can be validated and built upon. It's a productivity tool, rather than a tool for generating SQL that can be blindly trusted.

Prerequisites

  • Permission to connect Ask TK to external API endpoints. A user with the consumeradmin role must configure this for you. This how-to guide details the steps needed to do this.
  • The URL for the OpenAI-compatible API endpoint to connect to. This could be a locally running service hosted by Ollama, or a commercial service such as OpenAI.
  • An API key for the service. This tutorial expects you to use OpenAI as the LLM provider, but you can complete the tutorial using a different service, if required.
  • The name of the LLM to connect to and the name of an embedding model to use. For this tutorial you will use the gpt-4o and text-embedding-3-small models.
  • A populated database schema to ask questions against. In this tutorial you will use the NOAA sample data set.
  • A running Yellowbrick compute cluster. A single compute node will suffice.

Overview

In this tutorial you will accomplish the following tasks:

  1. Configure the connection to the LLM
  2. Load the sample NOAA data set
  3. Configure the context that you will provide with your questions
  4. Ask TK a question about the sample data
  5. Use the embedding model to find relevant tables
  6. Tune the LLM prompt to add additional instructions

Part 1: Configure the LLM Connection

To configure Ask TK to use an external LLM service, navigate to Hi 'User' → Profile → Ask TK from the menu at the top right. Start by entering your API key, model name, and the URL for the API service. Be sure to specify the complete URL including the protocol, for example https://api.openai.com. Ask TK expects the endpoint to be compatible with the OpenAI API. Note, if you are using a model hosted by Ollama, then the key can be any arbitrary string, but a key must always be provided.

Decide on the number of Most Common Values to use. The default is 100. This value is used to determine the maximum number of sampled values from each column in the schema to send to the LLM to help it generate the SQL. For this tutorial, leave the value at 100. For other data sets you might decide that you don't want potentially sensitive data to be sent to an external LLM. In this case, set the value here to 0.

Part 2: Loading the Sample Data

The NOAA data set contains weather observations stretching back several hundred years. The schema consists of four main tables: countries, observations, stations, and states. You will load the sample data into these tables and then use Ask TK to generate SQL from natural language questions posed about the data.

Navigate to Query Editor → Sample Catalog → NOAA data set. Click the Setup button. If you have never loaded sample data before, Yellowbrick Manager will offer to create a database in which to load the data. You can choose to accept this if the yellowbrick_trial database doesn't already exist. You can also decide to load it into a different database. You will need to create this database first and configure Yellowbrick Manager to use this database before clicking the Setup button.

The Query Editor will load the setup script into a new tab. Run this script to set up the NOAA schema in the current database. Once the setup script has finished running, navigate back to Sample Catalog → NOAA and click the Load button. This will open a new tab in the Query Editor containing the load script. Run the load script. The NOAA data set is quite large and takes around 10 minutes to load using a compute cluster with a single SMALL compute node.

Create a new tab in the Query Editor. The SQL output from the LLM will be pasted to this tab in response to a question.

Part 3: Configuring the Prompt Context

To generate a more useful response to a question, you must provide the LLM with information about the database schema that you want to analyze. "Ask TK" has the ability to inject schema information into the prompt that will be sent to the LLM to help it generate a SQL expression. This schema information includes table names, column definitions and sample values.

To set up the context ready to ask a question, you must first ensure that the correct database and search path are set in the Query Editor. The search path must be set. For this data set, set the search path to noaa_ghcn_pds and then navigate to Ask TK. The Ask TK dialog enables you to submit a question, view and modify the generated prompt, and configure which tables should be included in the prompt.

To select the tables to include, navigate to the Tables menu at the top left of the Ask TK dialog and select observations, states and stations.

Part 4: Ask a Question

Enter the following question into the Ask TK dialog:

What was the total rainfall in CARY, NC in inches on January 10th 2024? Note that NOAA rain observations are recorded in tenths of a millimeter.

Click Generate SQL. After a few seconds, the LLM will paste a SQL expression into the current Query Editor tab. Close the Ask TK dialog to view the query:

SELECT
    SUM(o.data_value) / 254.0 AS total_rainfall_inches
FROM
    noaa_ghcn_pds.observations o
JOIN
    noaa_ghcn_pds.stations s ON o.id = s.id
WHERE
    s.name = 'CARY'
    AND s.state = 'NC'
    AND o.year_date = 20240110
    AND o.element = 'PRCP';

See how gpt-4o in this case has used sample values from the data to infer the correct predicates to use and the table schema to select the correct join conditions between the observations and stations table.

Run the query. We can validate the result by using the NOAA website. From this NOAA webpage, select the month of January and the year of 2024 to retrieve the actual daily weather observations for the Cary, NC weather station. You will see that the rainfall on January 10th was 1.51 inches, which matches the result of our SQL query.

Navigate to Ask TK → View Generated Prompt to review the prompt sent to the LLM. You will see the database schema information on the three selected tables included in the prompt. The dialog also features a prompt template. This can be modified and reset back to the default as needed. Note that each query tab has it's own Ask TK dialog, so you can ask different questions and configure different prompts across query tabs.

Part 5: Automatically pick Relevant Tables

For schemas with many tables, rather than manually selecting which tables to include in the prompt, you can ask an AI model to help select what it thinks are the most relevant tables for you. Ask TK does this by creating vector embeddings by using the provided embedding model for the question and each table name in the schema, and then performing a cosine similarity search to determine the tables that most closely match the question.

Navigate to Hi 'User' → Profile → Ask TK to set the embedding model and the maximum number of relevant tables to return. For OpenAI, use the text-embedding-3-small model. Set the maximum number of tables to consider to 2. Generate a new SQL query and view the generated prompt. With the embedding model configured you will see that the two tables that most closely match the question have been included in the prompt instead of the three tables we manually selected in Part 3.

Part 6: Tuning the Prompt

While the default prompt provides good results generally, there might be cases where it is beneficial to provide the LLM with additional information to help it return better queries. Additional context can be added to the question, as you did in Part 4 when you told the LLM that NOAA rainfall totals are measured in tenths of a millimeter. The Ask TK prompt dialog can be edited to provide this additional context without having to add it to each question.

Each query tab can have a different prompt. In addition, the default prompt template can be altered for all query tabs by navigating to Hi 'User' → Profile → Ask TK Advanced and making modifications to the prompt template there.

Test this in a new query tab by appending the following to the prompt:

Note that NOAA rain observations are recorded in tenths of a millimeter.

and asking the question:

What was the total rainfall in CARY, NC in inches on January 10th 2024?

Ask TK should generate a SQL expression that returns the rainfall total in inches.

Conclusion

Congratulations for completing this tutorial. You have learned how to configure Ask TK to generate SQL expressions from natural language queries related to your databases in Yellowbrick!