MaTouch Work With Open Ai & ChatGPT

The MaTouch AI ESP32S3 2.8" TFT ST7789V board integrate I2S voice input/I2S speaker/ 3 million camera OV3660/ 320*240 resolution display, with ESP32S3 strong processor& Wifi ability, to make this board a good tool/platform for AI development with ESP32.

Recently, we successfully connected the MaTouch AI 2.8" board to OpenAI, enabling real-time voice interaction. With just your voice, you can talk directly to the device — it listens, understands, thinks, and responds with natural speech output.

Supplies

Hardware：

MaTouch AI ESP32S3 2.8" TFT ST7789V*1
Type-C USB Cable*1

Software：

ESP-IDF Development Environment
OpenAI API Key

What Is Open Ai?

OpenAI provides a powerful suite of AI models capable of understanding natural language, generating human-like responses, it integrates STT, TTS, and access to AI APIs. By integrating OpenAI’s API, developers can easily bring intelligent conversational abilities into embedded systems -- turning traditional hardware into truly “smart” devices.

How to Implement in MaTouch?

The MaTouch AI board communicates with OpenAI through three major steps: Speech-to-Text (STT) -- AI model model (GPT) -- Text-to-Speech (TTS).

Speech-to-Text (STT)

The user’s voice is recorded through the microphone and sent to OpenAI’s ’s STT model, which converts the audio into accurate text in real time.

AI model Processing (GPT)

The recognized text is transmitted to OpenAI’s AI model(Such as GPT-3.5). The model understands the context and generates a response.

Text-to-Speech (TTS)

The AI-generated text is sent to OpenAI’s TTS model, which produces a voice response. The MaTouch AI board then plays this voice output through the I2S speaker.

Set Up the ESP-IDF Development Environment

Before you begin, please ensure that esp-idf is installed on your computer. If not, click Get Started with esp-idf to complete the installation.

Get Open AI API Keys

Sign in or register on the OpenAI platform.
Click Start building, fill in the relevant information.
Enter the project name and key name. You may also use the default.
Copy your key and click “Continue”.
Please ensure your account has sufficient funds; otherwise, the key will not function.
You can return to the overview page, click the Settings button, and view the API information.

How the Code Works?

ai_task() is the core function that implements the entire AI dialogue system, completing：

1.Speech-to-Text (STT)-- Sends audio recorded by the microphone to OpenAI to obtain text results.
Language Understanding and Generation-- Passes the recognized text to the GPT model to generate a brief response.
Text-to-Speech (TTS)-- Converts the GPT response back into speech and plays it aloud.

static void ai_task(void *arg)

{

OpenAI_t *openai = OpenAICreate(OPENAI_API_KEY);

assert(openai);

OpenAI_AudioTranscription_t *audioTranscription = openai->audioTranscriptionCreate(openai);

assert(audioTranscription);

OpenAI_AudioSpeech_t *audioSpeech = openai->audioSpeechCreate(openai);

assert(audioSpeech);

OpenAI_ChatCompletion_t *chatCompletion = openai->chatCreate(openai);

assert(chatCompletion);

audioTranscription->setResponseFormat(audioTranscription, OPENAI_AUDIO_RESPONSE_FORMAT_JSON);

audioTranscription->setTemperature(audioTranscription, 0.2); //float between 0 and 1. Higher value gives more random results.

audioTranscription->setLanguage(audioTranscription, "en");

audioSpeech->setModel(audioSpeech, "tts-1-hd"); // openai tts-1-hd

audioSpeech->setVoice(audioSpeech, "alloy"); // nova

audioSpeech->setResponseFormat(audioSpeech, OPENAI_AUDIO_OUTPUT_FORMAT_WAV);

audioSpeech->setSpeed(audioSpeech, 1.0);

chatCompletion->setModel(chatCompletion, "gpt-3.5-turbo"); //Model to use for completion. Default is gpt-3.5-turbo

chatCompletion->setSystem(chatCompletion, "You are a helpful assistant. Keep your answers brief and concise. Respond in a single sentence whenever possible."); //Description of the required assistant

chatCompletion->setMaxTokens(chatCompletion, 512); //The maximum number of tokens to generate in the completion.

chatCompletion->setTemperature(chatCompletion, 0.2); //float between 0 and 1. Higher value gives more random results.

chatCompletion->setStop(chatCompletion, "\r"); //Up to 4 sequences where the API will stop generating further tokens.

chatCompletion->setPresencePenalty(chatCompletion, 0); //float between -2.0 and 2.0. Positive values increase the model's likelihood to talk about new topics.

chatCompletion->setFrequencyPenalty(chatCompletion, 0); //float between -2.0 and 2.0. Positive values decrease the model's likelihood to repeat the same line verbatim.

chatCompletion->setUser(chatCompletion, "OpenAI-ESP32"); //A unique identifier representing your end-user, which can help OpenAI to monitor and detect abuse.

while(true) {

xEventGroupWaitBits(

record_event_group,

AS_EVENT_RECORD_AI,

pdTRUE,

pdFALSE,

portMAX_DELAY

);

ESP_LOGI("AS", "Starting AI...");

char *text = audioTranscription->file(audioTranscription, audio_buffer, sizeof(wav_header_t) + AUDIO_BUFFER_SIZE, OPENAI_AUDIO_INPUT_FORMAT_WAV);

if(text == NULL) {

ESP_LOGE(TAG, "Failed to transcribe audio");

continue;

}

ESP_LOGI(TAG, "Text: %s", text);

set_chat_text_status(text);

OpenAI_StringResponse_t *result = chatCompletion->multiModalMessage(chatCompletion, "text", text, false);

if (result->getLen(result) == 1) {

ESP_LOGI(TAG, "Received message. Tokens: %"PRIu32"", result->getUsage(result));

char *response = result->getData(result, 0);

ESP_LOGI(TAG, "%s", response);

set_chat_text_status(response);

audioSpeech->speechStream(audioSpeech, response, on_stream);

} else if (result->getLen(result) > 1) {

ESP_LOGI(TAG, "Received %"PRIu32" messages. Tokens: %"PRIu32"", result->getLen(result), result->getUsage(result));

for (int i = 0; i < result->getLen(result); ++i) {

char *response = result->getData(result, i);

ESP_LOGI(TAG, "Message[%d]: %s", i, response);

}

} else if (result->getError(result)) {

ESP_LOGE(TAG, "Error! %s", result->getError(result));

} else {

ESP_LOGE(TAG, "Unknown error!");

}

free(text);

// heap_caps_free(buffer);

result->deleteResponse(result);

bsp_display_lock(0);

lv_obj_remove_state(speak_button, LV_STATE_DISABLED);

bsp_display_unlock();

}

Key part of the code

Create the OpenAI client.

Initializes the OpenAI client using your API key. This client handles all communication with the OpenAI cloud services.

OpenAI_t *openai = OpenAICreate(OPENAI_API_KEY);

Create functional modules

Audio Transcription (STT) – Converts recorded speech into text.

Chat Completion (GPT) – Generates a response based on recognized text.

Audio Speech (TTS) – Converts the response text back into speech.

OpenAI_AudioTranscription_t *audioTranscription = openai->audioTranscriptionCreate(openai);

OpenAI_ChatCompletion_t *chatCompletion = openai->chatCreate(openai);

OpenAI_AudioSpeech_t *audioSpeech = openai->audioSpeechCreate(openai);

Set module parameters

STT settings: Set the language to English, with an output stability of 0.2 (lower values indicate greater stability).

audioTranscription->setLanguage(audioTranscription, "en");

audioTranscription->setTemperature(audioTranscription, 0.2);

AI setting: Set the chat model to gpt-3.5-turbo and define it as an assistant.

chatCompletion->setModel(chatCompletion, "gpt-3.5-turbo");

chatCompletion->setSystem(chatCompletion, "You are a helpful assistant. Keep your answers brief and concise. Respond in a single sentence whenever possible.");

TTS settings: Set the voice model to tts-1-hd and the voice type to alloy.

audioSpeech->setModel(audioSpeech, "tts-1-hd"); // openai tts-1-hd

audioSpeech->setVoice(audioSpeech, "alloy"); // nova

Speech-to-Text (STT)

Sends the recorded audio buffer to OpenAI for transcription, receiving text output.

char *text = audioTranscription->file(audioTranscription, audio_buffer, sizeof(wav_header_t) + AUDIO_BUFFER_SIZE, OPENAI_AUDIO_INPUT_FORMAT_WAV);

Text-to-Response (ChatCompletion)

Passes the transcribed text to the GPT model, which generates a text response.

OpenAI_StringResponse_t *result = chatCompletion->multiModalMessage(chatCompletion, "text", text, false);

Text-to-Speech (TTS)

Converts the GPT-generated response into speech and plays it through the speaker.

audioSpeech->speechStream(audioSpeech, response, on_stream);

Upload the Code

Open the stt_llm_tts file by VS Code.
Paste the key you copied earlier from OpenAI into the code.
Set the target chip to ESP32S3.
Change your WiFi information.
Set Partition to “Custom partition table CSV”.
Set Flash size to 16MB.
Enable “Support for external, SPI-connect RAM” and set mode to “Octal Mode PSRAM”, finally click “Save”.
Use Type-C USB cable to connect the board and PC, select the corresponding port and Flash Device.

Result

MaTouch AI ESP32S3 x OpenAI! #esp32s3 #openai

Click “RECORD” to start a conversation;
Click “PLAY RECORD” to play back the recent recording.