Build a Voice-Controlled AI Assistant on ESP32 Using MCP

by Mukesh_Sankhla in Circuits > Assistive Tech

812 Views, 8 Favorites, 0 Comments

Build a Voice-Controlled AI Assistant on ESP32 Using MCP

AI Can Talk… But Can It Control Hardware? | ESP32 + MCP Explained | Project Mino
DSC03791.JPG
DSC03793.JPG
DSC03783.JPG
DSC03798.JPG
DSC03786.JPG

Most AI demos today can talk really well, but they can’t do real work.

In this project, I’ll show you how to build a voice-controlled AI assistant using an ESP32 and Xiaozhi that can safely control real hardware and software automations. This assistant doesn’t just chat; it turns lights ON and OFF, reads sensor data, and even creates and fetches meetings from Google Calendar.

The key idea behind this project is Model Context Protocol (MCP). MCP acts as a bridge between an AI model and physical systems, allowing the AI to call predefined tools using structured data instead of guessing commands.

Using the DFRobot ESP32-S3 AI Cam, we combine voice input, AI decision-making, and real execution on an embedded device. The result is a reliable, predictable, and secure AI assistant that actually works in the real world.

This Instructable walks you through the complete process, from hardware setup and enclosure design to MCP tools and real-world automation.

Supplies

DSC03717.JPG
DSC03718.JPG
DSC03720.JPG
DSC03737.JPG
DSC03738.JPG
DSC03723.JPG
DSC03724.JPG
DSC03726.JPG
DSC03725.JPG

CAD & 3D Printing

Gif33.gif
DSC03729.JPG
DSC03730.JPG
DSC03731.JPG
DSC03733.JPG
DSC03727.JPG

I designed a custom enclosure in Autodesk Fusion 360 to give the project a clean, product-like finish.

The enclosure consists of three parts:

  1. Main housing – holds all the electronics
  2. Button extension – brings the ESP32-S3 on-board button outside the enclosure
  3. Top cover – closes the assembly and includes the camera cutout

The design is compact, lightweight, and comfortable to hold, roughly the size of a soap bar.

I 3D-printed all parts using a Bambu Lab P1S printer with yellow PLA filament.

You can:

  1. Download the STL files and print them directly, or
  2. Download the Fusion 360 (STEP) files and modify the design as needed

⚠️ Note: This design is shared for educational and personal use only, not for commercial purposes.

Flash Xiaozhi Firmware

Enroll.png
Screenshot 2026-01-28 102100.png
Screenshot 2026-01-28 102131.png
Screenshot 2026-01-28 102302.png
Screenshot 2026-01-28 103607.png
Screenshot 2026-01-28 103620.png
Screenshot 2026-01-28 103747.png

To flash the Xiaozhi firmware onto the ESP32-S3 AI Cam, follow these steps.

1. Download Required Files

  1. ESP Flash Download Tool
  2. https://docs.espressif.com/projects/esp-test-tools/en/latest/esp32/production_stage/tools/flash_download_tool.html
  3. Mino Project Repository
  4. https://github.com/MukeshSankhla/Mino-ESP32_MCP
  5. This repository contains firmware and all project-related files.

2. Prepare the Flasher Tool

  1. Extract all downloaded files
  2. Open the ESP Flash Download Tool by double-clicking it
  3. Select the chip type as ESP32-S3

3. Flash the Firmware

You will now be on the flashing screen:

  1. Click the three dots (⋯) and select the firmware .bin(xiaozhi_v1.9.4.bin) file from the project folder
  2. Set the address to 0x00
  3. Check the enable checkbox
  4. Select the correct COM port
  5. Click Erase and wait until it shows Finished
  6. Click Start to begin flashing, wait until the flashing process completes

Once finished, the firmware is successfully flashed onto the ESP32-S3 AI Cam.

Circuit Connection

Project (1).png
DSC03742.JPG
DSC03743.JPG
DSC03749.JPG
Gif21.gif

Now, follow the circuit diagram and make the required connections using a soldering iron and wires.

Power Connections

  1. Battery to BMS (Input)
  2. Connect the Li-Po battery to the IP5306 BMS input
  3. Red wire → Positive (+)
  4. Black wire → Negative (−)
  5. Double-check polarity before soldering.
  6. Power Switch Connection
  7. Connect the mini switch in series with the output side of the IP5306 BMS
  8. This switch will control power delivery to the ESP32-S3 AI Cam

Power Connection to ESP32-S3 AI Cam

DSC03748.JPG
DSC03752.JPG
DSC03755.JPG
DSC03756.JPG
DSC03761.JPG
Gif22.gif
Gif23.gif

Now connect the output of the IP5306 BMS to the ESP32-S3 AI Cam.

The ESP32-S3 AI Cam comes with a 2-pin battery terminal block, but I removed it to make the overall assembly slimmer by about 3 mm.

Connection Steps:

  1. Solder the BMS output wires directly to the battery solder pads on the ESP32-S3 AI Cam
  2. Positive (+) to PW+
  3. Negative (−) to PW−
  4. Ensure the solder joints are solid and there are no short circuits.
  5. Turn ON the power switch to verify the connection.

If the board powers up correctly, the power wiring is complete.

ESP32-S3 Assembly

Gif24.gif
Gif25.gif
Gif26.gif
DSC03763.JPG
DSC03764.JPG
DSC03768.JPG
DSC03773.JPG
  1. Take the main housing and the button extension, and place the button extension into its cutout in the housing.
  2. Take the ESP32-S3 AI Cam board with the speaker connected.
  3. Place the speaker into its dedicated slot inside the housing.
  4. Align the ESP32-S3 board with the designed standoffs in the housing.
  5. Secure the board using 4x M2 screws.
  6. Press the button extension to make sure it moves freely and properly presses the on-board button.
  7. If it feels tight, lightly sand the button extension until it presses and releases smoothly.

BMS Assembly

DSC03775.JPG
Gif27.gif
  1. Place the IP5306 BMS module upside down inside the housing.
  2. Align the Type-C connector with the cutout provided on the enclosure.
  3. Secure the BMS using two M2 screws.

Switch Assembly

Gif28.gif
DSC03778.JPG
DSC03776.JPG
Gif29.gif
  1. Use quick glue to secure the mini switch inside the housing.
  2. Route the wires neatly to avoid pinching or stress.
  3. Fix the battery in place using double-sided tape.

Final Assembly

DSC03782.JPG
Gif30.gif
Gif31.gif
DSC03801.JPG
DSC03785.JPG
Gif32.gif
  1. Place the cover onto the housing, aligning the camera hole carefully.
  2. Flip the assembly over and secure it using three M2 screws.

That’s it — the build is complete! 🎉

Configuration

Screenshot 2026-01-28 111630.png
  1. Power on the Mino.
  2. It will speak instructions and create a Wi-Fi hotspot named Xiaozhi…
  3. On your phone or laptop, open Wi-Fi settings and connect to the Xiaozhi hotspot.
  4. Open a browser and go to 192.168.1.4.
  5. The Wi-Fi configuration page will open.
  6. Enter your Wi-Fi SSID and Password, then tap Connect.
  7. A green check mark confirms successful connection.

Once connected, the device will speak a 6-digit pairing code.

  1. Go to https://xiaozhi.me/ and create an account (or log in).
  2. Open the Console, click Add Device, and enter the 6-digit code.
  3. The device will now appear in your console.

From here, select Configure Role to customize the device—change the agent's name, language, voice profile, role, and select the LLM/AI Model more....

ESP32 & MCP

4.png
5.png
6.png
7.png
8.png
9.png
10.png

Model Context Protocol (MCP) is a standard way for an AI model to safely interact with real systems.

AI models (LLMs) are great at understanding language, but they cannot directly control hardware. They work on probabilities and guesses, while hardware needs strict and predictable instructions.

MCP solves this by acting as a bridge between the AI and the ESP32.

Think of MCP like USB for AI models:

  1. USB defines how devices talk to a computer
  2. MCP defines how an AI talks to hardware and software tools

How MCP Runs on the ESP32

In this project:

  1. The LLM runs in the cloud
  2. The ESP32-S3 acts as an MCP server
  3. MCP communication happens using structured JSON

The ESP32 exposes specific actions as tools, such as:

  1. Turning LEDs ON or OFF
  2. Reading sensor data
  3. Creating or fetching Google Calendar events

Each MCP tool has:

  1. A name
  2. A description (for the AI)
  3. A strict JSON input schema
  4. A defined execution and response

The AI selects a tool and sends a valid JSON request.

The ESP32 parses this request and executes only the allowed action—nothing more.

This makes the system safe, predictable, and reliable.

LED Control Example

The LED is a simple example to show how MCP works.

  1. The user says:
  2. “Turn on the room light”
  3. The AI selects the room_light tool and sends a JSON command:

{ "state": "ON" }
  1. The ESP32:
  2. Receives the JSON
  3. Validates the input
  4. Executes the action using digitalWrite()
  5. The ESP32 sends a response back:
  6. Success if the LED turns ON
  7. Error if something fails
  8. The AI confirms the result to the user.

Why This Matters

Without MCP:

  1. AI guesses commands
  2. APIs are unpredictable
  3. Hardware control is unsafe

With MCP:

  1. Every action is predefined
  2. Inputs are validated
  3. Execution is deterministic

This is how AI moves from chatting to real-world execution on embedded devices like the ESP32.

Basic MCP Example (LED + DHT11)

Project (2).png
DSC03808.JPG
DSC03807.JPG

In this example, we use a DFRobot FireBeetle ESP32-S3, which has:

  1. An on-board LED connected to GPIO 21
  2. A DHT11 temperature & humidity sensor connected to GPIO 3

This sketch demonstrates how ESP32 exposes real hardware as MCP tools that an AI can call safely.

What This Code Does (High Level)

  1. Connects the ESP32 to Wi-Fi
  2. Opens a WebSocket connection to the MCP server
  3. Registers two MCP tools:
  4. room_light → Control the LED
  5. room_climate → Read temperature & humidity
  6. Waits for AI requests and executes them on real hardware

Required Libraries

Make sure these libraries are installed in Arduino IDE:


#include <WebSocketMCP.h>
#include <ArduinoJson.h>
#include <DHT11.h>

Wi-Fi Configuration


const char* WIFI_SSID = "Makerbrains_2.4G";
const char* WIFI_PASS = "Balaji2830";

🔧 User Action:

Replace these with your own Wi-Fi credentials.

MCP Endpoint


const char* MCP_ENDPOINT = "wss://api.xiaozhi.me/mcp/?token=...";

This is the secure WebSocket endpoint that connects your ESP32 to the AI.

How to Get Your MCP Endpoint

  1. Go to xiaozhi.me
  2. Open Configure Role
  3. Scroll to MCP Settings
  4. Click Get MCP Endpoint
  5. Copy and paste it here

Hardware Configuration


#define LED_PIN 21
#define DHT_PIN 3
  1. LED is connected to GPIO 21
  2. DHT11 data pin is connected to GPIO 3

DHT11 dht11(DHT_PIN);

Creates a DHT11 sensor instance.

MCP Client Initialization


WebSocketMCP mcp;

This object handles:

  1. MCP connection
  2. Tool registration
  3. Message parsing
  4. Responses to AI

MCP Tool 1: LED Control (room_light)

Tool Definition:


mcp.registerTool(
"room_light",
"Control LED connected to ESP32",
"{\"type\":\"object\",\"properties\":{\"state\":{\"type\":\"string\",\"enum\":[\"on\",\"off\"]}},\"required\":[\"state\"]}",

This tool:

  1. Is named room_light
  2. Accepts only one parameter
  3. state must be "on" or "off"

No other values are allowed.

Tool Execution Logic


if (state == "on") {
digitalWrite(LED_PIN, HIGH);
} else if (state == "off") {
digitalWrite(LED_PIN, LOW);
}
  1. "on" → LED turns ON
  2. "off" → LED turns OFF

If the JSON is invalid or the value is wrong, an error is returned to the AI.

Tool Response


{
"success": true,
"device": "LED",
"state": "on"
}

This response tells the AI exactly what happened.

MCP Tool 2: Climate Sensor (room_climate)

Tool Definition:


mcp.registerTool(
"room_climate",
"Read temperature and humidity from DHT11",
"{\"type\":\"object\",\"properties\":{}}",

This tool:

  1. Takes no input
  2. Simply reads the DHT11 sensor

Sensor Reading


int result = dht11.readTemperatureHumidity(temperature, humidity);

If the read fails, an error is returned.

If successful, temperature and humidity are sent back to the AI.

Tool Response Example


{
"success": true,
"temperature_c": 28,
"humidity_percent": 60
}

MCP Connection Callback


void onMcpConnectionChange(bool connected)
  1. When MCP connects:
  2. Tools are registered
  3. When MCP disconnects:
  4. Status is printed on Serial Monitor

This ensures tools are available only when MCP is active.

Setup Function

In setup():

  1. Serial communication starts
  2. LED pin is configured
  3. Wi-Fi connection is established
  4. MCP client is started

mcp.begin(MCP_ENDPOINT, onMcpConnectionChange);

Loop Function


void loop() {
mcp.loop();
}

This keeps the MCP connection alive and listens for AI tool calls.

How the Full Flow Works

  1. User speaks to AI
  2. AI selects an MCP tool
  3. AI sends structured JSON
  4. ESP32 validates input
  5. Hardware action is executed
  6. ESP32 sends response
  7. AI confirms result to user

Google Calendar Demo (ESP32 + MCP)

12.png
13.png
14.png
15.png

In this step, the ESP32 becomes a real Google Calendar assistant, not just a voice demo.

The same ESP32-S3 board runs:

  1. MCP client (connected to Xiaozhi AI)
  2. Custom calendar tools (set_meeting, get_meetings)
  3. Google Calendar integration via Google Apps Script

When you speak a command, the AI decides which tool to call, and the ESP32 executes it.

1. set_meeting – Create a Google Calendar Event

This function is used when the AI hears something like:

“Create a meeting tomorrow at 2:30 PM for 60 minutes”

What the AI Sends to ESP32 (via MCP)

The AI does not send epoch time.

It sends human-readable structured data:


{
"title": "Project Review",
"time": "14:30",
"date": "18/01/2026",
"duration": 60
}

This is important because LLMs are bad at time math.

What the ESP32 Does (Step-by-Step)

1. Validate Inputs


if (timeStr.length() == 0 || dateStr.length() == 0)

Ensures time and date are present.

2. Convert Time + Date → Epoch (IST → UTC)


long long epochMs = convertToEpochMs(timeStr, dateStr);

Inside convertToEpochMs():

  1. Accepts multiple formats
  2. Builds a tm structure
  3. Assumes IST
  4. Converts to UTC epoch
  5. Returns milliseconds

✅ This fixes the biggest AI scheduling bug.

3. Build HTTP Request


?action=create
&title=Project%20Review
&start_epoch=1768636200000
&duration=30

The ESP32 sends this to Google Apps Script.

4. Google Apps Script Creates the Event


var start = new Date(startEpoch);
var end = new Date(start.getTime() + durationMin * 60000);

CalendarApp.getDefaultCalendar().createEvent(
title, start, end
);

✔ Event is now live in Google Calendar.

Response Back to AI


{
"success": true,
"meeting": "created",
"title": "Project Review",
"scheduled_time": "14:30 IST",
"scheduled_date": "18/01/2026"
}

AI speaks the confirmation.

2. get_meetings – Retrieve Calendar Events

Used when the AI hears:

“What meetings do I have tomorrow evening from 4 to 5?”

What the AI Sends to ESP32


{
"start_time": "16:00",
"start_date": "18/01/2026",
"end_time": "17:00",
"end_date": "18/01/2026"
}

Again — no epoch from AI.

What the ESP32 Does

1. Validate Time Range

Checks all fields exist and:


startEpoch < endEpoch

2. Convert Both Times to Epoch


startEpochMs = convertToEpochMs(start_time, start_date);
endEpochMs = convertToEpochMs(end_time, end_date);

Both are:

  1. Parsed as IST
  2. Converted to UTC
  3. Sent in milliseconds

3. Build Request


?action=get
&start_epoch=1768636200000
&end_epoch=1768643400000

Google Apps Script Fetches Meetings


var events = CalendarApp
.getDefaultCalendar()
.getEvents(startTime, endTime);

Each event is converted into JSON:


{
"title": "Project Review",
"start_readable": "Sat Jan 18 2026 16:00:00 GMT+0530",
"end_readable": "Sat Jan 18 2026 16:30:00 GMT+0530"
}

Response Back to ESP32 → AI


{
"success": true,
"count": 1,
"meetings": [ ... ]
}

The AI can now:

  1. Read meetings aloud
  2. Summarize schedule
  3. Make decisions (free/busy logic)

Get the Google Apps Script Web URL

To connect ESP32 with Google Calendar, we need a public Web App URL from Google Apps Script.

1. Create a New Script

  1. Go to https://script.google.com/
  2. Click New Project
  3. Delete the default code
  4. Copy–paste the provided Apps Script code

2. Save the Script

  1. Click Save
  2. Give the project a name (e.g., ESP32 Calendar MCP)

3. Deploy as Web App

  1. Click Deploy → New deployment
  2. Select Web app

Set the options:

  1. Execute as: Me
  2. Who has access: Anyone

Then click Deploy

On first deploy, Google will ask for permission — approve it.

4. Copy the Web URL

  1. After deployment, Google shows a Web App URL
  2. Copy this URL

5. Paste URL in ESP32 Code

Replace CALENDAR_URL in the ESP32 sketch:


const char* CALENDAR_URL = "PASTE_YOUR_WEB_APP_URL_HERE";


Script:

function doGet(e) {
var action = e.parameter.action || "create";
if (action === "create") {
return createMeeting(e);
} else if (action === "get") {
return getMeetings(e);
}
return ContentService.createTextOutput(JSON.stringify({
success: false,
error: "Invalid action. Use action=create or action=get"
})).setMimeType(ContentService.MimeType.JSON);
}

// Create meeting function
function createMeeting(e) {
var title = e.parameter.title || "ESP32 Meeting";
var startEpoch = Number(e.parameter.start_epoch);
var durationMin = Number(e.parameter.duration || 30);

if (!startEpoch || isNaN(startEpoch)) {
return ContentService.createTextOutput(JSON.stringify({
success: false,
error: "Invalid epoch"
})).setMimeType(ContentService.MimeType.JSON);
}

var start = new Date(startEpoch);
var end = new Date(start.getTime() + durationMin * 60000);

try {
var event = CalendarApp.getDefaultCalendar().createEvent(
title,
start,
end,
{ description: "Created from ESP32" }
);

return ContentService
.createTextOutput(JSON.stringify({
success: true,
message: "Meeting created",
title: title,
start: start.toString(),
end: end.toString(),
id: event.getId()
}))
.setMimeType(ContentService.MimeType.JSON);
} catch (error) {
return ContentService
.createTextOutput(JSON.stringify({
success: false,
error: error.toString()
}))
.setMimeType(ContentService.MimeType.JSON);
}
}

// Get meetings function
function getMeetings(e) {
var startEpoch = Number(e.parameter.start_epoch);
var endEpoch = Number(e.parameter.end_epoch);
if (!startEpoch || !endEpoch || isNaN(startEpoch) || isNaN(endEpoch)) {
return ContentService.createTextOutput(JSON.stringify({
success: false,
error: "Invalid start_epoch or end_epoch"
})).setMimeType(ContentService.MimeType.JSON);
}
try {
var startTime = new Date(startEpoch);
var endTime = new Date(endEpoch);
var events = CalendarApp.getDefaultCalendar().getEvents(startTime, endTime);
var meetings = events.map(function(event) {
return {
title: event.getTitle(),
start: event.getStartTime().getTime(),
end: event.getEndTime().getTime(),
start_readable: event.getStartTime().toString(),
end_readable: event.getEndTime().toString(),
description: event.getDescription() || "",
location: event.getLocation() || ""
};
});
return ContentService
.createTextOutput(JSON.stringify({
success: true,
count: meetings.length,
search_range: {
start: startTime.toString(),
end: endTime.toString()
},
meetings: meetings
}))
.setMimeType(ContentService.MimeType.JSON);
} catch (error) {
return ContentService
.createTextOutput(JSON.stringify({
success: false,
error: error.toString()
}))
.setMimeType(ContentService.MimeType.JSON);
}
}

Xiaozhi MCP Light (Relay Example)

Project (3).png

In this step, we demonstrate real AI-controlled hardware execution using Xiaozhi MCP.

Instead of a camera board, we use a DFRobot Beetle ESP32-C3, connected to a 10A relay module on GPIO 0.

This relay can control real loads like lights, fans, or appliances.

This example proves that MCP is not limited to one ESP32 — multiple ESP32 devices can expose tools independently.

Hardware Used:

  1. DFRobot Beetle ESP32-C3
  2. 10A Relay Module
  3. Relay control pin → GPIO 0

When the relay pin goes HIGH, the relay turns ON.

When it goes LOW, the relay turns OFF.

What This Example Does

The ESP32 exposes a single MCP tool: office_light

This tool allows the AI to:

  1. Turn the relay ON
  2. Turn the relay OFF

The AI does not toggle GPIOs directly.

It calls a structured tool, and the ESP32 executes it safely.

How the MCP Flow Works?

  1. Voice or AI Command
  2. Example:
  3. “Turn on the office light”
  4. Xiaozhi AI
  5. Understands intent
  6. Calls the MCP tool office_light
  7. Sends structured JSON: { "state": "on" }
  8. ESP32 Execution
  9. Receives the tool call
  10. Sets GPIO 0 HIGH or LOW
  11. Controls the relay instantly
  12. Response Back to AI
  13. ESP32 sends execution status
  14. AI confirms the action

This is true AI → hardware control, not keywords or if-else logic.

Conclusion

Model_Context_Protocol_page-0001.jpg

In this project, we built a real voice-controlled AI system on ESP32 — not a chatbot, but an execution engine.

Using MCP (Model Context Protocol), the ESP32 exposes its hardware and services as structured tools that an AI can safely call. This allowed us to:

  1. Control real hardware (LEDs, sensors, relays)
  2. Convert natural language into deterministic actions
  3. Create and fetch Google Calendar meetings
  4. Handle time, timezone, and epoch conversion directly on the device

What you’ve seen in this project are just a few examples of what MCP enables.

The real power is that any hardware or software capability can be exposed as an MCP tool — from home automation and factory sensors to cloud services, dashboards, and industrial control systems.

The possibilities are truly endless when AI is combined with structured, secure execution.

The key takeaway is the architecture:

  1. The AI decides what needs to be done
  2. MCP defines how it can be done
  3. ESP32 executes it safely in the real world

If you understand this flow, you’re no longer just building IoT projects —

you’re designing AI-driven automation systems.


Special Thanks

A big thank you to DFRobot for providing all the hardware components used in this project and supporting open, educational innovation.


Happy building 🚀