Skip to main content

Week 17 - Wildcard Week (AI Horse BOOK v2 Voice Edition)

1) Task & Motivation

AI Horse BOOK is designed to be a mechanical pony that engages in natural conversation with people—not a toy operated by a remote control. While the v1 prototype successfully drove the pony's mechanical linkages via Bluetooth commands from an ESP32 remote, the interaction model remained stuck at the "button-trigger" stage, lacking intelligent responsiveness.

The goal of v2 is to equip AI Horse BOOK with open-ended Mandarin voice interaction, enabling two core capabilities:

Intelligent Motion: Voice commands drive the N20 motor, which actuates the pony's mechanical linkages to perform actions

Music Performance: Voice commands trigger the MIDI module to play corresponding melodies

2) Technical Architecture & Data Pipeline

This solution builds upon the open-source "XiaoZhi" voice module with secondary development, utilizing MQTT to decouple speech recognition from mechanical control. The complete data flow is as follows:

  • Microphone → XiaoZhi Module (ASR + LLM) → MQTT Broker → ESP32 Main Controller → N20 Motor + MIDI Module

  • XiaoZhi Voice Module: Handles wake-word detection, audio capture, cloud-based ASR, and LLM semantic understanding, then publishes parsed commands to the MQTT Broker

  • MQTT Communication Layer: Serves as the message bridge between the voice module and execution modules, enabling logical decoupling and easy expansion to additional command types

  • ESP32 Main Controller: Subscribes to MQTT topics and processes incoming commands:

Motion commands → Drives the N20 worm-gear motor → Pony mechanical linkages perform the action

Music commands → Controls the MIDI module via UART/I²C → Plays the corresponding melody

3) MIDI Module Integration

To give the pony the ability to "perform music," a MIDI module is added to the system:

ItemDescription
Module SelectionStandard MIDI synthesizer module (e.g., DFRobot MIDI or VS1053 series)
CommunicationUART or I²C, connected to the ESP32 main controller
Trigger LogicVoice command (e.g., "play music", "play march") → XiaoZhi parses → MQTT publish → ESP32 triggers corresponding MIDI track
OutputMIDI audio output to a small speaker or headphone jack

The MIDI module and N20 motor can work in coordination—for example, playing rhythmic music while the pony walks, delivering an immersive audio-visual experience.

4) Issues & Resolutions

IssueResolution
MQTT Connection Stability: WiFi fluctuations or Broker unavailability prevent voice commands from reaching the execution side① Use a reliable MQTT Broker (e.g., EMQX cloud or self-hosted); ② Implement auto-reconnect logic on the ESP32; ③ Retain v1 Bluetooth remote as an offline backup control channel
Protocol Alignment: Command format output by XiaoZhi does not match what the ESP32 expectsDefine a unified command protocol (e.g., JSON format: {"cmd":"walk", "param":"fast"}) and ensure both ends parse accordingly
MIDI-Motor Sync Latency: Time gap between music and motion after voice command processingAdd timestamps or sequence numbers to MQTT messages for ordered execution on the ESP32; alternatively, preload MIDI data into cache to reduce real-time parsing overhead

5) Conclusion

The v2 prototype successfully upgrades AI Horse BOOK from remote-triggered to voice-interactive operation:

  • The decoupled architecture based on the open-source XiaoZhi module + MQTT cleanly separates speech recognition, motion control, and music performance, allowing each subsystem to function independently

  • The introduction of MQTT provides flexibility for future expansion to more intelligent commands (e.g., lighting control, facial expression display)

  • The v1 Bluetooth remote remains as a backup channel, ensuring basic operability even when the wireless network is unstable