Week 17 - Wildcard Week (AI Horse BOOK v2 Voice Edition)
1) Task & Motivation
AI Horse BOOK is designed to be a mechanical pony that engages in natural conversation with people—not a toy operated by a remote control. While the v1 prototype successfully drove the pony's mechanical linkages via Bluetooth commands from an ESP32 remote, the interaction model remained stuck at the "button-trigger" stage, lacking intelligent responsiveness.
The goal of v2 is to equip AI Horse BOOK with open-ended Mandarin voice interaction, enabling two core capabilities:
Intelligent Motion: Voice commands drive the N20 motor, which actuates the pony's mechanical linkages to perform actions
Music Performance: Voice commands trigger the MIDI module to play corresponding melodies
2) Technical Architecture & Data Pipeline
This solution builds upon the open-source "XiaoZhi" voice module with secondary development, utilizing MQTT to decouple speech recognition from mechanical control. The complete data flow is as follows:
-
Microphone → XiaoZhi Module (ASR + LLM) → MQTT Broker → ESP32 Main Controller → N20 Motor + MIDI Module
-
XiaoZhi Voice Module: Handles wake-word detection, audio capture, cloud-based ASR, and LLM semantic understanding, then publishes parsed commands to the MQTT Broker
-
MQTT Communication Layer: Serves as the message bridge between the voice module and execution modules, enabling logical decoupling and easy expansion to additional command types
-
ESP32 Main Controller: Subscribes to MQTT topics and processes incoming commands:
Motion commands → Drives the N20 worm-gear motor → Pony mechanical linkages perform the action
Music commands → Controls the MIDI module via UART/I²C → Plays the corresponding melody
3) MIDI Module Integration
To give the pony the ability to "perform music," a MIDI module is added to the system:
| Item | Description |
|---|---|
| Module Selection | Standard MIDI synthesizer module (e.g., DFRobot MIDI or VS1053 series) |
| Communication | UART or I²C, connected to the ESP32 main controller |
| Trigger Logic | Voice command (e.g., "play music", "play march") → XiaoZhi parses → MQTT publish → ESP32 triggers corresponding MIDI track |
| Output | MIDI audio output to a small speaker or headphone jack |
The MIDI module and N20 motor can work in coordination—for example, playing rhythmic music while the pony walks, delivering an immersive audio-visual experience.
4) Issues & Resolutions
| Issue | Resolution |
|---|---|
| MQTT Connection Stability: WiFi fluctuations or Broker unavailability prevent voice commands from reaching the execution side | ① Use a reliable MQTT Broker (e.g., EMQX cloud or self-hosted); ② Implement auto-reconnect logic on the ESP32; ③ Retain v1 Bluetooth remote as an offline backup control channel |
| Protocol Alignment: Command format output by XiaoZhi does not match what the ESP32 expects | Define a unified command protocol (e.g., JSON format: {"cmd":"walk", "param":"fast"}) and ensure both ends parse accordingly |
| MIDI-Motor Sync Latency: Time gap between music and motion after voice command processing | Add timestamps or sequence numbers to MQTT messages for ordered execution on the ESP32; alternatively, preload MIDI data into cache to reduce real-time parsing overhead |
5) Conclusion
The v2 prototype successfully upgrades AI Horse BOOK from remote-triggered to voice-interactive operation:
-
The decoupled architecture based on the open-source XiaoZhi module + MQTT cleanly separates speech recognition, motion control, and music performance, allowing each subsystem to function independently
-
The introduction of MQTT provides flexibility for future expansion to more intelligent commands (e.g., lighting control, facial expression display)
-
The v1 Bluetooth remote remains as a backup channel, ensuring basic operability even when the wireless network is unstable