Week 17: Wildcard week (Forest Fairy v2 voice)
Fab Academy’s Wildcard week asks for a digital process not already covered in another assignment: documented workflows, problems, fixes, and files someone else could reuse. My individual assignment here is the v2 voice axis of Forest Fairy (森之精灵): upgrading the three plant-companion spirits from a fixed phrase menu (documented on Week 16) toward open-ended Mandarin speech using cloud ASR and TTS on Alibaba Cloud, with microphone and speaker hardware salvaged from a 小智 (XiaoZhi) module after I could not realistically fork their full firmware stack.
Individual assignment
1) Task and motivation
Forest Fairy is meant to feel like a small forest spirit you can talk to, not a remote control with twenty hidden buttons. On the v1 bench I can wake 灵葭 and hit roughly twenty Mandarin command slots baked into ASRPRO, but I cannot say an arbitrary sentence and expect a sensible reply on the TFT. That ceiling showed up every time I wanted the companion to answer a real question instead of playing a menu.
For v2 I want all three spirits on the same trajectory: smarter dialogue, any utterance the mic can pick up, and answers that still land on the display stack I already integrated. The chain I care about is:
microphone → speech recognition → language model → screen (and speaker).
Building studio-grade ASR inside an ESP32 hub was never realistic; tuning Tianwen offline tables already ate weeks on v1. My first shortcut was XiaoZhi, a commercial voice module with wake, streaming ASR, and hooks, because I hoped I could stay inside their ecosystem, improve our TFT UI, and get real-time back-and-forth. That fork did not survive contact with their repository: the codebase was too large and too entangled for the time I had. I could wire the module in, but not reshape it into our Forest Fairy protocol fast enough to count as a reproducible class deliverable.
I changed strategy: harvest XiaoZhi’s microphone and speaker, solder them into the bench I already trust, and own the software path from capture through cloud speech APIs to the ESP32 display front-end.
2) Why this counts as wildcard: v2 cloud voice path
Not covered elsewhere: Week 16 documents offline, table-driven ASRPRO
(mailbox + snid). Week 15 documents TFT/UI integration. Neither week documents a
record → upload → Alibaba ASR → LLM → Alibaba TTS → playout loop. That cloud speech workflow is
what I am claiming for Wildcard week.
The diagram below is the pipeline I actually got working on the bench, swapping the ASRPRO mailbox for captured audio and REST calls, while keeping the ESP32‑WROOM + ILI9341 UI path from v1 as the place assistant text appears.
┌─────────────┐ audio file / stream ┌────────────────────┐
│ Microphone │ ───────────────────────▶│ Alibaba Cloud ASR │
│ (from XiaoZhi│ │ speech → text │
│ harvest) │◀────────────────────────│ │
└─────────────┘ └─────────┬──────────┘
│ UTF-8 text
▼
┌────────────────────┐
│ ESP32 integration │
│ (WROOM + hub path) │
└─────────┬──────────┘
│ HTTPS
▼
┌────────────────────┐
│ Cloud LLM │
│ (assistant reply) │
└─────────┬──────────┘
│ reply text
┌─────────────────────────┴─────────────────────────┐
▼ ▼
┌────────────────────┐ ┌────────────────────┐
│ ILI9341 UI │ │ Alibaba Cloud TTS │
│ (0x55 TLV / v1) │ │ text → speech │
└────────────────────┘ └─────────┬──────────┘
▼
┌────────────────────┐
│ Speaker │
│ (from XiaoZhi │
│ harvest) │
└────────────────────┘
DATA_FLOW.md still applies on the WROOM side.
Practical split: I record or buffer utterances on the MCU side, send them to Alibaba’s ASR service, receive transcript text, push that text into the firmware path that already knows how to call a large language model, then take the model’s reply string and call Alibaba’s text-to-speech API so the harvested speaker plays Mandarin audio back. When WiFi drops, the v1 offline menu on ASRPRO remains my fallback story on Week 16. I am not pretending v2 is fully edge-only yet.
XiaoZhi lesson: treating the module as a black-box ASR computer sounded fast until I opened the repo. I kept the transducers (mic + speaker) because they were already matched to a small voice product, and threw away the dependency on their full application image.
3) Plan
- Bench-test XiaoZhi beside the existing S3 + WROOM stack; confirm whether I can own UI + protocol (§4; outcome: hardware only).
- Desolder microphone and speaker from XiaoZhi; rewire into Forest Fairy power and signal domains.
- Implement capture → audio file → Alibaba ASR → text on the integration MCU.
- Route transcript text through the existing LLM + TFT path (WROOM display front-end, hub HTTPS).
- Close the loop with Alibaba TTS → amplifier → speaker; regression-test I²C/display after each wiring change.
- Archive firmware and API notes in-repo when the tree is cleaned for publication (placeholder below).
4) Build diary: hardware harvest and cloud loop
XiaoZhi on the bench (abandoned software fork)
I spent time reading XiaoZhi examples and UART hooks. Every path wanted their build system, asset bundles, and assumptions about hosts I was not using. I stopped trying to “patch their app” and treated the board as a donor for analog front-end parts instead.
Microphone and speaker modules
Moving analog parts between boards is tedious work: pad cleanup, short harnesses, and re-checking ground reference so I do not reintroduce the I²C/display noise I debugged on Week 16. I photographed each module before heat-sinking the next joint.
System wiring (v2 bench)
Aliyun onboarding log (why I switched and how I registered)
Once the basic cloud loop ran, latency became the problem. ASR → LLM → TTS worked, but each round trip still felt like waiting on a batch job, not talking to something on the desk. I stopped tuning every link in isolation and moved the chain to Alibaba Cloud Bailian, where a managed multimodal workflow already wires ASR, model, and TTS together. Forest Fairy then hooks into that endpoint instead of hand-rolling each REST hop.
I recorded the platform steps below because this is part of what made the Week 17 workflow reproducible: not only wiring and firmware, but also account-side configuration and key management.
app key, workspace key, and
API key for firmware integration. These are now mapped into my local secret configuration and are not
committed into the repository.
After registration, the path on the bench is: mic capture → Bailian voice chain → reply text/audio → ESP32 display + speaker. Delay is still there, but I spend less time on glue code and more on prompts, error handling, and the Week 20 demo script.
Cloud ASR / LLM / TTS: what “done” means
The acceptance test I use for v2 voice is conversational, not a slot ID:
- Speak a sentence that is not in the ASRPRO
snidtable. - Confirm the firmware obtains a transcript from Alibaba ASR.
- Confirm the ESP32‑WROOM path reaches the cloud LLM and returns reply text.
- See reply text on the ILI9341 chat surface (same UI family as v1).
- Hear the reply from the harvested speaker via Alibaba TTS.
That round trip is working on my bench. I am still cleaning secrets, error handling, and file naming before I publish a single clone-and-flash tree. Until then, this page is the architecture record; the photos are the wiring reference if you rebuild the bench.
Source code (to be linked):
code/week17-individual/. Firmware and API wrapper notes will land here; ping the repo or this page
again once the export is committed.
Bench demo: v2 open-speech round-trip
The clip below is the on-bench system test I recorded after the cloud loop above was stable: I speak a sentence outside the ASRPRO slot table, the firmware sends audio to Alibaba ASR, the LLM reply appears on the ILI9341, and the harvested speaker plays Alibaba TTS.
5) Conclusion
Wildcard week, for Forest Fairy, is the v2 voice upgrade: a cloud speech workflow that Week 16 deliberately does not cover. v1 remains the integrated ASRPRO → API → ILI9341 baseline; v2 retires the “menu only” feeling by sending real audio to Alibaba ASR and speaking answers back through Alibaba TTS.
I assumed the cloud APIs would be the hard part. They were not. XiaoZhi ate the time: I kept hoping their repo would be a shortcut until I admitted I could not reshape it fast enough. What I kept was simpler: donor mic and speaker, explicit audio files, REST speech services, and the display stack from v1. Slower to document than a vendor demo, but I can name every hop.
Reusable for the final project: three spirits sharing one open-speech pattern; v1 offline commands
as fallback; TFT + DATA_FLOW.md unchanged in spirit. The
bench demo clip above is the current evidence that the round-trip works; next improvements:
lower latency (streaming ASR instead of file hand-off) and in-repo firmware with redacted secrets.h once
the published tree matches the bench.