Week 17: Wildcard week (Forest Fairy v2 voice)

Fab Academy’s Wildcard week asks for a digital process not already covered in another assignment: documented workflows, problems, fixes, and files someone else could reuse. My individual assignment here is the v2 voice axis of Forest Fairy (森之精灵): upgrading the three plant-companion spirits from a fixed phrase menu (documented on Week 16) toward open-ended Mandarin speech using cloud ASR and TTS on Alibaba Cloud, with microphone and speaker hardware salvaged from a 小智 (XiaoZhi) module after I could not realistically fork their full firmware stack.

Individual assignment

1) Task and motivation

Forest Fairy is meant to feel like a small forest spirit you can talk to, not a remote control with twenty hidden buttons. On the v1 bench I can wake 灵葭 and hit roughly twenty Mandarin command slots baked into ASRPRO, but I cannot say an arbitrary sentence and expect a sensible reply on the TFT. That ceiling showed up every time I wanted the companion to answer a real question instead of playing a menu.

For v2 I want all three spirits on the same trajectory: smarter dialogue, any utterance the mic can pick up, and answers that still land on the display stack I already integrated. The chain I care about is:

microphone → speech recognition → language model → screen (and speaker).

Building studio-grade ASR inside an ESP32 hub was never realistic; tuning Tianwen offline tables already ate weeks on v1. My first shortcut was XiaoZhi, a commercial voice module with wake, streaming ASR, and hooks, because I hoped I could stay inside their ecosystem, improve our TFT UI, and get real-time back-and-forth. That fork did not survive contact with their repository: the codebase was too large and too entangled for the time I had. I could wire the module in, but not reshape it into our Forest Fairy protocol fast enough to count as a reproducible class deliverable.

I changed strategy: harvest XiaoZhi’s microphone and speaker, solder them into the bench I already trust, and own the software path from capture through cloud speech APIs to the ESP32 display front-end.

2) Why this counts as wildcard: v2 cloud voice path

Not covered elsewhere: Week 16 documents offline, table-driven ASRPRO (mailbox + snid). Week 15 documents TFT/UI integration. Neither week documents a record → upload → Alibaba ASR → LLM → Alibaba TTS → playout loop. That cloud speech workflow is what I am claiming for Wildcard week.

The diagram below is the pipeline I actually got working on the bench, swapping the ASRPRO mailbox for captured audio and REST calls, while keeping the ESP32‑WROOM + ILI9341 UI path from v1 as the place assistant text appears.

  ┌─────────────┐   audio file / stream   ┌────────────────────┐
  │ Microphone  │ ───────────────────────▶│ Alibaba Cloud ASR  │
  │ (from XiaoZhi│                         │ speech → text      │
  │  harvest)   │◀────────────────────────│                    │
  └─────────────┘                         └─────────┬──────────┘
                                                    │ UTF-8 text
                                                    ▼
                                          ┌────────────────────┐
                                          │ ESP32 integration  │
                                          │ (WROOM + hub path) │
                                          └─────────┬──────────┘
                                                    │ HTTPS
                                                    ▼
                                          ┌────────────────────┐
                                          │ Cloud LLM          │
                                          │ (assistant reply)  │
                                          └─────────┬──────────┘
                                                    │ reply text
                          ┌─────────────────────────┴─────────────────────────┐
                          ▼                                                   ▼
                ┌────────────────────┐                             ┌────────────────────┐
                │ ILI9341 UI         │                             │ Alibaba Cloud TTS  │
                │ (0x55 TLV / v1)    │                             │ text → speech      │
                └────────────────────┘                             └─────────┬──────────┘
                                                                             ▼
                                                                   ┌────────────────────┐
                                                                   │ Speaker            │
                                                                   │ (from XiaoZhi      │
                                                                   │  harvest)          │
                                                                   └────────────────────┘
                
Forest Fairy v2, open speech: capture and file hand-off locally; Alibaba handles ASR and TTS; the ESP32 stack forwards text to the LLM and UI. Display discipline from DATA_FLOW.md still applies on the WROOM side.

Practical split: I record or buffer utterances on the MCU side, send them to Alibaba’s ASR service, receive transcript text, push that text into the firmware path that already knows how to call a large language model, then take the model’s reply string and call Alibaba’s text-to-speech API so the harvested speaker plays Mandarin audio back. When WiFi drops, the v1 offline menu on ASRPRO remains my fallback story on Week 16. I am not pretending v2 is fully edge-only yet.

XiaoZhi lesson: treating the module as a black-box ASR computer sounded fast until I opened the repo. I kept the transducers (mic + speaker) because they were already matched to a small voice product, and threw away the dependency on their full application image.

3) Plan

  1. Bench-test XiaoZhi beside the existing S3 + WROOM stack; confirm whether I can own UI + protocol (§4; outcome: hardware only).
  2. Desolder microphone and speaker from XiaoZhi; rewire into Forest Fairy power and signal domains.
  3. Implement capture → audio file → Alibaba ASR → text on the integration MCU.
  4. Route transcript text through the existing LLM + TFT path (WROOM display front-end, hub HTTPS).
  5. Close the loop with Alibaba TTS → amplifier → speaker; regression-test I²C/display after each wiring change.
  6. Archive firmware and API notes in-repo when the tree is cleaned for publication (placeholder below).

4) Build diary: hardware harvest and cloud loop

XiaoZhi on the bench (abandoned software fork)

Breadboard with XiaoZhi voice module wired beside ESP32 display and hub boards
XiaoZhi beside the v1 stack (2026‑05‑25): I could power the module and probe buses, but integrating streaming recognition + our ILI9341 UI inside their codebase was not tractable in the week I had. This photo marks the pivot point: keep transducers, replace software.

I spent time reading XiaoZhi examples and UART hooks. Every path wanted their build system, asset bundles, and assumptions about hosts I was not using. I stopped trying to “patch their app” and treated the board as a donor for analog front-end parts instead.

Microphone and speaker modules

Close-up of desoldered microphone module from XiaoZhi ready to wire into Forest Fairy
Microphone harvest: XiaoZhi’s mic block after desoldering, reused on the Forest Fairy bench for capture into our record/ASR path.
Close-up of desoldered speaker module from XiaoZhi ready for TTS audio output
Speaker harvest: the matching speaker block for playout after Alibaba TTS returns audio.

Moving analog parts between boards is tedious work: pad cleanup, short harnesses, and re-checking ground reference so I do not reintroduce the I²C/display noise I debugged on Week 16. I photographed each module before heat-sinking the next joint.

System wiring (v2 bench)

Overview photo of Forest Fairy v2 bench wiring: ESP32 boards, harvested mic and speaker, and interconnects
v2 bench overview: ESP32 hub + WROOM display stack, harvested mic/speaker, and shared power/ground discipline. This is the state where open speech round-trip (ASR → LLM → TTS) succeeds.

Aliyun onboarding log (why I switched and how I registered)

Once the basic cloud loop ran, latency became the problem. ASR → LLM → TTS worked, but each round trip still felt like waiting on a batch job, not talking to something on the desk. I stopped tuning every link in isolation and moved the chain to Alibaba Cloud Bailian, where a managed multimodal workflow already wires ASR, model, and TTS together. Forest Fairy then hooks into that endpoint instead of hand-rolling each REST hop.

I recorded the platform steps below because this is part of what made the Week 17 workflow reproducible: not only wiring and firmware, but also account-side configuration and key management.

Alibaba Cloud Bailian app center showing multimodal interaction development toolkit
Step 1 - Enter Bailian application workspace: I opened the multimodal interaction toolkit in Alibaba Cloud and confirmed I was in the correct workspace before creating anything.
Alibaba Cloud create application page in Bailian
Step 2 - Create application: I started a new app instance for Forest Fairy v2 so the voice chain could be managed as a dedicated project instead of test snippets.
Alibaba Cloud app editor interface with configuration fields
Step 3 - Edit application settings: I configured app behavior and model-side parameters so the returned style matches the Mandarin interactive tone I want on the ILI9341 UI.
Alibaba Cloud page confirming Forest Fairy app creation successful
Step 4 - Confirm creation: the platform shows the Forest Fairy app was created successfully, which marks the handoff from local trial to a stable cloud endpoint.
Alibaba Cloud console showing app key, workspace key, and API key retrieval
Step 5 - Retrieve credentials: I collected app key, workspace key, and API key for firmware integration. These are now mapped into my local secret configuration and are not committed into the repository.

After registration, the path on the bench is: mic capture → Bailian voice chain → reply text/audio → ESP32 display + speaker. Delay is still there, but I spend less time on glue code and more on prompts, error handling, and the Week 20 demo script.

Cloud ASR / LLM / TTS: what “done” means

The acceptance test I use for v2 voice is conversational, not a slot ID:

  1. Speak a sentence that is not in the ASRPRO snid table.
  2. Confirm the firmware obtains a transcript from Alibaba ASR.
  3. Confirm the ESP32‑WROOM path reaches the cloud LLM and returns reply text.
  4. See reply text on the ILI9341 chat surface (same UI family as v1).
  5. Hear the reply from the harvested speaker via Alibaba TTS.

That round trip is working on my bench. I am still cleaning secrets, error handling, and file naming before I publish a single clone-and-flash tree. Until then, this page is the architecture record; the photos are the wiring reference if you rebuild the bench.

Source code (to be linked): code/week17-individual/. Firmware and API wrapper notes will land here; ping the repo or this page again once the export is committed.

Bench demo: v2 open-speech round-trip

The clip below is the on-bench system test I recorded after the cloud loop above was stable: I speak a sentence outside the ASRPRO slot table, the firmware sends audio to Alibaba ASR, the LLM reply appears on the ILI9341, and the harvested speaker plays Alibaba TTS.

v2 acceptance clip (2026‑06‑05): live mic → cloud ASR/LLM/TTS → TFT text + speaker audio on the Forest Fairy v2 bench wiring shown above.

5) Conclusion

Wildcard week, for Forest Fairy, is the v2 voice upgrade: a cloud speech workflow that Week 16 deliberately does not cover. v1 remains the integrated ASRPRO → API → ILI9341 baseline; v2 retires the “menu only” feeling by sending real audio to Alibaba ASR and speaking answers back through Alibaba TTS.

I assumed the cloud APIs would be the hard part. They were not. XiaoZhi ate the time: I kept hoping their repo would be a shortcut until I admitted I could not reshape it fast enough. What I kept was simpler: donor mic and speaker, explicit audio files, REST speech services, and the display stack from v1. Slower to document than a vendor demo, but I can name every hop.

Reusable for the final project: three spirits sharing one open-speech pattern; v1 offline commands as fallback; TFT + DATA_FLOW.md unchanged in spirit. The bench demo clip above is the current evidence that the round-trip works; next improvements: lower latency (streaming ASR instead of file hand-off) and in-repo firmware with redacted secrets.h once the published tree matches the bench.