3.1. Requirement Analysis
3.1.1 Hardware Requirements
- Main Control Unit: Xiao ESP32-S3
- Sensors/Devices:
- Microphone: Capture audio and recognize specific trigger phrases (e.g.,
"How many people are ahead?").
- Camera: Capture photos and send them to the main control unit for
processing.
- Display: Show the results of recognition (e.g., text or images).
- Speaker: Play audio feedback with recognition results.
- WiFi Module: Connect to the internet for processing and cloud
interaction (e.g., uploading photos or remote recognition).
3.1.2 Software Requirements
- User Interface:
- Simple and intuitive interface for interaction.
- Display recognition results clearly.
- Voice Recognition Module:
- Capture audio using the microphone.
- Recognize trigger words or phrases (e.g., “How many people are ahead?”).
- Image Processing Module:
- Capture a photo and send it to the processing unit.
- Use an image recognition model (e.g., TensorFlow Lite) to analyze the photo
and recognize objects or scenes.
- Display Module:
- Show the recognition results (e.g., text, image).
- Voice Feedback Module (optional):
- Play audio feedback based on the recognition results.
- Wi-Fi Module:
- Connect to the internet, possibly for remote control or uploading
recognition results.
- Power Management:
- Provide stable power to ensure continuous operation of the device.
3.1.3 Functional Requirements
- Startup & Connection: The device should automatically connect to the Wi-Fi
network upon startup and wait for microphone input.
- Voice Trigger: Trigger photo capture and image recognition via specific
voice commands.
- Image Recognition: Recognize people, objects, or scenes in the captured
photo and return results.
- Voice Feedback (optional): Play feedback information via the speaker based
on the recognition results.
- Display Results: Display the recognition results on the screen.
- Latency: The latency of voice recognition and image processing should be
as low as possible for a smooth user experience.
- Accuracy: The voice and image recognition accuracy must be high to ensure
the device can recognize trigger words and objects accurately.
3.2 System Architecture

3.3. Software Design
3.3.1. Voice Recognition
3.3.2. Image Recognition
3.3.3. Display
3.3.4. Voice Feedback
3.3.5. Wi-Fi Connection
3.3.6. Power Management
3.3.7. User Interface
3.3.8. Main Control Unit