Skip to main content

3. Software

3.1. Requirement Analysis

3.1.1 Hardware Requirements

  • Main Control Unit: Xiao ESP32-S3
  • Sensors/Devices:
    • Microphone: Capture audio and recognize specific trigger phrases (e.g., "How many people are ahead?").
    • Camera: Capture photos and send them to the main control unit for processing.
    • Display: Show the results of recognition (e.g., text or images).
    • Speaker: Play audio feedback with recognition results.
    • WiFi Module: Connect to the internet for processing and cloud interaction (e.g., uploading photos or remote recognition).

3.1.2 Software Requirements

  • User Interface:
    • Simple and intuitive interface for interaction.
    • Display recognition results clearly.
  • Voice Recognition Module:
    • Capture audio using the microphone.
    • Recognize trigger words or phrases (e.g., “How many people are ahead?”).
  • Image Processing Module:
    • Capture a photo and send it to the processing unit.
    • Use an image recognition model (e.g., TensorFlow Lite) to analyze the photo and recognize objects or scenes.
  • Display Module:
    • Show the recognition results (e.g., text, image).
  • Voice Feedback Module (optional):
    • Play audio feedback based on the recognition results.
  • Wi-Fi Module:
    • Connect to the internet, possibly for remote control or uploading recognition results.
  • Power Management:
    • Provide stable power to ensure continuous operation of the device.

3.1.3 Functional Requirements

  • Startup & Connection: The device should automatically connect to the Wi-Fi network upon startup and wait for microphone input.
  • Voice Trigger: Trigger photo capture and image recognition via specific voice commands.
  • Image Recognition: Recognize people, objects, or scenes in the captured photo and return results.
  • Voice Feedback (optional): Play feedback information via the speaker based on the recognition results.
  • Display Results: Display the recognition results on the screen.

3.1.4 Performance Requirements

  • Latency: The latency of voice recognition and image processing should be as low as possible for a smooth user experience.
  • Accuracy: The voice and image recognition accuracy must be high to ensure the device can recognize trigger words and objects accurately.

3.2 System Architecture

System Architecture

3.3. Software Design

3.3.1. Voice Recognition
3.3.2. Image Recognition
3.3.3. Display
3.3.4. Voice Feedback
3.3.5. Wi-Fi Connection
3.3.6. Power Management
3.3.7. User Interface
3.3.8. Main Control Unit