Recreate a silent speech interface for under $40. Inspired by MIT's AlterEgo (Kapur et al., 2018) and Inner Speech Recognition research (Nieto et al., 2022).
Complete hardware assembly. Every wire explained.
Prices from Amazon.in (March 2026). Total: ₹3,196 (~$38)
Your breadboards have specific layouts. Let's confirm the orientation.
Looking down at it with rows numbered on the left:
Note: Your long board has reversed columns (J to A, not A to J).
From zero to reading muscle signals. Step by step.
Before we start, let's understand what we're doing and make sure you're ready.
Understand the theory before you dive in.
This guide explains what we're doing and why. It doesn't include hands-on exercises.
Ready to start training? → Go to the ML Walkthrough (Hands-On Lab) →
We are not reading thoughts. We are reading the body's shadow of thoughts.
When you speak silently, the muscle signals are tiny. Barely visible above the noise.
We don't jump straight to silent speech. We train in steps:
Machine learning needs examples. Hundreds of them.
We use the ESP32 to capture muscle voltage at 250 times per second. This creates a time-series CSV file for every word you speak.
Human speech muscles move fast. If we measure too slowly, we miss the subtle twitches that distinguish "P" from "B".
Raw signals are full of static (from lights, wifi, movement). We need to filter them.
This is the magic step. Neural networks don't understand "waves". They understand patterns.
We convert the audio wave into a heatmap (Spectrogram). This lets the AI "see" the word like a picture.
The AI looks at this heatmap and says: "Oh, a bright spot in the bottom-left corner? That usually means 'ONE'."
We show the AI thousands of these heatmaps, labeled "ONE", "TWO", etc.
It starts by guessing randomly. If it guesses "TWO" when the label is "ONE", we tell it to adjust its internal math slightly. Over thousands of repetitions (Epochs), it gets accurate.
Why do we record "Overt" (loud) speech if we want "Silent" speech?
Because loudness doesn't change the shape of the word much. We train the heavy lifting on the loud (easy) data, then fine-tune it on the silent (hard) data.
It's like learning to drive a truck. Once you know how to drive a truck (Loud), learning to drive a car (Silent) is easy. You transfer the knowledge.
How do we know if the model is good? We use a Confusion Matrix.
| Pred: A | Pred: B | |
| Actual: A | Correct | Error |
| Actual: B | Error | Correct |
It tells you exactly what mistakes are happening. If the model thinks "FIVE" is "NINE", you know you need to pronounce those two more distinctly.
Let's train a model to recognize your silent speech. We start small, then grow.
Machine learning is overwhelming if you try to do everything at once. We will use a Curriculum Learning approach:
Phase 1: Just one word ("ONE"). Validation.
Phase 2: Three words ("ONE", "TWO", "THREE"). Classification.
Phase 3: All digits (0-9). Scaling up.
We focus on Mouthing (silent lip movement) first. It provides the best balance of signal strength and silence.
Ensure your system is ready.
| What | Status |
|---|---|
| Hardware | Built & Connected via USB |
| Arduino IDE | Installed & Opens |
| Python 3 | Installed (python3 --version) |