CP194 Capstone — Oral Defense
A $40 Dual-Channel sEMG System for
Silent Speech Classification
ESP32 + 2× AD8232 | 6 Commands | 2 Studies
Advisor: Prof. Patrick Watson • Second Reader: Prof. Shekhar
Subvocal speech recognition — classifying words a person "says" silently, with no audible output — requires reading facial and laryngeal muscle signals (sEMG). Current systems cost $1,000+ and require custom fabrication.
"Can we replicate MIT AlterEgo with cheaper, off-the-shelf parts?"
"Can a $40 system classify 6 silent commands above chance?"
More honest. More measurable. Doesn't assume the gap can be fully closed.
Random guessing 1 of 6 commands: 1 ÷ 6 = 16.7%. If we beat this at p < 0.001, the hardware is reading something real — not random noise.
$1,250+ • 7 electrodes • 24-bit ADC
Custom fabrication required; not reproducible.
$40 • 2 electrodes • 12-bit ADC
Off-the-shelf AD8232 cardiac sensors + ESP32.
Apple acquires Q.ai — $2 billion
Whispered speech + facial muscle detection. AirPods + Vision Pro. Jan 29, 2026.
Merge Labs (Sam Altman + OpenAI) — $250M seed
Goal: "a natural, nonverbal way for anyone to interact with AI." Jan 15, 2026.
Both companies are building what SOMACH demonstrates at $40. The market is no longer speculative.
Four phases of increasing signal complexity.
Spring 2025 SF. Tried EEG for silent speech. Hit acquisition noise floor. Contributed to the NETO paper. Pivoted to sEMG.
Spring 2025 • San Francisco
Phone-as-Xbox-Kinect. Walk/jump/punch → Hollow Knight controller. Android + real-time pipeline.
Sep 2025 • Taipei
Pixel Watch HAR. CNN-LSTM. 94% binary walking accuracy.
Oct–Nov 2025 • Taipei
AD8232 muscle sensing. 18-model benchmark. RF: 74.3% accuracy.
Nov–Dec 2025 • Taipei
Subvocalization. 6-class silent speech. 2 studies, 3 papers.
Dec 2025–Mar 2026 • Hyderabad
Each phase built skills for the next: Android dev → ML pipelines → Hardware integration → Study design
Originally designed as 3-channel. The third AD8232 failed mid-project. Used both mastoid bones behind the ear as a shared ground reference → 2 active channels remained.
CH 1
AD8232 #1
Mentalis/Chin
MCU
ESP32
250 Hz ADC
USB Serial
Python
pyserial
CH 2
AD8232 #2
Throat/Under-chin
Y-splitter for shared ground reference. 3.5mm electrode cables. Ag/AgCl gel electrodes.
Key insight: The AD8232's bandpass (0.5–40 Hz) matches the range MIT AlterEgo and OpenBCI use for sEMG signal processing — likely coincidental, but it turns out to be sufficient to capture the onset burst that carries the discriminative signal. No hardware modification needed.
Capture
250 Hz, 2-ch serial
Validate
NaN/zero check, trim
Normalize
Z-score per session
Train
5-fold stratified CV
Evaluate
Accuracy ± SE, confusion
Chin (mentalis) + Under-chin (mylohyoid)
4-phase progression: Overt → Mouthing → Exaggerated closed-mouth → Covert subvocalization. Model trained on overt, evaluated on covert.
900 CSVs • 6 classes • CN V (trigeminal)
Chin (mentalis) + Throat (laryngeal)
5-phase Speech Intensity Curriculum: highest energy → lowest. Inspired by the NETO (EEG-to-text) paper's decreasing-difficulty design — not Bengio. Descending motor intensity.
1,500 CSVs • 6 classes • CN X (vagus)
Figure — Signal Amplitude Across Curriculum Phases
The 12-bit ADC noise floor (~10 µV) is reached at covert speech — explaining the resolution ceiling.
51.8%
± 2.8% (5-fold CV)
48.9%
± 3.1% (5-fold CV)
Both studies: statistically significant above chance (p < 0.001)
Proves a $40 system can capture discriminative sEMG features during silent speech.
Figure — Study A vs Study B: Electrode Configuration Comparison
Left: single-session accuracy. Center: per-class F1. Right: cross-study transfer (near chance).
Feature importance analysis: 100% of discriminative weight concentrates in the first ~20 timesteps (~80ms). An onset masking experiment confirmed: zeroing out the first 80ms collapses accuracy to chance.
62.0%
Baseline
17.6%
Onset Masked
16.7%
Random Chance
The 12-bit ADC (4,096 levels) can detect the initial motor command spike — the moment the brain fires the signal to the muscles. But it lacks the resolution to read sustained articulation patterns. MIT's 24-bit ADC (16M levels) sees both onset and articulation. Our system does onset classification, not continuous silent speech recognition.
Figure — Study A: Training vs. Test Accuracy (Generalization Gap)
Train reaches ~99%. Test plateaus at 51.8%. The gap is the onset-only signal: memorized in-session, fails to generalize.
~50%
Both Study A and Study B
25–31%
Near chance — transfer fails
Study A: CN V (Trigeminal Nerve)
Chin + under-chin → mentalis + mylohyoid muscles. Jaw elevation, lip protrusion.
Study B: CN X (Vagus Nerve)
Chin + throat → mentalis + laryngeal muscles. Vocal fold tension, glottal closure.
Conclusion: Different cranial nerves produce fundamentally different signal patterns. Electrode placement is not interchangeable — it must be standardized for any cross-session or cross-subject generalization.
Figure — Study A: Held-Out Confusion Matrix (5-Fold CV, n=900)
UP and SILENCE are strong. DOWN/LEFT/RIGHT/RIGHT bleed into each other — same onset pattern, different articulation below our noise floor.
From 49.7% to 93.5% via software alone.
Figure — Confidence Gating: Accuracy vs. Coverage Trade-off
θ=0.60 sweet spot: 64.1% accuracy on 62.1% of samples. Higher threshold = more accurate, fewer answers.
arXiv Papers
P1: Curriculum, P2: Electrode, P3: EMG Benchmark
CSV Data Files
87 MB • Open dataset
Phase Repos
GitHub • MIT License
Blog Posts
Full journey documentation
Python Scripts
E2E pipeline
Arduino Sketches
ESP32 firmware
Instructables Pages
Reproducibility guide
Websites Deployed
somach.vercel.app + Kaggle dataset
Terminal UI Tool
Custom recorder + real-time display built for the study
live_demo.py — Watch the model classify silent speech in real-time.
250
Hz Input
2
sEMG Channels
6
Commands
UP • DOWN • LEFT • RIGHT • SILENCE • NOISE
RESEARCH ROADMAP
24-bit ADC (ADS1299) to capture sustained articulation, not just onset. Bridge the onset→articulation gap — the same upgrade that separates MIT from this work.
Current data: single subject (n=1). Expand to 5–10 subjects with standardized electrode placement protocol. Required before any real-world generalization claim.
4,033 CSVs → Kaggle/HuggingFace. Enable community replication and set a benchmark for low-cost sEMG silent speech — a field with no open standard yet.
COMPANY ROADMAP
One Python package abstracting 8+ EEG/EMG headsets into a single API. Developers build once; it runs on SOMACH, OpenBCI, Muse, and future form factors.
Manufacture and distribute 100 SOMACH units for independent user studies. Open-source hardware BOM. Prove cross-subject generalization at scale.
Apple and Merge Labs are shipping. The 12–18 month window before consumer hardware arrives is the window to establish an open-source developer ecosystem that won't disappear when AirPods do this natively.
ORAL DEFENSE — MARCH 10, 2026 — UNANIMOUS PASS
"Congratulations Carl, this was an easy pass for us. Hardware projects are difficult, and yours worked. Your documentation throughout has been exceptional. This was one of my favorite projects in a long time to advise."
— Prof. Patrick Watson, CS/ML (Advisor)
"Getting hardware to work under less than ideal circumstances — low budget — is amazing. You could make a strong case for why you should be part of the MIT team. I would call yours an ideal capstone trajectory."
— Prof. Shekhar, Signal Processing (Second Reader)
"Carl did this specific thing, which is a very difficult engineering challenge, and here's how he got it to work."
— Prof. Patrick Watson
Minerva University Class of 2026
kho@uni.minerva.edu
Thank you.