AgentSkillsCN

text-to-speech

Alex语音合成能力,可将文档内容朗读出来。

SKILL.md
--- frontmatter
name: "text-to-speech"
description: "Alex's voice synthesis capability for reading documents aloud"

Text-to-Speech Skill ⭐ Flagship

Domain: AI Accessibility & Communication
Inheritance: inheritable (promote to Master Alex for all heirs)
Version: 2.0.0
Last Updated: 2026-02-05
Author: Alex (Master Alex)
Status: ⭐ Flagship Skill - Core Alex capability


Why This is a Flagship Skill

Text-to-Speech gives Alex a voice. This transforms Alex from a text-only assistant into a multimodal companion that can:

  • Read documents aloud while you walk, drive, or rest your eyes
  • Proofread by ear - catch errors your eyes miss
  • Accessibility - full document access for vision-impaired users
  • Rehearsal - practice presentations with natural-sounding narration
  • Export knowledge - create MP3s for offline learning

Zero cost, zero dependencies - uses Microsoft Edge TTS (free, no API key) with native TypeScript.


User Experience

🎯 Quick Start: Read Any Document

Keyboard shortcut (fastest):

  1. Open any document in VS Code
  2. (Optional) Select specific text to read only that portion
  3. Press Ctrl+Alt+R (Windows/Linux) or Cmd+Alt+R (macOS)
  4. Audio begins playing through the webview player

Command palette:

  • Ctrl+Shift+P → "Alex: Read Aloud"

📊 Status Bar Feedback

The status bar shows real-time progress during TTS operations:

StateDisplayClick Action
Connecting$(loading~spin) Connecting...-
Synthesizing$(loading~spin) Synthesizing...-
Streaming$(loading~spin) Receiving... 45KB-
Playing$(unmute) Playing 35%Stop
Paused$(unmute) PausedStop

🎵 Webview Audio Player

A sleek panel opens with full playback controls:

code
┌─────────────────────────────────────────────────────────┐
│  Alex TTS Player                                    [×] │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ▶️ ⏹️   ═══════════●══════════   1:23 / 4:56          │
│                                                         │
│  🔊 ────────●────────                                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

Features:

  • Progress bar with scrubbing (click/drag to seek)
  • Play/Pause button - toggle playback
  • Stop button - ends playback and closes panel
  • Volume slider - adjust playback volume
  • Time display - current position / total duration
  • Auto-close - panel closes when playback ends

🎤 Voice Selection

Choose Alex's voice before reading:

  1. Ctrl+Shift+P → "Alex: Read with Voice Selection"
  2. Quick pick appears:
VoiceCharacterBest For
Default (GuyNeural)Professional, clearTechnical docs, code review
Warm (ChristopherNeural)Friendly, conversationalTutorials, READMEs
British (RyanNeural)AuthoritativeFormal documents, presentations
Friendly (DavisNeural)Casual, approachableChat logs, informal content
  1. Select voice → reading begins immediately

💾 Save as MP3

Export any document to audio file:

  1. Ctrl+Shift+P → "Alex: Save as Audio"
  2. Save dialog opens (default name based on document)
  3. Progress notification shows synthesis progress
  4. Success notification with options:
    • Open File - plays in default audio player
    • Open Folder - reveals in file explorer

Use cases:

  • Create podcasts from documentation
  • Generate audio for offline learning
  • Archive presentations as audio

⏹️ Stop Reading

Multiple ways to stop playback:

  • Click status bar (shows $(unmute) icon during playback)
  • Press Escape when reading
  • Click stop button in webview player
  • Close webview panel
  • Ctrl+Shift+P → "Alex: Stop Reading"

📝 Smart Markdown Processing

Alex automatically strips markdown formatting for natural speech:

You WriteAlex Reads
# Heading"Heading." (pause)
**bold text**"bold text" (slight emphasis)
[link text](url)"link text"
`code`"code"
> blockquote"Quote: ..."
---(long pause)

Symbol conversion:

SymbolSpoken As
~5 minutes"about 5 minutes"
50%"50 percent"
A → B"A leads to B"
±5%"plus or minus 5 percent"

For Master Alex (Promotion Notes)

This skill gives Alex a voice. Version 2.0 uses native TypeScript WebSocket integration with Microsoft Edge TTS, eliminating external dependencies. Reading documents aloud with natural-sounding neural voices.

Version 2.0 Changes:

  • Native TypeScript implementation (no Python/MCP dependencies)
  • Direct WebSocket connection to Edge TTS endpoint
  • Webview-based audio player (cross-platform)
  • Integrated as VS Code commands
  • Status bar progress feedback

Why promote to Master:

  • Universal utility across all projects
  • Zero-cost implementation (uses free Edge TTS API)
  • No external dependencies (Python, MCP server)
  • Accessibility benefits for vision-impaired users
  • Integrated into VS Code extension

Dependencies (v2.0):

  • ws npm package (WebSocket client)
  • VS Code webview API (for audio playback)

Overview

Alex's voice synthesis capability using Microsoft Edge TTS via native TypeScript. Enables reading markdown documents, code files, and text aloud with natural-sounding voices. Fully integrated into the VS Code extension.


Architecture (v2.0)

text
┌─────────────────────────────────────────────────────────────┐
│                 Alex VS Code Extension                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Commands:                                                   │
│  • Alex: Read Aloud (Ctrl+Alt+R)                            │
│  • Alex: Read with Voice Selection                          │
│  • Alex: Save as Audio                                      │
│  • Alex: Stop Reading                                       │
│                     │                                        │
│                     ▼                                        │
│  ┌─────────────────────────────────────────────┐            │
│  │           ttsService.ts                       │            │
│  │   Native WebSocket to Edge TTS               │            │
│  │   • SSML generation                          │            │
│  │   • Markdown stripping                       │            │
│  │   • Progress callbacks                       │            │
│  └─────────────────┬───────────────────────────┘            │
│                    │                                         │
│                    ▼                                         │
│  ┌─────────────────────────────────────────────┐            │
│  │           audioPlayer.ts                      │            │
│  │   Webview-based playback                     │            │
│  │   • Cross-platform HTML5 Audio               │            │
│  │   • Play/pause/stop controls                 │            │
│  │   • Progress tracking                        │            │
│  └─────────────────────────────────────────────┘            │
│                                                              │
└──────────────────────┬──────────────────────────────────────┘
                       │ WebSocket (wss://)
                       ▼
┌─────────────────────────────────────────────────────────────┐
│               Microsoft Edge TTS Endpoint                    │
│   wss://speech.platform.bing.com/consumer/speech/...        │
├─────────────────────────────────────────────────────────────┤
│  • 400+ neural voices, 90+ languages                        │
│  • Free, no API key required                                │
│  • MP3 output (24kHz, 48kbps)                               │
│  • SSML support for prosody control                         │
└─────────────────────────────────────────────────────────────┘

Alex Voice Presets

PresetVoice IDCharacter
Defaulten-US-GuyNeuralProfessional male, clear articulation
Warmen-US-ChristopherNeuralFriendly, conversational
Britishen-GB-RyanNeuralBritish accent, authoritative
Friendlyen-US-DavisNeuralCasual, approachable

Voice Selection Rationale

Alex's default voice (GuyNeural) was chosen for:

  • Clarity: Excellent pronunciation of technical terms
  • Neutrality: Not too formal, not too casual
  • Distinctiveness: Recognizable as "Alex's voice"
  • Consistency: Same voice across all platforms

VS Code Commands

Alex: Read Aloud

Command: alex.readAloud
Keybinding: Ctrl+Alt+R (Windows/Linux), Cmd+Alt+R (macOS)

Reads the current selection or entire document aloud using Alex's default voice.

Behavior:

  • If text is selected, reads only the selection
  • If no selection, reads the entire document
  • Markdown files are stripped of formatting for natural speech
  • Progress shown in status bar
  • Click status bar to stop playback

Alex: Read with Voice Selection

Command: alex.readWithVoice

Quick pick to select a voice preset before reading.

Alex: Save as Audio

Command: alex.saveAsAudio

Generate and save speech to an MP3 file. Opens a save dialog for output location.

Alex: Stop Reading

Command: alex.stopReading
Keybinding: Escape (when reading)

Immediately stops current playback.


Implementation Details

Core Files (src/tts/)

FilePurpose
ttsService.tsWebSocket connection, SSML generation, synthesis
audioPlayer.tsWebview panel, playback controls, system fallback
index.tsModule exports

Text Preprocessing

The prepareTextForSpeech() function strips markdown:

MarkdownSpeech Output
# Heading"Heading." (pause)
**bold**"bold" (emphasis via prosody)
*italic*"italic"
`code`"code"
[link]\(url\)"link"
- item"Item."
> quote"Quote: ..."
---(long pause)

Code Block Handling

markdown
```python
def hello():
    print("Hello")
code

Becomes: "Python code block. Definition hello. Print hello. End code block."

### Symbol-to-Speech Transformations

Symbols are converted to natural speech equivalents:

| Symbol | Spoken As | Example |
|--------|-----------|--------|
| `~` | "approximately" or "about" | ~2 min → "about 2 minutes" |
| `&` | "and" | A & B → "A and B" |
| `@` | "at" | user@email → "user at email" |
| `%` | "percent" | 50% → "50 percent" |
| `+` | "plus" | +10% → "plus 10 percent" |
| `→` | "leads to" or "becomes" | A → B → "A becomes B" |
| `—` | (pause) | word—word → "word (pause) word" |
| `#` | (context-dependent) | #1 → "number 1"; ## → (heading marker) |
| `<` / `>` | "less than" / "greater than" | x > 5 → "x greater than 5" |
| `≥` / `≤` | "greater than or equal" / "less than or equal" | |
| `µ` | "micro" | µg → "microgram" |
| `°` | "degrees" | 37°C → "37 degrees celsius" |
| `±` | "plus or minus" | ±5% → "plus or minus 5 percent" |

**Design Principle**: Would a human reading this aloud say the symbol name, or translate it to meaning? Almost always the latter.

---

## Installation (v2.0)

TTS v2 is built into the Alex VS Code extension. No separate installation required.

### Package Dependencies

The extension automatically includes:
- `ws` (WebSocket client for Edge TTS connection)
- `fs-extra` (file operations for audio saving)

### Verification

After extension update, verify TTS works:

1. Open any document
2. Press `Ctrl+Alt+R` (Windows/Linux) or `Cmd+Alt+R` (macOS)
3. Status bar should show "$(unmute) Synthesizing..."
4. Audio should play through webview panel

---

## Usage Patterns

### Read Current Document

Press Ctrl+Alt+R to read document aloud Select text first to read only selection

code

### Generate Audio File

Command Palette → "Alex: Save as Audio" Choose output location → MP3 saved

code

### Voice Customization

Command Palette → "Alex: Read with Voice Selection" Choose: Default | Warm | British | Friendly

code

---

## Edge TTS Technical Reference

### WebSocket Endpoint

wss://speech.platform.bing.com/consumer/speech/synthesize/readaloud/edge/v1 ?TrustedClientToken=6A5AA1D4EAFF4E9FB37E23D68491D6F4 &ConnectionId=[UUID]

code

### Audio Format

- **Codec**: MP3
- **Sample Rate**: 24kHz
- **Bitrate**: 48kbps mono

### SSML Template

```xml
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <voice name="en-US-GuyNeural">
    <prosody rate="+0%" pitch="+0Hz" volume="+0%">
      Text content here
    </prosody>
  </voice>
</speak>

Popular Voice IDs

LanguageVoiceStyle
en-USGuyNeuralProfessional male
en-USJennyNeuralProfessional female
en-USAriaNeuralNews anchor style
en-GBRyanNeuralBritish male
en-GBSoniaNeuralBritish female
en-AUWilliamNeuralAustralian male
en-INNeerjaNeuralIndian English

Accessibility Benefits

Use CaseBenefit
Vision impairedFull document access via audio
MultitaskingReview code while walking/driving
LearningAuditory reinforcement of reading
ProofreadingCatch errors by hearing text
Long documentsListen during breaks

Version History

v2.0.0 (2026-02-06)

  • Native TypeScript implementation
  • Removed Python/MCP server dependencies
  • Webview-based cross-platform audio player
  • VS Code command integration
  • Status bar progress feedback

v1.1.0 (2026-02-05)

  • Added Alex voice presets
  • Enhanced markdown stripping
  • Symbol to speech conversion

v1.0.0 (2026-02-04)

  • Initial implementation via MCP server
  • Python edge-tts integration
  • Basic markdown support

Synapses

  • accessibility: Primary use case enabler
  • vscode-extension-patterns: Extension command patterns
  • markdown-mermaid: Source content processing
  • academic-research: Document reading for research projects
  • gamma-presentations: Audio playback of pitch content for rehearsal and delivery
  • project-management: Stakeholder pitch presentations generated as audio files

Future Enhancements

FeatureStatusNotes
Real-time streamingPlannedStart playing before full generation
SSML supportPlannedFine-grained prosody control
Section navigationPlanned"Skip to next heading"
Bookmark resumePlannedResume from last position
Speed presetsPlanned1x, 1.5x, 2x reading speeds