PARLE Speech-to-Speech

Pipeline completo de Speech-to-Speech para ensino de portugues, deployavel em multiplas plataformas.

Deployment Options

Platform Directory Protocol GPU
Modal modal/ HTTP + SSE streaming L4, A10G, A100, H100
TensorDock / VAST.ai app.py WebSocket RTX 3090, RTX 4090

Pipeline (Modal - Current)

Audio -> Whisper v3 Turbo (STT) -> Ministral 3B (LLM) -> Qwen3-TTS (TTS) -> Audio

All 3 models run in a single GPU container ("Moshi pattern") for lowest latency.

Models

Component Model Role
STT openai/whisper-large-v3-turbo Speech-to-text
LLM mistralai/Ministral-3-3B-Instruct-2512-BF16 Response generation
TTS Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice Text-to-speech (streaming)

Quick Deploy (Modal)

pip install modal
export MODAL_TOKEN_ID="ak-..."
export MODAL_TOKEN_SECRET="as-..."
python3 -m modal deploy modal/modal_parle_whisper.py

See modal/README.md for full documentation.

Pipeline (TensorDock/VAST.ai - Legacy)

Audio -> Whisper (STT) -> CEFR Classifier -> Gemma 3 4B (LLM) -> Kokoro (TTS) -> Audio

Models

Component Model Role
STT openai/whisper-small Speech-to-text
LLM RedHatAI/gemma-3-4b-it-quantized.w4a16 Response generation
TTS hexgrad/Kokoro-82M Text-to-speech
CEFR marcosremar2/cefr-classifier-pt-mdeberta-v3-enem Level classification

SSE Streaming Protocol (Modal)

POST /api/stream-audio (multipart/form-data with audio file)

event: status     data: {"stage": "stt"}
event: transcript data: {"transcript": "...", "stt_ms": N}
event: status     data: {"stage": "llm"}
event: response   data: {"response": "...", "llm_ms": N}
event: status     data: {"stage": "tts"}
event: audio      data: {"chunk": "<base64 WAV>", "index": N}
event: complete   data: {"transcript":"...", "response":"...", "timing":{...}}

WebSocket Protocol (TensorDock)

const ws = new WebSocket('ws://HOST:PORT/ws/stream');
ws.send(audioBlob);       // Send recorded audio
ws.onmessage = (event) => {
  if (event.data instanceof Blob) playAudioChunk(event.data);
  else console.log(JSON.parse(event.data));
};

Environment Variables

No API keys are hardcoded. All credentials are passed via environment variables:

Variable Platform Description
MODAL_TOKEN_ID Modal Modal API token ID
MODAL_TOKEN_SECRET Modal Modal API token secret
TENSORDOCK_API_TOKEN TensorDock TensorDock API token
TENSORDOCK_INSTANCE_ID TensorDock Instance ID for auto-stop

License

MIT