PARLE Speech-to-Speech

Pipeline completo de Speech-to-Speech para ensino de portugues, deployavel em multiplas plataformas.

Deployment Options

Platform	Directory	Protocol	GPU
Modal	`modal/`	HTTP + SSE streaming	L4, A10G, A100, H100
TensorDock / VAST.ai	`app.py`	WebSocket	RTX 3090, RTX 4090

Pipeline (Modal - Current)

Audio -> Whisper v3 Turbo (STT) -> Ministral 3B (LLM) -> Qwen3-TTS (TTS) -> Audio

All 3 models run in a single GPU container ("Moshi pattern") for lowest latency.

Models

Component	Model	Role
STT	`openai/whisper-large-v3-turbo`	Speech-to-text
LLM	`mistralai/Ministral-3-3B-Instruct-2512-BF16`	Response generation
TTS	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	Text-to-speech (streaming)

Quick Deploy (Modal)

pip install modal
export MODAL_TOKEN_ID="ak-..."
export MODAL_TOKEN_SECRET="as-..."
python3 -m modal deploy modal/modal_parle_whisper.py

See modal/README.md for full documentation.

Pipeline (TensorDock/VAST.ai - Legacy)

Audio -> Whisper (STT) -> CEFR Classifier -> Gemma 3 4B (LLM) -> Kokoro (TTS) -> Audio

Models

Component	Model	Role
STT	`openai/whisper-small`	Speech-to-text
LLM	`RedHatAI/gemma-3-4b-it-quantized.w4a16`	Response generation
TTS	`hexgrad/Kokoro-82M`	Text-to-speech
CEFR	`marcosremar2/cefr-classifier-pt-mdeberta-v3-enem`	Level classification

SSE Streaming Protocol (Modal)

POST /api/stream-audio (multipart/form-data with audio file)

event: status     data: {"stage": "stt"}
event: transcript data: {"transcript": "...", "stt_ms": N}
event: status     data: {"stage": "llm"}
event: response   data: {"response": "...", "llm_ms": N}
event: status     data: {"stage": "tts"}
event: audio      data: {"chunk": "<base64 WAV>", "index": N}
event: complete   data: {"transcript":"...", "response":"...", "timing":{...}}

WebSocket Protocol (TensorDock)

const ws = new WebSocket('ws://HOST:PORT/ws/stream');
ws.send(audioBlob);       // Send recorded audio
ws.onmessage = (event) => {
  if (event.data instanceof Blob) playAudioChunk(event.data);
  else console.log(JSON.parse(event.data));
};

Environment Variables

No API keys are hardcoded. All credentials are passed via environment variables:

Variable	Platform	Description
`MODAL_TOKEN_ID`	Modal	Modal API token ID
`MODAL_TOKEN_SECRET`	Modal	Modal API token secret
`TENSORDOCK_API_TOKEN`	TensorDock	TensorDock API token
`TENSORDOCK_INSTANCE_ID`	TensorDock	Instance ID for auto-stop

License

MIT