Inside AI Agents / for engineers and architects

One turn, three journeys, decided up front

Every turn is classified on the cheapest model before any expensive work begins. The classification picks the mechanism that fits the work, so a long task is never discovered halfway through an answer.

How a question flows

The fork that keeps it responsive.

Immediate questions run inline and stream back. Long questions offload to a queue and return a job id so the conversation never blocks. Voice holds a live duplex stream. The decision is deterministic where it can be, and consults a model only when genuinely ambiguous.

Turn arrivesHTTP or WebSocket Classify cheapest model Immediaterun graph inline, stream Longenqueue, return job id Voiceduplex Sonic on Fargate Reply + AGUIsame connection Worker resumes threadpush findings later Audio + Tamil captionstranslate hop immediate long voice
Indigo: inline path. Gold: durable offload, the job id returns immediately while the worker resumes the thread later. Teal: the live voice path.

Why the long path is a queue, not an await

Asynchronous request handling keeps one request from blocking a worker, but it does not survive a dropped socket, a restart, a deploy or a timeout. Tying a multi-minute research chain to the lifetime of a connection is the exact fragility this design avoids. Offload is a durability concern, solved by the queue and a job id one layer down, not by awaiting harder.

Determinism

Heuristic first, model only on doubt

Clear cases are settled by rule offline. The classifier consults a model only for genuinely ambiguous input, so the fork stays cheap and fast.

Value: the cheapest possible decision on the highest-volume step.
Safety of retries

Idempotency on the offload

The job id is paired with a key derived from the thread and the query. A retried submission or at-least-once delivery collapses to the same job.

Value: the same research never runs or bills twice.