The fork that keeps it responsive.
Immediate questions run inline and stream back. Long questions offload to a queue and return a job id so the conversation never blocks. Voice holds a live duplex stream. The decision is deterministic where it can be, and consults a model only when genuinely ambiguous.
Why the long path is a queue, not an await
Asynchronous request handling keeps one request from blocking a worker, but it does not survive a dropped socket, a restart, a deploy or a timeout. Tying a multi-minute research chain to the lifetime of a connection is the exact fragility this design avoids. Offload is a durability concern, solved by the queue and a job id one layer down, not by awaiting harder.
Heuristic first, model only on doubt
Clear cases are settled by rule offline. The classifier consults a model only for genuinely ambiguous input, so the fork stays cheap and fast.
Value: the cheapest possible decision on the highest-volume step.Idempotency on the offload
The job id is paired with a key derived from the thread and the query. A retried submission or at-least-once delivery collapses to the same job.
Value: the same research never runs or bills twice.