Multi-LLM Orchestration: Architecture Instead of a Subscription Upgrade Against Rate Limits

Multi-LLM orchestration means using several language models and providers in parallel instead of tying yourself to a single subscription. When that makes sense and what an architecture can look like that absorbs rate limits and outages.

Why a single model is not enough

The available LLMs have different strengths. Claude delivers on reasoning and the synthesis of complex relationships. Gemini processes long documents and code bases in one go. Codex works on code tasks: boilerplate, refactoring, unit tests. A local open-source model covers repetitive text work, with no token costs and no cloud dependency. Complex reasoning is beyond it.

Many users nonetheless stick with one model, because every tool switch creates friction. The convenience ends as soon as the preferred model throttles or is unsuitable for a particular task.

A bottleneck limits everything upstream of it. That holds in special-purpose machinery construction just as it does in technical writing and in AI-supported work. What works is distributing the work across several models, each contributing its respective strength.

The architecture: roles instead of a toolbox

The concept of a multi-LLM orchestrator follows the same logic as classic process consulting: clear roles, clear responsibilities, a set of rules on paper rather than only in one person’s head.

Claude (Opus 4.6) is the orchestrator. It knows all running projects, rules, and customer contexts. It decides which model takes on which task. The Claude allowance is expensive and stays reserved for the tasks that demand deep reasoning. Boilerplate, research, and text drafts move to the specialists.

Gemini (Google) is the researcher. Long documents, extensive contexts, analyses over large volumes of data. Integrated via the slash command /gemini, it uses its own allowance.

Codex (OpenAI) is the code specialist. Boilerplate, refactoring, debugging. Integrated via /codex, it uses a ChatGPT allowance.

A local open-source model takes on the repetitive text work. Raw newsletter drafts, documentation blocks, simple rewrites. Which concrete model runs is interchangeable as far as the architecture is concerned.

A status line shows the live limits of the three cloud providers. Visible are the limits as well as the reset timers of the respective models and the current activities. The orchestrator uses this information for the routing decision: when the Claude allowance is tight, more tasks move to the specialists.

Why memory is the actual core

The model architecture alone is not yet a solution. That becomes clear as soon as a specialised agent works without context: it improvises. It starts scanning structures and accesses areas that should be none of its business. It lacks the set of rules; it does not know its limits.

A shared memory system is the prerequisite for several models to work in a coordinated way.

The system consists of a SQLite database in WAL mode, an MCP memory server as a Claude Code subprocess, and a cross-agent CLI through which Gemini and Codex access the same data base via Bash. As soon as Claude writes memory files, a PostToolUse hook synchronises the database automatically.

The content is structured into four categories:

User: who the user is, how they work, what their preferences are.
Project: the status of running projects, open points, where the work was left off.
Feedback: rules drawn from mistakes and corrections. „Never write directly into a deployment folder.“ „Always do auth fixes with a side-effect check.“ Such rules are written down once and are then available in every session.
Reference: where things are. Which server has which address. Where credentials are.

There are currently 75 entries in the database. The search runs on text substring matching. An embedding-based variant is prepared but, owing to a Torch version conflict, is not permanently active.

The result: a specialised agent accesses structured context information at start-up. Project status, rules, infrastructure references. The repetition of the basics in every session is dropped.

Configuration symmetry

The third element is the set of rules. At every start, Claude reads a CLAUDE.md with infrastructure, standing orders, project overviews, and behavioural rules. A migration script mirrors this document into GEMINI.md and AGENTS.md, so that Gemini and Codex have the same rule base.

The current setup contains 13 slash commands on Claude, mirrored as 25 Gemini skills, plus 8 subagents for specialised tasks. An identity-lint hook blocks incorrect spellings of names and ASCII umlauts, because these errors are not acceptable in customer documents and no model reliably avoids them without an explicit rule.

What stays in the head, the model forgets after the session. What is written in a file applies again at the next start. Many users do without this documentation and explain the same rules to the AI every day.

Limits of the system

Embeddings partly unavailable. The Torch version conflict between LM Studio and the sentence-transformers library is not yet resolved. As long as LM Studio is not actively running, the fallback uses text substring search. That works without semantic search.

Gemini does not reliably resolve ~/ paths on Windows. The workaround with absolute paths is functional and documented.

Memory drift. Entries go out of date as projects change. A memory system needs active maintenance like a process manual. Anyone who sets it up once and never touches it again has an outdated data base after a few months.

Codex has its quirks with auth and JWT fixes. To prevent silent errors in fixes on other API routes, sufficient checks have to be made mandatory in the set of rules.

Local open-source models fail at complex reasoning. For architecture decisions they remain unsuitable. The „input-output cost-benefit ratio“ also has to be checked constantly. The smaller the open-source models are, the more rework may be required.

Four models together work on clearly defined tasks with available context. Unclear tasks without a set of rules deliver correspondingly unclear results.

Who the approach makes sense for

The setup is intended for individuals or small teams who already actively use several AI subscriptions: Claude, Google AI Pro, ChatGPT. The orchestration distributes the load across the already-paid allowances and creates no additional costs. An estimated 40 to 50 % of the delegable work moves to Gemini or Codex. That is an observation from productive use, not a controlled measurement.

_{author’s personal note}

The redundant approach makes sense in every professional scenario in which automated processes or service delivery depend on AI systems. In many companies there are few or no manual fallbacks, because the whole point is to make those very (manual) processes obsolete with such a system.

Precisely in the case of an outage of a „single source“ system, the damage can be immense. Entire processes can then grind to a halt. Business owners urgently have to apply the „second-source“ or „third-source“ quality standards familiar from procurement here, and thus ensure the best possible failover resilience.
Anyone who subscribes to monitoring services such as https://status.claude.com/ gets a very good sense of how often such services are not available.

Also not to be overlooked is the fact that all the relevant services (apart from „le Chat“, Mistral (France)) are American services. Making yourself dependent on one country in this respect is a strategy that no longer works out at the latest when that country has to use the computing power for its own purposes, for example in national emergencies and war (at the beginning of the Iran war this was very palpable). Equally conceivable would be political and economic influence by such countries through the deliberate scarcity of the raw material „computing power“.

For enterprise scenarios with multi-user requirements, central IT governance, and formal compliance requirements, the setup is unsuitable. It lacks user management, audit logs, and an RBAC structure. Companies with these requirements need a different architecture.

For the individual user doing daily AI work, the practical question is: is there a set of rules that makes specialised agents controllable?

Conclusion

A single model limit creates a bottleneck in the workflow. Bottlenecks can be remedied by distributing across several resources. Distribution only works with a shared set of rules. This logic applies equally in human teams and in AI teams.

You can find the full technical article with the technical architecture as a PDF download directly here:

Why a single model is not enough

The architecture: roles instead of a toolbox

Why memory is the actual core

Configuration symmetry

Limits of the system

Who the approach makes sense for

Conclusion

Artikel teilen

Ähnliche Beiträge

Spotting and Removing AI Text: 11 Patterns with Before-and-After Examples

A Chatbot on the Command Line: Why You Should Close the Browser

Setting Up a CLI Chatbot: Three Paths for Technical Writing

Kommentar schreiben Antwort abbrechen

Diese Website verwendet Cookies