How LLM Inference Works at Parallel
TL;DR: To switch models or providers, just change
model: ModelNameEnum.XXXin your code. That's it. The infrastructure handles everything else.
The One-Line Model Change
// Using GPT-4.1 on Azure OpenAI
const response = await this.llmRunnableService.invoke({
promptName: PromptName.MY_AGENT,
state,
model: ModelNameEnum.GPT_4_1_EU, // ← Change this...
})
// Using Claude on Anthropic
const response = await this.llmRunnableService.invoke({
promptName: PromptName.MY_AGENT,
state,
model: ModelNameEnum.CLAUDE_SONNET_4_5, // ← ...to this
})No code changes. No provider-specific logic. No configuration updates. The application layer stays completely provider-agnostic.
Available Models
| Model | Enum Value | Provider | Context Window |
|---|---|---|---|
| GPT-4o | ModelNameEnum.GPT_4O | Azure OpenAI | 128K tokens |
| GPT-4.1 | ModelNameEnum.GPT_4_1 | Azure OpenAI | 1M tokens |
| GPT-4.1 EU | ModelNameEnum.GPT_4_1_EU | Azure OpenAI | 1M tokens |
| GPT-4.1 APIM | ModelNameEnum.GPT_4_1_APIM | Azure APIM (load balanced) | 1M tokens |
| Claude Sonnet 4.5 | ModelNameEnum.CLAUDE_SONNET_4_5 | Anthropic (Azure AI Foundry) | 200K tokens |
| Claude Haiku 4.5 | ModelNameEnum.CLAUDE_HAIKU_4_5 | Anthropic (Azure AI Foundry) | 200K tokens |
| Mistral Large 3 | ModelNameEnum.MISTRAL_LARGE_3 | Mistral (Azure AI Foundry) | 256K tokens |
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ │
│ llmRunnableService.invoke({ model: ModelNameEnum.XXX, ... }) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LLMRunnableService │
│ │
│ • Routes to correct provider based on model name │
│ • Fetches prompts from LangChain Hub │
│ • Validates token counts against model limits │
│ • Post-processes responses (encoding cleanup, schema unwrapping) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ AzureOpenAIService │ │ AnthropicService │ │ MistralService │ ...
│ │ │ │ │ │
│ Map<Model,Instance> │ │ Map<Model,Instance> │ │ Map<Model,Instance> │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ AzureOpenAI │ │ Anthropic │ │ Mistral │
│ InstanceService │ │ InstanceService │ │ InstanceService │
│ │ │ │ │ │
│ • AzureChatOpenAI │ │ • ChatAnthropic │ │ • AzureChatOpenAI │
│ • Rate limiter │ │ • Rate limiter │ │ • Rate limiter │
│ • Config │ │ • Config │ │ • Config │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘Three-Layer Design
Model Registry (
ModelNameEnum+ModelService)- Defines all available models as enum values
- Groups models by provider
- Provides token limits per model
Instance Layer (per-provider
InstanceService)- One instance per model, created at startup
- Handles provider-specific configuration (endpoints, API keys)
- Includes rate limiting via Bottleneck (60 RPM)
- Returns a LangChain
BaseChatModel
Service Orchestration (
LLMRunnableService)- Single entry point for all LLM calls
- Routes requests to the correct provider
- Handles prompt loading, token validation, response processing
Execution Flow
When you call llmRunnableService.invoke():
┌─────────────────────────────────────────────────────────────────┐
│ 1. GET INSTANCE │
│ getInstanceServiceForModel(model) │
│ → Checks which provider has this model loaded │
│ → Returns the pre-configured InstanceService │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. FETCH PROMPT │
│ LangchainHubService.getPrompt(promptName, promptTag) │
│ → Pulls prompt template from LangChain Hub │
│ → Supports version tags for prompt management │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 3. FILTER STATE │
│ Only keeps state keys that match prompt.inputVariables │
│ (e.g., medicalInformation, documents, etc.) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 4. TOKEN VALIDATION │
│ countTokens(model, prompt + state) │
│ If > model limit: │
│ → Reduces state by truncating document content │
│ Uses appropriate tokenizer per provider │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 5. BUILD & EXECUTE RUNNABLE │
│ runnable = prompt.pipe(instanceModel) │
│ response = await runnable.invoke(state) │
│ → Rate-limited by Bottleneck │
│ → Retries on failure (maxRetries: 3) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 6. POST-PROCESS RESPONSE │
│ • Unwrap schema-formatted responses │
│ • Remove title/description fields │
│ • Fix encoding issues (unicode escapes, null bytes, etc.) │
└─────────────────────────────────────────────────────────────────┘
↓
Return typed responseWhy Not Traditional Dependency Injection?
You might wonder: "Why not inject LLMService with different implementations for each provider?"
Traditional DI patterns don't fit well here for several reasons:
1. Dynamic Runtime Configuration
Each model requires different environment variables discovered at runtime:
// Azure OpenAI models use per-model env vars
AZURE_DEPLOYMENT_NAME_GPT_4O=...
AZURE_API_KEY_GPT_4O=...
// Anthropic uses shared credentials
ANTHROPIC_API_KEY=...
ANTHROPIC_API_INSTANCE_NAME=...
// Mistral has its own configuration
MISTRAL_API_KEY=...
MISTRAL_API_INSTANCE_NAME=...A DI container can't elegantly handle "create N instances with configuration discovered at startup."
2. Multiple Instances Per Provider
We need one instance per model, not one instance per provider. A single AzureOpenAIService manages multiple AzureOpenAIInstanceService instances, each with its own rate limiter and configuration.
3. Graceful Degradation
If a model's environment variables are missing, we log a warning and continue. The app doesn't crash—it just doesn't have that model available:
try {
const instanceService = new AnthropicInstanceService(...)
this.modelToInstanceMap.set(modelName, instanceService)
} catch (error) {
this.loggerService.warn(`Failed to load Anthropic instance...`)
// Continue loading other models
}4. Runtime Model Selection
The calling code chooses which model to use at runtime:
model: shouldUseFastModel ? ModelNameEnum.CLAUDE_HAIKU_4_5 : ModelNameEnum.GPT_4_1_EUThis isn't a compile-time decision—it's based on business logic, feature flags, or request parameters.
The Actual Pattern: Service Locator with Registry
We use a service locator pattern with a registry approach:
- Provider services act as registries:
Map<ModelNameEnum, InstanceService> LLMRunnableServicequeries each provider to find who owns a given model- The caller doesn't know or care which provider backs their model
private async getInstanceServiceForModel(model: ModelNameEnum): Promise<InstanceService> {
if (this.anthropicService.hasModel(model)) {
return this.anthropicService.getInstanceService(model)
}
if (this.mistralService.hasModel(model)) {
return this.mistralService.getInstanceService(model)
}
if (this.azureOpenAIService.hasModel(model)) {
return this.azureOpenAIService.getInstanceService(model)
}
if (this.azureAPIMService.hasModel(model)) {
return this.azureAPIMService.getInstanceService(model)
}
throw new Error(`No provider found for model: ${model}`)
}Configuration Reference
Azure OpenAI Models
Per-model environment variables:
AZURE_DEPLOYMENT_NAME_GPT_4O=gpt-4o
AZURE_API_VERSION_GPT_4O=2024-02-15-preview
AZURE_API_KEY_GPT_4O=xxx
AZURE_INSTANCE_NAME_GPT_4O=my-azure-instance
# Repeat for GPT_4_1, GPT_4_1_EU...Azure APIM (Load Balanced)
AZURE_DEPLOYMENT_NAME_GPT_4_1_APIM=gpt-4.1-apim
AZURE_API_VERSION_GPT_4_1_APIM=2025-01-01-preview
AZURE_API_KEY_GPT_4_1_APIM=xxx
AZURE_INSTANCE_NAME_GPT_4_1_APIM=apim-parallelAnthropic (Azure AI Foundry)
Shared credentials for all Anthropic models:
ANTHROPIC_API_KEY=xxx
ANTHROPIC_API_INSTANCE_NAME=my-ai-foundry-instanceMistral (Azure AI Foundry)
Shared credentials for all Mistral models:
MISTRAL_API_KEY=xxx
MISTRAL_API_INSTANCE_NAME=my-ai-foundry-instance
MISTRAL_API_VERSION=2024-05-01-previewAdding a New Model
1. Add to ModelNameEnum
export enum ModelNameEnum {
// ... existing models
MY_NEW_MODEL = 'my-new-model',
}
// Add to the appropriate provider's key list
export const AZURE_OPENAI_MODEL_KEYS: (keyof typeof ModelNameEnum)[] = [
'GPT_4O', 'GPT_4_1', 'GPT_4_1_EU', 'MY_NEW_MODEL'
]2. Add Token Limit in ModelService
getModelLimit(): number {
switch (this.modelName) {
// ... existing cases
case ModelNameEnum.MY_NEW_MODEL:
return 128000
}
}3. Set Environment Variables
AZURE_DEPLOYMENT_NAME_MY_NEW_MODEL=my-deployment
AZURE_API_VERSION_MY_NEW_MODEL=2024-02-15-preview
AZURE_API_KEY_MY_NEW_MODEL=xxx
AZURE_INSTANCE_NAME_MY_NEW_MODEL=my-instance4. Use It
await this.llmRunnableService.invoke({
promptName: PromptName.MY_AGENT,
state,
model: ModelNameEnum.MY_NEW_MODEL,
})Key Design Decisions
| Aspect | Implementation |
|---|---|
| Provider Agnosticism | Application code uses ModelNameEnum, never provider-specific classes |
| Prompt Management | Externalized in LangChain Hub with version tags |
| Multi-model Support | One instance per model, selected at call time |
| Rate Limiting | Per-instance Bottleneck (60 RPM, prevents throttling) |
| Token Safety | Pre-flight token counting with automatic state reduction |
| Structured Output | Response unwrapping + encoding cleanup |
| Batch Processing | batch_invoke() for parallel calls with concurrency control |
| LangChain Integration | All providers return BaseChatModel, enabling .pipe() chains |
Usage Example: Multi-Model Workflow
@Injectable()
export class MyWorkflowService {
constructor(
private readonly llmRunnableService: LLMRunnableService,
) {}
async process(data: MyData) {
// Fast model for simple tasks
const summary = await this.llmRunnableService.invoke({
promptName: PromptName.SUMMARIZER,
state: { content: data.content },
model: ModelNameEnum.CLAUDE_HAIKU_4_5, // Fast & cheap
})
// Powerful model for complex reasoning
const analysis = await this.llmRunnableService.invoke({
promptName: PromptName.ANALYZER,
state: { summary, context: data.context },
model: ModelNameEnum.GPT_4_1_EU, // 1M context window
})
// Batch processing with concurrency control
const results = await this.llmRunnableService.batch_invoke({
promptName: PromptName.ITEM_PROCESSOR,
states: data.items.map(item => ({ item })),
model: ModelNameEnum.GPT_4_1_APIM, // Load balanced
batchOptions: { maxConcurrency: 5 },
})
return { summary, analysis, results }
}
}