Skip to content

How LLM Inference Works at Parallel

TL;DR: To switch models or providers, just change model: ModelNameEnum.XXX in your code. That's it. The infrastructure handles everything else.

The One-Line Model Change

typescript
// Using GPT-4.1 on Azure OpenAI
const response = await this.llmRunnableService.invoke({
    promptName: PromptName.MY_AGENT,
    state,
    model: ModelNameEnum.GPT_4_1_EU,  // ← Change this...
})

// Using Claude on Anthropic
const response = await this.llmRunnableService.invoke({
    promptName: PromptName.MY_AGENT,
    state,
    model: ModelNameEnum.CLAUDE_SONNET_4_5,  // ← ...to this
})

No code changes. No provider-specific logic. No configuration updates. The application layer stays completely provider-agnostic.


Available Models

ModelEnum ValueProviderContext Window
GPT-4oModelNameEnum.GPT_4OAzure OpenAI128K tokens
GPT-4.1ModelNameEnum.GPT_4_1Azure OpenAI1M tokens
GPT-4.1 EUModelNameEnum.GPT_4_1_EUAzure OpenAI1M tokens
GPT-4.1 APIMModelNameEnum.GPT_4_1_APIMAzure APIM (load balanced)1M tokens
Claude Sonnet 4.5ModelNameEnum.CLAUDE_SONNET_4_5Anthropic (Azure AI Foundry)200K tokens
Claude Haiku 4.5ModelNameEnum.CLAUDE_HAIKU_4_5Anthropic (Azure AI Foundry)200K tokens
Mistral Large 3ModelNameEnum.MISTRAL_LARGE_3Mistral (Azure AI Foundry)256K tokens

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                          APPLICATION LAYER                                   │
│                                                                             │
│   llmRunnableService.invoke({ model: ModelNameEnum.XXX, ... })              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────────────┐
│                          LLMRunnableService                                  │
│                                                                             │
│   • Routes to correct provider based on model name                          │
│   • Fetches prompts from LangChain Hub                                      │
│   • Validates token counts against model limits                             │
│   • Post-processes responses (encoding cleanup, schema unwrapping)          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

                    ┌───────────────┼───────────────┐
                    │               │               │
                    ▼               ▼               ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│  AzureOpenAIService │ │  AnthropicService   │ │   MistralService    │ ...
│                     │ │                     │ │                     │
│ Map<Model,Instance> │ │ Map<Model,Instance> │ │ Map<Model,Instance> │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
          │                       │                       │
          ▼                       ▼                       ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ AzureOpenAI         │ │ Anthropic           │ │ Mistral             │
│ InstanceService     │ │ InstanceService     │ │ InstanceService     │
│                     │ │                     │ │                     │
│ • AzureChatOpenAI   │ │ • ChatAnthropic     │ │ • AzureChatOpenAI   │
│ • Rate limiter      │ │ • Rate limiter      │ │ • Rate limiter      │
│ • Config            │ │ • Config            │ │ • Config            │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘

Three-Layer Design

  1. Model Registry (ModelNameEnum + ModelService)

    • Defines all available models as enum values
    • Groups models by provider
    • Provides token limits per model
  2. Instance Layer (per-provider InstanceService)

    • One instance per model, created at startup
    • Handles provider-specific configuration (endpoints, API keys)
    • Includes rate limiting via Bottleneck (60 RPM)
    • Returns a LangChain BaseChatModel
  3. Service Orchestration (LLMRunnableService)

    • Single entry point for all LLM calls
    • Routes requests to the correct provider
    • Handles prompt loading, token validation, response processing

Execution Flow

When you call llmRunnableService.invoke():

┌─────────────────────────────────────────────────────────────────┐
│  1. GET INSTANCE                                                │
│     getInstanceServiceForModel(model)                           │
│     → Checks which provider has this model loaded               │
│     → Returns the pre-configured InstanceService                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  2. FETCH PROMPT                                                │
│     LangchainHubService.getPrompt(promptName, promptTag)        │
│     → Pulls prompt template from LangChain Hub                  │
│     → Supports version tags for prompt management               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  3. FILTER STATE                                                │
│     Only keeps state keys that match prompt.inputVariables      │
│     (e.g., medicalInformation, documents, etc.)                 │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  4. TOKEN VALIDATION                                            │
│     countTokens(model, prompt + state)                          │
│     If > model limit:                                           │
│       → Reduces state by truncating document content            │
│     Uses appropriate tokenizer per provider                     │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  5. BUILD & EXECUTE RUNNABLE                                    │
│     runnable = prompt.pipe(instanceModel)                       │
│     response = await runnable.invoke(state)                     │
│     → Rate-limited by Bottleneck                                │
│     → Retries on failure (maxRetries: 3)                        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  6. POST-PROCESS RESPONSE                                       │
│     • Unwrap schema-formatted responses                         │
│     • Remove title/description fields                           │
│     • Fix encoding issues (unicode escapes, null bytes, etc.)   │
└─────────────────────────────────────────────────────────────────┘

                        Return typed response

Why Not Traditional Dependency Injection?

You might wonder: "Why not inject LLMService with different implementations for each provider?"

Traditional DI patterns don't fit well here for several reasons:

1. Dynamic Runtime Configuration

Each model requires different environment variables discovered at runtime:

typescript
// Azure OpenAI models use per-model env vars
AZURE_DEPLOYMENT_NAME_GPT_4O=...
AZURE_API_KEY_GPT_4O=...

// Anthropic uses shared credentials
ANTHROPIC_API_KEY=...
ANTHROPIC_API_INSTANCE_NAME=...

// Mistral has its own configuration
MISTRAL_API_KEY=...
MISTRAL_API_INSTANCE_NAME=...

A DI container can't elegantly handle "create N instances with configuration discovered at startup."

2. Multiple Instances Per Provider

We need one instance per model, not one instance per provider. A single AzureOpenAIService manages multiple AzureOpenAIInstanceService instances, each with its own rate limiter and configuration.

3. Graceful Degradation

If a model's environment variables are missing, we log a warning and continue. The app doesn't crash—it just doesn't have that model available:

typescript
try {
    const instanceService = new AnthropicInstanceService(...)
    this.modelToInstanceMap.set(modelName, instanceService)
} catch (error) {
    this.loggerService.warn(`Failed to load Anthropic instance...`)
    // Continue loading other models
}

4. Runtime Model Selection

The calling code chooses which model to use at runtime:

typescript
model: shouldUseFastModel ? ModelNameEnum.CLAUDE_HAIKU_4_5 : ModelNameEnum.GPT_4_1_EU

This isn't a compile-time decision—it's based on business logic, feature flags, or request parameters.

The Actual Pattern: Service Locator with Registry

We use a service locator pattern with a registry approach:

  • Provider services act as registries: Map<ModelNameEnum, InstanceService>
  • LLMRunnableService queries each provider to find who owns a given model
  • The caller doesn't know or care which provider backs their model
typescript
private async getInstanceServiceForModel(model: ModelNameEnum): Promise<InstanceService> {
    if (this.anthropicService.hasModel(model)) {
        return this.anthropicService.getInstanceService(model)
    }
    if (this.mistralService.hasModel(model)) {
        return this.mistralService.getInstanceService(model)
    }
    if (this.azureOpenAIService.hasModel(model)) {
        return this.azureOpenAIService.getInstanceService(model)
    }
    if (this.azureAPIMService.hasModel(model)) {
        return this.azureAPIMService.getInstanceService(model)
    }
    throw new Error(`No provider found for model: ${model}`)
}

Configuration Reference

Azure OpenAI Models

Per-model environment variables:

env
AZURE_DEPLOYMENT_NAME_GPT_4O=gpt-4o
AZURE_API_VERSION_GPT_4O=2024-02-15-preview
AZURE_API_KEY_GPT_4O=xxx
AZURE_INSTANCE_NAME_GPT_4O=my-azure-instance

# Repeat for GPT_4_1, GPT_4_1_EU...

Azure APIM (Load Balanced)

env
AZURE_DEPLOYMENT_NAME_GPT_4_1_APIM=gpt-4.1-apim
AZURE_API_VERSION_GPT_4_1_APIM=2025-01-01-preview
AZURE_API_KEY_GPT_4_1_APIM=xxx
AZURE_INSTANCE_NAME_GPT_4_1_APIM=apim-parallel

Anthropic (Azure AI Foundry)

Shared credentials for all Anthropic models:

env
ANTHROPIC_API_KEY=xxx
ANTHROPIC_API_INSTANCE_NAME=my-ai-foundry-instance

Mistral (Azure AI Foundry)

Shared credentials for all Mistral models:

env
MISTRAL_API_KEY=xxx
MISTRAL_API_INSTANCE_NAME=my-ai-foundry-instance
MISTRAL_API_VERSION=2024-05-01-preview

Adding a New Model

1. Add to ModelNameEnum

typescript
export enum ModelNameEnum {
    // ... existing models
    MY_NEW_MODEL = 'my-new-model',
}

// Add to the appropriate provider's key list
export const AZURE_OPENAI_MODEL_KEYS: (keyof typeof ModelNameEnum)[] = [
    'GPT_4O', 'GPT_4_1', 'GPT_4_1_EU', 'MY_NEW_MODEL'
]

2. Add Token Limit in ModelService

typescript
getModelLimit(): number {
    switch (this.modelName) {
        // ... existing cases
        case ModelNameEnum.MY_NEW_MODEL:
            return 128000
    }
}

3. Set Environment Variables

env
AZURE_DEPLOYMENT_NAME_MY_NEW_MODEL=my-deployment
AZURE_API_VERSION_MY_NEW_MODEL=2024-02-15-preview
AZURE_API_KEY_MY_NEW_MODEL=xxx
AZURE_INSTANCE_NAME_MY_NEW_MODEL=my-instance

4. Use It

typescript
await this.llmRunnableService.invoke({
    promptName: PromptName.MY_AGENT,
    state,
    model: ModelNameEnum.MY_NEW_MODEL,
})

Key Design Decisions

AspectImplementation
Provider AgnosticismApplication code uses ModelNameEnum, never provider-specific classes
Prompt ManagementExternalized in LangChain Hub with version tags
Multi-model SupportOne instance per model, selected at call time
Rate LimitingPer-instance Bottleneck (60 RPM, prevents throttling)
Token SafetyPre-flight token counting with automatic state reduction
Structured OutputResponse unwrapping + encoding cleanup
Batch Processingbatch_invoke() for parallel calls with concurrency control
LangChain IntegrationAll providers return BaseChatModel, enabling .pipe() chains

Usage Example: Multi-Model Workflow

typescript
@Injectable()
export class MyWorkflowService {
    constructor(
        private readonly llmRunnableService: LLMRunnableService,
    ) {}

    async process(data: MyData) {
        // Fast model for simple tasks
        const summary = await this.llmRunnableService.invoke({
            promptName: PromptName.SUMMARIZER,
            state: { content: data.content },
            model: ModelNameEnum.CLAUDE_HAIKU_4_5,  // Fast & cheap
        })

        // Powerful model for complex reasoning
        const analysis = await this.llmRunnableService.invoke({
            promptName: PromptName.ANALYZER,
            state: { summary, context: data.context },
            model: ModelNameEnum.GPT_4_1_EU,  // 1M context window
        })

        // Batch processing with concurrency control
        const results = await this.llmRunnableService.batch_invoke({
            promptName: PromptName.ITEM_PROCESSOR,
            states: data.items.map(item => ({ item })),
            model: ModelNameEnum.GPT_4_1_APIM,  // Load balanced
            batchOptions: { maxConcurrency: 5 },
        })

        return { summary, analysis, results }
    }
}