How LLM Inference Works at Parallel

TL;DR: To switch models or providers, just change model: ModelNameEnum.XXX in your code. That's it. The infrastructure handles everything else.

The One-Line Model Change

typescript

// Using GPT-4.1 on Azure OpenAI
const response = await this.llmRunnableService.invoke({
    promptName: PromptName.MY_AGENT,
    state,
    model: ModelNameEnum.GPT_4_1_EU,  // ← Change this...
})

// Using Claude on Anthropic
const response = await this.llmRunnableService.invoke({
    promptName: PromptName.MY_AGENT,
    state,
    model: ModelNameEnum.CLAUDE_SONNET_4_5,  // ← ...to this
})

No code changes. No provider-specific logic. No configuration updates. The application layer stays completely provider-agnostic.

Available Models

Model	Enum Value	Provider	Context Window
GPT-4o	`ModelNameEnum.GPT_4O`	Azure OpenAI	128K tokens
GPT-4.1	`ModelNameEnum.GPT_4_1`	Azure OpenAI	1M tokens
GPT-4.1 EU	`ModelNameEnum.GPT_4_1_EU`	Azure OpenAI	1M tokens
GPT-4.1 APIM	`ModelNameEnum.GPT_4_1_APIM`	Azure APIM (load balanced)	1M tokens
Claude Sonnet 4.5	`ModelNameEnum.CLAUDE_SONNET_4_5`	Anthropic (Azure AI Foundry)	200K tokens
Claude Haiku 4.5	`ModelNameEnum.CLAUDE_HAIKU_4_5`	Anthropic (Azure AI Foundry)	200K tokens
Mistral Large 3	`ModelNameEnum.MISTRAL_LARGE_3`	Mistral (Azure AI Foundry)	256K tokens

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                          APPLICATION LAYER                                   │
│                                                                             │
│   llmRunnableService.invoke({ model: ModelNameEnum.XXX, ... })              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          LLMRunnableService                                  │
│                                                                             │
│   • Routes to correct provider based on model name                          │
│   • Fetches prompts from LangChain Hub                                      │
│   • Validates token counts against model limits                             │
│   • Post-processes responses (encoding cleanup, schema unwrapping)          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    │               │               │
                    ▼               ▼               ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│  AzureOpenAIService │ │  AnthropicService   │ │   MistralService    │ ...
│                     │ │                     │ │                     │
│ Map<Model,Instance> │ │ Map<Model,Instance> │ │ Map<Model,Instance> │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
          │                       │                       │
          ▼                       ▼                       ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ AzureOpenAI         │ │ Anthropic           │ │ Mistral             │
│ InstanceService     │ │ InstanceService     │ │ InstanceService     │
│                     │ │                     │ │                     │
│ • AzureChatOpenAI   │ │ • ChatAnthropic     │ │ • AzureChatOpenAI   │
│ • Rate limiter      │ │ • Rate limiter      │ │ • Rate limiter      │
│ • Config            │ │ • Config            │ │ • Config            │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘

Three-Layer Design

Model Registry (ModelNameEnum + ModelService)
- Defines all available models as enum values
- Groups models by provider
- Provides token limits per model
Instance Layer (per-provider InstanceService)
- One instance per model, created at startup
- Handles provider-specific configuration (endpoints, API keys)
- Includes rate limiting via Bottleneck (60 RPM)
- Returns a LangChain BaseChatModel
Service Orchestration (LLMRunnableService)
- Single entry point for all LLM calls
- Routes requests to the correct provider
- Handles prompt loading, token validation, response processing

Execution Flow

When you call llmRunnableService.invoke():

┌─────────────────────────────────────────────────────────────────┐
│  1. GET INSTANCE                                                │
│     getInstanceServiceForModel(model)                           │
│     → Checks which provider has this model loaded               │
│     → Returns the pre-configured InstanceService                │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  2. FETCH PROMPT                                                │
│     LangchainHubService.getPrompt(promptName, promptTag)        │
│     → Pulls prompt template from LangChain Hub                  │
│     → Supports version tags for prompt management               │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  3. FILTER STATE                                                │
│     Only keeps state keys that match prompt.inputVariables      │
│     (e.g., medicalInformation, documents, etc.)                 │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  4. TOKEN VALIDATION                                            │
│     countTokens(model, prompt + state)                          │
│     If > model limit:                                           │
│       → Reduces state by truncating document content            │
│     Uses appropriate tokenizer per provider                     │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  5. BUILD & EXECUTE RUNNABLE                                    │
│     runnable = prompt.pipe(instanceModel)                       │
│     response = await runnable.invoke(state)                     │
│     → Rate-limited by Bottleneck                                │
│     → Retries on failure (maxRetries: 3)                        │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  6. POST-PROCESS RESPONSE                                       │
│     • Unwrap schema-formatted responses                         │
│     • Remove title/description fields                           │
│     • Fix encoding issues (unicode escapes, null bytes, etc.)   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
                        Return typed response

Why Not Traditional Dependency Injection?

You might wonder: "Why not inject LLMService with different implementations for each provider?"

Traditional DI patterns don't fit well here for several reasons:

1. Dynamic Runtime Configuration

Each model requires different environment variables discovered at runtime:

typescript

// Azure OpenAI models use per-model env vars
AZURE_DEPLOYMENT_NAME_GPT_4O=...
AZURE_API_KEY_GPT_4O=...

// Anthropic uses shared credentials
ANTHROPIC_API_KEY=...
ANTHROPIC_API_INSTANCE_NAME=...

// Mistral has its own configuration
MISTRAL_API_KEY=...
MISTRAL_API_INSTANCE_NAME=...

A DI container can't elegantly handle "create N instances with configuration discovered at startup."

2. Multiple Instances Per Provider

We need one instance per model, not one instance per provider. A single AzureOpenAIService manages multiple AzureOpenAIInstanceService instances, each with its own rate limiter and configuration.

3. Graceful Degradation

If a model's environment variables are missing, we log a warning and continue. The app doesn't crash—it just doesn't have that model available:

typescript

try {
    const instanceService = new AnthropicInstanceService(...)
    this.modelToInstanceMap.set(modelName, instanceService)
} catch (error) {
    this.loggerService.warn(`Failed to load Anthropic instance...`)
    // Continue loading other models
}

4. Runtime Model Selection

The calling code chooses which model to use at runtime:

typescript

model: shouldUseFastModel ? ModelNameEnum.CLAUDE_HAIKU_4_5 : ModelNameEnum.GPT_4_1_EU

This isn't a compile-time decision—it's based on business logic, feature flags, or request parameters.

The Actual Pattern: Service Locator with Registry

We use a service locator pattern with a registry approach:

Provider services act as registries: Map<ModelNameEnum, InstanceService>
LLMRunnableService queries each provider to find who owns a given model
The caller doesn't know or care which provider backs their model

typescript

private async getInstanceServiceForModel(model: ModelNameEnum): Promise<InstanceService> {
    if (this.anthropicService.hasModel(model)) {
        return this.anthropicService.getInstanceService(model)
    }
    if (this.mistralService.hasModel(model)) {
        return this.mistralService.getInstanceService(model)
    }
    if (this.azureOpenAIService.hasModel(model)) {
        return this.azureOpenAIService.getInstanceService(model)
    }
    if (this.azureAPIMService.hasModel(model)) {
        return this.azureAPIMService.getInstanceService(model)
    }
    throw new Error(`No provider found for model: ${model}`)
}

Configuration Reference

Azure OpenAI Models

Per-model environment variables:

env

AZURE_DEPLOYMENT_NAME_GPT_4O=gpt-4o
AZURE_API_VERSION_GPT_4O=2024-02-15-preview
AZURE_API_KEY_GPT_4O=xxx
AZURE_INSTANCE_NAME_GPT_4O=my-azure-instance

# Repeat for GPT_4_1, GPT_4_1_EU...

Azure APIM (Load Balanced)

env

AZURE_DEPLOYMENT_NAME_GPT_4_1_APIM=gpt-4.1-apim
AZURE_API_VERSION_GPT_4_1_APIM=2025-01-01-preview
AZURE_API_KEY_GPT_4_1_APIM=xxx
AZURE_INSTANCE_NAME_GPT_4_1_APIM=apim-parallel

Anthropic (Azure AI Foundry)

Shared credentials for all Anthropic models:

env

ANTHROPIC_API_KEY=xxx
ANTHROPIC_API_INSTANCE_NAME=my-ai-foundry-instance

Mistral (Azure AI Foundry)

Shared credentials for all Mistral models:

env

MISTRAL_API_KEY=xxx
MISTRAL_API_INSTANCE_NAME=my-ai-foundry-instance
MISTRAL_API_VERSION=2024-05-01-preview

Adding a New Model

1. Add to `ModelNameEnum`

typescript

export enum ModelNameEnum {
    // ... existing models
    MY_NEW_MODEL = 'my-new-model',
}

// Add to the appropriate provider's key list
export const AZURE_OPENAI_MODEL_KEYS: (keyof typeof ModelNameEnum)[] = [
    'GPT_4O', 'GPT_4_1', 'GPT_4_1_EU', 'MY_NEW_MODEL'
]

2. Add Token Limit in `ModelService`

typescript

getModelLimit(): number {
    switch (this.modelName) {
        // ... existing cases
        case ModelNameEnum.MY_NEW_MODEL:
            return 128000
    }
}

3. Set Environment Variables

env

AZURE_DEPLOYMENT_NAME_MY_NEW_MODEL=my-deployment
AZURE_API_VERSION_MY_NEW_MODEL=2024-02-15-preview
AZURE_API_KEY_MY_NEW_MODEL=xxx
AZURE_INSTANCE_NAME_MY_NEW_MODEL=my-instance

4. Use It

typescript

await this.llmRunnableService.invoke({
    promptName: PromptName.MY_AGENT,
    state,
    model: ModelNameEnum.MY_NEW_MODEL,
})

Key Design Decisions

Aspect	Implementation
Provider Agnosticism	Application code uses `ModelNameEnum`, never provider-specific classes
Prompt Management	Externalized in LangChain Hub with version tags
Multi-model Support	One instance per model, selected at call time
Rate Limiting	Per-instance Bottleneck (60 RPM, prevents throttling)
Token Safety	Pre-flight token counting with automatic state reduction
Structured Output	Response unwrapping + encoding cleanup
Batch Processing	`batch_invoke()` for parallel calls with concurrency control
LangChain Integration	All providers return `BaseChatModel`, enabling `.pipe()` chains

Usage Example: Multi-Model Workflow

typescript

@Injectable()
export class MyWorkflowService {
    constructor(
        private readonly llmRunnableService: LLMRunnableService,
    ) {}

    async process(data: MyData) {
        // Fast model for simple tasks
        const summary = await this.llmRunnableService.invoke({
            promptName: PromptName.SUMMARIZER,
            state: { content: data.content },
            model: ModelNameEnum.CLAUDE_HAIKU_4_5,  // Fast & cheap
        })

        // Powerful model for complex reasoning
        const analysis = await this.llmRunnableService.invoke({
            promptName: PromptName.ANALYZER,
            state: { summary, context: data.context },
            model: ModelNameEnum.GPT_4_1_EU,  // 1M context window
        })

        // Batch processing with concurrency control
        const results = await this.llmRunnableService.batch_invoke({
            promptName: PromptName.ITEM_PROCESSOR,
            states: data.items.map(item => ({ item })),
            model: ModelNameEnum.GPT_4_1_APIM,  // Load balanced
            batchOptions: { maxConcurrency: 5 },
        })

        return { summary, analysis, results }
    }
}

How LLM Inference Works at Parallel ​

The One-Line Model Change ​

Available Models ​

Architecture Overview ​

Three-Layer Design ​

Execution Flow ​

Why Not Traditional Dependency Injection? ​

1. Dynamic Runtime Configuration ​

2. Multiple Instances Per Provider ​

3. Graceful Degradation ​

4. Runtime Model Selection ​

The Actual Pattern: Service Locator with Registry ​

Configuration Reference ​

Azure OpenAI Models ​

Azure APIM (Load Balanced) ​

Anthropic (Azure AI Foundry) ​

Mistral (Azure AI Foundry) ​

Adding a New Model ​

1. Add to ModelNameEnum ​

2. Add Token Limit in ModelService ​

3. Set Environment Variables ​

4. Use It ​

Key Design Decisions ​

Usage Example: Multi-Model Workflow ​

How LLM Inference Works at Parallel

The One-Line Model Change

Available Models

Architecture Overview

Three-Layer Design

Execution Flow

Why Not Traditional Dependency Injection?

1. Dynamic Runtime Configuration

2. Multiple Instances Per Provider

3. Graceful Degradation

4. Runtime Model Selection

The Actual Pattern: Service Locator with Registry

Configuration Reference

Azure OpenAI Models

Azure APIM (Load Balanced)

Anthropic (Azure AI Foundry)

Mistral (Azure AI Foundry)

Adding a New Model

1. Add to `ModelNameEnum`

2. Add Token Limit in `ModelService`

3. Set Environment Variables

4. Use It

Key Design Decisions

Usage Example: Multi-Model Workflow