Kimi Linear 48B A3B Instruct

Kimi Linear 48B A3B Instruct is Moonshot AI's efficient hybrid linear attention model released in October 2025, featuring Kimi Delta Attention (KDA) for 75% less KV cache usage and up to 6x faster decoding at 1M context.

Overview

Kimi Linear introduces a novel hybrid architecture that combines Kimi Delta Attention (KDA) with Multi-Head Latent Attention (MLA) in a 3:1 ratio. This design achieves superior performance while dramatically reducing memory usage and increasing throughput, making it ideal for long-context and agentic workloads.

Key Features

Kimi Delta Attention (KDA): Linear attention with channel-wise gating for fine-grained memory control
75% KV cache reduction: Dramatically lower memory requirements
6x decoding throughput: At 1M context length compared to full attention
1M token context: Native support for extremely long sequences
Drop-in replacement: Can replace full attention architectures

Technical Specifications

Specification	Value
Total Parameters	48B
Active Parameters	3B
Architecture	Hybrid KDA + MLA (3:1 ratio)
Context Length	1M tokens
Attention Design	Channel-wise gating with DPLR matrices
Position Encoding	NoPE (No Position Embeddings)

Availability

Open Weights: Hugging Face (moonshotai/Kimi-Linear-48B-A3B-Instruct)
Inference: vLLM with FLA (Flash Linear Attention) kernel
KDA Kernel: Open-sourced in FLA library

Use Cases

Long-context document analysis
Agentic workflows with extended trajectories
Memory-constrained deployments
High-throughput inference scenarios
Test-time scaling applications

Role in Series

Moonshot AI's Kimi models offer different capabilities:

Kimi Linear: Most efficient, best for long context and throughput (this model)
Kimi K2 0905: Agentic coding specialist, 256K context
Kimi K2 Thinking: Extended reasoning with tool orchestration
Kimi K2.5: Maximum capability with vision and Agent Swarm

MoonshotAI: Kimi Linear 48B A3B Instruct

Model Type