vLLM/Recipes
inclusionAI

inclusionAI/Ling-2.6-flash

Ling-2.6-flash (BailingMoeV2_5) instruct model with 104B total / 7.4B active params, hybrid linear + MLA attention, 128K context, optimized for agent workloads

View on HuggingFace
moe104B / 7.4B131,072 ctxvLLM 0.20.2+text
Guide

Overview

Ling-2.6-flash is a BailingMoeV2_5 MoE instruct model with 104B total / 7.4B active parameters, hybrid linear + MLA attention, and a 131K context window.

Deployment Configurations

Docker (AMD MI300X / MI325X / MI355X, TP=2)

MI300X / MI325X / MI355X GPUs have larger per-GPU HBM, so TP=2 fits the full 131K context.

docker run --rm -it \
  --cap-add=SYS_PTRACE \
  --ipc=host \
  --privileged=true \
  --shm-size=128GB \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  -e VLLM_ROCM_USE_AITER=1 \
  vllm/vllm-openai-rocm:v0.20.2 \
    inclusionAI/Ling-2.6-flash \
    --tensor-parallel-size 2 \
    --trust-remote-code

Client Usage

Text Generation

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="inclusionAI/Ling-2.6-flash",
    messages=[{"role": "user", "content": "Write a poem about the ocean."}],
    max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)

References