vLLM/Recipes
inclusionAI

inclusionAI/Ling-2.6-1T

Ling-2.6-1T (BailingMoeV2_5) FP8 instruct model with 1T total / 50B active params, hybrid linear + MLA attention, 262K context

View on HuggingFace
moe1T / 50B262,144 ctxvLLM 0.20.2+text
Guide

Overview

Ling-2.6-1T is inclusionAI's BailingMoeV2_5 FP8 flagship model with 1T total / 50B active parameters, hybrid linear + MLA attention, and a 262K context window.

Deployment Configurations

Docker (AMD MI300X / MI325X / MI355X, TP=8)

TP=8 has been verified on an MI300X-class node at the model-derived 262K context. MI325X and MI355X have larger per-GPU HBM.

docker run --rm -it \
  --cap-add=SYS_PTRACE \
  --ipc=host \
  --privileged=true \
  --shm-size=128GB \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  -e VLLM_ROCM_USE_AITER=1 \
  vllm/vllm-openai-rocm:v0.20.2 \
    inclusionAI/Ling-2.6-1T \
    --tensor-parallel-size 8 \
    --trust-remote-code

Client Usage

Text Generation

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="inclusionAI/Ling-2.6-1T",
    messages=[{"role": "user", "content": "Write a poem about the ocean."}],
    max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)

References