inclusionAI/Ling-2.6-1T
Ling-2.6-1T (BailingMoeV2_5) FP8 instruct model with 1T total / 50B active params, hybrid linear + MLA attention, 262K context
View on HuggingFaceGuide
Overview
Ling-2.6-1T is inclusionAI's BailingMoeV2_5 FP8 flagship model with 1T total / 50B active parameters, hybrid linear + MLA attention, and a 262K context window.
Deployment Configurations
Docker (AMD MI300X / MI325X / MI355X, TP=8)
TP=8 has been verified on an MI300X-class node at the model-derived 262K context. MI325X and MI355X have larger per-GPU HBM.
docker run --rm -it \
--cap-add=SYS_PTRACE \
--ipc=host \
--privileged=true \
--shm-size=128GB \
--network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
-e VLLM_ROCM_USE_AITER=1 \
vllm/vllm-openai-rocm:v0.20.2 \
inclusionAI/Ling-2.6-1T \
--tensor-parallel-size 8 \
--trust-remote-code
Client Usage
Text Generation
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="inclusionAI/Ling-2.6-1T",
messages=[{"role": "user", "content": "Write a poem about the ocean."}],
max_tokens=512, temperature=0.7,
)
print(response.choices[0].message.content)