2026/03/28/indexcache-a-new-sparse-attention-optimizer

Tsinghua and Z.ai unveil IndexCache to speed long-context DeepSeek-style sparse attention inference by up to 1.82x

2026년 3월 27일 PM 07:00·VentureBeat

편집자 요약

Researchers at Tsinghua University and Z.ai developed IndexCache, an optimizer for DeepSeek Sparse Attention models that reduces redundant computation during long-context inference. In tests with 200,000-token contexts, it delivered up to 1.82x faster time-to-first-token and 1.48x higher generation throughput, including early validation on the 744B-parameter GLM-5 model.

인사이트

IndexCache targets one of the biggest production bottlenecks for long-context LLMs: the cost of repeatedly scoring large token histories. If broadly adopted, optimizations like this could make long-context AI more practical for enterprise document analysis, agent workflows, and reasoning-heavy applications without requiring entirely new model architectures.

토론

> geekhaus:~$ 다음 읽을거리?

New agentic memory framework uses 118K tokens per query. LangMem burns through 3.26M.

VentureBeat

Tsinghua and Z.ai unveil IndexCache to speed long-context DeepSeek-style sparse attention inference by up to 1.82x

편집자 요약

인사이트

댓글

토론

New agentic memory framework uses 118K tokens per query. LangMem burns through 3.26M.

Autonomous security agents need complete data. Here's how to check if yours is ready.

Liquid AI's smallest model yet LFM2.5-230M beats models 4X its size at data extraction, can run 'anywhere'

편집자 요약

인사이트

댓글

토론

다음 읽을거리 추천

New agentic memory framework uses 118K tokens per query. LangMem burns through 3.26M.

Autonomous security agents need complete data. Here's how to check if yours is ready.

Liquid AI's smallest model yet LFM2.5-230M beats models 4X its size at data extraction, can run 'anywhere'