Study finds transformers can share key-value projections to cut KV cache memory with limited language model quality loss
An arXiv paper systematically tests transformer attention variants that share or collapse query, key, and value projections across synthetic, vision, and language tasks. The strong...


