EpiCache: Episodic KV Cache Management for Long Conversational Question Answering
Abstract
EpiCache is a KV cache management framework for long conversational question answering that reduces memory usage and improves accuracy through block-wise prefill, episodic KV compression, and adaptive layer-wise budget allocation.
Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.
Community
Episodic KV cache management for multi-turn conversations on resource-constrained devices.
Code release soon!
This is super interesting and valuable work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Retrospective Sparse Attention for Efficient Long-Context Generation (2025)
- KVCompose: Efficient Structured KV Cache Compression with Composite Tokens (2025)
- CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation (2025)
- StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding (2025)
- PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference (2025)
- HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs (2025)
- SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper