GEEK HAUS
피드로 돌아가기
2026/05/29/memos-memory-model-lets-teams-upgrade-their-llm

MeMo's memory model lets teams upgrade their LLM without retraining it — and performance jumps 26%

·VentureBeat
원문 보기

편집자 요약

여러 대학 연구진이 제안한 MeMo는 새 지식을 주 LLM에 직접 주입하지 않고, 별도의 소형 memory model에 인코딩해 활용하는 프레임워크입니다. 이 방식은 open-source와 closed-source 모델 모두에 적용 가능하며, RAG 파이프라인의 노이즈와 context window 제약, 전체 재학습 비용을 줄이면서 실험에서 성능을 26% 높였습니다.

맥락

MeMo는 기업이 LLM을 운영하며 지속적으로 지식을 갱신해야 하는 문제에 대해, fine-tuning과 RAG 사이의 실용적 대안을 제시합니다. 특히 closed-source 모델에도 적용될 수 있다는 점은 모델 소유권과 데이터 통제 요구가 강한 엔터프라이즈 AI 시장에서 모듈형 메모리 아키텍처의 확산 가능성을 보여줍니다.

본문

Enabling LLMs to acquire new knowledge after training remains a major hurdle for enterprise AI — current solutions are either too expensive, too slow, or constrained by context window limits.MeMo, a framework from researchers at multiple universities, encodes new knowledge into a dedicated smaller memory model that operates separately from the main LLM.The modular architecture works with both open- and closed-source models and sidesteps the complexity of RAG pipelines and full model retraining.Experiments show that MeMo handles complex queries reliably even when retrieval pipelines are noisy. It avoids the catastrophic forgetting associated with direct fine-tuning and provides a cost-effective pathway for continuous knowledge updates.The challenge of updating LLM memoryLarge language models are frozen after training and their internal knowledge remains static until they undergo subsequent, computationally massive updates. Currently, developers rely on three main approaches to integrate external knowledge into an LLM, each with distinct drawbacks:Non-parametric methods, such as retrieval-augmented generation (RAG) and in-context learning, retrieve relevant documents from an external database and insert them directly into the model's prompt. While popular, these methods are limited by context window sizes. As Armando Solar-Lezama, a co-author of the paper, told VentureBeat, “Vector databases have a fundamentally difficult job of encoding the full semantics of a chunk of text in a single vector, and then match that vector to a query, even when the relevance of the chunk... may only be apparent in the context of other chunks.” The researchers note that the semantic similarity of embeddings often does not correspond to what a user's query actually requires. Processing thousands of retrieved tokens also creates substantial computational overhead and inference latency. Most problematically, RAG systems are highly sensitive to noise. Irrelevant or poorly retrieved passages often degrade the model's final response.Parametric methods, like continual pretraining or supervised fine-tuning, attempt to internalize new knowledge directly into the LLM's weights. Updating modern, massive LLMs is prohibitively expensive and typically impossible for proprietary, closed-source models hidden behind APIs. Fine-tuning is also prone to causing catastrophic forgetting. Forcing the model to adapt to new corporate data often erodes its previously acquired reasoning capabilities and safety guardrails.Latent memory methods, such as context compression, offer a middle ground. They compress knowledge into compact "soft tokens" or representations that are added to the model’s context during inference. The fatal flaw here is "representation coupling." The compressed memory is strictly bound to the model architecture that produced it; you can't transfer a latent memory trained on an open-source model to a closed-source one.How MeMo worksThe MeMo (Memory as a Model) framework introduces a modular architecture featuring two separate components. The MEMORY model is a small language model trained specifically to encode new knowledge into its parameters. The EXECUTIVE model is a frozen, off-the-shelf LLM that functions as the reasoning engine. When a user asks a question, the EXECUTIVE model treats the MEMORY model as an external oracle, issuing targeted sub-queries to gather facts and synthesizing those facts into a final answer.The core design principle driving MeMo is the concept of "reflections." Reflections are targeted question-answer (QA) pairs designed to capture every possible angle of a knowledge corpus. Rather than forcing the AI to process a massive, unstructured document corpus during training, MeMo uses a GENERATOR model to distill the raw text into thousands of targeted QA pairs. The MEMORY model is then fine-tuned on this dataset to answer questions using only its parametric knowledge without the need to read retrieved context.At inference time, the interaction between the two models follows a structured, three-stage protocol:1. The EXECUTIVE model decomposes a user's complex query into a set of atomic sub-questions. The MEMORY model answers each independently to establish the basic facts.2. Using those initial clues, the EXECUTIVE model issues follow-up queries to narrow down candidate entities until it confidently converges on a specific target. 3. Finally, the EXECUTIVE model queries the MEMORY model for supporting facts about that target entity and synthesizes the retrieved snippets into a cohesive answer.This architecture merges the strengths of the three existing AI memory paradigms while bypassing their pitfalls. It leverages off-the-shelf frontier models by keeping memory storage separate from reasoning, guaranteeing compatibility with both open-weight and closed API models. It internalizes knowledge directly into parameters, but isolates the updates to a smaller, dedicated MEMORY model to pr

댓글

토론

> geekhaus:~$ 다음 읽을거리?

다음 읽을거리 추천