Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts
Abstract
LLMs process mixed contexts by prioritizing less prevalent information, which can degrade response quality; RW-Steering, a two-stage fine-tuning approach, improves LLM safety by identifying and ignoring inappropriate signals.
Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.
Community
We introduce RW-Steering, a Rescorla–Wagner inspired two-stage method that addresses mixed-context vulnerability in LLMs. Even minimal inappropriate content can disproportionately degrade responses; RW-Steering enables models to detect and discount such interference, outperforming alignment and filtering baselines.
▶ Towards robust and safe LLMs: RW-Steering generalizes across diverse contamination, improves response quality by +39.8%, and provides a practical solution for safer, more reliable LLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models (2025)
- AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models (2025)
- Conflict-Aware Soft Prompting for Retrieval-Augmented Generation (2025)
- Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs (2025)
- UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases (2025)
- Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models (2025)
- PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper