Exploring SLERP Abliteration
Abliteration is a process which is can be used for targeted removal or disabling of specific components or mechanisms within a large language model, customarily targeting behaviors that are responsible for generating refusals or safety responses, although other behaviors can be targeted. A full discussion of this topic is beyond the scope of this article.
In conventional abliteration of LLMs, straightforward vector difference is used to compute the refusal vector between notionally harmless and harmful responses. This method is aligned with linear interpolation:
refusal_dir = harmful_mean - harmless_mean
However, we propose that Spherical Linear Interpolation (SLERP) could be a viable alternative, as we are dealing with high-dimensional spaces where behavior might be better captured on a hypersphere. This would preserve angular relationships, which in turn would better respect any language model embeddings that encode semantic meaning on a hypersphere (cosine similarity being a common metric).
SLERP imlementation:
def slerp(v0, v1, t):
"""Spherical linear interpolation between two vectors."""
# Normalize input vectors
v0_norm = v0 / v0.norm()
v1_norm = v1 / v1.norm()
# Calculate the dot product (cosine of angle between vectors)
dot = torch.sum(v0_norm * v1_norm)
# Clamp dot product to remain in valid range for acos
dot = torch.clamp(dot, -1.0, 1.0)
# Calculate the angle between vectors
omega = torch.acos(dot)
# Handle edge cases
if omega < 1e-6: # Vectors are nearly parallel
return (1-t) * v0 + t * v1
# Perform SLERP
sin_omega = torch.sin(omega)
return torch.sin((1-t) * omega) / sin_omega * v0 + torch.sin(t * omega) / sin_omega * v1
Alternate refusal direction calculation via SLERP calculation:
# Normalize means (important for SLERP)
harmful_mean_norm = harmful_mean / harmful_mean.norm()
harmless_mean_norm = harmless_mean / harmless_mean.norm()
# Using t=1 gives the full direction from harmless to harmful
refusal_dir = slerp(harmless_mean_norm, harmful_mean_norm, 1.0) - harmless_mean_norm
refusal_dir = refusal_dir / refusal_dir.norm()
The above can be transplanted quickly into any Python implementation of abliteration.
A working SLERP code implementation, using Transformers, is available on GitHub
Code snippets were generated with the assistance of Clause Sonnet 3.7.
Limitation
Extensive testing and benchmarking against linear abliteration have not yet been performed, although basic proof of concept has been promising. Scarcity of computing resources was a factor. Source code has been made available to enable others to explore this research direction more deeply.
References
- Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, "Refusal in LLMs is mediated by a single direction", LessWrong, 2024.
- Maxime Labonne, "Uncensor any LLM with abliteration", Huggingface, 2024.
- Sumandora, "Remove Refusals With Transformers", GitHub, 2024.