Papers
arxiv:2308.10248

Activation Addition: Steering Language Models Without Optimization

Published on Aug 20, 2023
Authors:
,
,
,
,
,

Abstract

Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking the activation differences which result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet, and replicate the effect on Llama-13B and GPT-J-6B. Our approach yields inference-time control over high-level properties of output & preserves performance on off-target topics. The method requires far less compute and implementation effort than finetuning and RLHF, allows for natural language specification by users, and its overhead scales naturally with model size.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2308.10248 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2308.10248 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.