Language Models use Lookbacks to Track Beliefs

Nikhil Prakash1, Natalie Shapira1, Arnab Sen Sharma1, Christoph Riedl1, Yonatan Belinkov2, Tamar Rott Shaham3, David Bau1, Atticus Geiger4
1Northeastern University, 2Technion - IIT, 3MIT CSAIL, 4Pr(Ai)2R Group

ArXiv Preprint thumbnail
Under Review

How Do Language Models Track Characters' Beliefs?

The ability to infer mental states of others—known as Theory of Mind (ToM)—is an essential aspect of social and collective intelligence. Consequently, numerous studies have explored this capability in contemporary language models (LMs). However, there is no clear consensus on the extent of these capabilities, largely because existing research relies primarily on behavioral testing. That is, it remains unclear whether LMs are leveraging surface-level statistical patterns or have genuinely learned to represent and track mental states. To address this, the present work investigates the belief-tracking mechanisms that may underlie the early signs of ToM in LMs.

Our investigation uncovered a consistent algorithmic pattern, which we refer to as the Lookback Mechanism. The language model (LM) first generates reference information—specifically, Ordering IDs (OIDs)—for each character-object-state triple. It then performs reasoning over this reference information using three distinct lookbacks. The first, the Binding Lookback, identifies the correct state OID. The second, the Answer Lookback, uses the OID obtained from the binding lookback to retrieve the corresponding answer state token value. Finally, when an explicit visibility condition between characters is present, the LM employs an additional lookback—called the Visibility Lookback—to integrate information about the observed character, thereby enhancing the observing character's awareness.

CausalToM Dataset

Existing datasets for evaluating ToM capabilities of LMs are designed for behavioral testing and lack the ability to construct counterfactual pairs needed for causal analysis. To address this, we constructed CausalToM, a structured dataset of simple stories, where each story involves two characters, each interacting with a distinct object causing the object to take a unique state. We analyze the LM's ability to track characters' beliefs in two distinct settings:

  1. No Visibility: Both characters are unaware of each other's actions (left).
  2. Explicit Visibility: Explicit information about whether a character can/cannot observe the other's actions is provided (right).

In this work, we investigate the internal mechanisms that enable Llama-3-70B-Instruct to reason about and answer questions concerning characters' beliefs about the state of each object.

The Lookback Mechanism

Our investigations of belief tracking uncovered a recurring pattern of computation that we call lookback. In a lookback mechanism, source information is copied via attention into an address copy in the residual stream of a recalled token and a pointer copy in the residual stream of a lookback token that occurs later in the text. The LM places the address alongside a payload of the recalled token's residual stream that can be brought forward to the lookback token via attention if necessary.

Figure 1: The lookback mechanism is used to perform conditional reasoning. The source token contains information that is copied into two instances via attention to create a pointer and an address. Alongside the address in the residual stream is a payload information. If necessary, the model can retrieve the payload by dereferencing the pointer. The solid lines are movement via residual connections or attention heads, while the dotted line indicates the attention "looking back" from pointer to address.

We identified three key lookback mechanisms that collectively perform belief tracking:

  1. Binding Lookback: This lookback identifies the correct state token reference information, i.e. their Ordering ID (OID).
  2. Answer Lookback: Uses the answer state OID from the binding lookback to retrieve the answer state token value.
  3. Visibility Lookback: When an explicit visibility condition between characters is mentioned, this lookback employs additional reference information, called the Visibility ID, to retrieve information about the observed character, augmenting the observing character's awareness.

Belief Tracking with No Visibility Between Characters

When presented with belief tracking tasks where characters have no visibility of each other, the LM solves the task using Ordering ID assignment, binding lookback, and answer lookback.

Belief Tracking Diagram
Figure 2: Belief Tracking with no visibility between characters. The LM assigns ordering IDs (OIDs) to each character, object, and state that encode their order of appearance. (a) Binding lookback: Address copies of character and object OIDs are placed alongside the state OID payload in the residual stream of state tokens while pointer copies are moved to the final token residual stream. (b) Answer lookback: An address copy of the state OID is alongside the state token payload in the residual stream of state tokens while a pointer copy is moved to the final token residual stream via binding lookback.

Ordering ID Assignment

LM processes input tokens by assigning an Ordering ID (OID) to each crucial token, including character, object, and state tokens. These OIDs, encoded in a low-rank subspace of the internal activation, serve as a reference that indicates whether an entity is the first or second of its type, regardless of its token value.

Binding Lookback

The Binding lookback is the first operation applied to these OIDs. The character and object OIDs, serving as the source information, are duplicated into two instances each. One copy, referred to as the address, is placed in the residual stream of the state token (recalled token), alongside the state OID as the payload to transfer. The other copy, known as the pointer, is moved in the residual stream of the final token (lookback token). These pointer and address copies are then used to form the QK-circuit at the lookback token, which dereferences the state OID payload, transferring it from the state token to the final token.

Answer Lookback

The LM answers the question using the Answer Lookback. The state OID of the correct answer serves as the source information, which is copied into two instances. One instance, the address copy of the state OID, is in the residual stream of the state token (the recalled token) with the state token itself as the payload. The other instance, the pointer copy of the state OID, is transferred to the residual stream of the final token as the binding lookback payload. This pointer is then dereferenced, bringing the state token payload into the residual stream of the final token.

Impact of Visibility Conditions on Belief Tracking

When provided with additional information—that one of the characters (observing) can observe the actions of others (observed), the LM employs another lookback mechanism, which we refer to as the Visibility Lookback, to incorporate information about the observed character.

Visibility Lookback Diagram
Figure 3: Visibility Lookback - When one character (the observing character) can see another (the observed character), the LM assigns a visibility ID to the visibility sentence. An address copy of this visibility ID remains in the visibility sentence's residual stream. A pointer copy of the visibility ID is transferred to the question's residual stream (lookback tokens). This mechanism allows the model to incorporate the observed character's knowledge into the observing character's belief state.

As illustrated above, the LM first generates a Visibility ID at the residual stream of the visibility sentence, serving as the source information. The address copy of the visibility ID remains in the residual stream of the visibility sentence, while its pointer copy gets transferred to the residual streams of the question tokens, which are the lookback tokens. Then LM forms a QK-circuit at the lookback token and dereferences the visibility ID pointer to bring forward the payload containing vital information about the observed character.

Related Works

Our work builds upon insights from previous research that has investigated large language models from various other perspectives:

Finetuning (Prakash et. al 2024) Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau. Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. ICLR 2024.
Notes: Investigated how LLMs perform entity tracking task and found that they primarily use position-based reference information to associate entities with their corresponding attributes. Additionally, it also identified a submechanism similar to the Answer lookback, through the model deferences the reference information to fetch correct attribute token value.

OID (Dai et. al 2024) Qin Dai, Benjamin Heinzerling, Kentaro Inui. Representational Analysis of Binding in Language Models. EMNLP 2024.
Notes:

DAS (Geiger et. al 2021) Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations. Conference on Causal Learning and Reasoning (CLeaR 2024).
Notes: Proposed Distributed Alignment Search (DAS) method to find the alignment between high-level and low-level causal models by learning an orthogonal matrix using gradient descent.

DCM (Davies et. al 2021) Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau. Discovering Variable Binding Circuitry with Desiderata. Workshop on Challenges in Deployable Generative AI at International Conference on Machine Learning (ICML 2023).
Notes: Proposed Desiderata-based Component Masking (DCM) method to localize components responsible for variable binding in Llama-13B.