Sparse Autoencoders Are the SR 11-7 Tool I Wish I'd Had

When the Federal Reserve published SR 11-7 in 2011, “model risk” mostly meant regression scorecards and a few gradient boosters. Documentation was tractable: you could write down every variable, every weight, every monotonicity constraint, and a validator could read it end to end. Fifteen years later, banks want to deploy LLMs into customer-facing pipelines — fraud narrative summarization, complaint triage, agent assist — and the documentation gap is the binding constraint. You cannot put a 70-billion-parameter dense network in a model inventory and call it auditable.

This post is about a tool that has moved from interpretability research into something I now reach for in MRM conversations: sparse autoencoders (SAEs). They don’t make LLMs simple. They make LLMs legible — they turn the soup of a residual stream into a finite, named set of features you can monitor, threshold, and document. That is exactly what SR 11-7 has always asked for, and exactly what I couldn’t deliver to validators in 2022.

What an SAE actually does

A pretrained transformer’s residual stream is a high-dimensional vector that, at any given layer, smears together hundreds of overlapping concepts in each neuron. This is polysemanticity: one neuron fires for “uppercase letters in code” and “Spanish soccer commentary” and “the third token of a list.” That makes per-neuron interpretation a dead end.

The SAE’s job is to learn an overcomplete dictionary — many more features than dimensions — under a sparsity penalty that forces each input to activate only a handful of features. If it works, each feature ends up monosemantic: it fires for one human-recognizable concept and nothing else.

The training objective is just reconstruction plus an L1 penalty on the sparse code:

\mathcal{L}(x) = \lVert x - \hat{x} \rVert_2^2 + \lambda \lVert z \rVert_1, \quad z = \mathrm{ReLU}(W_\mathrm{enc}(x - b)), \quad \hat{x} = W_\mathrm{dec} z + b

The reconstruction term keeps $z$ informative; the sparsity term keeps $z$ named. Empirically, $\lambda$ in the range of $10^{-3}$ to $10^{-2}$ and a dictionary size 8–64× the model’s hidden dimension produces features that humans can label.

A toy forward pass

Here is the entire SAE forward pass, which is small enough to fit in a docstring and which validators tend to appreciate seeing in code form before the math:

import torch
import torch.nn as nn
import torch.nn.functional as F

class SparseAutoencoder(nn.Module):
    def __init__(self, d_model: int, d_features: int):
        super().__init__()
        self.enc = nn.Linear(d_model, d_features, bias=False)
        self.dec = nn.Linear(d_features, d_model, bias=False)
        self.bias = nn.Parameter(torch.zeros(d_model))

    def forward(self, x: torch.Tensor):
        z = F.relu(self.enc(x - self.bias))            # sparse code
        x_hat = self.dec(z) + self.bias                # reconstruction
        return x_hat, z

    def loss(self, x: torch.Tensor, lam: float = 1e-3):
        x_hat, z = self(x)
        recon = F.mse_loss(x_hat, x, reduction="sum") / x.size(0)
        sparsity = z.abs().sum(dim=-1).mean()
        return recon + lam * sparsity

Train this on $\sim 10^9$ residual-stream activations sampled from your base model and you’ll come out the other side with a dictionary of features. The interesting work — and where most teams underinvest — is the labeling pipeline that turns each feature index into a human description.

I’ve sat through many model risk reviews where the validator’s question was, in essence, what does this model think it knows, and how do you know when that breaks? For a logistic regression you point at the coefficients. For a gradient booster you point at SHAP. For a transformer you used to wave at attention maps and hope.

SAEs change three of the SR 11-7 sub-requirements concretely:

Conceptual soundness. You can document, per layer, which features dominate the model’s representation of inputs in your domain — credit narratives, regulatory disclosures, complaint text. That documentation is durable across base-model versions because the features (not weights) are what you label.
Outcomes analysis. Instead of monitoring only loss or accuracy, you monitor feature activations. A drift in the firing rate of “mentions of foreclosure” between Q3 and Q4 is a tractable, dashboardable signal.
Ongoing monitoring. When a feature behaves anomalously — fires in unexpected contexts, or stops firing altogether — you have a unit of monitoring that maps to a unit of meaning, not just a unit of math.

The single biggest unlock isn’t that SAEs make LLMs interpretable in some philosophical sense. It’s that they give you a vocabulary you can put in a control. Validators are pragmatic: they want to know what triggers an alert and what an analyst does next. Features answer both.

What still doesn’t work

I want to be honest about the limits, because the worst thing a tool like this can do is enable theatre.

Feature labeling is expensive and human-bottlenecked. Pipelines that auto-generate labels using a stronger LLM still need human verification for any feature you put in a control. Plan for it.
Reconstructions are lossy. A small fraction of model behavior — often the long tail you most want to monitor — sits in directions the SAE doesn’t capture. Don’t claim full coverage.
Features are not always stable across fine-tunes. If you retrain the base model, you re-run the SAE pipeline and re-label. That’s a real operational tax.

A residual stream activation projected onto the top-10 SAE features that explain it

These caveats are exactly the kind of thing that belongs in the limitations section of model documentation. SR 11-7 doesn’t reward perfection — it rewards calibrated honesty about what your model can and cannot do.

Where I’d start

If you’re at a regulated lender and want to pilot this:

Pick one narrow use case with bounded text — fraud narrative summarization is a good start.
Train an SAE on the residual stream of the LLM serving that use case, on a few hundred million tokens of in-domain text.
Label the top 1,000 most-active features by activation frequency. Most signal lives in the top 5%.
Wire a handful of those features into your monitoring stack as time-series. Start with three or four obviously-domain-relevant features.
Document the pipeline end-to-end in a model card you’d be comfortable handing to a validator.

You won’t get perfect coverage. You will get something better than “we cannot decompose the model into features,” which is where most banks are today. That delta is exactly what makes this an SR 11-7 tool, even if the original papers were never written with the Fed in mind.

FAQ

Does using SAEs satisfy SR 11-7 conformance on its own?

No. SR 11-7 is a governance framework, not a technique. SAEs give you something concrete to put in the 'conceptual soundness' and 'ongoing monitoring' sections of model documentation, but you still need independent validation, change management, and use-test evidence.

How big does the SAE need to be?

The dictionary size needs to be a substantial multiple of the model's hidden dimension — typical setups use 8x to 64x. For a 4096-dim model, that's 32k to 262k features. Cost scales linearly with dictionary size at training time, but inference cost on the base model is unchanged.

What about non-LLM models — gradient boosters, scorecards?

For tabular models you usually don't need SAEs at all. SHAP, partial dependence, and monotonic constraints already give you regulator-ready explanations. SAEs solve a problem specific to dense neural representations where individual neurons are polysemantic.

What an SAE actually does

A toy forward pass

Why this matters for SR 11-7

What still doesn’t work

Where I’d start

FAQ