Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

A closed-form projector built from cross-attention activations rather than text-encoder anchors. PURE erases target concepts while preserving the rest, even under paraphrased and adversarial prompts.

1CSE, POSTECH · 2GSAI, POSTECH
Preprint · May 2026
Comparison between text and activation bases
FIG. 01 Prior closed-form methods build a shared text-space basis from anchor prompt embeddings. PURE instead captures layer-specific cross-attention activations during denoising and constructs an activation-space basis for each U-Net cross-attention layer.
01

Overview

Concept unlearning erases a target concept from a pretrained text-to-image diffusion model without retraining. Existing closed-form methods rely on the text encoder's response to short anchor prompts, so paraphrased prompts that evoke the concept without naming it slip past the edit. PURE (Projection in U-Net Rendering for Erasure) builds forget and retain bases directly from per-layer cross-attention activations along a short denoising trajectory, then applies a single closed-form linear projector to the cross-attention K/V weights. On the HUB benchmark spanning 10 concepts across style, IP, celebrity, and NSFW, PURE achieves the best forget-retain trade-off in every category.

02

Why activation bases?

We build a basis from a small set of short anchor prompts, either from text-encoder embeddings or from cross-attention activations, then train a binary classifier in that basis. We measure recall on a held-out set of natural prompts that describe the same concept in longer, more varied form than the anchors.

A text-space basis catches only a small fraction. A cross-attention activation basis recovers 5 to 7 times more across artistic style, intellectual property, and celebrity categories.

This is the core motivation for PURE: erase the target where the model actually represents it.

Binary-probing recall on natural prompts
FIG. 02 Binary-probing recall on natural prompts (↑).
03

Method

Cross-attention in the diffusion U-Net. At every cross-attention layer \(\ell\), image features form queries \(Q^\ell\), and the text embedding \(e\) is projected into keys and values by learned weights \(W_K^\ell\) and \(W_V^\ell\). The post-attention activation at one query position is

$$h^\ell = \mathrm{softmax}\!\left(\frac{Q^\ell {K^\ell}^{\!\top}}{\sqrt{d^\ell}}\right) V^\ell.$$

\(W_K^\ell\) and \(W_V^\ell\) are the two matrices that decide how text content flows into the image. PURE edits exactly these, in closed form, given a small forget anchor set \(\mathcal{A}_f\) (short phrasings of the target concept) and a retain anchor set \(\mathcal{A}_r\) (phrasings of neighboring concepts to preserve). The full procedure is three steps and no gradient descent.

STEP
01
ACTIVATION COLLECTION

Run a short denoising trajectory

For each anchor prompt, run a short denoising trajectory with a few random latents. At every cross-attention layer \(\ell\) and every step, read the post-attention activation and mean-pool over the spatial axis.

Stack the rows into per-layer activation matrices \(H_F^\ell\) (from forget anchors) and \(H_R^\ell\) (from retain anchors).

STEP
02
SUBSPACE ESTIMATION

SVD on activation matrices

Take the SVD of \(H_F^\ell\) and \(H_R^\ell\), keep the top right-singular vectors up to a cumulative-variance threshold, and form orthonormal bases \(V_F^\ell, V_R^\ell\) and projectors

$$P_F^\ell = V_F^\ell {V_F^\ell}^{\!\top}, \qquad P_R^\ell = V_R^\ell {V_R^\ell}^{\!\top}.$$
STEP
03
CLOSED-FORM EDIT

Left-multiply the K/V weights

Compose the edit operator and apply it once to each layer's cross-attention key and value matrices:

$$E^\ell = I - P_F^\ell\,(I - P_R^\ell)$$
$$W_K^\ell \leftarrow E^\ell W_K^\ell, \qquad W_V^\ell \leftarrow E^\ell W_V^\ell.$$

Because the basis is built from what the U-Net renders rather than what the user happens to type, the edit generalizes to paraphrased and adversarial prompts that the anchor set does not literally contain.

Relationship to CURE. PURE inherits the projection-and-cancellation form of CURE and edits the same cross-attention K/V matrices. The change is what gets projected: per-layer cross-attention activations rather than text-encoder embeddings. The switch in basis source forces the projector to be applied by left-multiplication instead of right-multiplication.

04

Results

We report the H-mean, a harmonic mean over four metrics (target proportion, within-category retention, attack robustness, and quality), using the HUB benchmark. PURE wins every category, and improves the average over the next-best baseline (CURE) by +9.7 points.

Method Style IP Celebrity NSFW Average
SD (no edit) 0.462 0.331 0.469 0.482 0.436
ESD 0.599 0.551 0.640 0.312 0.526
MACE 0.614 0.578 0.584 0.296 0.518
Receler 0.525 0.377 0.610 0.518 0.508
UCE 0.328 0.654 0.657 0.470 0.527
CURE 0.565 0.571 0.572 0.465 0.543
PURE 0.655 0.683 0.693 0.528 0.640
Table 1. H-mean on the HUB benchmark (higher is better). PURE achieves the best score in every category.
Qualitative comparison on HUB forget and retain prompts
FIG. 03 Qualitative comparison on HUB forget and retain prompts. Each pair of consecutive rows shows a forget prompt on top and its corresponding retain prompt below. Training-based methods often damage neighboring concepts while suppressing the target; prior closed-form methods preserve them but leave noticeable leakage. PURE achieves stronger target suppression while preserving retain-image quality across categories.
05

Ablation: anchor set size

How sensitive is the method to how many anchor prompts we collect? We sweep the size of the forget anchor set \(|\mathcal{A}_f|\) and the retain anchor set \(|\mathcal{A}_r|\) independently and measure target detection on the held-out prompts (lower target = better forget) and retention on the related concepts (higher retain = better preservation).

The text-basis variant is brittle in both directions. Adding more forget anchors damages retention; adding more retain anchors causes the target to leak back in. The activation-basis variant stays stable across the entire sweep.

Forget anchor scaling
FIG. 04(a) Sweep over \(|\mathcal{A}_f|\): more forget anchors damage text-basis retention; activation-basis retention is stable.
Retain anchor scaling
FIG. 04(b) Sweep over \(|\mathcal{A}_r|\): more retain anchors let the target leak back into the text-basis; activation-basis keeps the target suppressed.
Qualitative comparison as forget anchor set size grows
FIG. 05 Qualitative sweep over \(|\mathcal{A}_f|\). Forget: "Pikachu runs up a mountain with the sun setting behind it." Retain: "Mario standing in a Mushroom Kingdom street." The text-basis edit increasingly damages the retain image as more forget anchors are added; the activation-basis edit preserves it.
Qualitative comparison as retain anchor set size grows
FIG. 06 Qualitative sweep over \(|\mathcal{A}_r|\). Forget: "Pikachu sitting on a pile of hay in a rustic barn." Retain: "Snoopy sitting on a vintage motorcycle in a sunny desert landscape." Under the text-basis, Pikachu reappears as the retain set grows; the activation-basis keeps it suppressed.
06

Citation

If you find this work useful, please cite:

@article{moon2026pure,
  title   = {Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models},
  author  = {Saemi Moon and Suhyeon Jun and Seoyeon Lee and Dongwoo Kim},
  journal = {arXiv preprint arXiv:2605.25765},
  year    = {2026}
}