PANDORAEraser: Pixel-wise Attention Dissolutionand Latent Guidance for Zero-Shot Object Removal

1University of Science, VNU-HCM, Ho Chi Minh City, Vietnam2Vietnam National University, Ho Chi Minh City, Vietnam3University of Dayton, Ohio, United States
{vdkhoi, nvloc}@selab.hcmus.edu.vn, tamnguyen@udayton.edu, {tmtriet, ltnghia}@fit.hcmus.edu.vn

Demo Video

Original
Result

Abstract

Removing objects from natural images remains a formidable challenge, often hindered by the inability to synthesize semantically appropriate content in the foreground while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still struggle to maintain texture consistency, produce rigid or unnatural results, lack precise foreground-background disentanglement, and fail to flexibly handle multiple objects—ultimately limiting their scalability and practical applicability. In this paper, we propose a zero-shot object removal framework that operates directly on pre-trained text-to-image diffusion models—requiring no fine-tuning, no prompts, and no optimization. At the core is our Pixel-wise Attention Dissolution, which performs fine-grained, pixel-wise dissolution of object information by nullifying the most correlated keys for each masked pixel. This operation causes the object to vanish from the self-attention flow, allowing the coherent background context to seamlessly dominate the reconstruction. To complement this, we introduce Localized Attentional Disentanglement Guidance, which steers the denoising process toward latent manifolds that favor clean object removal. Together, Pixel-wise Attention Dissolution and Localized Attentional Disentanglement Guidance enable precise, non-rigid, scalable, and prompt-free multi-object erasure in a single pass. Experiments show our method outperforms state-of-the-art methods even with fine-tuned and prompt-guided baselines in both visual fidelity and semantic plausibility.

Object Removal

We propose a zero-shot object removal framework that operates directly on pre-trained diffusion models in a single pass, without any fine-tuning, prompt engineering, or inference-time optimization, thus fully leveraging their latent generative capacity for inpainting

🖱️Click to see results

⏱️Processing takes ~10 seconds - please be patient!

Approach

Our framework performs zero-shot object removal directly on a pre-trained diffusion model. Given an input image Is and a binary mask M specifying the target objects, the model produces an edited image It where the masked regions are erased and seamlessly reconstructed with contextually consistent background. The process begins with latent inversion to map the input image into the noise space while preserving unaffected regions in the denoising process. We then apply Pixel-wise Attention Dissolution (PAD) to disconnect masked query pixels from their most correlated keys, effectively dissolving object information at the attention level. Next, Localized Attentional Disentanglement Guidance (LADG) steers the denoising trajectory in latent space away from the object regions, refining the reconstruction to suppress residual artifacts.

Together, PAD and LADG enable precise, pixel-level control for single- and multi-object removal in a single forward pass, without any fine-tuning, prompt engineering, or inference-time optimization.

PANDORA Pipeline Diagram

Qualitative Comparison

🔬Qualitative comparison on various object removal scenarios📊From left to right: original image with a mask, and results from different methods

🎯

Single-Object Removal

Top two rows

🎯🎯

Multi-Object Cases

Middle two rows

🎯🎯🎯

Mass-Similar Objects

Bottom two rows

Zero-shot methods shown in the last four columns, with the last two columns showing our PANDORA method

Qualitative comparison of object removal methods

Quantitative Comparison

MethodTextFID↓LPIPS↓MSE↓CLIP score↑
Fine-tuning-based methods (SD 2.1 backbone, except LaMa)
PowerPaint22.810.13220.010424.15
LaMa0.710.00120.000124.5
SD2-Inpaint17.930.11060.007324.06
SD2-Inpaint-wprompt18.010.10980.007224.32
Zero-shot methods (no retraining, SD 2.1 backbone)
CPAM25.250.09530.004824.49
PANDORA w/o PAD (Ours)27.30.09850.00524.58
PANDORA w/o LADG (Ours)30.80.10070.005524.65
PANDORA (Ours)35.10.10640.005924.69
Zero-shot methods (no retraining, SD 1.5 backbone)
CPAM29.540.15640.013824.32
Attentive Eraser118.090.25670.02724.42
PANDORA w/o PAD (Ours)35.590.17020.015624.4
PANDORA w/o LADG (Ours)42.170.18440.017124.55
PANDORA (Ours)44.980.18950.018424.57
📉Lower is better: FID, LPIPS, MSE
📈Higher is better: CLIP score
📝Text column: uses prompts; no prompts
Bold numbers indicate best across all methods; yellow rows are Ours

Quantitative comparison of fine-tuned and zero-shot object removal methods averaged across all dataset types. PANDORA consistently achieves the best object removal quality with competitive background realism, without any retraining or textual prompts, demonstrating strong generalization across both Stable Diffusion v1.5 and v2.1 backbones. Removing LADG slightly reduces removal quality, while removing PAD causes a significant degradation.

Acknowledgment

💰

Funding and GPU Support

This research is funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant Number 102.05-2023.31. This research used the GPUs provided by the Intelligent Systems Lab at the Faculty of Information Technology, University of Science, VNU-HCM.

🙏

User Study Participants

We extend our heartfelt gratitude to all participants who took part in our comprehensive user study. Your valuable time, thoughtful feedback, and detailed evaluations were instrumental in validating the effectiveness and usability of our PANDORA framework. Your insights helped us understand the practical impact of our zero-shot object removal approach and provided crucial evidence of its superiority over existing methods.

🎨

Website Design Inspiration

This website design is inspired by ObjectDrop. We thank the authors for their excellent work and creative design approach.

🚀

Demo Design Inspiration

Our Gradio demo design is inspired by MimicBrush. We thank the authors for their excellent work and creative design approach.