GeCHO: Generation of Contextualized Human

Abstract

Creating realistic human-world interactions with diffusion models remains a key challenge, often requiring tedious trial-and-error processes and iterative manual refinements. Current approaches either fail to seamlessly integrate new content while maintaining global scene consistency, or require time-consuming editing and prompt engineering, making the process impractical for large-scale applications.

To address this challenge, we propose GeCHO, an inpainting framework specifically architected to generate spatially consistent and contextually-aware human-object interactions. Our method improves local object fidelity and global scene consistency by leveraging cross-attention maps for automated, annotation-free object placement and using ControlNet to ensure precise spatial localization.

We demonstrate the practical impact of our approach through two key applications: natural image inpainting, where we achieve contextual object placement with flexible spatial control, and human-object interaction (HOI) detection, where we address long-tail distributions through synthetic data generation. Our results show that GeCHO achieves 35.50% top-1 zero-shot action recognition accuracy, outperforming the strong Add-SD baseline (28.14%), confirming its superior ability to synthesize coherent interactions rather than simple isolated objects.

Method

Fig. 1. Overview of the GeCHO framework. (Top) Two-stage training: (I) ControlNet training for object inpainting with U-Net frozen, then (II) U-Net LoRA fine-tuning with frozen ControlNet. (Bottom) Two inference modes: (a) user-controlled generation with an explicit object mask M_obj, and (b) free-form generation where M_obj is automatically extracted from cross-attention maps.

ControlNet for Object Placement

We train a dedicated ControlNet — a frozen-backbone copy of the inpainting U-Net — specifically for object generation. It receives a binary object mask M_obj as spatial conditioning and a dedicated text prompt, allowing fine-grained control over where the object appears while the base U-Net maintains overall scene coherence.

LoRA Fine-Tuning for Harmonization

Combining the ControlNet with the U-Net introduces domain shifts and morphing artifacts. We apply parameter-efficient LoRA fine-tuning only on the ResNet convolutional blocks — where ControlNet outputs and skip connections are merged — eliminating these inconsistencies while preserving general inpainting capability.

Attention-Guided Free-Form Masking

When no explicit object mask is provided, we extract and threshold cross-attention maps from the U-Net to dynamically derive M_obj,t at each denoising step. This enables large-scale automatic HOI synthesis without any mask annotation, while keeping ControlNet and U-Net generation spatially aligned throughout the diffusion process.

Results: Inpainting Comparison

We compare against Paint-By-Example (image-reference inpainting) and SD Inpaint (Stable Diffusion 2.1 inpainting baseline) on object insertion into HICO-DET test scenes. The inpainting mask covers only the object region (M = M_obj).

Input

Paint-By-Example

SD Inpaint

GeCHO (Ours)

'wield' ↔ 'baseball bat'

'ride' ↔ 'skateboard'

'hold' ↔ 'cup'

'carry' ↔ 'surfboard'

GeCHO consistently generates objects that align with the subject's posture, scale, and the scene's depth cues. SD Inpaint often defaults to background continuation; Paint-By-Example struggles with larger masks.

Results: Editing Method Comparison

We compare against Add-SD, a state-of-the-art instruction-based object addition method trained on paired image datasets. For each scene we show one representative generation per method. The red mask indicates the inpainting region used by GeCHO; Add-SD operates on the full image.

Input (masked)

Input (original)

Add-SD

GeCHO (Ours)

'ride' ↔ 'skateboard'

'jump' ↔ 'bicycle'

'hold' ↔ 'laptop'

GeCHO produces more diverse and interaction-faithful outputs. Add-SD can place objects plausibly but struggles to synthesize the intended action.

Quantitative Results

We evaluate using CLIP-based metrics: CLIP_text distance (lower is better), zero-shot action recognition (ZS Action Acc@1/5), and zero-shot object detection (ZS Obj Acc@1/5).

vs. Inpainting Methods

Method	CLIP_text ↓	ZS Action Acc@1 ↑	ZS Action Acc@5 ↑	ZS Obj Acc@1 ↑	ZS Obj Acc@5 ↑
Paint-By-Example	0.7143	19.98%	75.70%	74.72%	91.73%
SD Inpaint	0.6757	25.87%	81.30%	79.40%	91.34%
GeCHO (Ours)	0.6747	26.12%	81.85%	83.33%	93.79%

vs. Editing Methods

Method	CLIP_text ↓	ZS Action Acc@1 ↑	ZS Action Acc@5 ↑	ZS Obj Acc@1 ↑	ZS Obj Acc@5 ↑
LaMa removed objects	0.7110	25.15%	81.81%	49.28%	76.90%
MGIE	0.7138	24.94%	82.14%	51.19%	79.30%
MagicBrush	0.7074	26.56%	80.75%	52.93%	80.48%
InstructPix2Pix	0.7040	26.89%	83.58%	78.79%	91.73%
Add-SD	0.6810	28.14%	83.55%	86.37%	95.44%
GeCHO (Ours)	0.6574	35.50%	86.12%	84.95%	94.60%

GeCHO achieves +7.36 percentage points over Add-SD on zero-shot action recognition Acc@1, confirming our approach synthesizes coherent interactions, not just isolated objects.

Downstream Task: HOI Detection

We validate GeCHO as a synthetic data generator for HOI detection, tackling the severe long-tail distribution in HICO-DET (343 out of 600 categories have fewer than 50 training images). Using GeCHO we synthesize 76,929 images — 2× the original training set — with an acceptance rate of 69% under uniform sampling (17 valid images/min on a single RTX A6000).

We train GEN-VLKT under three configurations: real-only, synthetic pre-training + real fine-tuning, and joint synthetic+real training. Pre-training with GeCHO data consistently improves mAP, especially on rare categories (+1.40 mAP_rare in the Known-Object setting), demonstrating the value of our generated data for learning interaction representations.

Data synthesis with GeCHO + GroundedSAM filtering → GEN-VLKT training.

Distribution of HOI triplets in HICO-DET (blue) and our synthetic complement (red). Our inverse-proportional sampling boosts the under-represented tail categories.

Training strategy		mAP_full (DEF) ↑	mAP_rare (DEF) ↑	mAP_full (KO) ↑	mAP_rare (KO) ↑
Real only (baseline)		33.60	28.94	36.65	32.48
GeCHO	Synth pre-train + real fine-tune	33.86	30.34	36.99	33.76
GeCHO	Synth+real joint training	32.64	27.86	35.94	31.01
SD3	Synth pre-train + real fine-tune	32.72	27.84	35.91	30.99
SD3	Synth+real joint training	30.59	26.73	34.17	31.09

DEF = Default setting (all test images); KO = Known-Object (only images containing the target object). GeCHO pre-training outperforms both the real-only baseline and SD3-based augmentation, particularly on rare interaction categories.

Ablation Studies

Component Ablation

We ablate the two main components of our pipeline: the ControlNet and the LoRA fine-tuning step.

w/o ControlNet & Fine-Tuning

w/ ControlNet only (w/o fine-tune)

Full GeCHO (Ours)

'eat' ↔ 'dining table'

'hold' ↔ 'potted plant'

'kiss' ↔ 'horse'

'hold' ↔ 'donut'

Without ControlNet, the base model often deviates from the intended interaction. Adding ControlNet improves spatial consistency, but artifacts appear where ControlNet activations interact with the U-Net. The LoRA fine-tuning step (Full GeCHO) harmonizes both networks, yielding coherent lighting, poses, and object fidelity.

Mask Configuration Ablation

We compare three strategies for providing M_obj at inference: a fixed user-specified mask, Token-Attention (dynamic mask from full-prompt cross-attention), and Prompt-Attention (dynamic mask from an isolated object-prompt).

M_obj strategy	CLIP_text ↓	ZS Action Acc@1 ↑	ZS Action Acc@5 ↑	ZS Obj Acc@1 ↑	ZS Obj Acc@5 ↑
Fixed Mask	0.6747	26.12%	81.85%	83.33%	93.79%
Token-Attention	0.6715	27.37%	82.73%	83.79%	93.60%
Prompt-Attention	0.6715	27.71%	82.83%	82.71%	92.90%

Dynamic masking improves action accuracy (+1.6 pp) while maintaining object recognition. Token-Attention better preserves spatial relationships; Prompt-Attention allows flexible word selection at the cost of occasional object placement imprecision.

Authors

Giovanni Minelli

IIT & Univ. of Bologna

Andrea Benericetti

Verizon Connect

Leonardo Taccari

Verizon Connect

Francesco Sambo

Verizon Connect

Samuele Salti

Univ. of Bologna

BibTeX

@ARTICLE{11456932,
  author={Minelli, Giovanni and Benericetti, Andrea and Taccari, Leonardo and Sambo, Francesco and Salti, Samuele},
  journal={IEEE Access},
  title={GeCHO: Generation of Contextualized Human--Object Interactions},
  year={2026},
  volume={14},
  pages={48872--48886},
  doi={10.1109/ACCESS.2026.3678513}
}

GeCHO

Generation of Contextualized Human–Object Interactions

GeCHO enables natural human-object interaction synthesis through text-guided inpainting. Left: visualization of inpainting regions (red = free-form, green = object-mask) and corresponding interaction prompts. Right: outputs demonstrating realistic, coherent interaction scenes produced by our method.

Abstract

Method

ControlNet for Object Placement

LoRA Fine-Tuning for Harmonization

Attention-Guided Free-Form Masking

Results: Inpainting Comparison

Results: Editing Method Comparison

Quantitative Results

vs. Inpainting Methods

vs. Editing Methods

Downstream Task: HOI Detection

Ablation Studies

Component Ablation

Mask Configuration Ablation

Authors

BibTeX