皮可-香蕉-400k

皮可-香蕉-400k
Pico-Banana-400k

原始链接: https://github.com/apple/pico-banana-400k

## Pico-Banana-400K：用于文本引导图像编辑的数据集 Pico-Banana-400K 是一个新型的大规模数据集，包含约 40 万个文本-图像-编辑三元组，旨在改进文本引导的图像编辑研究。它基于 Open Images 中的图像构建，并由 Nano-Banana 模型生成/验证编辑内容，涵盖 35 种编辑操作和 8 个语义类别——从颜色调整到复杂的对象和风格变化。该数据集分为约 25.7 万个样本用于监督微调 (SFT)，约 5.6 万个样本用于偏好学习（使用失败的编辑），以及约 7.2 万个样本用于多轮编辑应用。指令由 Gemini-2.5-Flash 生成，注重简洁和自然语言。一个强大的自我评估流程，利用 Gemini-2.5-Pro，确保高质量的编辑，评估指令遵循度、真实性和技术质量。 Pico-Banana-400K 提供多样化的内容，包括人物、物体和富含文本的场景，并以 CC BY-NC-ND 4.0 许可免费提供给研究和非商业用途。它旨在成为一个通用的资源，用于开发更可控和更符合指令的图像编辑模型。

## Pico-Banana-400k：苹果图像编辑数据集苹果发布了Pico-Banana-400k，一个用于训练和研究图像编辑模型的数据库，通过使用Nano-Banana编辑OpenImages输入并使用Gemini-2.5-Pro过滤结果构建而成。该项目旨在方便开发者构建和测试他们自己的图像编辑系统。讨论主要集中在评估方法上，用户比较了Gemini 2.5 Pro、GPT-5和Qwen3 VL，发现Gemini更一致。多位用户正在构建类似的生成式AI自动化评估系统，其中一人分享了一个网站 ([https://genai-showdown.specr.net/image-editing](https://genai-showdown.specr.net/image-editing))，并提出了滑块同步功能请求，该请求很快被实现。对话还涉及训练技术，例如使用逆向任务和合成数据，以及AI行业命名规范的挑战。人们对数据集的许可协议（CC BY-NC-ND）以及AI生成内容的版权影响表示担忧。最后，用户指出该数据集依赖于谷歌的Gemini和Nano-Banana，并与其他模型如Flux和Qwen Image Edit形成对比。

原文

Pico-Banana-400K is a large-scale dataset of ~400K text–image–edit triplets designed to advance research in text-guided image editing.
Each example contains:

an original image (from Open Images),
a human-like edit instruction, and
the edited result generated and verified by the Nano-Banana model.

The dataset spans 35 edit operations across 8 semantic categories, covering diverse transformations—from low-level color adjustments to high-level object, scene, and stylistic edits.

Feature	Description
Total Samples	~257K single-turn text–image–edit triplets for SFT, ~56K single-turn text-image(positive) - image(negative)-edit for preference learning, and ~72K multi-turn texts-images-edits for multi-turn applications
Source	Open Images
Edit Operations	35 across 8 semantic categories
Categories	Pixel & Photometric, Object-Level, Scene Composition, Stylistic, Text & Symbol, Human-Centric, Scale & Perspective, Spatial/Layout
Image Resolution	512–1024 px
Prompt Generator	Gemini-2.5-Flash
Editing Model	Nano-Banana
Self-Evaluation	Automated judging pipeline using Gemini-2.5-Pro for edit quality

🏗️ Dataset Construction

Pico-Banana-400K is built using a two-stage multimodal generation pipeline:

Instruction Generation
Each Open Images sample is passed to Gemini-2.5-Flash, which writes concise, natural-language editing instructions grounded in visible content. We also provide short instructions summarized by Qwen-2.5-Instruct-7B. Example:
```
{
  "instruction": "Change the red car to blue."
}
```
Editing + Self-Evaluation The Nano-Banana model performs the edit, then automatically evaluates the result using a structured quality prompt that measures: Instruction Compliance (40%) Editing Realism (25%) Preservation Balance (20%) Technical Quality (15%) Only edits scoring above a strict threshold (~0.7) are labeled as successful, forming the main dataset; the remaining ~56K are retained as failure cases for robustness and preference learning.

Nano-Banana-400K contains ~400K image editing data, covering a wide visual and semantic range drawn from real-world imagery.

🧭 Category Distribution

Category	Description	Percentage
Object-Level Semantic	Add, remove, replace, or relocate objects	35%
Scene Composition & Multi-Subject	Contextual and environmental transformations	20%
Human-Centric	Edits involving clothing, expression, or appearance	18%
Stylistic	Domain and artistic style transfer	10%
Text & Symbol	Edits involving visible text, signs, or symbols	8%
Pixel & Photometric	Brightness, contrast, and tonal adjustments	5%
Scale & Perspective	Zoom, viewpoint, or framing changes	2%
Spatial / Layout	Outpainting, composition, or canvas extension	2%

Single-Turn SFT samples (successful edits): ~257K
Single-Turn Preference samples (failure cases): ~56K
Multi-Turn SFT samples (successful cases): ~72K
Gemini-generated instructions: concise, natural, and image-aware
Edit coverage: 35 edit types across 8 semantic categories
Image diversity: includes humans, objects, text-rich scenes, etc from Open Images

Below are representative examples from different categories:

Category	Example
Object-Level	“Replace the red apple with a green one.”
Scene Composition	“Add sunlight streaming through the window.”
Human-Centric	“Change the person’s expression to smiling.”
Text & Symbol	“Uppercase the text on the billboard.”
Stylistic	“Convert the image to a Van Gogh painting style.”

Pico-Banana-400K provides both breadth (diverse edit operations) and depth (quality-controlled multimodal supervision), making it a strong foundation for training and evaluating text-guided image editing models.

Pico-Banana-400K serves as a versatile resource for advancing controllable and instruction-aware image editing.
Beyond single-step editing, the dataset enables multi-turn, conversational editing and reward-based training paradigms.

📦 Dataset Download Guide

The Pico-Banana-400K dataset is hosted on Apple’s public CDN.
You can download each component (single-turn, multi-turn, and preference data) using the provided manifest files.

🖼️ 1. Single-Turn Edited Images

Manifest files: sft link and preference link

🖼️ 2. Multi-Turn Edited Images

Manifest file: multi-turn link

Urls to download source images are provided along with edit instructions in sft link, preference link, and multi-turn link. If you hit rate limit with Flickr when downloading images, you can either request higher rate limit with Flickr or follow steps below.

Another way to download the source images is to download packed files train_0.tar.gz and train_1.tar.gz from Open Images, then map with the urls we provide. We also provide a sample mapping code here. Due to legal requirements, we cannot provide the source image files directly.

# install awscli(https://aws.amazon.com/cli/)
# Download Open Images packed files 
aws s3 --no-sign-request --endpoint-url https://s3.amazonaws.com cp s3://open-images-dataset/tar/train_0.tar.gz . 
aws s3 --no-sign-request --endpoint-url https://s3.amazonaws.com cp s3://open-images-dataset/tar/train_1.tar.gz . 

# Create folder for extracted images 
mkdir openimage_source_images

# Extract the tar files 
tar -xvzf train_0.tar.gz -C openimage_source_images
tar -xvzf train_1.tar.gz -C openimage_source_images

# Download metadata CSV (ImageID ↔ OriginalURL mapping)  
wget https://storage.googleapis.com/openimages/2018_04/train/train-images-boxable-with-rotation.csv

# Map urls to local paths
python map_openimage_url_to_local.py #please modify variable is_multi_turn and file paths as needed

Pico-Banana-400K is released under the Creative Commons Attribution–NonCommercial–NoDerivatives (CC BY-NC-ND 4.0) license. ✅ Free for research and non-commercial use ❌ Commercial use and derivative redistribution are not permitted 🖼️ Source images follow the Open Images (CC BY 2.0) license By using this dataset, you agree to comply with the terms of both licenses.

If you use 🍌 Pico-Banana-400K in your research, please cite it as follows:

@inproceedings{Qian2025PicoBanana400KAL,
  title={Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing},
  author={Yusu Qian and Eli Bocek-Rivele and Liangchen Song and Jialing Tong and Yinfei Yang and Jiasen Lu and Wenze Hu and Zhe Gan},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:282272484}
}