Open Preference Dataset for Text-to-Image Generation by Hugging Face Community

LeMaterial is transforming how materials are discovered and developed, making it easier for researchers to collaborate and innovate. This open-source platform integrates AI and data-driven methodologies to advance materials science at an unprecedented pace.

The Data is Better Together community has just released an essential new dataset to advance open-source development. Recognizing the lack of open preference datasets for text-to-image generation, we've launched an Apache 2.0 licensed dataset focused on preference pairs across various image generation categories. This dataset integrates diverse model families and includes varying prompt complexities.

TL;DR: All results are available in this collection on the Hugging Face Hub, with pre- and post-processing code hosted on GitHub. A ready-to-use preference dataset and flux-dev-lora-finetune are included. To show your support, don’t forget to like, subscribe, and follow us.

The Input Dataset

To create a solid input dataset for this sprint, we began with base prompts, filtering them for toxicity and enhancing them with categories and complexities through synthetic data generation using distilabel. Flux and Stable Diffusion models were used to generate the resulting images, culminating in the open-image-preferences-v1 dataset.

Input Prompts

Imgsys, a generative image model platform by fal.ai, enables users to choose between two generated models based on given prompts. While the generated images are not publicly available, the prompts are hosted on Hugging Face. These prompts represent real-life use cases, often used for everyday image generation, but included duplicates and toxic content that required filtering.

Reducing Toxicity

We prioritized removing NSFW content from the dataset by employing a multi-model approach with text-based and image-based classifiers. After filtering, a manual review ensured no harmful content remained, proving the success of our filtering methods.

The Filtering Pipeline:

Classify images as NSFW
Remove positive samples
Manual review by the Argilla team
Repeat the process based on feedback

Synthetic Prompt Enhancement

To enhance the dataset’s diversity and quality, we synthetically generated new prompts, rewriting them based on various categories and complexities using a distilabel pipeline. This approach significantly improved the dataset’s robustness and adaptability.

Prompt Categories

In text-to-text generation, InstructGPT has established core task categories, but there has been no direct equivalent for text-to-image generation. To address this, we sourced categories from two primary inputs: google/sdxl and Microsoft. This resulted in the following core categories: ["Cinematic", "Photographic", "Anime", "Manga", "Digital art", "Pixel art", "Fantasy art", "Neonpunk", "3D Model", "Painting", "Animation", "Illustration"]. Additionally, we introduced mutually exclusive sub-categories to diversify the prompts further. These categories and sub-categories were randomly sampled, ensuring a balanced distribution across the dataset.

Prompt Complexities

The Deita paper demonstrated that varying the complexity and diversity of prompts enhances model generation and fine-tuning. However, humans don't always take the time to craft detailed prompts. To address this, we used the same prompt in both a complex and simplified version, generating two distinct data points for different preference generations.

Image Generation

The ArtificialAnalysis/Text-to-Image-Leaderboard offers insights into top-performing image models. We selected two high-performing models based on their license and availability on the Hugging Face Hub. To ensure diversity, we chose models from different families, specifically stabilityai/stable-diffusion-3.5-large and black-forest-labs/FLUX.1-dev. These models were then used to generate images for both the simplified and complex prompts within the same stylistic categories.

The results

A raw export of all the annotated data includes responses to a multiple-choice question, where each annotator selected whether one of the models outperformed the other, both models performed equally well, or both models performed poorly. From this, we were able to analyze annotator alignment, evaluate model performance across different categories, and even conduct a model fine-tuning, which is already available for experimentation on the Hub! The annotated dataset can be found below:

Annotator Alignment

Annotator alignment helps assess the reliability of a task. When a task is too difficult, annotators may disagree, and when it's too easy, their answers may be too consistent. Striking the right balance is challenging, but we achieved it during this sprint. We conducted this analysis using the Hugging Face datasets SQL console. Overall, SD3.5-XL was slightly more likely to win in our test setup.

Model Performance

With the annotator alignment in mind, both models performed well on their own. To further assess their strengths, we analyzed performance across categories. In summary, FLUX-dev performed better for anime-related prompts, while SD3.5-XL excelled in art and cinematic scenarios.

Tie: Photographic, Animation
FLUX-dev better: 3D Model, Anime, Manga
SD3.5-XL better: Cinematic, Digital art, Fantasy art, Illustration, Neonpunk, Painting, Pixel art

Model Fine-Tune

To evaluate the dataset’s quality without investing excessive time and resources, we fine-tuned the black-forest-labs/FLUX.1-dev model using a LoRA approach, based on the diffusers example on GitHub. In this fine-tuning process, we included the selected samples as expected completions and excluded the rejected ones. Interestingly, the fine-tuned models performed significantly better in art and cinematic scenarios, where the original model had underperformed. You can try the fine-tuned adapter here.

The community

In short, they annotated 10K preference pairs with an annotator overlap of 2/3, resulting in over 30K responses in less than two weeks with contributions from more than 250 community members. The image leaderboard highlights that some community members provided over 5K preferences. They want to express their gratitude to everyone who participated in this sprint, with special recognition for the top three users, who will each receive a month of Hugging Face Pro membership. Make sure to follow them on the Hub: aashish1904, prithivMLmods, and Malalatiana.

What’s Next?

After another successful community sprint, they will continue organizing these events on the Hugging Face Hub. To stay updated, make sure to follow the Data Is Better Together organization. They also encourage community members to take the initiative and are happy to offer guidance, reshare content on social media, and within the organization on the Hub. There are several ways to contribute:

Join and participate in other sprints.
Propose your own sprints or requests for high-quality datasets.
Fine-tune models based on the preference dataset. For example, they might consider doing a full SFT fine-tune of SDXL or FLUX-schnell, or even a DPO/ORPO fine-tune.
Evaluate the improved performance of the LoRA adapter compared to the original SD3.5-XL and FLUX-dev models.

Press contact

Timon Harz

oneboardhq@outlook.com