TubeSalt

Introduction

Thumbnail generation requires specific capabilities: models need to be able to render crisp text, generate realistic faces, and compose multiple elements into a cohesive image. We don't really think these skills are measured well by current vibe or arena-based evals, so we've built our own.

Evaluations

We give image models the same prompt (derived from production thumbnail templates) then score their outputs against standard and prompt-specific criteria. Thumbnail Bench relies on human evaluation across boolean (true/false) criteria, as large language models seem pretty poor at identifying subtle visual defects.

For our first go at this, Thumbnail Bench is mostly focused on prompt following. This makes eval completion faster and less subjective.

We created the criteria as boolean true/false values hoping it would be easier for LLMs to evaluate, but at least for now, they can't. GPT-4.1/5/Sonnet 4.5 all seem unable to pick up on things like eye contact, human hands and plastic-looking skin.

Common Criteria

Hands, Face, Body look anatomically correct
Skin looks natural
All Text and Graphics Present
Text spelled correctly + clear legible
Correct composition + framing
Colors + style match prompt

We have a half dozen criteria that frequently apply, and then we also use an LLM to generate specific criteria for each image prompt.

Creating these benchmarks has given us much to look forward to for building the next iteration, but these have already been helpful. If you're generating thumbnails for your own channel, we hope these benchmarks are helpful for you, too. If you have a suggestion or request, you can reach out.

What's Not Included

Right now we're focused on text-to-image generation with production-ready prompts. We're not yet testing the ability to come up with thumbnail concepts, generate thumbnails from creator's images, or evaluate editing capabilities. We hope to add those benchmarks in the near future.

Best All Around Model

Google's Imagen 4 is the best all-round closed-weight image generation model

Imagen 4 has good aesthetics, good prompt following, and good text handling.

Best All Around Open Weight Model

Hunyuan 3 is very good. For photorealistic images, I often like its aesthetics better than Imagen 4.

Surprisingly Bad

Seedream 4 tops some benchmarks, but the images look mushy and over-compressed. The benchmark scores seem suspect.

Qwen struggles with text, prompt following and aging women subjects beyond the age in their prompt. These women should be 30 but look in their 50s/60s.

Thumbnail Bench Report