Thumbnail generation requires specific capabilities: models need to be able to render crisp text, generate realistic faces, and compose multiple elements into a cohesive image. We don't really think these skills are measured well by current vibe or arena-based evals, so we've built our own.
We give image models the same prompt (derived from production thumbnail templates) then score their outputs against standard and prompt-specific criteria. Thumbnail Bench relies on human evaluation across boolean (true/false) criteria, as large language models seem pretty poor at identifying subtle visual defects.
For our first go at this, Thumbnail Bench is mostly focused on prompt following. This makes eval completion faster and less subjective.
We created the criteria as boolean true/false values hoping it would be easier for LLMs to evaluate, but at least for now, they can't. GPT-4.1/5/Sonnet 4.5 all seem unable to pick up on things like eye contact, human hands and plastic-looking skin.
We have a half dozen criteria that frequently apply, and then we also use an LLM to generate specific criteria for each image prompt.
Creating these benchmarks has given us much to look forward to for building the next iteration, but these have already been helpful. If you're generating thumbnails for your own channel, we hope these benchmarks are helpful for you, too. If you have a suggestion or request, you can reach out.
Right now we're focused on text-to-image generation with production-ready prompts. We're not yet testing the ability to come up with thumbnail concepts, generate thumbnails from creator's images, or evaluate editing capabilities. We hope to add those benchmarks in the near future.
Seedream 4 tops some benchmarks, but the images look mushy and over-compressed. The benchmark scores seem suspect.
Get notified when we publish new benchmark results and model comparisons.