Summary for ChatGPT 4o
ChatGPT 4o presents itself as a specialized powerhouse rather than a jack-of-all-trades. Ranking mid-tier overall with a score of 7.62, it distinguishes itself with exceptional performance in typography and design tasks, yet struggles with photorealism and strict content policies.
Key Findings:
- ‼️ High Refusal Rate: The model refused 11% of prompts, particularly those involving children or specific copyrighted styles (e.g., Ghibli).
- ✅ Design Superiority: It achieved its highest score in Graphic Design (9.1), making it a top-tier choice for commercial assets.
- ⚠️ The 'Plastic' Look: A recurring critique across Photorealistic People & Portraits is an overly smooth, airbrushed skin texture that detracts from realism.
- ✍️ Text Proficiency: It excels at integrating legible text into images, though it occasionally hallucinates prompt instructions into the visual output.
Deep Dive: Patterns & Technical Capabilities
ChatGPT 4o exhibits distinct behaviors that set it apart from other models in the leaderboard. Its underlying engine (DALL-E 3) prioritizes prompt adherence and safety, often at the cost of raw photorealism and creative flexibility.
1. The "Smoothness" Artifact
Across multiple categories, specifically Photorealistic People & Portraits and Hands & Anatomy, evaluators consistently noted an "airbrushed" or "plastic" quality. For example, in the Group Selfie and Freckled Woman, while the composition was good, the lack of skin micro-texture (pores, imperfections) capped the realism scores. This makes the model less suitable for high-end photographic simulations compared to models like Nano Banana Pro.
2. Text Rendering: A Double-Edged Sword
ChatGPT 4o is remarkably good at rendering specific text. It scored perfect 10s on prompts like Magazine Cover and Minimalist Logo. However, this capability sometimes backfires. In the Movie Poster, the model hallucinated the prompt instructions ("A MOVIE POSTER...") as actual text on the poster, ruining the result.
3. Safety Guardrails & Refusals
The model has very strict safety filters. It refused to generate images for prompts involving children, such as the Toddler Photo, and copyrighted styles like Disney Princess. While this ensures safety, it significantly hampers utility for users looking for specific stylistic emulations or innocuous family-oriented imagery.
4. Anatomical Struggles in Complex Scenes
While simple poses are handled well, the model struggles with complex anatomical interactions. In the Market Scene and Busy Intersection, the model lost coherence, rendering distorted faces in crowds and gibberish text on background signs, resulting in low scores (4/10).
Best Model Analysis by Use Case
Based on the data, here is where ChatGPT 4o excels and where it should be avoided:
🎨 Best Use Case: Graphic Design & Typography
This is the model's strongest suit. If you need clean, vector-style assets or images containing specific text, ChatGPT 4o is a market leader.
- Top Performance: Spring Sale Post (Score: 10) and Mascot Design (Score: 10).
- Why: It understands layout, font styles (serif vs. sans-serif), and color palettes perfectly.
📝 Good For: Text-Heavy Imagery
For signs, book covers, and labels, this model is highly reliable.
- Example: Neon Sign and Book Cover.
- Caveat: Be careful with complex prompts to avoid the model writing your instructions onto the image.
⚠️ Mixed Results: Architecture & Interiors
The model creates beautiful, well-lit spaces but leans towards illustration rather than photorealism.
⛔ Not Recommended: Specific Art Styles & Crowds
- Avoid for: Ghibli or Disney style prompts. The model will refuse these requests (e.g., Kiki's Delivery Service).
- Avoid for: Complex Scenes. The model struggles to maintain detail across multiple subjects, often resulting in distorted faces and low scores (Average Score: 6.22).