The Making of ImageBattle.ai: A Journey to learn (Vibe)Coding

In an era where AI is rapidly changing the development landscape, I found myself feeling increasingly disconnected from the hands-on coding experience I once loved. It had been years since I personally shipped something to production, and I was starting to feel like a COBOL developer trying to navigate the world of modern web development. The rise of AI-assisted coding was creating a new paradigm that I wanted to experience firsthand.

#The Problem I Wanted to Solve As someone interested in AI image generation in bunch of hobby projects, I found myself constantly testing different models to find the best one for specific needs. Existing comparison tools like Artificial Analysis offered scores and rankings but lacked a crucial feature: an easy way to visually compare images in a gallery mode. What's more, I discovered that even the best generators occasionally produce subpar results. I wondered: could an "LLM as a Judge" model work? Could a Vision LLM effectively evaluate multiple images and determine which was best? This curiosity led to the creation of ImageBattle.ai – a tool designed to help designers, product managers, engineers, and creators identify the optimal image generation model for their specific needs.

Building ImageBattle: The Development Journey

Step 1: Finding the Right Name

Creating with AI obviously I had to use ChatGPT for that. Name explorations with ChatGPT After exploring various options, ImageBattle.ai emerged as the clear winner – simple, descriptive, and memorable.

Step 2: Creating a Product Specification

As CPO @ Grab dutifully I had to create a PRD before going into coding ;) Using the fantastic ChatPRD, I drafted an initial Product Requirements Document. This structured approach helped clarify the core features and user experience I wanted to create.

Step 3: Deep Research for Categories and Prompts

To ensure comprehensive testing across different use cases, I leveraged several AI tools for research:

OpenAI Deep Research for initial category development ChatGPT Chat
Perplexity for some improvement suggestions: Perplexity Chat
Gemini for prompt refinement and enhancement Prompt refinement: Screenshot gemini chat - no link sharing in Gemini yet

This multi-tool approach helped me create a diverse set of test categories and high-quality prompts that would effectively showcase the strengths and weaknesses of each model.

Step 4: Implementing the Image Generation Models

Integrating multiple AI models presented varying levels of challenge:

The Straightforward Ones: DALL-E 3, Google Imagen, and Replicate (which supports BlackForrest Labs and Recraft) were relatively simple to implement since I had prior experience with their APIs.
The New Additions: For models I hadn't used before, I simply fed API documentation into Gemini 2.5 Pro, which generated the necessary PHP classes for API integration.
The Tricky One - Reve: This model had no public API, but I discovered a community GitHub project (reve-sdk). I uploaded the entire codebase to Gemini 2.5 Pro, (unfortunate google doesn't allow link sharing) which transformed it into a working PHP library with just one minor bug to fix. This process saved countless hours of reverse engineering work.
The Challenge of OpenAI: Their robust anti-scraping measures made automated generation difficult - and I didn't want to go too deep into the rabbit hole to circumvent there security (though very tempting adventure using some browser automation). After experiencing high refusal rates even in the UI, I hired a freelancer on Fiverr who manually generated all 100 prompts for $100, delivering results within 24 hours.

Step 5: Ranking and Evaluating the Output

For evaluation, I leveraged Gemini 2.5 Pro's multimodal capabilities. For each prompt, I fed all generated images (currently comparing 11 models) into Gemini and had it evaluate them across multiple criteria. The results were surprisingly effective – the model noticed subtle details and provided insightful comparisons, with only occasional artifacts in the analysis.

Step 6: Meta-analysis Using Gemini 2.5 Pro

The final step involved conducting meta-analysis across all evaluations. I compiled all output files into large JSONs and let Gemini 2.5 Pro analyze the data to identify patterns and insights. The model's ability to handle these massive contexts (prompts ranging from 50,000-250,000 tokens) was remarkable, producing high-quality analysis that revealed meaningful distinctions between the models.

Interesting Discoveries Along the Way

Gemini 2.5 Pro is unbelievable for data analysis as well as vision capabilities.

Google really cooked with this model release. The capabilities are absolutely wild at a price that's super cheap. Having a 1 million token context window - with at least up to 300-500k really high recall is super good, also the vision capability that you can give it 10-15 images and rank and evaluate them all in one shot with high accuracy is absolutely insane. Kudos Google, great to have you back!

OpenAI ChatGPT image generation is king

Not only Ghibli style images are great- but across super complex prompts OpenAI is in a league of it's own in many cases. The SimCity 2000 user interface is wild (though it has 2 Golden Gate bridges hahah) Ultra Hard -> SimCity 2000 -> ChatGPT

Congrats OpenAI for dominating image generation!

MidJourney 7 is promising

I implemented the MidJourney 7 Alpha, and it has some insanely good images, for example the person with facial Tattoos it's clearly a league of it's own with the details. MidJourney v7 Tattoo It's still an Alpha so many artifacts and struggles especially with text and complex prompts, but I'm sure it'll be great in a few weeks.

The "Ghost in the Machine" - Weird Artifacts

During testing, I encountered some fascinating AI perception phenomena. In one case, multiple AI vision models "saw" a watch displaying 10:45 in an image where the watch showed a different time. This consistent misinterpretation across models raised interesting questions about how AI vision systems process information. ChatGPT convo

Gemini 2.5 Pro Token Limits

While Gemini 2.5 Pro demonstrated impressive analytical capabilities, I discovered its limitations when exceeding 500,000 tokens. Beyond this threshold, the model began hallucinating, creating incorrect IDs and fabricating information. This led me to optimize my prompts, keeping them under 200,000 tokens for reliable analysis.

The Joy of Coding Again

Perhaps the most meaningful outcome of this project was rediscovering the joy of hands-on development. Building ImageBattle.ai reconnected me with the creative satisfaction that comes from turning an idea into a functional product. In an age where AI is increasingly handling coding tasks, there's still immense value in vibe-coding solutions togehter with AI. This project hopefully demonstrated how modern developers can leverage AI as a powerful assistant while maintaining the core creativity and problem-solving that makes programming so rewarding and fun.

I've found that ImageBattle.ai not only serves as a useful tool for comparing AI image generators but also represents a blueprint for how developers can work alongside AI to create better tools faster than ever before. If you're interested in exploring the platform, check out ImageBattle.ai and see for yourself how different AI image generators compare across various prompts and categories.

Image Battle