How does AI 'create' an image?

(Image credit: Google)

I've always been fascinated by tech. From biotech to future tech and everything in between, I've wanted to try it all and then break it down so I understand how it works. Even so, if you had told me 30 years ago that one day, a small handheld device would be able to create an image out of thin air and a text prompt, I wouldn't have believed it.

Yet here we are, and your phone can turn what you say into a picture through AI. It's often not a great picture (and can even be a disturbing mess), but it's still a piece of machinery doing something that used to require a human. It still does. Technically, it requires a lot of humans to spend a lot of time.

The work happens before you use it

An AI-generated image of Sora from Kingdom Hearts and R2-D2. — (Image credit: Derrek Lee / Android Central)

Modern AI works using a neural network. You might recognize that the word neural means related to the nervous system, and that's not accidental. Computers aren't organic and don't have a nervous system, but they can mimic the process and function in their own way. That's where everything starts: with a convolutional neural network.

These specialize in the ability to recognize patterns and objects — not in the same way we do, but in a way that's almost as cool, even if not nearly as complex as a human eye and brain.

You don't remember an exact replica of everything you've ever learned or can recognize. You know a shirt is a shirt regardless of what color it is, for example, because your brain knows what a shirt is; you don't have to see every shirt in the world to recognize one.

AI does something similar. It's trained from processing hundreds of millions of images, each with a description stating exactly what the image is. Take this one, for example:

A photo of a cheeseburger and fries. — (Image credit: Jerry Hildenbrand)

This is a cheeseburger and a side of fries. But it can be described in much more detail:

This is a photograph of food. It has a cheeseburger with two pieces of bacon and Swiss cheese, and a bun that looks moist. There are visible grill lines on the meat patty, and some of the meat patty's juices have soaked into the bun. There is also a wire basket that is a replica of a deep fryer basket holding at least 13 pieces of what look to be sliced potatoes. They have been fried, and at least one of them is slightly burned.

On a different, smaller plate are the remnants of an unknown appetizer with a small dish of unmelted butter in the center. There is also a small square plate with a fork and knife laid on it and a goblet off to the side filled partially with an unknown liquid. The tabletop is brown wood and there are reflections of red and yellow light near the top.

This is how images should be described as they are fed into an AI training algorithm. Every detail is analyzed, and nothing is insignificant because the computers doing the "looking" are looking for a pattern inside the visual noise of the photo.

When training AI, every detail matters, even the seemingly insignificant ones.

Eventually the model will be able to take a prompt and recreate the right noise patterns to build an image because it has the right amount of the right kind of data. Everything in an analyzed image is relevant, not just the cheeseburger that you and I would notice.

With enough analyzed data, it can serve as a path or set of instructions to create a new image that fulfills a user request. It's not taking bits and pieces of images it has already seen and piecing them together like a puzzle; it's simply creating patterns of visual noise. With enough training, those patterns end up looking like an image.

This also explains why some models get some things really wrong. AI can only create based on what it was trained on; if you train using 100,000,000 photos of black dogs but never include a brown one, the AI can never create an image of a brown dog, no matter how you try to tell it to do so.

Gemini 2.5 Pro graphics and benchmark results. — (Image credit: Google)

Bias exists because AI is trained on web data, and certain things are overrepresented while others are underrepresented. This makes its way into the results because, as we discussed, AI can only recreate what it was trained on. Ask AI to create an image of a scientist wearing a shirt with the Croatian flag and blue sneakers, and the doctor will probably be Caucasian simply because of how the training data was represented.

You could ask for an image of a black scientist with the same shirt and shoes sitting in a wheelchair, and you would likely be presented with one. Like during the training, a good description matters a lot.

AI will continue to get better, and image generation will be part of it. Researchers have plenty of hurdles, not only with fine-tuning an algorithm and using representative data but also trying to ethically work around inherent bias and incomplete training data.

We've come a long way in just a few years, and things do not look to be slowing down anytime soon.

Jerry is an amateur woodworker and struggling shade tree mechanic. There's nothing he can't take apart, but many things he can't reassemble. You'll find him writing and speaking his loud opinion on Android Central and occasionally on Threads.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.

The work happens before you use it

Be an expert in 5 minutes

You must confirm your public display name before commenting