A New Approach to Synthetic Image Generation

We configured a GPT-3 model to work with DALL·E 2 to create higher-quality artwork from dynamic text inputs.

Jul 12, 2022

If you’ve been paying attention to your news feeds lately, you’ve probably heard about DALL·E 2, an AI model that can generate original images and art from a text description. It combines an innate understanding of human language with that of imagery, and the images you can create with it are mind-blowing.

But coming up with a good text input for DALL·E 2 is an art form in itself, with knowledge of art direction, art history, and mastery of language being just some prerequisites for generating consistently high-quality images. This sort of prompt engineering can feel unapproachable for the less art-savvy, and it doesn't scale. It poses an opportunity for other AI models to fill this void.

We’ve trained a GPT-3 model to create DALL·E 2 prompts based on open-ended text inputs and the result is objectively better artwork. This multi-model process for image generation represents a new approach to synthetic image generation that can scale to the full range of dynamic text opportunities that exist in the world today.

How It Works

To demonstrate how this works, we’ll showcase a potential use case: generating artwork for a story. We’ve given GPT-3 some background on its task and examples to demonstrate the prompt structure we’d like it to follow.

Let’s take this excerpt from the opening passage of one of the world’s most celebrated stories, The Iliad:

“Sing, goddess, of the anger, the anger of Achilles born of Peleus, the anger all-a-damaging that brought countless pains to men, sending many stout souls to Hades before their time, making heroes into food for dogs and cultures.”

When we directly input this text into DALL·E 2, it generates mediocre images like this:

But when we feed the text to our GPT-3 Art Director it translates it into a vividly descriptive text input for DALL·E 2:

This artwork is inspired by the Iliad, and specifically the anger of Achilles. The artwork itself is dark and foreboding, with a sense of doom and despair hanging over it. The figures are caught in the throes of battle, their bodies contorted in pain and rage. The artwork is in the style of Peter Paul Rubens.

Once we feed this prompt into DALL·E 2 it generates high-quality images with more depth and nuance, like this one.

Now let’s look at some more story excerpts and image pairings that have been generated using this technique.

Richard III (Shakespeare)

Text excerpt:

“I shall despair. There is no creature loves me;
And if I die, no soul shall pity me:
Nay, wherefore should they, since that I myself
Find in myself no pity to myself?”

GPT-3’s Art Direction:

This artwork is a portrait of despair. The figure is slumped over, their head in their hands. The background is dark and featureless, emphasizing the loneliness and isolation of the figure. The artwork is in the style of Edvard Munch's The Scream.

The Great Gatsby (F. Scott Fitzgerald)

Text excerpt:

“If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away.”

GPT-3’s Art Direction:

This artwork is inspired by the novel The Great Gatsby. It depicts the protagonist, Jay Gatsby, in a moment of triumph. He is surrounded by the trappings of his wealth and success, but there is an emptiness to his eyes. The artwork is in the style of Andy Warhol's portraits of celebrities.

The Hitchhiker’s Guide to the Galaxy (Douglas Adams)

Text excerpt:

“The Restaurant at the End of the Universe is one of the most extraordinary ventures in the entire history of catering.
It is built on the fragmented remains of an eventually ruined planet which is (wioll haven be) enclosed in a vast time bubble and projected forward in time to the precise moment of the End of the Universe.
This is, many would say, impossible.
In it, guests take (willan on-take) their places at table and eat (willan on-eat) sumptuous meals whilst watching (willing watchen) the whole of creation explode around them.
This is, many would say, equally impossible.”

GPT-3’s Art Direction:

This artwork depicts the restaurant at the end of the universe. The restaurant is in the style of Art Deco, with a sleek and modern look. The guests are seated at their tables, enjoying their meals, while the universe explodes around them. This artwork is in the style of Salvador Dali.

Conclusion

DALL·E 2 is currently in private beta. But as it evolves and becomes available for commercial use, leveraging a large language model like GPT-3 as an intermediary could unlock a range of dynamic use cases. Because this approach eliminates the need for human prompt engineering on an image-by-image basis, it enables the generation of high-quality synthetic images at scale.

Addition

Discussion about this post