The artificial intelligence research group OpenAI has created a new version of DALL-E, your text-to-image generator program. DALL-E 2 features a higher resolution, lower latency version of the original system, producing images that display user-written descriptions. It also includes new capabilities, such as editing an existing image. As with OpenAI’s previous work, the tool is not being released directly to the public. But researchers can sign up online to preview the system, and OpenAI hopes it will then be available for use in third-party apps.
The original DALL-E, an acronym for the artist “Salvador Dalí” and the robot “WALL-E”, debuted in January 2021. He was a limited but fascinating test of AI’s ability to visually represent concepts, from mundane representations of a mannequin in a flannel shirt to “a giraffe made of a turtle” or an illustration of a radish walking a dog. At the time, OpenAI said it would continue to develop the system while examining potential pitfalls, such as bias in image generation or production of misinformation. It is attempting to address those issues by using technical safeguards and a new content policy, while reducing its computing load and boosting the core capabilities of the model.
One of the new features in DALL-E 2, inpainting, applies the text-to-image capabilities of DALL-E on a more granular level. Users can start with an existing image, select an area, and tell the model to edit it. You can block out a painting on a living room wall and replace it with a different image, for example, or add a vase of flowers to a coffee table. The model can fill in (or remove) objects by taking into account details such as the directions of shadows in a room. Another feature, variations, is like a search tool for images that don’t exist. Users can upload an initial image and then create a variety of similar variations. They can also combine two images, generating images that have elements of both. The generated images are 1024 x 1024 pixels, a jump over the 256 x 256 pixels that the original model delivered.
DALL-E 2 is based on CLIP, a computer vision system that OpenAI also announced last year. “DALL-E 1 simply took our GPT-3 approach to language and applied it to produce an image: We compressed images into a series of words and learned to predict what comes next,” says OpenAI research scientist Prafulla Dhariwal, referring to the GPT model used by many text AI applications. But the combination of words did not necessarily capture the qualities that humans considered most important, and the predictive process limited the realism of the images. CLIP was designed to look at images and summarize their content as a human would, and OpenAI iterated on this process to create “unCLIP”, an inverted version that starts with the description and progresses to an image. DALL-E 2 generates the image using a process called diffusion, which Dhariwal describes as starting with a “bag of dots” and then filling in a pattern with greater and greater detail.
Interestingly, a preliminary document on unCLIP says that it is partially resistant to a very funny weakness of CLIP: The fact that people can fool model identification capabilities by labeling an object (such as a Granny Smith apple) with a word that indicates something else (such as an iPod). The variations tool, the authors say, “still generates images of apples with high probability,” even when using a mislabeled image that CLIP cannot identify as Granny Smith. In contrast, “the model never produces images of iPods, despite the very high predicted relative probability of this legend.”
The full model of DALL-E was never made public, but other developers have perfected their own tools that mimic some of its functionality over the past year. One of the most popular top apps. is Wombo’s Dream mobile app, which generates images of anything users describe in a variety of artistic styles. OpenAI isn’t releasing any new models today, but developers could use their technical findings to update their own work.
OpenAI has implemented some built-in security measures. The model was trained on data that had some objectionable material removed, ideally limiting its ability to produce objectionable content. There is a watermark indicating the nature of the AI-generated work, although it could theoretically be cropped. As a preventative feature against abuse, the model is also unable to generate recognizable faces based on a name, even asking for something like the Mona Lisa it would apparently return a variant of the actual face of the painting.
DALL-E 2 will be tested by vetted partners with some caveats. Users are prohibited from uploading or generating images that are “not G-rated” and “may cause harm,” including anything that involves hate symbols, nudity, obscene gestures, or “major conspiracies or events related to major ongoing geopolitical events.” They also need to reveal the AI’s role in generating the images, and they can’t show generated images to other people via an app or website, so you won’t initially see a DALL-E powered version of something like Dream. But OpenAI hopes to add it to the group’s API toolset later, allowing it to power third-party apps. “Our hope is to continue to do a phased process here, so that we can continue to evaluate from the feedback we receive how to launch this technology safely,” says Dhariwal.
Additional reporting by James Vincent.