This blog shows how to generate stunning advertising creatives with PixArt-δ.
Text-to-image generation models have been a game-changer. They have transformed image generation with the help of prompting. Many models have been introduced for this purpose, but they lack three major performance efficiencies: pixel dependency capturing, alignment between text and image, and high aesthetic quality assurance. OpenX Lab introduced PIXART-α as an answer to these challenges. It comes with variants like SDXL, SDXL LoRA, and SDXL LCM LoRA. OpenX Lab has since improved the model by introducing PIXART-δ, with its variants ControlNet and ControlNet LCM models.
Text-to-image synthesis offers numerous applications; here, we'll explore its use in generating images for an airplane crew member's training school and show how it can enhance advertising efforts on their website.
In this blog post, we'll delve into how PixArt-δ has developed strategies to address three key challenges in text-to-image synthesis and how it is meeting these challenges for advertisement-related image generation. But, first, let's familiarize ourselves with PIXART-α!
PIXART-α is a Transformer-based text-to-image (T2I) diffusion model that represents an innovative approach to generating high-quality images from textual descriptions. Let’s take a look at its architecture.
PIXART-α adopts the Diffusion Transformer (DiT) as its base architecture. The DiT architecture is specifically designed to handle the unique challenges of T2I tasks, which makes it well-suited for generating images from text descriptions. This is the basic architecture of PIXART-α. The model consists of Transformer blocks, which are the fundamental building blocks that process both textual and visual information. These blocks enable the model to effectively capture dependencies between words in the text and pixels in the image, which facilitates the generation of coherent and realistic images.
PIXART-α incorporates a multi-head cross-attention mechanism into each Transformer block. This mechanism allows the model to flexibly interact with the textual conditions extracted from the language model while processing image data. It helps align the textual descriptions with the corresponding image features by enhancing the quality of generated images.
The model utilizes adaptive layer normalization (adaLN) modules within each Transformer block. These modules enable the model to adaptively adjust the normalization parameters based on the input data by enhancing the flexibility and effectiveness of the model in handling different types of input conditions. It also incorporates re-parameterization techniques to improve the efficiency and scalability of the model. This involves initializing certain parameters in the model to specific values that facilitate training and inference, which leads to better performance and reduced computational complexity.
Stage 1: Pixel Dependency Learning: This stage focuses on understanding the intricate pixel-level dependencies within images. The goal is to generate realistic images by capturing the distribution of pixel-level features. The approach involves training a class-conditional image generation model using a pre-trained model as a starting point. By leveraging pre-trained weights and designing the model architecture to be compatible with these weights, the training process becomes more efficient.
Stage 2: Text-Image Alignment Learning: The primary challenge in transitioning from pre-trained class-guided image generation to text-to-image generation is achieving accurate alignment between textual descriptions and images. In this stage, a dataset consisting of precise text-image pairs with high concept density is constructed. This dataset helps the model learn to align textual descriptions with images effectively, thus improving the overall quality of generated images.
Stage 3: High-Resolution and Aesthetic Image Generation: In the final stage, the model is fine-tuned using high-quality aesthetic data for generating high-resolution images. By incorporating additional datasets and leveraging prior knowledge gained from the previous stages, the adaptation process in this stage converges faster, which results in higher-quality generated images.
When it comes to evaluating the quality of generated images, the key metric Frechet Inception Distance (FID) is used. FID measures the similarity between the distributions of real and generated images based on features extracted from a pre-trained neural network. PIXART-α achieves a low FID score, which indicates high fidelity in generating images that closely match real images. This model outperforms other T2I models.
In terms of training efficiency like GPU days and the total volume of training images, this model showed high fidelity with relatively lower resource consumption compared to other T2I models. In the evaluation of the alignment between generated images and textual descriptions, metrics such as attribute binding, object relationships, and overall compositionality are used. This model demonstrates its capability to accurately translate textual descriptions into visually coherent images. PIXART-α has a higher user satisfaction compared to other T2I models.
Now let’s understand PIXART-δ.
PIXART-δ is a cutting-edge text-to-image generation model designed to significantly enhance the efficiency and control of image synthesis from textual descriptions. It stands out by integrating Latent Consistency Models (LCM) to achieve a remarkable 4-step sampling acceleration by ensuring the generation of high-quality images at an impressively fast pace.
Let's delve into its key features:
The ControlNet-UNet architecture is inspired by the classic UNet model, which is widely used for image segmentation tasks. UNet's architecture is fundamentally a convolutional neural network (CNN) that expands the typical structure with a symmetrically designed decoder path to perform precise localization. This makes it exceptionally suitable for tasks requiring detailed control over the pixel-level attributes of images.
How It Works:
The ControlNet-Transformer variant is built upon the Transformer architecture, which has been very successful in natural language processing and is increasingly used in image-related tasks.
How It Works:
The 1024px results in PIXART-δ are impressive, which showcase the model's ability to generate high-resolution images with fine details and controllability. By leveraging the multi-scale image generation capabilities of PIXART-α, PIXART-δ achieves remarkable results even at the high resolution of 1024x1024 pixels. The generated images demonstrate fidelity to the input prompts while maintaining high visual quality and coherence. Developers can exert precise control over the geometric composition and visual elements of the generated images, which allows for customization and fine-tuning according to specific requirements.
PIXART-δ differs from PIXART-α in several key aspects:
In this blog post, we have experimented with PIXART-α’s SDXL Standard Version and PIXART-δ’s ControlNet LCM version. So, roll up your sleeves and get started with your Colab notebook Let’s see which model is generating the best images.
Install the dependencies.
!pip install -q diffusers transformers sentencepiece accelerate
Import torch and PixArtAlphaPipeline from Diffusers.
import torch
from diffusers import PixArtAlphaPipeline
Load the ControlNet version model and load it to “CUDA”.
pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
pipe = pipe.to('cuda')
Pass the prompt, and observe the results.
Prompt 1:
prompt = "A real air hostess, airplane background, wearing red dress and hat, photorealistic"
pipe(prompt).images[0]
Result:
Prompt 2:
prompt = "A real airplane crew member, man, white shirt and black tie, photorealistic"
pipe(prompt).images[0]
Result:
Prompt 3:
prompt = "A real airplane pilot, woman in airplane, white shirt, airplane background, photorealistic"
pipe(prompt).images[0]
Result:
Initiate the model.
pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-LCM-XL-2-1024-MS", torch_dtype=torch.float16)
pipe = pipe.to('cuda')
Pass the prompt and observe the results.
Prompt 1:
prompt = "A real air hostess, airplane background, wearing red dress and hat, photorealistic"
pipe(prompt).images[0]
Result:
Prompt 2:
prompt = "A real airplane crew member, man, white shirt and black tie, photorealistic"
pipe(prompt).images[0]
Result:
Prompt 3:
prompt = "A real airplane pilot, woman in airplane, white shirt, airplane background, photorealistic"
pipe(prompt).images[0]
Result:
We saw that after comparing the two models, the latter gave better results.
We saw how easy it was to use PIXART-δ on the Colab notebook. The image results for the PIXART-δ ControlNet LCM version were better, more realistic than the PIXART-α SDXL Standard version. It is wonderful to see that PixArt-δ has built a training strategy by identifying the challenges faced by T2I models. Thanks for reading this blog!
https://arxiv.org/pdf/2310.00426.pdf
https://arxiv.org/pdf/2401.05252.pdf
https://github.com/PixArt-alpha/PixArt-alpha
Superteams.ai connects top AI talent with companies seeking accelerated product and content development. Superteamers offer individual or team-based solutions for projects involving cutting-edge technologies like LLMs, image synthesis, audio or voice synthesis, and other open-source AI solutions. Superteams have facilitated diverse projects with over 500 AI researchers and developers like 3D e-commerce model generation, advertising creative generation, enterprise-grade RAG pipelines, geospatial applications, and more. Focusing on talent from India and the global South, Superteams offers competitive solutions for companies worldwide. To explore partnership opportunities, please write to founders@superteams.ai or visit this link.