Learn how to effectively fine-tune Stable Diffusion with Dreambooth, Textual Inversion and Lora
Stable Diffusion has opened tremendous possibilities for creators, designers, artists, AI engineers, and photo editors to translate their imagination into reality through appropriate prompts. But classical Stable Diffusion alone is insufficient for specific use cases.
Peter is a creative director; and his team was trying to create posters for Taylor Swift’s Eras Tour. Being an engineering graduate and a tech enthusiast, he asked his team to create posters in volume using AI for the tour during the upcoming week. After consulting his techie friends, his team tried Stable Diffusion to generate some Taylor Swift images.
He was a bit disappointed as Stable Diffusion gave this result. The face and body parts were unclear and blurry. Being a Swiftie, he was adamant about designing 200 posters by the end of the week.
Out of anxiety, he asked his techie friend Noah for help. She explained to him the in and outs of Stable Diffusion and provided him with this link to read about the same: https://stable-diffusion-art.com/how-stable-diffusion-work/
Noah advised him to perform fine-tuning for his use case.
Large models like Stable Diffusion are typically trained on general text-image pairs, which can make them unsuitable for understanding specific subjects and generating those subjects in various contexts. For instance, they may not excel at creating photorealistic images of Taylor Swift's face in cinematic lighting.
In order to train Stable Diffusion for specific use cases, we reuse our model parameters as a starting point and train task-specific layers with a set of reference images.
In order to fine-tune the model, these are the most common techniques:
Peter was amazed when he discovered that he could personalize Stable Diffusion to generate Taylor Swift images using only 4-6 images and fine-tuning.
Dreambooth associates a unique identifier, such as "Swiftie" in this case, with the subject image, which is the image of Taylor Swift. Here is the workflow:
In this context, Swift serves as an identifier. The subject statement is transformed to text embeddings. Noise is added to the original image up to n steps to generate input for the model, along with an additional image of up to n-1 steps as ground truth. The input is then processed by the model to produce an output, which is compared to the ground truth to compute the loss. The gradient is updated, and training continues for 200 epochs.
However, there are two factors that can potentially disrupt the training process:
In order to mitigate these issues, various solutions can be implemented:
To address these challenges, Peter employed prior preservation loss. These are the steps for Swiftie model training:
These steps were implemented to fine-tune the model and achieve the desired results.
accelerate config
In the accelerated configuration, you will be presented with various options. Given that Peter is a graphic designer and has access to an 8GB GPU, he made specific choices to optimize his training set-up:
Peter, being a graphic designer with limited access to an 8GB GPU, decided to employ the following configurations:
By configuring his training environment in this manner, Peter aimed to maximize the efficiency of his training process while effectively managing the limitations posed by his 8GB GPU.
It took a few hours for Peter's team to train a swiftie model, and here is the result.
Prompt - Swiftie in Cinematic Lightning
Wow! Peter's clients were astonished. How did he manage to deliver such outstanding results in a small amount of time? Thanks to Dreambooth.
As a technology enthusiast, Peter found great enjoyment in the fine-tuning process and harbored a desire to delve deeper into the experimentation. His curiosity led him to wonder whether he could incorporate the name "Taylor Swift" into Stable Diffusion's lexicon. So he sought guidance from his friend, Noah.
Noah proceeded to introduce Peter to the concept of Textual Inversion.
To understand how Diffusion operates, it's essential to grasp its underlying principles. Diffusion models are trained in a manner in which Gaussian noise is incrementally added to an image during a series of time steps (the forward process). Subsequently, the image is denoised, ultimately resulting in the generation of the original image (the reverse process).
Textual inversion, on the other hand, involves the integration of embeddings that establish a new relationship between text and images into a pre-trained Stable Diffusion model.
Eager to learn more, Peter posed a question to his tech-savvy acquaintance: "How can I generate these embeddings to incorporate 'Taylor Swift' into Stable Diffusion's vocabulary?"
In response, Noah advised Peter to engage in the training process known as textual inversion. This approach entails feeding 3-5 images along with their associated text into a model, training it to operate in a manner akin to the diffusion reverse process. Hence, it earns the name "Textual Inversion."
This approach differs from Dreambooth in several key ways:
Eager to explore this approach further, Peter expressed a strong desire to get started as soon as possible. His tech-savvy friend recommended that he should leverage embeddings from a pre-trained source, which he promptly obtained from the https://civitai.com/models/4631/taylor-swift repository.
Peter proceeded to incorporate these embeddings into the web user interface (webUI) of the Automatic C111 repository, available at https://github.com/AUTOMATIC1111/stable-diffusion-webui, to experiment with their application.
The outcomes were as follows for the "Taylor Swift" prompt:
LORA is faster to train. It adds new intermediate layers to train the model. As fine-tuning proceeds, new weights of those intermediate layers are added to the final model. The weight to be updated is of less size compared to DreamBooth; therefore it takes less time to train.
Subsequently, Peter posed a question to Noah, inquiring about the necessity of adding extra layers in between. He wondered if it might be possible to introduce an entirely new model into Stable Diffusion for further fine-tuning.
In response, Noah confirmed yes, indeed, with some having explored an innovative technique known as "hypernetworks."
The aim was to add a model that could add intermediate layers in the Stable Diffusion model for fine-tuning. The approach hasn't gained popularity yet due to time complexity and overfitting issues.
So, the most effective techniques to fine-tune Stable Diffusion models are:
Dreambooth:
1) Basically fine-tuning the whole model with 3-5 images of a subject and updating the gradient of the pre-trained model continuously.
2) It is the most effective technique due to its lesser ability to overfit and language drift.
3) It is storage inefficient as the whole model weights have to be stored.
Textual Inversion:
1) Fine-tuning the model such that the gradient gets updated for the text embedding and captures text-image relationship directly without interfering with the model weights.
2) Output weights are a tiny embedding.
3) It has a fainter idea due to fixed weights, while Dreambooth has a better idea.
LORA:
1) Taking intermediate layers to train in order to update only intermediate layer weights.
2) It is the quickest to train.
3) It shows excellent results for 20-30 images as it requires less VRAM to train.
4) When there's a situation of training 1000+ images, Dreambooth is a better option because it's difficult to set parameters for LORA.