This article discusses a recently published research on generating synthetic data with little to none seed examples or instruction.
LLMs are widely used across a variety of applications due to their generative capabilities. Although they have been trained on large amounts of data and have huge model sizes, yet they do not generalize well across a wide range of scenarios. Instruction tuning was introduced to circumvent this issue where the LLM learns to respond to instructions accurately. It is a specialized form of fine-tuning an LLM using training data composed of pairs of instructions and responses. The instructions could cover a diverse range of topics, in order to facilitate generalization ability of LLMs. However, the instruction-tuning dataset heavily depends on human provided instructions to achieve the data quality needed to effectively train LLMs.
Several methods have been proposed in the literature to automatically generate an instruction-tuning dataset, which could then be used to train a LLM. One such method is Self-Instruct, which generates synthetic instruction-response pairs starting from a pool of seed examples. Data generation proceeds by few-shot prompting on the seed examples. However, this method has a drawback that the generated instructions are not diverse. In order to promote diversity in the instruction-tuning data, Evol-Instruct was proposed, which augments the seed instructions by generating data with variability in paraphrasing and writing style.
A recent research by Microsoft focuses on generating a generalizable instruction-tuning dataset. This work is discussed in the paper “Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models” (GLAN), 2024. This work aims to generate diverse synthetic data without the need for seed examples. GLAN is inspired by the generalization ability of humans, which is gained through the education system which teaches a wide range of subjects from grade 1-12. GLAN follows a structured approach to generating synthetic data.
GLAN is implemented using GPT-4, and consists of the following steps:
“Act as an education expert of discipline Mathematics and design a list of subjects a student should learn.”
Homeworks questions are generated with different levels of difficulty. For basic questions, a single concept is samples from a pool of concepts/classroom sessions. For challenging homeworks, questions are generated by combining several key concepts. GPT-4 is used to generate the questions. Further context could be added to this step to instruct GPT-4 about lessons or concepts covered till date while generating homeworks.
The answers to the homework questions are generated using GPT-3.5-turbo due to its inference speed. The temperature parameter for this model is set low, in order to generate accurate answers. The model currently generates single-turn Q&A datasets, and might be extended to multi-turn conversations in the future.
The generated instruction-tuning data is used to fine-tune a Mistral-7B model for 3 epochs. The resulting fine-tuned model outperforms the base model on a wide range of topics including mathematical reasoning, logical reasoning, and academic exams, as depicted in the paper on GLAN.
GLAN provides an almost automatic way of generating instruction tuning data. Since the data generation is based on a taxonomy of the human education system, the generated data spans a comprehensive set of disciplines and subjects, hence ensuring diversity in the instruction tuning data. Thus, a model fine-tuned with this dataset is generalizable to a wide range of instructions and provides satisfactory responses. The model is scalable and customizable. Adding new disciplines does not require re-generating the entire dataset. Due to human verification in the first step of taxonomy generation, it is easy to refine or add new disciplines to the hierarchical tree of knowledge.
It is to be noted that Microsoft has not yet released the instruction tuning data or any of the prompts used in data generation. This model has the potential of being deployed widely when such data becomes available.