Can LLMs Go to School to Learn?

This article discusses a recently published research on generating synthetic data with little to none seed examples or instruction.

Instruction-Tuning

LLMs are widely used across a variety of applications due to their generative capabilities. Although they have been trained on large amounts of data and have huge model sizes, yet they do not generalize well across a wide range of scenarios. Instruction tuning was introduced to circumvent this issue where the LLM learns to respond to instructions accurately. It is a specialized form of fine-tuning an LLM using training data composed of pairs of instructions and responses. The instructions could cover a diverse range of topics, in order to facilitate generalization ability of LLMs. However, the instruction-tuning dataset heavily depends on human provided instructions to achieve the data quality needed to effectively train LLMs.

Several methods have been proposed in the literature to automatically generate an instruction-tuning dataset, which could then be used to train a LLM. One such method is Self-Instruct, which generates synthetic instruction-response pairs starting from a pool of seed examples. Data generation proceeds by few-shot prompting on the seed examples. However, this method has a drawback that the generated instructions are not diverse. In order to promote diversity in the instruction-tuning data, Evol-Instruct was proposed, which augments the seed instructions by generating data with variability in paraphrasing and writing style.

A recent research by Microsoft focuses on generating a generalizable instruction-tuning dataset. This work is discussed in the paper “Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models” (GLAN), 2024. This work aims to generate diverse synthetic data without the need for seed examples. GLAN is inspired by the generalization ability of humans, which is gained through the education system which teaches a wide range of subjects from grade 1-12. GLAN follows a structured approach to generating synthetic data.

Methodology for GLAN-Based Training of LLM

GLAN is implemented using GPT-4, and consists of the following steps:

Taxonomy Generation: The first step in generating nearly autonomous synthetic data is to build a taxonomy of disciplines defined in the human education system. This taxonomy is easily generated from GPT-4. The generated taxonomy is then verified by humans to refine the taxonomy. Since the number of disciplines in the human education system is limited, this task is easily accomplished.

Subject Generation: The next step is to generate a comprehensive list of subjects for each discipline. GPT-4 is used to accomplish this task, along with a prompting strategy which suggests GPT-4 to act as an education expert. For example:

“Act as an education expert of discipline Mathematics and design a list of subjects a student should learn.”

Syllabus Generation: In the next step, syllabuses are generated for each subject using GPT-4, similar to how instructors design syllabuses for each course. This requires breaking down the topics in each subject in a hierarchical approach, into class sessions. The syllabus also provides a set of objectives and outcomes for a course, which is useful for the next step to generate specific classroom instruction for a topic.

Instruction Generation: The final step is to generate homework questions and answers for each topic within a syllabus. These Q & A pairs form the instruction-tuning dataset.

Homeworks questions are generated with different levels of difficulty. For basic questions, a single concept is samples from a pool of concepts/classroom sessions. For challenging homeworks, questions are generated by combining several key concepts. GPT-4 is used to generate the questions. Further context could be added to this step to instruct GPT-4 about lessons or concepts covered till date while generating homeworks.

The answers to the homework questions are generated using GPT-3.5-turbo due to its inference speed. The temperature parameter for this model is set low, in order to generate accurate answers. The model currently generates single-turn Q&A datasets, and might be extended to multi-turn conversations in the future.

Training and Evaluation of an LLM

The generated instruction-tuning data is used to fine-tune a Mistral-7B model for 3 epochs. The resulting fine-tuned model outperforms the base model on a wide range of topics including mathematical reasoning, logical reasoning, and academic exams, as depicted in the paper on GLAN.

Advantages of GLAN

GLAN provides an almost automatic way of generating instruction tuning data. Since the data generation is based on a taxonomy of the human education system, the generated data spans a comprehensive set of disciplines and subjects, hence ensuring diversity in the instruction tuning data. Thus, a model fine-tuned with this dataset is generalizable to a wide range of instructions and provides satisfactory responses. The model is scalable and customizable. Adding new disciplines does not require re-generating the entire dataset. Due to human verification in the first step of taxonomy generation, it is easy to refine or add new disciplines to the hierarchical tree of knowledge.

It is to be noted that Microsoft has not yet released the instruction tuning data or any of the prompts used in data generation. This model has the potential of being deployed widely when such data becomes available.