A Beginner’s Guide to Fine-Tuning Mixtral Instruct Model

Adithya S K
Generative AI
Published in
12 min readJan 8, 2024

--

Unleashing the Power of MixTRAL: A Comprehensive Guide to Fine-Tuning

Introduction

In this blog, we’ll explore Mixtral’s key features, including its remarkable context handling, multilingual competence, and fine-tuning flexibility. With a sparse architecture and strategic pre-training, Mixtral redefines efficiency, making it a frontrunner in the evolving landscape of open-access language models. Join us as we unravel the innovation behind Mixtral and its transformative impact on AI-driven language processing.

What is Fine tuning

Fine-tuning refers to the process of taking a pre-trained model and adjusting its parameters to make it more adept at solving a particular problem. Instead of training a model from scratch, which can be computationally expensive and time-consuming, fine-tuning leverages the knowledge acquired by a model during its initial training on a large dataset. This method is particularly valuable when working with limited data for a specific task.

Introduction to Mixtral

Mixtral 8x7B, an exceptional sparse mixture-of-experts (SMoE) model with openly accessible weights, operates under the Apache 2.0 license. Notably, Mixtral surpasses Llama 2 70B across a majority of benchmarks, boasting a remarkable sixfold improvement in inference speed. This model stands out as the most robust open-weight variant, featuring a permissive license and excelling in cost/performance trade-offs. Particularly noteworthy is Mixtral’s ability to either match or outperform GPT3.5 on standard benchmarks.

Key Features of Mixtral:

  1. Context Handling Excellence: Demonstrating a graceful handling of a substantial context, Mixtral efficiently manages up to 32k tokens, enhancing its contextual understanding.
  2. Multilingual Competence: With support for English, French, Italian, German, and Spanish, Mixtral accommodates a diverse linguistic landscape.
  3. Code Generation Proficiency: Showcasing robust performance in code generation tasks, Mixtral proves to be a versatile model with applications across various domains.
  4. Fine-Tuning Flexibility: Mixtral’s adaptability shines as it can be fine-tuned into an instruction-following model, achieving an impressive score of 8.3 on the MT-Bench benchmark.

In the realm of open models, Mixtral leverages a sparse mixture-of-experts network, specifically as a decoder-only model. Operating with a total parameter count of 46.7B, it effectively utilizes only 12.9B parameters per token, ensuring efficient processing and output generation at a speed and cost comparable to a 12.9B model. This pioneering approach to sparse architectures places Mixtral at the forefront of innovation.

Importantly, Mixtral’s pre-training on data sourced from the open web involves simultaneous training of experts and routers, establishing a robust foundation for its performance. This strategic training methodology sets the stage for ongoing advancements in open-access language models, solidifying Mixtral’s position as a frontrunner in the evolving landscape of AI-driven natural language processing.

Try it yourself!!

This notebook is designed for the specific task of fine-tuning Mixtral models and later uploading them to the Hub. Let’s walk through a sample code snippet that demonstrates this process. Happy Learning!!

Code Walkthrough

Step 01: Installing the packages

!pip install transformers trl accelerate torch bitsandbytes peft datasets -qU
!pip install flash-attn --no-build-isolation
  • This installs several Python libraries and the flash-attn library using pip package manager.

Step 02: Load dataset for finetuning

from datasets import load_dataset
dataset = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split="train")
dataset
df = dataset.to_pandas()
df.head(10)
  • The code starts by importing the required libraries.
  • The load_dataset function is then used to load a dataset. The dataset is specified by the identifier "TokenBender/code_instructions_122k_alpaca_style," and the split is set to "train," indicating the training split of the dataset.
  • The dataset object is then displayed, which typically includes information about the dataset.
  • The to_pandas method is used to convert the dataset into a Pandas DataFrame. This step is often performed to facilitate easier data manipulation and analysis using Pandas.
  • Finally, the first 10 rows of the DataFrame are displayed using the head method, providing a glimpse of the loaded data.

Step 03: Formating the Dataset

def generate_prompt(data_point):
"""Gen. input text based on a prompt, task instruction, (context info.), and answer:param data_point: dict: Data point
:return: dict: tokenzed prompt
"""
prefix_text = 'Below is an instruction that describes a task. Write a response that ' \\
'appropriately completes the request.\\n\\n'
# Samples with additional context into.
if data_point['input']:
text = f"""<s>[INST]{prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} [/INST]</s> \\\\n <s>{data_point["output"]}</s>"""
# Without
else:
text = f"""<s>[INST]{prefix_text} {data_point["instruction"]} [/INST] </s> \\\\n <s> {data_point["output"]} </s>"""
return text
# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)
  • This function takes a data_point dictionary as input.
  • The function generates a prompt by combining the task instruction, additional input information (if available), and the expected output. The generated prompt is then returned.
  • The generate_prompt function is applied to each data point in the dataset using a list comprehension. The resulting list of prompts is then added as a new column called "prompt" to the existing dataset:
dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)
  • In this code snippet, it appears that the dataset is being shuffled and then tokenized using the specified tokenizer.
  • The shuffle method is applied to the dataset, shuffling the order of its elements. The seed parameter is set to 1234 to ensure reproducibility. Using a seed value ensures that if you run the code with the same seed, you will get the same shuffled order.
  • The map method is then used to apply a lambda function to each sample in the dataset. The lambda function tokenizes the "prompt" column using a specified tokenizer.
  • The batched=True parameter indicates that the tokenization should be applied in batches, which can be more efficient than processing one sample at a time.
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]
  • The train_test_split method is used to split the dataset into training and testing subsets. The test_size parameter is set to 0.2, indicating that 20% of the data will be used for testing, and the remaining 80% will be used for training.
  • After the split, the training data is obtained by accessing the “train” key in the dataset object, and the testing data is obtained by accessing the "test" key.
After Formatting, We should get something like this
{
"text":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
sequence: sum += num return sum"
"prompt":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>"
}

Step 04: Loading the Base Model and Tokenization

model_id = "mistralai/Mixtral-8x7B-v0.1"
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
quantization_config=nf4_config,
use_cache=False,
attn_implementation="flash_attention_2"
)
  • It first specifies the identifier for the instruct-tuned model you want to load.
  • Then the necessary libraries are imported, including the Transformers library for working with pre-trained models.
  • Following this it configures the quantization settings using the BitsAndBytesConfig class. It specifies loading the model in 4-bit precision with double quantization and using bfloat16 as the compute data type.
  • Finally it then loads the instruct-tuned model using AutoModelForCausalLM.from_pretrained. It includes the specified quantization configuration, disables caching (use_cache=False), and sets the attention implementation to "flash_attention_2".
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
  • Here, you’re creating a tokenizer using AutoTokenizer.from_pretrained based on the specified model ID.
  • You then set the pad_token to be the eos_token and configure padding to be on the right side.
def generate_response(prompt, model):
encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
model_inputs = encoded_input.to('cuda')
generated_ids = model.generate(**model_inputs,
max_new_tokens=512,
do_sample=True,
pad_token_id=tokenizer.eos_token_id)
decoded_output = tokenizer.batch_decode(generated_ids)
return decoded_output[0].replace(prompt, "")
  • Now, let’s break down the generate_response function:
  • encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True): Tokenizes the input prompt, converts it to PyTorch tensors, and adds special tokens.
  • model_inputs = encoded_input.to('cuda'): Moves the input tensors to the GPU.
  • generated_ids = model.generate(...):
  • Uses the instruct-tuned model to generate a response based on the input prompt.
  • max_new_tokens=512: Specifies the maximum number of tokens in the generated response.
  • do_sample=True: Enables sampling from the model's output distribution.
  • pad_token_id=tokenizer.eos_token_id: Specifies the token ID to use for padding.
  • decoded_output = tokenizer.batch_decode(generated_ids): Decodes the generated token IDs back to text.
  • return decoded_output[0].replace(prompt, ""): Returns the decoded output, removing the original prompt from the generated response.
prompt = "[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM. \\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[/INST]"
generate_response(prompt, model)
print(model)
  • This generates a response based on the given prompt using the instruct-tuned model and returns the decoded output.
  • print(model), will display a textual representation of the model, providing information about its architecture, configurations, and parameters.

Step 05: Setting up the Training

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
task_type="CAUSAL_LM"
)
  • Importing the required modules from the PEFT library, including LoraConfig, get_peft_model, and prepare_model_for_kbit_training.
  • Configuring the PEFT model using the LoraConfig class. Key parameters include:
  • lora_alpha: A parameter controlling the sparsity of the LORA mechanism.
  • lora_dropout: Dropout rate for LORA.
  • r: Number of random projections in LORA.
  • bias: Bias type for LORA.
  • target_modules: List of target modules for applying LORA. These modules will undergo efficient fine-tuning.
  • task_type: Task type, in this case, "CAUSAL_LM" indicating causal language modeling.
  • This configuration is preparing the model for efficient fine-tuning using PEFT with specific settings tailored for causal language modeling.
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
print_trainable_parameters(model)
  • First it applies the necessary transformations to the model to prepare it for k-bit training using the PEFT library.
  • Then applies the PEFT configuration (peft_config) to the model, incorporating parameters specific to efficient fine-tuning.
  • print_trainable_parameters function iterates through the named parameters of the model, counting the total number of parameters (all_param) and the number of trainable parameters (trainable_params).
  • It then prints the total number of parameters, the number of trainable parameters, and the percentage of trainable parameters relative to all parameters.
  • Finally, calls the print_trainable_parameters function with the prepared model, printing information about the number and percentage of trainable parameters in the model.

Step 06: Hyper-paramters for training

if torch.cuda.device_count() > 1: # If more than 1 GPU
print(torch.cuda.device_count())
model.is_parallelizable = True
model.model_parallel = True
  • This code snippet checks if there is more than one GPU available. If true, it prints the number of available GPUs and sets the model to be parallelizable across GPUs. This can enhance training speed by distributing computations across multiple devices.

Note: Ensure that your model and training setup support parallelization across multiple GPUs. Not all models or training configurations are compatible with model parallelism.

from transformers import TrainingArguments
args = TrainingArguments(
output_dir = "Mixtral_Alpace_v2",
#num_train_epochs=5,
max_steps = 1000, # comment out this line if you want to train in epochs
per_device_train_batch_size = 32,
warmup_steps = 0.03,
logging_steps=10,
save_strategy="epoch",
#evaluation_strategy="epoch",
evaluation_strategy="steps",
eval_steps=10, # comment out this line if you want to evaluate at the end of each epoch
learning_rate=2.5e-5,
bf16=True,
# lr_scheduler_type='constant',
)
  • Here we are configuring the training arguments using the TrainingArguments class from the Transformers library.
from trl import SFTTrainer
max_seq_length = 1024
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
max_seq_length=max_seq_length,
tokenizer=tokenizer,
packing=True,
args=args,
train_dataset=train_data,
eval_dataset=test_data,
)
  • We are setting up a trainer using the TRL (Transformers for Reinforcement Learning) library with the provided SFTTrainer class.
  • This setup includes the necessary components such as the model, tokenizer, datasets, and training configurations.
trainer.train()
  • We are now starting the training process using the trainer.train() method. This will initiate the training loop and begin fine-tuning your model based on the specified training configuration, datasets, and model architecture. The training will proceed for the specified number of steps or epochs, as determined by the parameters in the args (TrainingArguments).
  • During training, the trainer will print updates to the console, including training loss, evaluation metrics, and any other relevant information specified in the training arguments.
trainer.save_model("Mixtral_Alpace_v2")
  • This will save the trained model, including its weights and configuration, in the specified directory.
  • After training, it’s a good practice to save the model so that you can later load it for inference or further fine-tuning.

Step 07: Save Model and Push to Hub

merged_model = model.merge_and_unload()
  • This method is typically part of model parallelism strategies, where a model might be split across multiple devices (GPUs) during training and then merged back into a single model for inference.
def generate_response(prompt, model):
encoded_input = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)
model_inputs = encoded_input.to('cuda')

generated_ids = model.generate(**model_inputs,
max_new_tokens=150,
do_sample=True,
pad_token_id=tokenizer.eos_token_id)
decoded_output = tokenizer.batch_decode(generated_ids)
return decoded_output[0]
  • We have defined a function generate_response for generating responses from a given prompt using a language model.
  • First uses the tokenizer to convert the input prompt into tokenized and encoded form. The return_tensors="pt" specifies that PyTorch tensors should be returned.
  • Then moves the tokenized input to the GPU. This assumes that your model is loaded on the GPU.
  • Uses the model’s generate method to generate a response based on the input prompt. The max_new_tokens parameter controls the maximum number of tokens in the generated response. do_sample=True enables sampling from the model's output distribution, and pad_token_id specifies the token ID to use for padding.
  • Decodes the generated token IDs back into human-readable text using the tokenizer’s batch_decode method.
  • Finally, returns the decoded output, which represents the generated response.
prompt = "[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM.\\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[/INST]"
generate_response(prompt, merged_model)
  • This code will generate a response based on the given prompt using your trained model (merged_model). The generated response will be printed to the console.

Conclusion

In conclusion, Mixtral 8x7B stands tall as a game-changer in the world of open-access language models. With its openly accessible weights,and a sixfold improvement in inference speed over competitors like Llama 2 70B, Mixtral is poised to revolutionize natural language processing.

Beyond its impressive performance, Mixtral’s sparse mixture-of-experts network, operating as a decoder-only model, presents an innovative approach. The model’s strategic pre-training on open web data, involving simultaneous training of experts and routers, sets the stage for ongoing advancements in the field.

As we navigate the landscape of AI-driven language models, Mixtral’s efficient processing, cost-effectiveness, and commitment to openness make it a frontrunner. Developers and researchers can harness its power to push the boundaries of natural language understanding and generation.

In essence, Mixtral is not just a model; it’s a catalyst for progress in the evolving landscape of AI. The journey we’ve taken through its features and architecture signifies a new era, where innovation and practicality converge. Mixtral 8x7B has indeed left an indelible mark, shaping the future of AI-driven natural language processing.

Happy Learning!!

If you found this post valuable, make sure to follow me for more insightful content. I frequently write about the practical applications of Generative AI, LLMs, Stable Diffusion, and explore the broader impacts of AI on society.

Let’s stay connected on Twitter. I’d love to engage in discussions with you.

If you’re not a Medium member yet and wish to support writers like me, consider signing up through my referral link: Medium Membership. Your support is greatly appreciated!

Resources

Fine-Tune Mixtral 8x7B (Mistral’s Mixture of Experts MoE) Model — Walkthrough Guide

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay on the loop with the latest AI stories. Let’s shape the future of AI together!

--

--

Post blogs about Gen AI | Cloud | Web Dev | Founder @CognitiveLab spending time fine-tuning LLMs ,Diffusion models and developing production ready application