How I Finetuned Aya 😲🤫

August 28, 2024

The Context

I recently dedicated a lot of time to learning about and reading papers on the Aya models from Cohere, which are open-source multilingual language models. The papers were very well-written and easy to understand. Recently, Cohere released a new suite of Aya models, including versions with 8 billion, 13 billion, and, if I remember correctly, 135 billion parameters. I decided to fine-tune the 8 billion parameter model because it seemed interesting and fun. For this, I used a dataset of prompts and responses that I've been curating over the past two years.

The first step, which I won't go into detail about in this article, was creating the dataset. This involved writing prompt-response pairs and storing everything in a JSON-L file. The key point about the JSON-L file is that it contains both the prompt and the corresponding response, which will be used to fine-tune the model.

In this article, I'll walk you through the process of fine-tuning the Aya model using the Hugging Face Transformers library. I will cover how to load the model, tokenize the data, and fine-tune the model on the dataset of prompt-response pairs. Let's dive in!

First, for illustrative purposes, let me share the script I used to create the JSON-L file. You might need to adapt it to fit your own dataset. This script reads a JSON file containing conversations and extracts the prompt-response pairs from the user and the assistant. Here’s the script:

import json

def extract_prompt_response_pairs_stream(input_file, output_file):
    with open(output_file, 'w') as outfile:
        with open(input_file, 'r') as infile:
            conversations = json.load(infile)

            if isinstance(conversations, list):
                for conversation_data in conversations:
                    process_conversation(conversation_data, outfile)
            else:
                process_conversation(conversations, outfile)

def process_conversation(conversation_data, outfile):
    mapping = conversation_data.get('mapping', {})

    for node_id, node_data in mapping.items():
        message = node_data.get('message', {})
        if not message:
            continue

        author = message.get('author', {}).get('role', '')

        if author == 'user':
            user_message_content = message.get('content', {}).get('parts', [])
            user_message_text = extract_text_from_parts(user_message_content)

            children_ids = node_data.get('children', [])

            for child_id in children_ids:
                child_node_data = mapping.get(child_id, {})
                child_message = child_node_data.get('message', {})
                if not child_message:
                    continue

                child_author = child_message.get('author', {}).get('role', '')

                if child_author == 'assistant':
                    assistant_message_content = child_message.get('content', {}).get('parts', [])
                    assistant_message_text = extract_text_from_parts(assistant_message_content)

                    pair = {
                        'prompt': user_message_text,
                        'response': assistant_message_text
                    }
                    json.dump(pair, outfile)
                    outfile.write('\n')
                    break

def extract_text_from_parts(parts):
    """Extracts and concatenates text content from a list of parts, ignoring dictionary items."""
    text_parts = []
    for part in parts:
        if isinstance(part, str):
            text_parts.append(part)
        # Ignore dictionaries or other non-text elements.
    return ' '.join(text_parts).strip()


# Define input and output files
input_file = 'conversations.json'
output_file = 'prompt_response_pairs.jsonl'

# Call the function to process the file and save pairs
extract_prompt_response_pairs_stream(input_file, output_file)

print(f"Prompt-response pairs have been saved to {output_file}")

Fine-Tuning the Aya Model

To get started, you first need to create an account on Hugging Face, which is a platform similar to GitHub but focused on machine learning. On Hugging Face, you can find a wide variety of machine learning datasets, including some very large ones, and virtually every open-source model, including large language models, vision models, audio models, and more.

After creating your Hugging Face account, you will need to generate a token with read and write permissions in your namespace. This token is essential because our goal is to fine-tune the Aya-8B model and then upload the fine-tuned version to Hugging Face. This allows others, including machine learning researchers, to access and use the model.

With your account set up and token created, the next step is to authenticate with Hugging Face, as demonstrated in the code snippet below. During this process, I also set the environment variable to manage CUDA memory allocation. Speaking of CUDA, it's worth mentioning that I use a GPU provided by Google Colab Pro, specifically the A100 GPU, which is the most powerful GPU option available on the platform. This setup provides access to 40 GB of GPU RAM, over 80 GB of system RAM, and 200 GB of disk space. While these resources are generally sufficient for many machine learning tasks, they might not be enough for training very large models. However, for models with around 7 billion parameters, this GPU setup should handle the task with some constraints, which you will need to manage.

pip install datasets peft huggingface_hub
pip install accelerate bitsandbytes

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, pipeline
from datasets import load_dataset
from huggingface_hub import login
from peft import get_peft_model, LoraConfig, PeftType
import os
from google.colab import userdata

# Set the Hugging Face token from Colab environment
os.environ['HF_S_TOKEN'] = userdata.get('HF_S_TOKEN')

# Step 1: Authenticate with Hugging Face
login(token=os.getenv("HF_S_TOKEN"),add_to_git_credential=True)

# Step 2: Set environment variable to manage CUDA memory allocation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

The first step I took was to obtain the original Cohere Aya-8B model and load it using 8-bit precision with the help of the bits and bytes library. Using the model's original precision would have led to out-of-memory errors, so I applied quantization to reduce the precision to 8-bit. After loading the model with reduced precision, I ensured that it was set to training mode.

# Step 3: Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-23-8B")

# No need to add special tokens since they are already in the vocabulary

# Step 4: Load the model with 8-bit precision using bitsandbytes
model = AutoModelForCausalLM.from_pretrained(
    "CohereForAI/aya-23-8B",
    device_map="auto",  # Automatically handle device mapping
    load_in_8bit=True,  # Load the model in 8-bit precision to save memory
    offload_folder="./offload",  # Folder to store offloaded model parts
    torch_dtype="float16"  # Use 16-bit precision for floating-point operations
)

# Resize token embeddings to account for added special tokens
# If you are sure no special tokens need to be added, you might not need to resize
# model.resize_token_embeddings(len(tokenizer))  # Not needed if no new tokens are added

# Ensure the model is in training mode
model.train()

Next, I loaded my dataset from the finetuning_data.jsonl file, which includes the prompts and responses. Since we are fine-tuning a chat model, it's crucial to use specific tokens in the vocabulary to structure the chat conversation for the final output. These tokens include the BOS token, start-of-turn token, user token, chatbot token, and end-of-turn token. During tokenization, I place the prompt inside the user token and the response inside the chatbot token. This approach helps the model learn to differentiate between the prompt and the corresponding response. To achieve this, I use a dedicated tokenize function. Additionally, I set the tokenizer's maximum length to 512 tokens to prevent memory issues, especially since we're using a single GPU with 40 GB of GPU RAM.

# Step 5: Load the dataset from a jsonl file
dataset = load_dataset("json", data_files="finetuning_data.jsonl")

# Step 6: Tokenize function using the existing special tokens in the vocabulary
def tokenize_function(examples):
    input_texts = [
        f"<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{prompt}<|END_OF_TURN_TOKEN|>"
        f"<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{response}<|END_OF_TURN_TOKEN|>"
        for prompt, response in zip(examples['prompt'], examples['response'])
    ]
    encoding = tokenizer(input_texts, padding="max_length", truncation=True, max_length=512)  # Adjust max_length as needed
    encoding["labels"] = encoding["input_ids"].copy()
    return encoding

# Tokenize the entire dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["prompt", "response"])

After that, I used LoRa (Low-Rank Adaptation) for fine-tuning. LoRa enables us to modify only a subset of the model's weights, which helps us achieve our fine-tuning goals without the need to retrain the entire model, making the process less computationally intensive. I configured LoRa by setting parameters like R, LoRa alpha, and LoRa dropout. To prevent memory issues, I reduced LoRa alpha from 32 to 16 and set LoRa dropout to 0.1. Once the LoRa configuration was defined, I integrated it with the model using parameter-efficient fine-tuning, accomplished with the get_peft_model function, which takes both the model and LoRa configurations as inputs.

# Step 7: Set up LoRA configuration for PEFT
lora_config = LoraConfig(
    peft_type=PeftType.LORA,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
)

# Step 8: Integrate the model with LoRA using PEFT
model = get_peft_model(model, lora_config)

# Optionally disable gradient checkpointing to resolve conflicts
# model.gradient_checkpointing_enable()  # Disable this if it causes conflicts

The next step was to define the training arguments. I set the maximum number of training steps to 100, as I noticed performance degradation beyond this point in earlier experiments. To minimize the risk of out-of-memory errors, I used a batch size of 1 and implemented gradient accumulation after every 16 steps. While there are many other parameters available for tuning, these were the key ones for this particular setup.

# Step 9: Define training arguments with max_steps set to 100
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    max_steps=100,  # Limit training to 100 steps
    per_device_train_batch_size=1,  # Set the batch size
    gradient_accumulation_steps=16,  # To handle memory issues
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=50,  # Save checkpoint every 50 steps
    save_total_limit=2,  # Keep only the latest checkpoints
    eval_strategy="steps",
    eval_steps=50,  # Evaluate every 50 steps
    remove_unused_columns=False,
    fp16=False,  # Disable mixed precision to avoid conflicts
)

# Step 10: Initialize the Trainer with PEFT
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["train"],
)

# Step 11: Fine-tune the model using PEFT and LoRA
trainer.train()

Training the model took about 36 to 40 minutes. After that, I saved the fine-tuned model and uploaded it to the Hugging Face hub. To make it easy for users, I combined the original model's weights with the adjusted weights from LoRa, creating a complete model. This way, users don't need to retrieve additional components or apply the adapter themselves. This process is managed in the provided code.

from transformers import AutoConfig, AutoModelForCausalLM

# Load the base model's configuration and model
base_model_name = "CohereForAI/aya-23-8B"
config = AutoConfig.from_pretrained(base_model_name)

# Save the configuration
config.save_pretrained("./fine_tuned_model")

# Step 12: Save the fine-tuned model locally
trainer.save_model("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

# Assuming you have already integrated the adapter using LoRA

# Save the full model, including the LoRA adapters
# Ensure model refers to the main model with integrated adapters
model.save_pretrained("./fine_tuned_model", save_config=True)

After that, we ensure all the necessary files are present. These files include pytorch_model.bin, config.json, tokenizer.json, and tokenizer_config.json.

import os
# Verify the presence of essential files in the local directory
print("Files in fine_tuned_model directory:", os.listdir("./fine_tuned_model"))

# Optional: Ensure that essential files are present before uploading
required_files = ["config.json", "pytorch_model.bin", "tokenizer.json", "tokenizer_config.json"]
missing_files = [file for file in required_files if not os.path.exists(f"./fine_tuned_model/{file}")]
if missing_files:
    print(f"Warning: The following required files are missing: {missing_files}")

Once we have all the files, we can push the model to the Hugging Face hub. This is done using the API with a Hugging Face token that has read and write permissions.

# Step 13: Push the fine-tuned model to Hugging Face Hub
from huggingface_hub import HfApi, HfFolder
import os

# Set the Hugging Face token from your environment or enter manually
hf_token = os.getenv("HF_TOKEN")  # Ensure your token is set correctly in your environment
api = HfApi()

# Define model directory and name
model_dir = "./fine_tuned_model"
model_name = "aya-finetuned-mura-8B-lora"  # Replace with your model name
repo_id = f"fsndzomga/{model_name}"  # Replace with your Hugging Face username and model name

# Step 13: Push the entire fine-tuned model directory to Hugging Face Hub
# This uploads all files in the specified directory
api.upload_folder(
    folder_path=model_dir,
    path_in_repo="",
    repo_id=repo_id,
    repo_type="model",
    token=hf_token
)

That's it! The fine-tuned Aya-8B model is now available on my Hugging Face account. You can use it with the Transformers library by using AutoTokenizer and AutoModelForCausalLM along with the model ID, which follows this format: 'username/aya-finetuned-mura-8B-lora'. You'll also need to pass messages as a dictionary with keys like role, user, content. Use the 'tokenizer.apply_chat_template' function to add special tokens, ensuring the model correctly recognizes user messages and chatbot responses.

For example, if you ask the chatbot a question about Emmanuel Macron, it might respond, "Emmanuel Macron is a French politician who has been the president of France," or something similar. The model is effective in generating accurate responses.

Here’s how you can interact with the model using Python:

# pip install transformers==4.41.1
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "fsndzomga/aya-finetuned-mura-8B-lora"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Format message with the command-r-plus chat template
messages = [{"role": "user", "content": "who is emmanuel macron ?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

gen_tokens = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

Conclusion

This project was a valuable learning experience, highlighting the challenges of fine-tuning a model on a single GPU with 40 GB of RAM. Out-of-memory issues are common, requiring careful adjustments. For larger-scale fine-tuning tasks, it would be better to use multiple GPUs and the Hugging Face Accelerate library for distributed training. Despite these challenges, the process was rewarding, and the results were impressive. The fine-tuned Aya-8B model is now available on the Hugging Face hub, ready for use by researchers and developers. I hope this article has provided valuable insights into the process of fine-tuning large language models and the benefits of using LoRa for efficient adaptation. If you have any questions or feedback, feel free to reach out. Happy fine-tuning!