This article continues our research into the productivity-risk profile of open-source small language models. In our previous article, we explored how Mistral and Llama3 classify prompts as safe or unsafe. We found that while Llama3 was more restrictive, Mistral was more aligned with higher productivity. In this article, we explore how both models perform on the same test, but after we have fine-tuned them. You can find our results and models on Hugging Face.
Introduction
In the rapidly evolving landscape of AI, enterprises are increasingly turning to Small Language Models (SLMs)—like Mistral 7B and Llama3 8B—for tailored applications across various domains. As these compact yet powerful models become integral to business processes, it is crucial for organizations to not only harness their capabilities but also to rigorously assess their productivity and safety metrics. In our previous article, we conducted a detailed evaluation of two common SLMs, Mistral 7B and Llama3 8B, focusing on their out-of-the-box results without specific fine-tuning for safety measures. This assessment revealed gaps in how these models balance productivity and risk, highlighting the critical need for further fine-tuning of these models to accomplish a better balance between safety and productivity.
In this article, we explore how these models behave after fine-tuning, setting the stage for the successful implementation of SLMs in enterprise settings.
Out-of-the-box Model Profiles
In our previous work, we ran 18,000 prompts through the two out-of-the-box models and asked them to classify the prompts as safe or unsafe. While both models perform well on the most general metric, accuracy, there was a big difference on two more subtle metrics, precision and recall. Precision can be used as a proxy measurement of productivity, and similarly, recall can be used to measure risk mitigation. We found that Mistral had much higher precision, making it best suited for enterprises with a highly-trained AI workforce. On the contrary, for enterprises at a less mature stage of their GenAI adoption, we found Llama3 was the preferred choice because of its more restrictive predictions.
Our results are summarized in the image and table below.
Figure 1: Confusion matrix for Mistral-7B-Instruct and Llama3-8B-Instruct using 18,000 curated prompts
| Mistral-7B-Instruct | Llama3-8B-Instruct | |
|---|---|---|
| Accuracy | 92 % | 87 % |
| Precision (Productivity Alignment) | 94 % | 74 % |
| Recall (Risk Mitigation) | 85 % | 98 % |
| F1 Score (Overall) | 89 % | 84 % |
Table 1: Metrics for Mistral-7B-Instruct vs Llama3-8B-Instruct
Fine-tuned Model Profiles
Fine-tuning is the process of updating model weights so that it can perform better at a specific task. While the performance of the two models was acceptable on the initial test, neither was ready for enterprise deployment. Several unsafe prompts were still incorrectly classified by both models.
Quality data is essential for fine-tuning. For our fine-tuning, we used a hand-curated set of about 19,500 samples from the Anthropic/hh-rlhf dataset—with each prompt labeled as safe or unsafe by a human. The labels were then validated by another human—and for ambiguous prompts, they were validated by a committee vote. The data set was further refined by our team to ensure it matched the required formatting for the two models.
Next, the dataset was randomly split into a 75:25 ratio—the larger portion being used as the “training data” and the smaller portion for “test data,” taking inspiration from the Llama Guard paper.
Our fine-tuning process leveraged the Supervised Fine-tuning Trainer library to perform the fine-tuning. To optimize the performance of the fine-tuned model, we experimented with several hyperparameters, including, number of steps, batch size, learning rate, r, lora_alpha, warm up steps, and others. Our experiments included multiple iterations of model fine-tuning.
The fine-tuned Mistral model (Mistral QS-Sentry) had 1,150 steps which is close to 5 epochs and took around 14 hrs with batch size 64. The fine-tuned Lllama3 (Llama 3 QS-Sentry) had 550 steps which is close to 3 epochs and took around 20 hrs with batch size 64.
Our fine-tuned models can be found on Hugging Face.
Results: Fine-tuned Productivity vs. Risk
While the fine-tuned models do not attain 100% accuracy, they both benefit significantly from the fine-tuning. The results of fine-tuning the models to better identify prompts that should be blocked show that the models can be improved greatly when trained with quality data. Our results are summarized in the confusion matrix, table and graphs below.
Figure 2: Confusion matrices for fine-tuned models showing results for validation data
| Mistral QS-Sentry | Llama 3 QS-Sentry | |
|---|---|---|
| Accuracy | 95% | 94% |
| Precision (Productivity Alignment) | 93% | 88% |
| Recall (Risk Mitigation) | 94% | 97% |
| F1 Score (Overall) | 93% | 92% |
Table 2: Metrics for fine-tuned models
The graph below shows the results of the fine-tuning of Mistral. The most notable improvement was in Mistral’s Recall (risk), going from 85% to 94%. Accordingly, its F1 score improved to 93%.
Figure 3: Results of out-of-the-box vs. fine-tuned Mistral 7B
The second graph shows the results for Llama. Most notably, Llama’s Precision (productivity) improved from 74% to 88% after fine-tuning. And its F1 score improved to 92%.
Figure 4: Results of out-of-the-box vs. fine-tuned Llama3 8B
Our initial test showed that Mistral, because of its better performance on the Precision metric, was aligned with higher productivity. Before the fine-tuning, when Mistral identified a prompt as unsafe, it was more likely than Llama3 to be correct. On the other hand, off-the-shelf Llama3 was more restrictive and thus aligned with lower risk. Given a prompt labeled unsafe by a human, Llama3 was more likely to block it before fine-tuning.
After the fine-tuning, the models achieved nearly identical performance on all four core metrics: accuracy, precision, recall, and F1. We’ve learned, then, that the models can be “balanced out” after fine-tuning, achieving a similar productivity-risk profile.
Recommendations
Fine-tuning is essential for enterprises who want to deploy cutting-edge GenAI models while ensuring that they are not exposing themselves to unnecessary risk. As the above results show, the productivity-risk profile of off-the-shelf models can be greatly improved through fine-tuning, promising safer deployments and greater productivity gains for enterprise users.
Based on this evaluation and the results shown above in Figures 3 and 4, we recommend the following:
- Explore the potential to fine-tune small language models for use cases specific to your enterprise. The results show that, for specific tasks, a better balance between risk and productivity can be attained through fine-tuning.
- Engage industry experts, like Quantum Gears, to scope and assess the feasibility of GenAI pilots.
- Whatever models are decided on, ensure that all traffic is routed through a single chokepoint via a multi-LLM gateway, like QS SecureGPT, that enables guardrails, content moderation, obfuscation, and observability policies.
About Quantum Gears
Forum Systems and its subsidiary, Quantum Gears, are leading the Enterprise GenAI revolution. Patent-pending products—like QS SecureGPT, QS Contracts, QS Benefits, and Forum Sentry—mitigate the unpredictable nature of LLMs through integration with corporate APIs, ensuring LLM output is truthful and accurate. Used by some of the largest global companies for building intelligent business workflows, Forum’s suite of products provides unique, industry-leading solutions that allow enterprises to reinvent themselves with GenAI.