Abdul Baqi

Search ...

F1 Score: A Key Metric in LLM Evaluation

Evaluating the performance of Large Language Models (LLMs) is an important and necessary step in refining it. LLMs are used in solving many different problems ranging from text classification and information extraction.

Choosing the correct metrics to measure the performance of an LLM can greatly increase the effectiveness of the model.

In this blog, we will explore one such crucial metric – the F1 score. This blog will guide you through what the F1 score is, why it is crucial for evaluating LLMs, and how it is able to provide users with a balanced view of model performance, particularly with imbalanced datasets.

By the end, you will be able to calculate the F1 score and understand its significance, which will be demonstrated with a practical example.

Read more about LLM evaluation, its metrics, benchmarks, and leaderboards

What is F1 Score?

F1 score is a metric used to evaluate the performance of a classification model. It combines both precision and recall.

Precision: measures the proportion of true positive predictions out of total positive predictions by the model
Recall: measures the proportion of true positive predictions out of actual positive predictions made by the model

The F1 score combines these two metrics into a single harmonic mean:

The F1 score is particularly useful for imbalanced datasets – distribution of classes is uneven. In this case a metric such as accuracy (Accuracy = Correct predictions/All predictions) can be misleading whereas the F1 score will take in to account both false positives as well as false negatives ensuring a more refined evaluation.

There are many real-world instances where a false positive or false negative can be very costly to the application of the model. For example:

In spam detection, a false positive (marking a real email as spam) can lead to losing important emails.
In medical diagnosis, a false negative (failing to detect a disease) could have severe consequences.

Here’s a list of key LLM evaluation metrics you must know about

Why Are F1 Scores Important in LLMs?

The evaluation of NLP tasks requires a metric that is able to effectively encapsulate the subtlety in its performance. The F1 score does a great job in these tasks.

Text Classification: evaluate the performance of an LLM in categorizing texts into distinct categories – for example, sentiment analysis or spam detection.
Information Extraction: evaluate the performance of an LLM in accurately identifying entities or key phrases – for example, personally identifiable information (PII) detection.

The trade-off between precision and recall is addressed by the F1 score and due to the nature of the complexity of an LLM, it is pertinent to ensure the model’s performance is evaluated across all metrics.

In fields like healthcare, finances, and legal settings, ensuring high precision is very useful but considering the false positives and negatives (recall) are essential as making small mistakes could be very costly.

Explore a list of key LLM benchmarks for evaluation

Real-World Example: Spam Detection

Let’s examine how the F1 score can help in the evaluation of an LLM- based spam detection system. Spam detection is a critical classification task where both false positives and false negatives could be causes for high alert.

False Positives: Legitimate emails mistakenly marked as spam can cause missed communication.
False Negatives: Spam emails that bypass the filters may expose users to phishing attacks.

Initial Model

Consider a synthetic dataset with a clear imbalance in classes: most emails are real with reduced spam (which is a likely scenario in the real world).

Result – Accuracy: 0.80

Despite having a high accuracy, it is not safe to assume that we have created an ideal model. Because we could have just easily created a model that predicts all emails as real and in certain scenarios, would be highly accurate.

Result

Precision: 1.00

Recall: 0.50

F1 Score: 0.67

To confirm our suspicion, we can go ahead and calculate the precision, recall, and F1 scores. We notice that there is a disparity between our precision and recall scores.

High Precision, Low Recall: Minimizes false positives but misses in filtering spam emails
Low Precision, High Recall: Correctly filters most spam, but also marks real emails as spam

In the real-world application of a spam detection system, an LLM needs to be very diligent with marking the false positives and false negatives. That is why the F1 score is more representative of how well the model is working, whereas the accuracy score wouldn’t capture that insightful nuance.

A balanced assessment of both precision and recall is certainly necessary as the false positives and negatives carry a huge risk to a spam detector’s classification task. Upon noting these remarks, we can fine-tune our LLM to better optimize precision and recall – using the F1 score for evaluation.

Improved Model

Result – Improved Accuracy: 0.80

Result

Improved Precision: 0.75

Improved Recall: 0.75

Improved F1 Score: 0.75

As you can see from the above, after simulating fine-tuning of our model to address the low F1 score, we get similar accuracy, but a higher F1 score. Here’s why, despite the lower precision score, this is still a more refined and reliable LLM.

A recall score of 0.5 in the previous iteration of the model would suggest that many actual spam emails would go unmarked, a vital classification task of our spam detector
F1 score improves balancing false positives and false negatives. Yes, this is a very repeated rhetoric, but it is essential to understand its importance in the evaluation, both for our specific example and many other classification tasks
- False Positives: Sure, a few legitimate emails will be marked as spam, but the trade-off is accepted considering the vast improvement in the coverage of detecting spam emails
- False Negatives: A classification task needs to be reliable, and this is achieved by the reduction in missed spam emails. Reliability shows the robustness of an LLM as it demonstrates the ability for the model to address false negatives, rather than simplifying the model on account of the bias (imbalance) in the data.

Navigate through the top 5 LLM leaderboards and their impact

In the real world, a spam detector that prioritizes high precision would be inadequate in protecting users from actual spam. In another example, if we had created a model with high recall and lower precision, important emails would never reach the user.

That is why it is fundamental to properly understand the F1 score and its ability to balance both the precision and recall, which was something that the accuracy score did not reflect.

When building or evaluating your next LLM, remember that accuracy is only part of the picture. The F1 score offers a more complete and insightful metric, particularly for critical and imbalanced tasks like spam detection.

Ready to dive deeper into LLM evaluation metrics? Explore our LLM bootcamp and master the art of creating reliable Gen AI models!

LLM

How To Make an LSTM Model with Multiple Inputs?

Long short-term memory (LSTM) models are powerful tools primarily used for processing sequential data, such as time series, weather forecasts, or stock prices. When it comes to LSTM models, a common query associated with it is: How Do I Make an LSTM Model with Multiple Inputs?

Before we dig deeper into the multiple inputs feature, let’s explore the multiple inputs functionality of an LSTM model through some easy-to-understand examples.

Typically, an LSTM model handles sequential data in the shape of a 3D tensor (samples, time steps, features). The feature here is the variable at each time step. An LSTM model is tasked to make predictions based on this sequential data, so it is certainly useful for this model to handle multiple sequential inputs.

Think about a meteorologist who wants to forecast the weather. In a simple setting, the input would perhaps be just the temperature. And while this would do a pretty good job in predicting the temperature, adding in other features such as humidity or wind speed would do a far better job.

Imagine trying to predict tomorrow’s stock prices. You wouldn’t rely on just yesterday’s closing price; you’d consider trends, volatility, and other influencing factors from the past. That’s exactly what long short-term memory (LSTM) models are designed to do – learn from patterns within sequential data to make predictions about what values follow subsequently.

While these examples explain how multiple inputs enhance the performance of an LSTM model, let’s dig deeper into the technical process of the question: How do I Make an LSTM Model with Multiple Inputs?

What is a Long Short-Term Memory (LSTM)?

An LSTM is a specialized type of recurrent neural network (RNN) that can “remember” important information from past time steps while ignoring irrelevant information.

It achieves this through a system of gates as shown in the diagram:

The input gate decides what new information to store
The forget gate determines what to discard
The output gate controls what to send forward

This architecture allows LSTMs to observe relationships between variables in the long term, making them ideal for time-series analysis, natural language processing (NLP), and more.

What makes LSTMs even more impressive is their ability to process multiple inputs. Instead of just relying on one feature, like the closing price of a stock, you can enrich your model with additional inputs like the opening price, trading volume, or even indicators like market sentiment.

Each feature becomes part of a time-step sequence that is fed into the LSTM, allowing it to analyze the combined impact of these multiple factors.

How do I Make an LSTM Model with Multiple Inputs?

To demonstrate one of the approaches to building an LSTM model with multiple inputs, we can use the S&P 500 Dataset found on Kaggle and focus on the IBM stock data.

Below is a visualization of the stock’s closing price over time.

The closing price will be the prediction target so understanding the plot helps us contextualize the challenge of predicting the trend. Understanding the intent of adding other inputs to our LSTM model is rather case-specific.

For example, in our case, adding opening price as an additional feature to our LSTM model helps it to capture price swings, reveal market volatility, and most importantly, increased data granularity.

Splitting the Data

Now, we can go ahead and split the data into testing (evaluating) and training (majority of data).

Feature Scaling

To further prepare the data for the LSTM model, we will normalize open and close prices to a range of 0 to 1 to handle varying magnitudes of the two inputs.

Preparing Sequential Data

A key part of training an LSTM is preparing sequential data. The function generates sequences of 60-time steps (offset) to train the model. Here:

x (Inputs): Sequences of the past 60 days’ features (open and close prices).
y (Target): The closing price of the 61st day.

For example, X_train has a shape of (947, 60, 2):

947: Number of samples.
60: Time steps (days).
2: Features (open and close prices).

LSTMs require input in the form [samples, time steps, features]. For each input sequence, the model predicts one target value—the closing price for the 61st day. This structure enables the LSTM to capture time-dependent patterns in stock price movements.

The output is presented as follows:

Learning Attention Weights

The attention mechanism further improves the LSTM by assisting it in focusing on the most critical parts of the sequence. It achieves this by learning attention weights (importance of features at each time step) and biases (fine-tuning scores).

These weights are calculated using a softmax function, highlighting the most relevant information and summarizing it into a “context vector.” This vector enables the LSTM to make more accurate predictions by concentrating on the most significant details within the sequence.

Integrating the Attention Layer into the LSTM Model

Now that we have our attention layer, the next step is to integrate it into the LSTM model. The function build_attention_lstm combines all the components to create the final architecture.

Input Layer: The model starts with an input layer that takes data shaped as [time steps, features]. In our case, that’s [60, 2]—60 time steps and 2 features (open and close prices).
LSTM Layer: Next is the LSTM layer with 64 units. This layer processes the sequential data and outputs a representation for every time step. We set return_sequences=True so that the attention layer can work with the entire sequence of outputs, not just the final one.
Attention Layer: The attention layer takes the LSTM’s outputs and focuses on the most relevant time steps. It compresses the sequence into a single vector of size 64, which represents the most significant information from the input sequence.
Dense Layer: The dense layer is the final step, producing a single prediction (the stock’s closing price) based on the attention layer’s output.
Compilation: The model is compiled using the Adam optimizer and mean_squared_error loss, making it appropriate for regression tasks like predicting stock prices.

The model summary shows the architecture:

The LSTM processes sequential data (17,152 parameters to learn).
The attention layer dynamically focuses on key time steps (124 parameters).
The dense layer maps the attention’s output to a final prediction (65 parameters).

By integrating attention to the LSTM, this model improves in its ability to predict trends by emphasizing the most important parts of the data sequence.

Building and Summarizing the Model

The output is:

Training the Model

Now that the LSTM model is built, we train it using x_train and y_train. The key training parameters include:

Epochs: It refers to how many times the model iterates over the training data (can be adjusted to handle overfitting/underfitting)
Batch size: The model processes 32 samples at a time before updating the weights (smaller batch size takes a longer time but requires less memory)
Validation data: The model evaluates its performance against the testing set after each iteration

The result of this training process is two metrics:

Training loss: how well the model fits the training data, and a decreasing training loss shows the model is learning patterns in the training data
Validation loss: how well the model generalizes unseen data; and if it starts increasing while training loss decreases, it could be a sign of overfitting

Evaluating the Model

The output:

As you can see, the test loss is nearly 0, indicating that the model is performing well and very capable of predicting unseen data.

Finally, we have a visual representation of the predicted values vs the actual values of the closing prices based on the testing set. As you can see, the predicted values closely followed the actual values, meaning the model captures the patterns in the data effectively. There are spikes in the actual values which are generally hard to predict due to the nature of time-series models.

Now that you’ve seen how to build and train an LSTM model with multiple inputs, why not experiment further? Try using a different dataset, additional features, or tweaking model parameters to improve performance.

If you’re eager to dive into the world of LLMs and their applications, consider joining the Data Science Dojo’s LLM Bootcamp.

LLM

LLM - Online Courses

Reviews

Consulting

Community