For a hands-on learning experience to develop LLM applications, join our LLM Bootcamp today.
First 3 seats get a discount of 20%! So hurry up!

language model

Have you heard about Microsoft’s latest tech marvel in the AI world? It’s called Phi-2, a nifty little language model that’s stirring up quite the excitement.

Despite its compact size of 2.7 billion parameters, this little dynamo is an upgrade from its predecessor, Phi-1.5. What’s cool is that it’s all set and ready for you to explore in the Azure AI Studio model catalogue.

 

LLM blog banner

 

Now, Phi-2 isn’t just any small language model. Microsoft’s team, led by Satya Nadella, showcased it at Ignite 2023, and guess what? They say it’s a real powerhouse, even giving the bigger players like Llama-2 and Gemini-2 a run for their money in generative AI tests.

This model isn’t just about crunching data; it’s about understanding language, making sense of the world, and reasoning logically. Microsoft even claims it can outdo models 25 times their size in certain tasks.

 

Read in detail about: Google launches Gemini AI

 

But here’s the kicker: training Phi-2 is a breeze compared to giants like GPT-4. It gets its smarts from a mix of high-quality data, including synthetic sets, everyday knowledge, and more. It’s built on a transformer framework, aiming to predict the next word in a sequence. And the training? Just 14 days on 96 A100 GPUs. Now, that’s efficient, especially when you think about GPT-4 needing up to 100 days and a whole lot more GPUs!

Comparative analysis of Phi-2

Comparing Phi 2, Llama 2, and other notable language models can provide insights into their unique strengths and applications.

 

Phi-2: Comparative analysis

 

  1. Phi 2 (Microsoft):

    • Size & Architecture: Transformer-based architecture optimized for next-word prediction, with a 2048-token context window.
    • Training Data: Trained on 1.4 trillion tokens, including synthetic “textbook-quality” datasets (generated with GPT-3.5/GPT-4) and filtered web data (e.g., Falcon RefinedWeb, SlimPajama). Focuses on STEM, common sense, and theory of mind.
    • Performance: Outperforms larger models like Mistral-7B and Llama-2-13B/70B in coding, math, and reasoning benchmarks.
    • Efficiency: Designed for fast inference on consumer-grade GPUs.

    Applications

    • Research in language model efficiency and reasoning.
    • Natural language understanding (NLU) tasks.
    • Resource-constrained environments (e.g., edge devices).

    Limitations

  2. Llama 2 (Meta AI)

    • Size & Variants: Available in 7B, 13B, and 70B parameter versions (not 65B). Code Llama variants specialize in programming languages.
    • Training Data: Trained on 2 trillion tokens, including text from Common Crawl, Wikipedia, GitHub, and public forums.
    • Performance:
      • Code Llama supports Python, C++, Java, and more.
      • Llama-2-Chat (fine-tuned for dialogue) competes with GPT-3.5 but lags behind GPT-4.
    • Licensing: Free for research and commercial use (under 700M monthly users), but not fully open-source by OSI standards.

    Applications

    • Code generation and completion (via Code Llama).
    • Chatbots and virtual assistants (Llama-2-Chat).
    • Text summarization and question-answering.

     

  3. Other Notable Language Models

    1. GPT-4 (OpenAI)

    • Overview: A massive multimodal model (exact size undisclosed) optimized for text, image, and code tasks.
    • Key Features:
      • Superior reasoning and contextual understanding.
      • Powers ChatGPT Plus and enterprise applications.
    • Applications: Advanced chatbots, content creation, and complex problem-solving.

    2. BERT (Google AI)

    • Overview: Bidirectional transformer model focused on sentence-level comprehension.
    • Key Features:
      • Trained on Wikipedia and BooksCorpus.
      • Revolutionized search engines (e.g., Google Search).
    • Applications: Sentiment analysis, search query interpretation.

    3. Bloom (BigScience)

    • Overview: Open-source multilingual model with 176B parameters.
    • Key Features:
      • Trained in 46 languages, including underrepresented ones.
      • Transparent development process.
    • Applications: Translation, multilingual text classification.

      Also learn about Claude 3.5 Sonnet

    4. WuDao 2.0 (Baidu & Tsinghua University)

    • Overview: A 1.75 trillion parameter model trained on Chinese and English data.
    • Key Features:
      • Handles text, image, and video generation.
      • Optimized for Chinese NLP tasks.
    • Applications: AIGC (AI-generated content), bilingual research.

Phi-2 Features and Capabilities

Phi-2 stands out for several key features and capabilities including:

 

Phi-2 : Microsoft's Efficient 2.7B-Parameter AI Model

 

Key Features

1. Compact yet Powerful Architecture

  • Transformer-Based Design: Built as a decoder-only transformer optimized for next-word prediction, enabling efficient text generation and reasoning.
  • Small Parameter Count: At 2.7 billion parameters, it is lightweight compared to models like Llama 2-70B (70B) or GPT-4 (1.76T), making it ideal for deployment on consumer-grade GPUs (e.g., NVIDIA A100, RTX 3090).
  • Extended Context Window: Supports sequences of up to 2,048 tokens, allowing it to handle multi-step reasoning tasks and maintain coherence in longer outputs.

2. High-Quality, Curated Training Data

  • Synthetic “Textbook-Quality” Data:
    • Generated using GPT-3.5/GPT-4 to create structured lessons in STEM, common-sense reasoning, and theory of mind.
    • Focuses on logical progression (e.g., teaching physics concepts from basics to advanced principles).
  • Filtered Web Data:
    • Combines cleaned datasets like Falcon RefinedWeb and SlimPajama, rigorously scrubbed of low-quality, toxic, or biased content.
  • Curriculum Learning Strategy:
    • Trains the model on simpler concepts first, then gradually introduces complexity, mimicking human educational methods.

3. Exceptional Performance for Its Size

  • Outperforms Larger Models:
    • Surpasses Mistral-7B and Llama-2-13B/70B in reasoning, coding, and math tasks.
    • Example benchmarks:
      • Common-Sense Reasoning: Scores 68.7% on WinoGrande (vs. 70.1% for Llama-2-70B).
      • Coding: Achieves 61.2% on HumanEval (Python), outperforming most 7B–13B models.
      • Math: 64.6% on GSM8K (grade-school math problems).
  • Speed and Efficiency:
    • Faster inference than larger models due to its compact size, with minimal accuracy trade-offs.

 

How generative AI and LLMs work

 

4. Focus on Safety and Reliability

  • Inherently Cleaner Outputs:
    • Reduced toxic or harmful content generation due to rigorously filtered training data.
    • No reliance on post-training alignment (e.g., RLHF), making it a “pure” base model for research.
  • Transparency:
    • Microsoft openly shares details about its training data composition, unlike many proprietary models.

5. Specialized Capabilities

  • Common-Sense Reasoning:
    • Excels at tasks requiring real-world logic (e.g., “If it’s raining, should I carry an umbrella?”).
  • Language Understanding:
    • Strong performance in semantic parsing, summarization, and question-answering.
  • STEM Proficiency:
    • Tackles math, physics, and coding problems with accuracy rivaling models 5x its size.

6. Deployment Flexibility

  • Edge Device Compatibility:
    • Runs efficiently on laptops, IoT devices, or cloud environments with limited compute.
  • Cost-Effective:
    • Lower hardware and energy costs compared to massive models, ideal for startups or academic projects.

Applications

  1. Research:
    • Ideal for studying reasoning in small models and data quality vs. quantity trade-offs.
    • Serves as a base model for fine-tuning experiments.
  2. Natural Language Understanding (NLU):
    • Effective for tasks like text classification, sentiment analysis, and entity recognition.
  3. Resource-Constrained Environments:
    • Deployable on edge devices (e.g., laptops, IoT) due to low hardware requirements.
    • Cost-effective for startups or academic projects with limited compute budgets.
  4. Education:
    • Potential for tutoring systems in STEM, leveraging its synthetic textbook-trained knowledge.

 

Explore a hands-on curriculum that helps you build custom LLM applications!

 

Limitations

  • No Alignment: Unlike ChatGPT or Llama-2-Chat, Phi-2 lacks reinforcement learning from human feedback (RLHF), so outputs may require post-processing for safety.
  • Niche Generalization: Struggles with highly specialized domains (e.g., legal or medical jargon) due to its focus on common sense and STEM.
  • Multilingual Gaps: Primarily trained on English; limited non-English capabilities.

Why These Features Matter

Phi-2 challenges the “bigger is better” paradigm by proving that data quality and architectural precision can offset scale. Its features make it a groundbreaking tool for:

  • Researchers studying efficient AI training methods.
  • Developers needing lightweight models for real-world applications.
  • Educators exploring AI-driven tutoring systems in STEM.

 

Read in detail about: Multimodality revolution

 

In summary, while Phi 2 and Llama 2 are both advanced language models, they serve different purposes. Phi 2 excels in language understanding and reasoning, making it suitable for research and development, while Llama 2 focuses on code generation and software development applications. Other models, like GPT-3 or BERT, have broader applications and are often used in content generation and natural language understanding tasks.

December 21, 2023

In this blog, we delve into Large Language Model Evaluation and Tracing with LangSmith, emphasizing their pivotal role in ensuring application reliability and performance.

You’ll learn to set up LangSmith, connect it with LangChain, and master the process of precise tracing and evaluation, equipping you with the tools to optimize your Large Language Model applications and bring them to production. Discover the key to unlock your model’s full potential.

 

LLM evaluation and tracing with LangSmith

 

Whether you’re an experienced developer or just starting your journey, LangSmith’s private beta provides a valuable tool for your toolkit. 

Understanding the significance of evaluation and tracing is key to improving Large Language Model applications, ensuring the reliability, correctness, and performance of your models. This is a critical step in the development process, particularly if you’re working towards bringing your LLM application to production. 

LangSmith and LangChain in LLM application

In working on Large Language Models (LLMs), LangChain and LangSmith stand as key pillars for developers and AI enthusiasts.

LangChain simplifies the integration of powerful LLMs into applications, streamlining data access, and offering flexibility through concepts like “Chains” and “Agents.” It bridges the gap between these models and external data sources, enabling the creation of robust natural language processing applications.

LangSmith, developed by LangChain, takes LLM application development to the next level. It aids in debugging, monitoring, and evaluating LLM-based applications, with features like logging runs, visualizing components, and facilitating collaboration. It ensures the reliability and efficiency of your LLM applications.

These two tools together form a dynamic duo, unleashing the true potential of large language models in application development. In the upcoming sections, we’ll delve deeper into the mechanics, showcasing how they can elevate your LLM projects to new heights.

 

Large language model bootcamp

Quick start to LangSmith

Prerequisites

Please note that LangSmith is currently in a private beta phase, so we’ll show you how to join the waitlist. Once LangSmith releases new invites, you’ll be at the forefront of this innovative platform. 

Sign up for an account here.

 

welcome to LangSmith

 

Configuring LangSmith with LangChain 

Configuring LangSmith alongside LangChain is a straightforward procedure. It merely involves a few simple steps to establish LangSmith and start utilizing it for tracing and evaluation. 

 

Read more about LangChain in detail

 

To initiate your journey, follow the sequential steps provided below: 

  • Begin by creating a LangSmith account, as outlined in the prerequisites 
  • In your working folder, create .env file containing essential environment variables. Although initial placeholders are provided, these will be replaced in subsequent steps: 

 

 

  • Substitute the placeholder <your-openai-api-key> with your OpenAI API key obtained from OpenAI. 
  • For the LangChain API key, navigate to the settings page on LangSmith, generate the key, and replace the placeholder. 

 

LangSmith-Create API key- 1

 

  • Return to the home page and create a project with a suitable name. Subsequently, copy the project name and update the placeholder. 

 

LangSmith - Project 2

  • Install it and any other necessary dependencies with the following command: 

 

 

 

  • Execute the provided example code to initiate the process: 

 

 

  • After running the code, return to the LangSmith home page, and access the project you just created. 

Getting started with LangSmith 3

  • Within the “Traces” section, you will find the run that was recently executed. Click on it to access detailed trace information. 

Getting started with LangSmith 4

Congratulations, your initial run is now visible and traceable within LangSmith! 

Scenario # 01: LLM Tracing 

What is a trace? 

A ‘Run’ signifies a solitary instance of a task or operation within your LLM application. This could be anything from a single call to an LLM, chain, or agent. 

 

 

A ‘Trace’ encompasses an arrangement of runs structured in a hierarchical or interconnected manner. The highest-level run in a trace, known as the ‘Root Run,’ is the one directly triggered by the user or application. The root run is designated with an execution order of 1, indicating the order in which it was initiated within the trace when considered as a sequence. 

 

Learn to build LLM applications

 

Examples of traces 

We’ve already examined a straightforward LLM Call trace, where we observed the input provided to the large language model and the resulting output. In this uncomplicated case, a single run was evident, devoid of any hierarchical or multiple run structures.  

Now, let’s delve further by tracing the LangChain chain and agent to uncover deeper insights into their operations. 

Trace a sequential chain

In this instance, we explore the tracing of a sequential chain within LangChain, a foundational chain of this platform. Sequential chains enable the connection of multiple chains, creating complex pipelines for specific scenarios. Detailed information on this can be found here. 

Let’s run this example of a sequential chain and see what we get in the trace. 

 

Upon executing the code for this sequential chain and returning to our project, a new trace, ‘SimpleSequentialChain,’ becomes visible. 

 

LangSmith - ChatOpenAI 5

Upon examination, this trace reveals a collection of LLM calls, featuring two distinct LLM call runs within its hierarchy. 

 

LangSmith - Sequential Chain 6

 

This delineation of execution order becomes apparent; in our example, the initial run entails extracting a title and constructing a synopsis, as displayed in the provided screenshot. 

LangSmith - ChatOpenAI 7

 

Subsequently, the second run utilizes the synopsis and the output from the first run to generate a review. 

LangSmith - ChatOpenAI 8

 

This meticulous tracing mechanism grants us the ability to inspect intermediate results, the messages transmitted to the LLM, and the outputs at each step, all while offering insights into token counts and latency measures. Furthermore, the option to filter traces based on various parameters adds an additional layer of customization and control.

Evaluate and trace with LangSmith: Mastering LLM optimization | Data Science Dojo

 

Trace an agent 

In this segment, we embark on a journey to trace an agent’s inner workings using LangSmith. For those keen to delve deeper into the world of agents, you’ll find comprehensive documentation in LangChain.

To provide a brief overview, we’ve engineered a ZeroShotAgent, equipping it with tools like DuckDuckGo search and paraphrasing capabilities. The agent interacts with user queries, employing these tools in a ReAct(Reason + Act) manner to generate a response. 

Here is the code for the agent: 

 

 

By tracing the agent’s actions, we gain insights into the sequence and tools utilized by the agent, as well as the intermediate outputs it produces. This tracing capability proves invaluable for agent design and debugging, allowing us to identify and resolve errors efficiently.

 

LangSmith - Agent executor 9

 

The trace reveals that the agent initiates with an LLM call, proceeds to search for DuckDuckGo Results Json, engages the paraphraser, and subsequently executes two additional LLM calls to generate responses, which in our case are the suggested blog topics. 

These traces underscore the critical role tracing plays in debugging and designing effective LLM applications. It’s important to note that all this information is meticulously logged in LangSmith, offering a treasure trove of insights for various applications, which we’ll briefly explore in subsequent sections.

Sharing your trace 

LangSmith simplifies the process of sharing the logged runs. This feature facilitates easy publishing and replication of your work. For example, if you encounter a bug or unexpected output under specific conditions, you can share it with your team or create an issue on LangChain for collaborative troubleshooting.

By simply clicking the share option located at the top right corner of the page, you can effortlessly distribute your run for analysis and resolution 

 

LangSmith - Agent executor 10

 

LangSmith Run shared 11

Scenario # 02: Testing and evaluation 

Why is testing and evaluation essential for LLMs? 

The development of high-quality, production-grade Large Language Model (LLM) applications is a complex task fraught with challenges, including: 

  • Non-deterministic Outputs: LLM models operate probabilistically, often yielding varying outputs for the same input prompt. This unpredictability persists even when utilizing a temperature setting of 0, as model weights are not static over time. 
  • API Opacity: Models underpinning APIs undergo changes and updates, making it imperative to assess their evolving behavior. 
  • Security Concerns: LLMs are susceptible to prompt injections, posing potential security risks. 
  • Latency Requirements: Many applications demand swift response times. 

These challenges underscore the critical need for rigorous testing and evaluation in the development of LLM applications. 

Step-by-step LLM evaluation process 

1. Define an LLM chain 

Begin by defining an LLM and creating a simple LLM chain aimed at generating concise responses to specific queries. This LLM will serve as the subject of evaluation and testing. 

 

 

2. Create a dataset 

Generate a compact dataset comprising question-and-answer pairs related to computer science abbreviations and terms. This data set, containing both questions and their corresponding answers, will be used to evaluate and test the model.

 

After executing the code, navigate to LangSmith. Within the “Datasets & Testing” section, you’ll find the dataset you’ve created. By expanding it under “examples,” you’ll encounter the six specific examples you’ve defined for evaluation. 

LangSmith - Datasets and testing 13

3. Evaluation 

For our evaluations, we’ll make use of the LangChain evaluator, specifically focusing on the ‘Correctness: QA evaluation.’ QA evaluators play a vital role in assessing the accuracy of responses to user queries, especially when you have a dataset with reference labels or context documents. Our approach incorporates all three QA evaluators: 

  • “context_qa”: This evaluator directs the LLM chain to utilize reference “context” (supplied through example outputs) to ascertain correctness. 
  • “qa”: It prompts an LLMChain to directly appraise a response as either “correct” or “incorrect,” based on the reference answer. 
  • “cot_qa”: This evaluator closely resembles “context_qa” but introduces a chain of thought “reasoning” before delivering a final verdict. This approach generally leads to responses that align more closely with human judgments, albeit with a slightly increased token and runtime cost. 

Below is the code to kick-start the evaluation of the dataset. 

 

4. Reviewing evaluation outcomes 

Upon completing the evaluation, LangSmith provides a platform to examine the results. Navigate to the “Dataset & Testing” section, select the dataset used for the evaluation, and access “Test Runs.” You’ll find the designated Test Run Name and feedback from the evaluator. 

By clicking on the Test Run Name, you can delve deeper, inspect feedback for individual examples, and view side-by-side comparisons. Clicking on any reference example reveals detailed information. 

 

LangSmith traces 14

 

For instance, the first example received a perfect score of 1 from all three evaluators. The generated and expected outputs are presented side by side, accompanied by feedback and comments from the evaluator.

 

LangSmith - Run 15

 

However, in a different example, one evaluator issued a score of 1, while the other two scored it as 0. Upon closer examination, it becomes apparent that there exists a disparity between the generated and expected outputs 

LangSmith Run - 16

LLM chain LangSmith - 17

 

The “cot-qa” evaluator assigned a score of 1, and further exploration of the comments reveals that, although the generated output was correct, discrepancies in the dataset contextually influenced the evaluation. It’s worth noting that the “cot-qa” evaluator spotted this, demonstrating its ability to notice context-related subtleties that other evaluators might miss. 

Run - LangSmith 18

 

Varied evaluation choices (Delve deeper) 

The evaluator showcased in the previous example is but one of several available within LangSmith. Each option serves specific purposes and holds its unique value. For a detailed understanding of each evaluator’s specific functions and to explore illustrative examples, we encourage you to explore LangChain Evaluators where in-depth coverage of these available options is provided.

Implement the power of tracing and evaluation with LangSmith 

In summary, our journey through LangSmith has underscored the critical importance of evaluating and tracing Large Language Model applications. These processes are the cornerstone of reliability and high performance, ensuring that your models meet rigorous standards. 

With LangSmith, we’ve explored the power of precise tracing and evaluation, empowering you to optimize your models confidently. As you continue your exploration, remember that your LLM applications hold limitless potential, and LangSmith is your guiding light on this path of discovery.

Thank you for joining us on this transformative journey through the world of LLM Evaluation and Tracing with LangSmith. 

October 7, 2023

Related Topics

Statistics
Resources
rag
Programming
Machine Learning
LLM
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision
Career
AI