> ## Documentation Index
> Fetch the complete documentation index at: https://wb-21fd5541-docs-1778-mysql-updates.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Tutorial: Build an Evaluation pipeline

> Learn how to build an evaluation pipeline with Weave Models and Evaluations

To iterate on an application, we need a way to evaluate if it's improving. To do so, a common practice is to test it against the same set of examples when there is a change. Weave has a first-class way to track evaluations with `Model` & `Evaluation` classes. We have built the APIs to make minimal assumptions to allow for the flexibility to support a wide array of use-cases.

<img src="https://mintcdn.com/wb-21fd5541-docs-1778-mysql-updates/Mwcl85-HQWc8KNlg/images/evals-hero.png?fit=max&auto=format&n=Mwcl85-HQWc8KNlg&q=85&s=7938f5ec062d37e7d82fb1cd3a0c2bad" alt="Evals hero" width="4100" height="2160" data-path="images/evals-hero.png" />

## 1. Build a `Model`

`Model`s store and version information about your system, such as prompts, temperatures, and more.
Weave automatically captures when they are used and updates the version when there are changes.

`Model`s are declared by subclassing `Model` and implementing a `predict` function definition, which takes one example and returns the response.

<Important>
  **Known Issue**: If you are using Google Colab, remove `async` from the following examples.
</Important>

<CodeGroup>
  ```python lines theme={null}
  import json
  import openai
  import weave

  class ExtractFruitsModel(weave.Model):
      model_name: str
      prompt_template: str

      @weave.op()
      async def predict(self, sentence: str) -> dict:
          client = openai.AsyncClient()

          response = await client.chat.completions.create(
              model=self.model_name,
              messages=[
                  {"role": "user", "content": self.prompt_template.format(sentence=sentence)}
              ],
          )
          result = response.choices[0].message.content
          if result is None:
              raise ValueError("No response from model")
          parsed = json.loads(result)
          return parsed
  ```

  ```typescript TypeScript theme={null}
  // Note: weave.Model is not supported in TypeScript yet.
  // Instead, wrap your model-like function with weave.op

  const model = weave.op(async function myModel({datasetRow}) {
    const prompt = `Extract fields ("fruit": <str>, "color": <str>, "flavor") from the following text, as json: ${datasetRow.sentence}`;
    const response = await openaiClient.chat.completions.create({
      model: 'gpt-3.5-turbo',
      messages: [{ role: 'user', content: prompt }],
      response_format: { type: 'json_object' }
    });
    return JSON.parse(response.choices[0].message.content);
  });
  ```
</CodeGroup>

You can instantiate `Model` objects as normal like this:

<CodeGroup>
  ```python lines theme={null}
  import asyncio
  import weave

  weave.init('intro-example')

  model = ExtractFruitsModel(
      model_name='gpt-3.5-turbo-1106',
      prompt_template='Extract fields ("fruit": <str>, "color": <str>, "flavor": <str>) from the following text, as json: {sentence}'
  )
  sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy."
  print(asyncio.run(model.predict(sentence)))
  # if you're in a Jupyter Notebook, run:
  # await model.predict(sentence)
  ```

  ```typescript TypeScript theme={null}
  await weave.init('intro-example');

  const sentence = "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.";
  const result = await model({ datasetRow: { sentence } });
  console.log(result);
  ```
</CodeGroup>

<Note>
  Checkout the [Models](/weave/guides/core-types/models) guide to learn more.
</Note>

## 2. Collect some examples

Next, you need a dataset to evaluate your model on. A `Dataset` is just a collection of examples stored as a Weave object. You'll be able to download, browse and run evaluations on datasets in the Weave UI.

Here we build a list of examples in code, but you can also log them one at a time from your running application.

<CodeGroup>
  ```python lines theme={null}
  sentences = ["There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
  "Pounits are a bright green color and are more savory than sweet.",
  "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."]
  labels = [
      {'fruit': 'neoskizzles', 'color': 'purple', 'flavor': 'candy'},
      {'fruit': 'pounits', 'color': 'bright green', 'flavor': 'savory'},
      {'fruit': 'glowls', 'color': 'pale orange', 'flavor': 'sour and bitter'}
  ]
  examples = [
      {'id': '0', 'sentence': sentences[0], 'target': labels[0]},
      {'id': '1', 'sentence': sentences[1], 'target': labels[1]},
      {'id': '2', 'sentence': sentences[2], 'target': labels[2]}
  ]
  ```

  ```typescript TypeScript theme={null}
  const sentences = [
    "There are many fruits that were found on the recently discovered planet Goocrux. There are neoskizzles that grow there, which are purple and taste like candy.",
    "Pounits are a bright green color and are more savory than sweet.",
    "Finally, there are fruits called glowls, which have a very sour and bitter taste which is acidic and caustic, and a pale orange tinge to them."
  ];
  const labels = [
    { fruit: 'neoskizzles', color: 'purple', flavor: 'candy' },
    { fruit: 'pounits', color: 'bright green', flavor: 'savory' },
    { fruit: 'glowls', color: 'pale orange', flavor: 'sour and bitter' }
  ];
  const examples = sentences.map((sentence, i) => ({
    id: i.toString(),
    sentence,
    target: labels[i]
  }));
  ```
</CodeGroup>

Then publish your dataset:

<CodeGroup>
  ```python lines theme={null}
  import weave
  # highlight-next-line
  weave.init('intro-example')
  dataset = weave.Dataset(name='fruits', rows=examples)
  # highlight-next-line
  weave.publish(dataset)
  ```

  ```typescript TypeScript theme={null}
  import * as weave from 'weave';
  // highlight-next-line
  await weave.init('intro-example');
  const dataset = new weave.Dataset({
    name: 'fruits',
    rows: examples
  });
  // highlight-next-line
  await dataset.save();
  ```
</CodeGroup>

<Note>
  Checkout the [Datasets](/weave/guides/core-types/datasets) guide to learn more.
</Note>

## 3. Define scoring functions

`Evaluation`s assess a `Model`s performance on a set of examples using a list of specified scoring functions or `weave.scorer.Scorer` classes.

<CodeGroup>
  ```python lines theme={null}
  # highlight-next-line
  import weave
  from weave.scorers import MultiTaskBinaryClassificationF1

  @weave.op()
  def fruit_name_score(target: dict, output: dict) -> dict:
      return {'correct': target['fruit'] == output['fruit']}
  ```

  ```typescript TypeScript theme={null}
  // highlight-next-line
  import * as weave from 'weave';

  const fruitNameScorer = weave.op(
    function fruitNameScore({target, output}) {
      return { correct: target.fruit === output.fruit };
    }
  );
  ```
</CodeGroup>

<Note>
  To make your own scoring function, learn more in the [Scorers](/weave/guides/evaluation/scorers) guide.

  In some applications we want to create custom `Scorer` classes - where for example a standardized `LLMJudge` class should be created with specific parameters (e.g. chat model, prompt), specific scoring of each row, and specific calculation of an aggregate score. See the tutorial on defining a `Scorer` class in the next chapter on [Model-Based Evaluation of RAG applications](/weave/tutorial-rag#optional-defining-a-scorer-class) for more information.
</Note>

## 4. Run the evaluation

Now, you're ready to run an evaluation of `ExtractFruitsModel` on the `fruits` dataset using your scoring function.

<CodeGroup>
  ```python lines theme={null}
  import asyncio
  import weave
  from weave.scorers import MultiTaskBinaryClassificationF1

  weave.init('intro-example')

  evaluation = weave.Evaluation(
      name='fruit_eval',
      dataset=dataset, 
      scorers=[
          MultiTaskBinaryClassificationF1(class_names=["fruit", "color", "flavor"]), 
          fruit_name_score
      ],
  )
  print(asyncio.run(evaluation.evaluate(model)))
  # if you're in a Jupyter Notebook, run:
  # await evaluation.evaluate(model)
  ```

  ```typescript TypeScript theme={null}
  import * as weave from 'weave';

  await weave.init('intro-example');

  const evaluation = new weave.Evaluation({
    name: 'fruit_eval',
    dataset: dataset,
    scorers: [fruitNameScorer],
  });
  const results = await evaluation.evaluate(model);
  console.log(results);
  ```
</CodeGroup>

<Note>
  If you're running from a python script, you'll need to use `asyncio.run`. However, if you're running from a Jupyter notebook, you can use `await` directly.
</Note>

## 5. View your evaluation results

Weave will automatically capture traces of each prediction and score.

Click on the link printed by the evaluation to view the results in the Weave UI.

<img src="https://mintcdn.com/wb-21fd5541-docs-1778-mysql-updates/Mwcl85-HQWc8KNlg/images/evals-hero.png?fit=max&auto=format&n=Mwcl85-HQWc8KNlg&q=85&s=7938f5ec062d37e7d82fb1cd3a0c2bad" alt="Evaluation results" width="4100" height="2160" data-path="images/evals-hero.png" />

## What's next?

Learn how to:

1. **Compare model performance**: Try different models and compare their results
2. **Explore Built-in Scorers**: Check out Weave's built-in scoring functions in our [Scorers guide](/weave/guides/evaluation/scorers)
3. **Build a RAG app**: Follow our [RAG tutorial](/weave/tutorial-rag) to learn about evaluating retrieval-augmented generation
4. **Advanced evaluation patterns**: Learn about [Model-Based Evaluation](/weave/guides/evaluation/scorers#model-based-evaluation) for using LLMs as judges
