autoevals
AutoEvals is a tool to quickly and easily evaluate AI model outputs.
Quickstart
Example
Use AutoEvals to model-grade an example LLM completion using the factuality prompt.
Interfaces
Functions
AnswerCorrectness
▸ AnswerCorrectness(args
): Score
| Promise
<Score
>
Measures answer correctness compared to ground truth using a weighted average of factuality and semantic similarity.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , { context? : string | string [] ; input? : string ; model? : string } & { maxTokens? : number ; temperature? : number } & OpenAIAuth & { answerSimilarity? : Scorer <string , {}> ; answerSimilarityWeight? : number ; factualityWeight? : number }> |
Returns
Score
| Promise
<Score
>
Defined in
AnswerRelevancy
▸ AnswerRelevancy(args
): Score
| Promise
<Score
>
Scores the relevancy of the generated answer to the given question. Answers with incomplete, redundant or unnecessary information are penalized.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , { context? : string | string [] ; input? : string ; model? : string } & { maxTokens? : number ; temperature? : number } & OpenAIAuth & { strictness? : number }> |
Returns
Score
| Promise
<Score
>
Defined in
AnswerSimilarity
▸ AnswerSimilarity(args
): Score
| Promise
<Score
>
Scores the semantic similarity between the generated answer and ground truth.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , RagasArgs > |
Returns
Score
| Promise
<Score
>
Defined in
Battle
▸ Battle(args
): Score
| Promise
<Score
>
Test whether an output better performs the instructions
than the original
(expected) value.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{ instructions : string }>> |
Returns
Score
| Promise
<Score
>
Defined in
ClosedQA
▸ ClosedQA(args
): Score
| Promise
<Score
>
Test whether an output answers the input
using knowledge built into the model.
You can specify criteria
to further constrain the answer.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{ criteria : any ; input : string }>> |
Returns
Score
| Promise
<Score
>
Defined in
ContextEntityRecall
▸ ContextEntityRecall(args
): Score
| Promise
<Score
>
Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , { context? : string | string [] ; input? : string ; model? : string } & { maxTokens? : number ; temperature? : number } & OpenAIAuth & { pairwiseScorer? : Scorer <string , {}> }> |
Returns
Score
| Promise
<Score
>
Defined in
ContextPrecision
▸ ContextPrecision(args
): Score
| Promise
<Score
>
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , RagasArgs > |
Returns
Score
| Promise
<Score
>
Defined in
ContextRecall
▸ ContextRecall(args
): Score
| Promise
<Score
>
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , RagasArgs > |
Returns
Score
| Promise
<Score
>
Defined in
ContextRelevancy
▸ ContextRelevancy(args
): Score
| Promise
<Score
>
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , RagasArgs > |
Returns
Score
| Promise
<Score
>
Defined in
EmbeddingSimilarity
▸ EmbeddingSimilarity(args
): Score
| Promise
<Score
>
A scorer that uses cosine similarity to compare two strings.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , { expectedMin? : number ; model? : string ; prefix? : string } & OpenAIAuth > |
Returns
Score
| Promise
<Score
>
A score between 0 and 1, where 1 is a perfect match.
Defined in
Factuality
▸ Factuality(args
): Score
| Promise
<Score
>
Test whether an output is factual, compared to an original (expected
) value.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{ expected? : string ; input : string ; output : string }>> |
Returns
Score
| Promise
<Score
>
Defined in
Faithfulness
▸ Faithfulness(args
): Score
| Promise
<Score
>
Measures factual consistency of the generated answer with the given context.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , RagasArgs > |
Returns
Score
| Promise
<Score
>
Defined in
Humor
▸ Humor(args
): Score
| Promise
<Score
>
Test whether an output is funny.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{}>> |
Returns
Score
| Promise
<Score
>
Defined in
JSONDiff
▸ JSONDiff(args
): Score
| Promise
<Score
>
A simple scorer that compares JSON objects, using a customizable comparison method for strings (defaults to Levenshtein) and numbers (defaults to NumericDiff).
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , { numberScorer? : Scorer <number , {}> ; stringScorer? : Scorer <string , {}> }> |
Returns
Score
| Promise
<Score
>
Defined in
LLMClassifierFromSpec
▸ LLMClassifierFromSpec<RenderArgs
>(name
, spec
): Scorer
<any
, LLMClassifierArgs
<RenderArgs
>>
Type parameters
Name |
---|
RenderArgs |
Parameters
Name | Type |
---|---|
name | string |
spec | ModelGradedSpec |
Returns
Scorer
<any
, LLMClassifierArgs
<RenderArgs
>>
Defined in
LLMClassifierFromSpecFile
▸ LLMClassifierFromSpecFile<RenderArgs
>(name
, templateName
): Scorer
<any
, LLMClassifierArgs
<RenderArgs
>>
Type parameters
Name |
---|
RenderArgs |
Parameters
Name | Type |
---|---|
name | string |
templateName | "battle" | "closed_q_a" | "factuality" | "humor" | "possible" | "security" | "sql" | "summary" | "translation" |
Returns
Scorer
<any
, LLMClassifierArgs
<RenderArgs
>>
Defined in
LLMClassifierFromTemplate
▸ LLMClassifierFromTemplate<RenderArgs
>(«destructured»
): Scorer
<string
, LLMClassifierArgs
<RenderArgs
>>
Type parameters
Name |
---|
RenderArgs |
Parameters
Name | Type | Default value |
---|---|---|
«destructured» | Object | undefined |
› choiceScores | Record <string , number > | undefined |
› model? | string | "gpt-3.5-turbo" |
› name | string | undefined |
› promptTemplate | string | undefined |
› temperature? | number | undefined |
› useCoT? | boolean | undefined |
Returns
Scorer
<string
, LLMClassifierArgs
<RenderArgs
>>
Defined in
Levenshtein
▸ Levenshtein(args
): Score
| Promise
<Score
>
A simple scorer that uses the Levenshtein distance to compare two strings.
Parameters
Name | Type |
---|---|
args | Object |
Returns
Score
| Promise
<Score
>
Defined in
LevenshteinScorer
▸ LevenshteinScorer(args
): Score
| Promise
<Score
>
Parameters
Name | Type |
---|---|
args | Object |
Returns
Score
| Promise
<Score
>
Defined in
ListContains
▸ ListContains(args
): Score
| Promise
<Score
>
A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string [], { allowExtraEntities? : boolean ; pairwiseScorer? : Scorer <string , {}> }> |
Returns
Score
| Promise
<Score
>
Defined in
Moderation
▸ Moderation(args
): Score
| Promise
<Score
>
A scorer that uses OpenAI's moderation API to determine if AI response contains ANY flagged content.
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , { threshold? : number } & OpenAIAuth > |
Returns
Score
| Promise
<Score
>
A score between 0 and 1, where 1 means content passed all moderation checks.
Defined in
NumericDiff
▸ NumericDiff(args
): Score
| Promise
<Score
>
A simple scorer that compares numbers by normalizing their difference.
Parameters
Name | Type |
---|---|
args | Object |
Returns
Score
| Promise
<Score
>
Defined in
OpenAIClassifier
▸ OpenAIClassifier<RenderArgs
, Output
>(args
): Promise
<Score
>
Type parameters
Name |
---|
RenderArgs |
Output |
Parameters
Name | Type |
---|---|
args | ScorerArgs <Output , OpenAIClassifierArgs <RenderArgs >> |
Returns
Promise
<Score
>
Defined in
Possible
▸ Possible(args
): Score
| Promise
<Score
>
Test whether an output is a possible solution to the challenge posed in the input.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{ input : string }>> |
Returns
Score
| Promise
<Score
>
Defined in
Security
▸ Security(args
): Score
| Promise
<Score
>
Test whether an output is malicious.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{}>> |
Returns
Score
| Promise
<Score
>
Defined in
Sql
▸ Sql(args
): Score
| Promise
<Score
>
Test whether a SQL query is semantically the same as a reference (output) query.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{ input : string }>> |
Returns
Score
| Promise
<Score
>
Defined in
Summary
▸ Summary(args
): Score
| Promise
<Score
>
Test whether an output is a better summary of the input
than the original (expected
) value.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{ input : string }>> |
Returns
Score
| Promise
<Score
>
Defined in
Translation
▸ Translation(args
): Score
| Promise
<Score
>
Test whether an output
is as good of a translation of the input
in the specified language
as an expert (expected
) value.
Parameters
Name | Type |
---|---|
args | ScorerArgs <any , LLMClassifierArgs <{ input : string ; language : string }>> |
Returns
Score
| Promise
<Score
>
Defined in
ValidJSON
▸ ValidJSON(args
): Score
| Promise
<Score
>
A binary scorer that evaluates the validity of JSON output, optionally validating against a JSON Schema definition (see https://json-schema.org/learn/getting-started-step-by-step#create).
Parameters
Name | Type |
---|---|
args | ScorerArgs <string , { schema? : any }> |
Returns
Score
| Promise
<Score
>
Defined in
buildClassificationFunctions
▸ buildClassificationFunctions(useCoT
, choiceStrings
): { description
: string
= "Call this function to select a choice."; name
: string
= "select_choice"; parameters
: { properties
: { choice
: { description
: string
= "The choice"; enum
: string
[] = choiceStrings; title
: string
= "Choice"; type
: string
= "string" } } ; required
: string
[] ; title
: string
= "FunctionResponse"; type
: string
= "object" } = enumParams }[]
Parameters
Name | Type |
---|---|
useCoT | boolean |
choiceStrings | string [] |
Returns
{ description
: string
= "Call this function to select a choice."; name
: string
= "select_choice"; parameters
: { properties
: { choice
: { description
: string
= "The choice"; enum
: string
[] = choiceStrings; title
: string
= "Choice"; type
: string
= "string" } } ; required
: string
[] ; title
: string
= "FunctionResponse"; type
: string
= "object" } = enumParams }[]
Defined in
Type Aliases
LLMArgs
Ƭ LLMArgs: { maxTokens?
: number
; temperature?
: number
} & OpenAIAuth
Defined in
LLMClassifierArgs
Ƭ LLMClassifierArgs<RenderArgs
>: { model?
: string
; useCoT?
: boolean
} & LLMArgs
& RenderArgs
Type parameters
Name |
---|
RenderArgs |
Defined in
OpenAIClassifierArgs
Ƭ OpenAIClassifierArgs<RenderArgs
>: { cache?
: ChatCache
; choiceScores
: Record
<string
, number
> ; classificationFunctions
: ChatCompletionCreateParams.Function
[] ; messages
: ChatCompletionMessageParam
[] ; model
: string
; name
: string
} & LLMArgs
& RenderArgs
Type parameters
Name |
---|
RenderArgs |
Defined in
Variables
Evaluators
• Const
Evaluators: { label
: string
; methods
: AutoevalMethod
[] }[]
Defined in
templates
• Const
templates: Object
Type declaration
Name | Type |
---|---|
battle | string |
closed_q_a | string |
factuality | string |
humor | string |
possible | string |
security | string |
sql | string |
summary | string |
translation | string |