Evaluation¶
Needlr's agent framework plugs directly into Microsoft.Extensions.AI.Evaluation without adapters or flattening.
Overview¶
Microsoft.Extensions.AI.Evaluation (MEAI.Evaluation) evaluates LLM interactions by consuming native MEAI primitives: ChatMessage, ChatResponse, and UsageDetails. Needlr exposes these same shapes on its result surfaces, so evaluators slot in directly — no string flattening, no re-hydration step.
Live-path result types¶
The surfaces an agent run returns expose full MEAI types rather than flattened strings, so the output of a run is feedable to an evaluator as-is.
| Surface | Type |
|---|---|
IterativeLoopResult.FinalResponse |
ChatResponse? |
IterationRecord.FinalResponse |
ChatResponse? |
TerminationContext.LastMessage |
ChatMessage? |
TerminationContext.Usage |
UsageDetails? |
IAgentStageResult.FinalResponse |
ChatResponse? |
AgentStageResult |
positional ChatResponse? |
IPipelineRunResult.FinalResponses |
IReadOnlyDictionary<string, ChatResponse?> |
Use response.Text when a string projection is needed.
Wiring an evaluator¶
using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.Evaluation;
using Microsoft.Extensions.AI.Evaluation.Quality;
// loopResult is an IterativeLoopResult returned by IterativeAgentLoop.
ChatResponse? finalResponse = loopResult.FinalResponse;
var chatConfiguration = new ChatConfiguration(judgeChatClient);
var evaluator = new RelevanceEvaluator();
var userPrompt = new ChatMessage(ChatRole.User, "Original user question here.");
EvaluationResult result = await evaluator.EvaluateAsync(
[userPrompt],
finalResponse!,
chatConfiguration);
Trajectory adapter¶
Tool-call trajectories are extracted from IterationRecord.ToolCalls via an extension method:
using NexusLabs.Needlr.AgentFramework.Iterative;
IEnumerable<AIContent> trajectory = iterationRecord.ToToolCallTrajectory();
The returned sequence alternates FunctionCallContent (the call) and FunctionResultContent (the result) in call order — the exact shape MEAI.Evaluation tool-call evaluators consume.
Example¶
See src/Examples/AgentFramework/IterativeTripPlannerApp.Evaluation for an end-to-end evaluation demo. It runs a real trip-planner agent via IterativeTripPlannerApp.Core, extracts diagnostics, converts them to EvaluationInputs via ToEvaluationInputs(), and scores the run with both Needlr-native deterministic evaluators and MS MEAI quality evaluators using CopilotChatClient as the judge.
Run it:
Post-hoc replay from diagnostics¶
A serialized AgentRunDiagnostics is sufficient for offline replay and evaluation — no need to re-invoke the agent or the underlying model.
| Diagnostic | Property | Contents |
|---|---|---|
ChatCompletionDiagnostics |
RequestMessages : IReadOnlyList<ChatMessage>? |
The exact messages sent to the chat client on that call. |
ChatCompletionDiagnostics |
Response : ChatResponse? |
The full response returned by the chat client (null on failure). |
ToolCallDiagnostics |
Arguments : IReadOnlyDictionary<string, object?>? |
Snapshot of the arguments the tool was invoked with. |
ToolCallDiagnostics |
Result : object? |
The value returned by the tool invocation (null on failure). |
All four properties are init-only, default to null, and are populated automatically by DiagnosticsChatClientMiddleware and DiagnosticsFunctionCallingMiddleware. Capture is always on.
With these in hand you can rehydrate a MEAI ChatResponse + trajectory from a persisted diagnostics document and feed it into any IEvaluator offline.
Full-fidelity transcripts¶
Evaluation and agent-assisted debugging both depend on replay-grade transcripts — every chat exchange, not just totals.
Streaming capture¶
DiagnosticsChatClientMiddleware instruments both paths:
GetResponseAsync— captured on completion.GetStreamingResponseAsync— streaming updates are teed through to the caller in real time, then buffered viaToChatResponse()at stream completion. The synthesizedChatResponseis written toChatCompletionDiagnostics.Responsewith identical shape to the non-streaming path.
Errors mid-stream still populate ChatCompletionDiagnostics.{Success=false, ErrorMessage, Response} with the partial response built from updates observed before the failure. No data is silently dropped.
Streaming agent runs¶
DiagnosticsAgentRunMiddleware instruments both agent-run paths:
HandleAsync— captured on completion.HandleStreamingAsync—AgentResponseUpdates are teed through to the caller in real time while distinct non-nullMessageIds accumulate intoAgentRunDiagnostics.TotalOutputMessages. On stream completion the builder is finalized and written to the configuredIAgentDiagnosticsWriterwith identical shape to the non-streaming path.
Mid-stream failures record the partial output-message count observed so far and call AgentRunDiagnosticsBuilder.RecordFailure(...) before rethrowing, so streaming agent runs surface in diagnostics the same way non-streaming runs do.
Character counts¶
Tokens are an LLM-reported abstraction; character counts are a direct measure of the payload Needlr actually shipped and received. Both are captured on every completion.
ChatCompletionDiagnostics.RequestCharCount— sum ofTextContent.Text?.Lengthacross allRequestMessages.ChatCompletionDiagnostics.ResponseCharCount— sum of text length across the aggregatedResponse.ToolCallDiagnostics.ArgumentsCharCount— length of theSystem.Text.Jsonserialization of the capturedArgumentsdictionary.ToolCallDiagnostics.ResultCharCount— length of theSystem.Text.Jsonserialization of the capturedResult.
Populated automatically by DiagnosticsChatClientMiddleware and DiagnosticsFunctionCallingMiddleware on both success and failure paths. DiagnosticsCharCounter (in NexusLabs.Needlr.AgentFramework.Diagnostics) exposes the same helpers for callers who want to compute counts outside the middlewares. All helpers are null-safe and exception-tolerant — a counter failure never destabilizes the live path; it just yields 0.
OpenTelemetry interop¶
When MEAI's UseOpenTelemetry() or MAF's WithOpenTelemetry() is also active, both the upstream middleware and Needlr's DiagnosticsChatClientMiddleware create Activity spans for the same chat completion call. To avoid duplicate spans, set ChatCompletionActivityMode to EnrichParent:
.UsingAgentFramework(af => af
.ConfigureMetrics(o =>
o.ChatCompletionActivityMode = ChatCompletionActivityMode.EnrichParent))
In EnrichParent mode, when a parent gen_ai.* activity exists (from MEAI or MAF), Needlr skips creating its own activity and instead adds Needlr-specific tags (sequence number, char counts, agent name) to the existing parent span. When no parent exists, Needlr creates its own activity as normal.
Tool call activities (agent.tool) are not affected — neither MEAI nor MAF produces per-tool-call spans, so Needlr's tool tracing is always the sole source.
| Mode | When to use |
|---|---|
Always (default) |
Needlr is the only OTel instrumentation layer |
EnrichParent |
Both Needlr and upstream (MEAI/MAF) OTel middleware are active |
Metrics (counters, histograms) and in-process diagnostics recording are unaffected by this setting — only Activity span creation is suppressed.
Ordered timeline¶
IAgentRunDiagnostics exposes ChatCompletions and ToolCalls as separate collections, each with its own Sequence. When you need to see what actually happened in execution order, call the GetOrderedTimeline() extension method:
using NexusLabs.Needlr.AgentFramework.Diagnostics;
var timeline = diag.GetOrderedTimeline();
foreach (var entry in timeline)
{
Console.WriteLine($"[{entry.StartedAt:HH:mm:ss.fff}] {entry.Kind} #{entry.Sequence}");
}
The returned list merges both collections and sorts them by StartedAt (wall-clock). When two entries share the same StartedAt, ChatCompletion entries sort before ToolCall entries (a chat completion is what triggers a tool call, not the reverse); further ties resolve by Sequence within kind. Each DiagnosticsTimelineEntry carries the original ChatCompletionDiagnostics or ToolCallDiagnostics reference in the property matching its Kind, so no information is lost in the merge — the ordered view is purely additive.
Agent-run boundary capture¶
Beyond the per-completion and per-tool-call records, IAgentRunDiagnostics captures the exact input and output at the run boundary:
InputMessages : IReadOnlyList<ChatMessage>— the full input list handed to the middleware at run start. Empty when no input was supplied.OutputResponse : AgentResponse?— the full response assembled at run completion. For non-streaming runs this is the underlyingAgentResponse; for streaming runs the middleware aggregatesAgentResponseUpdatefragments byMessageId(updates without an id become discrete messages keyed by arrival ordinal).
Partial responses are still captured when a streaming run fails mid-stream — OutputResponse carries whatever messages were assembled before the fault, alongside Succeeded = false and the ErrorMessage. This makes a serialized AgentRunDiagnostics replay-complete: an evaluator can consume InputMessages + OutputResponse directly, without reaching back to the caller for the original prompt or the streamed output.
Native agent-run evaluators¶
NexusLabs.Needlr.AgentFramework.Evaluation ships three deterministic evaluators that operate directly on IAgentRunDiagnostics. They are pure computations over captured diagnostics — no LLM judge is invoked, so they run offline and are cheap enough to assert in unit tests.
All three evaluators consume the same bridge type:
AgentRunDiagnosticsContext¶
AgentRunDiagnosticsContext : EvaluationContext wraps an IAgentRunDiagnostics instance and exposes it through an EvaluationContext so native evaluators (and MEAI-provided evaluators that accept supplemental context) can read diagnostics without a custom adapter.
ContextName = "Needlr Agent Run Diagnostics"— stable constant used as the context identifier.Diagnostics— the wrappedIAgentRunDiagnostics.BuildContents()emits a singleTextContentsummary of the run (agent name, execution mode, outcome, chat-completion count, tool-call count, duration) so MEAI judge-based evaluators that round-trip context through a prompt still see a readable summary.
Evaluators downcast the context to AgentRunDiagnosticsContext to reach the full Diagnostics surface.
EfficiencyEvaluator¶
Reports on token usage and cost efficiency.
Total Tokens— aggregate token count across all LLM calls.Input Token Ratio— input tokens / total tokens. High values suggest verbose prompts; low values suggest verbose outputs.Tokens Per Tool Call— total tokens / tool call count. Measures the token cost of each tool invocation. Zero when no tool calls occurred.Cache Hit Ratio— cached input tokens / input tokens. Higher values mean more prompt-cache reuse.Under Budget— boolean, only emitted whentokenBudgetis provided to the constructor. True when total tokens is strictly below the budget.
When no AgentRunDiagnosticsContext is present, the evaluator returns an empty result. The optional tokenBudget constructor parameter controls whether the budget metric is emitted.
TaskCompletionEvaluator (LLM-judged)¶
Assesses whether the agent actually accomplished the task it was given. Unlike MEAI's TaskAdherenceEvaluator (which checks instruction following), this evaluator checks task success: did the agent produce output that satisfies the original request?
Task Completed— boolean. True when the judge determines the agent accomplished the requested task.Task Completion Score— numeric (1–5). How completely and correctly the agent fulfilled the request. 5 = fully complete, 1 = not started or completely wrong. The completion threshold is 3.Task Completion Reasoning— string. The judge's explanation for the score.
This evaluator requires a ChatConfiguration with a judge IChatClient. When no judge is configured, the evaluator returns an empty result. When AgentRunDiagnosticsContext is provided, tool-call counts and success status are included in the judge prompt for richer assessment.
var taskCompletion = await new TaskCompletionEvaluator()
.EvaluateAsync(
messages: [new ChatMessage(ChatRole.User, "Plan a 7-day trip to Japan")],
modelResponse: agentResponse,
chatConfiguration: new ChatConfiguration(judgeChatClient),
additionalContext: [new AgentRunDiagnosticsContext(diagnostics)]);
var completed = ((BooleanMetric)taskCompletion.Metrics["Task Completed"]).Value;
var score = ((NumericMetric)taskCompletion.Metrics["Task Completion Score"]).Value;
ToolCallTrajectoryEvaluator¶
Reports on the sequence of tool calls across a run.
Tool Calls Total— total tool-call records observed.Tool Calls Failed— count of records whereSucceededis false.Tool Call Sequence Gaps— number of positions where consecutive tool-callSequenceNumbervalues are not strictly increasing by one.All Tool Calls Succeeded— boolean rollup, true when every tool call succeeded.
When no AgentRunDiagnosticsContext is present in the evaluation's additional context, the evaluator returns an empty result ("not applicable"). This lets callers include it unconditionally in a pipeline of evaluators.
IterationCoherenceEvaluator¶
Reports on iterative-loop structure.
Iteration Count— number ofIterationRecordentries.Iteration Empty Outputs— count of iterations whoseFinalResponsehas no text content.Terminated Coherently— boolean, true when the run reports a terminated-coherently signal consistent with the captured iterations.
Gated on execution mode: the evaluator only emits metrics when Diagnostics.ExecutionMode == "IterativeLoop" (available as the IterativeLoopExecutionMode const). Other execution modes produce an empty result.
TerminationAppropriatenessEvaluator¶
Reports on whether the run's terminal state is internally consistent.
Run Succeeded— boolean mirror ofDiagnostics.Succeeded.Termination Consistent— boolean, true whenSucceededagrees with the presence/absence ofErrorMessage(success ⇔ no error message).Execution Mode— string metric carrying the run's execution mode, or theUnknownExecutionMode = "Unknown"fallback when null.
Wiring native evaluators¶
var context = new AgentRunDiagnosticsContext(diagnostics);
var additionalContext = new[] { context };
// ChatConfiguration is required by the MEAI evaluator contract even for
// deterministic evaluators. Pass a judge when you also run judge-based
// evaluators; deterministic evaluators ignore it.
var chatConfiguration = new ChatConfiguration(judgeChatClient);
var trajectory = await new ToolCallTrajectoryEvaluator()
.EvaluateAsync(
messages: Array.Empty<ChatMessage>(),
modelResponse: new ChatResponse(),
chatConfiguration: null,
additionalContext: additionalContext);
var efficiency = await new EfficiencyEvaluator(tokenBudget: 10_000)
.EvaluateAsync(
messages: Array.Empty<ChatMessage>(),
modelResponse: new ChatResponse(),
chatConfiguration: null,
additionalContext: additionalContext);
var coherence = await new IterationCoherenceEvaluator()
.EvaluateAsync(
messages: Array.Empty<ChatMessage>(),
modelResponse: new ChatResponse(),
chatConfiguration: null,
additionalContext: additionalContext);
var termination = await new TerminationAppropriatenessEvaluator()
.EvaluateAsync(
messages: Array.Empty<ChatMessage>(),
modelResponse: new ChatResponse(),
chatConfiguration: null,
additionalContext: additionalContext);
Each EvaluationResult exposes metrics by name — use the *MetricName constants on each evaluator type to look them up without string literals at the call site.
Quality gate for CI¶
EvaluationQualityGate defines configurable thresholds that throw QualityGateFailedException when evaluation metrics regress. Use it in CI pipelines or xUnit tests to gate deployments on agent quality.
var gate = new EvaluationQualityGate()
.RequireBoolean(ToolCallTrajectoryEvaluator.AllSucceededMetricName, expected: true)
.RequireBoolean(IterationCoherenceEvaluator.TerminatedCoherentlyMetricName, expected: true)
.RequireNumericMax(EfficiencyEvaluator.TotalTokensMetricName, max: 50_000)
.RequireBoolean(EfficiencyEvaluator.UnderBudgetMetricName, expected: true)
.RequireNumericMin(IterationCoherenceEvaluator.EfficiencyRatioMetricName, min: 0.5);
// Throws QualityGateFailedException listing all violations.
gate.Assert(trajectoryResult, coherenceResult, efficiencyResult);
Threshold types:
RequireNumericMax(name, max)— metric value must be ≤ max.RequireNumericMin(name, min)— metric value must be ≥ min.RequireBoolean(name, expected)— metric value must equal expected.
Missing metrics are silently skipped — this allows a single gate definition to work with evaluators that conditionally emit metrics.
Transcript markdown¶
For snapshot tests, review artifacts, and CI log attachments, render an entire agent run as deterministic Markdown with ToTranscriptMarkdown():
using NexusLabs.Needlr.AgentFramework.Diagnostics;
string transcript = diag.ToTranscriptMarkdown();
File.WriteAllText("run.md", transcript);
The output is byte-stable across locales — it uses CultureInfo.InvariantCulture for numeric formatting and System.Text.Json with WriteIndented = true for embedded tool arguments and results. Structure:
- H1 header — agent name, execution mode, success/failure, total duration (ms), aggregate token usage.
## Input messages— only emitted whenInputMessagesis non-empty.## Timeline— the ordered view fromGetOrderedTimeline(), with each entry prefixed by its offset fromStartedAtin milliseconds. Tool-call entries embedArgumentsandResultas pretty-printed JSON blocks.## Output response— only emitted whenOutputResponseis non-null and carries at least one message.## Error— only emitted whenSucceededis false.
The renderer is a read-side projection over IAgentRunDiagnostics — calling it has no effect on the live path.
Capture-chat-client middleware¶
Evaluation suites and CI harnesses benefit from deterministic, repeatable chat responses: the first run against a real model captures the response; subsequent runs replay it without hitting the network. EvaluationCaptureChatClient is a transparent IChatClient decorator that implements this pattern.
using NexusLabs.Needlr.AgentFramework.Evaluation;
IChatClient cached = realChatClient.WithEvaluationCapture("./cache/evaluation");
// First call: delegates to realChatClient and persists the response.
// Second call with the same request: served from the store, real client untouched.
ChatResponse response = await cached.GetResponseAsync(messages, options, ct);
Cache key¶
The key is a SHA-256 lowercase hex digest (64 chars) computed over:
- Each
ChatMessageformatted as"{role}:{text}\n"in order. - A
---\nseparator. - The tuple
model,temperature,top_p,max_tokensfrom the suppliedChatOptions(missing values emit empty strings, floats are formatted with"R"+InvariantCulture).
The key intentionally excludes tool definitions, response format, and custom options. Two requests that differ only in those fields collide on the same cache entry — if your suite needs them to vary, route them to separate stores.
Store contract¶
IEvaluationCaptureStore has two methods:
| Method | Semantics |
|---|---|
TryGetAsync(key, ct) |
Returns the captured ChatResponse? for the key, or null on miss. |
SaveAsync(key, response, ct) |
Persists the response under the key, overwriting any existing entry. |
Two implementations ship in-box:
FileEvaluationCaptureStore— one JSON file per key under a caller-supplied directory. Writes are atomic via write-then-rename. Directory is created on first save.- Custom stores — implement
IEvaluationCaptureStorefor Redis, blob storage, in-memory dictionaries, or any other backing.
Use the fluent extension to wrap a client with a file store:
Or inject any IEvaluationCaptureStore:
Streaming¶
GetStreamingResponseAsync aggregates updates via ToChatResponse(), saves the aggregated response under the same key used by the non-streaming path, then re-emits the captured response as a single ChatResponseUpdate. A cache hit yields one update; the caller sees the same await foreach surface either way. Mid-stream failures are not cached — only complete responses reach the store.
When not to use it¶
Capture-replay is unsuitable for suites that rely on response variance (temperature sampling across runs), suites that depend on live model behavior changes, or suites where the cache key's intentional omissions (tools, response format) would cause false hits. For those, bypass the decorator and use the underlying client directly.