@microsoft/m365-copilot-eval
v1.8.0-preview.1
Published
Zero-config Node.js wrapper for M365 Copilot Agent Evaluations CLI (Python-based Azure AI Evaluation SDK)
Downloads
753
Readme
M365 Copilot Agent Evaluations
PUBLIC PREVIEW: This tool is currently in public preview; refer to the instructions below to get started.
A CLI for evaluating M365 Copilot agents. Send prompts to your agent, get responses, and automatically score them with Azure AI Evaluation metrics (relevance, coherence, groundedness).
- Send a batch (or interactive set) of prompts to a configured chat API endpoint.
- Collect agent responses and evaluate them locally using Azure AI Evaluation SDK.
- The CLI supports 7 evaluator types. Evaluators marked with ⭐ are enabled by default.
| Evaluator | Type | Scale | Default Threshold | Default | |-----------|------|-------|-------------------|---------| | Relevance ⭐ | LLM-based | 1-5 | 3 | Yes | | Coherence ⭐ | LLM-based | 1-5 | 3 | Yes | | Groundedness | LLM-based | 1-5 | 3 | No | | Similarity | LLM-based | 1-5 | 3 | No | | Citations | Count-based | >= 0 | 1 | No | | ExactMatch | String match | boolean | N/A | No | | PartialMatch | String match | 0.0-1.0 | 0.5 | No |
- Multiple input modes: command‑line list, JSON file, interactive.
- Multiple output formats: console (colorized), JSON, CSV, HTML (auto‑opens report).
📋 Prerequisites
- M365 Copilot License for your tenant
- M365 Copilot Agent deployed to your tenant (can be created with M365 Agents Toolkit or any other method)
- Node.js 24.12.0+ (check:
node --version) - Python 3.13.x is downloaded automatically. If the download fails (e.g., network restrictions), set
PYTHON_PATHto a local Python 3.13.x installation (see Troubleshooting) - Environment file with your credentials and agent ID (see Environment Setup below)
- Your Tenant ID - get your tenant id using the instructions here
- Admin approval to run WORKIQ Client App for your tenant here
- Azure OpenAI endpoint, and API key (see Getting Variables below)
Platform authentication support:
- Windows — Windows Account Manager (WAM) broker, built-in.
- macOS — Company Portal broker. Install Microsoft Company Portal before running.
- Known limitation (Intel Macs): Sign-in via the broker is currently failing on Intel-based Macs. Apple Silicon (M-series) Macs are not affected. The MSAL team is investigating; progress is tracked in AzureAD/microsoft-authentication-library-for-python#908.
- Linux / WSL — Intune broker. Install the required system libraries first:
If the required libraries are missing, the authentication library raises ansudo apt install libwebkit2gtk-4.1-0 libdbus-1-dev python3-gi gir1.2-secret-1 libubsan1ImportErrorinstead of falling back to browser-based authentication — install the packages above before running.
🔧 Environment Setup
Install the Tool
- Make sure you have Node.js
- Run
npm install -g @microsoft/m365-copilot-eval
Setup Steps
Now, set up where you'll store your environment variables:
Are you using M365 Agents Toolkit (ATK)?
- Yes → You already have
.env.localin your project withM365_TITLE_ID(automatically used as your agent ID). Keep non-secret config there and put secrets likeAZURE_AI_API_KEYin.env.local.user(never committed).
- Yes → You already have
- No → Create a new
env/.env.devfile in your project directory. You'll add all variables there.
- No → Create a new
The CLI loads environment variables from multiple sources (in order of precedence):
.env.localin current directory (auto-detected, ideal for ATK projects).env.local.userin current directory — orenv/.env.local.user— auto-loaded as a user-specific override (never commit this file; put secrets here)env/.env.{environment}via--envflag (e.g.,--env devloadsenv/.env.dev)- System environment variables
Option 1: For M365 Agents Toolkit (ATK) Projects
ATK projects already check in .env.local with agent configuration. Do not put secrets in .env.local — use .env.local.user instead, which is loaded automatically and should be added to your .gitignore.
# .env.local (checked in — no secrets!)
# Already present from ATK:
M365_TITLE_ID="T_your-title-id-here" # Auto-generated by ATK
TEAMS_APP_TENANT_ID="your-tenant-id" # Auto-generated by ATK# .env.local.user (NOT checked in — secrets go here)
AZURE_AI_OPENAI_ENDPOINT="<your-azure-openai-endpoint>"
AZURE_AI_API_KEY="<your-api-key-from-azure-portal>"
AZURE_AI_API_VERSION="2024-12-01-preview" # default
AZURE_AI_MODEL_NAME="gpt-4o-mini" # recommendedAdd .env.local.user to your .gitignore:
# User-specific secrets — never commit
.env.local.user
env/.env.local.userOption 2: For Non-ATK Projects
Create env/.env.dev in your project directory:
# env/.env.dev (new file you create)
# Your agent ID (Optional):
M365_AGENT_ID="your-agent-id" # e.g., U_0dc4a8a2-b95f-edac-91c8-d802023ec2d4
# You'll add these (see Getting Variables section below):
AZURE_AI_OPENAI_ENDPOINT="<your-azure-openai-endpoint>"
AZURE_AI_API_KEY="<your-api-key-from-azure-portal>"
AZURE_AI_API_VERSION="2024-12-01-preview" # default
AZURE_AI_MODEL_NAME="gpt-4o-mini" # recommended
TENANT_ID="<your-tenant-id>"You can also override the agent ID at runtime: runevals --m365-agent-id "custom-id"
🔑 Getting Variables
Now that you know what's needed, here's how to get the required values:
1. Tenant ID
Your Azure Active Directory (AAD) tenant ID.
- If you have created your agent using Agents Toolkit, the tool automatically reads
TEAMS_APP_TENANT_IDfrom.env.localand uses it as the tenant ID. No additional configuration is needed. - For non-ATK projects, set
TENANT_IDin your env file.
How to obtain:
- Go to Azure Portal
- Search for "Azure Active Directory" or "Microsoft Entra ID"
- In the Overview section, you'll see Tenant ID
- Copy this value - this is your
TENANT_ID
Alternatively, if you have the Azure CLI installed:
az account show --query tenantId2. Agent ID
- If you have created your agent using Agents Toolkit, the tool automatically reads
M365_TITLE_IDfrom.env.localand constructs the agent ID. - If you don't know your agent-id, the tool offers agent selection when you try to submit a job. The agent selection has both the name, description, agent-id so that you can select the right agent.
3. Azure OpenAI Endpoint and API Key
You need both the endpoint URL and API key from your Azure OpenAI resource for "LLM as a Judge" evaluations. This Azure OpenAI endpoint can be in any tenant or account, and you will just configure the Evals tool using AZURE_AI_OPENAI_ENDPOINT and AZURE_AI_API_KEY.
How to obtain:
- Go to Azure Portal
- Open Azure Portal. Search OpenAI in the search bar and select Azure OpenAi.

- once you select Azure OpenAi, then Create an AI Foundry Resource.

- On the Create Foundry Resource, fill in the details and click 'Review + Create'.

- Once the resource deployed, go to foundry portal

- At this point, you should be able to deploy an LLM model.
- Select Models + Endpoints on the left rail

- Select Deploy Model -> Deploy base model (we recommend gpt-4o-mini model)

- Select Confirm, then select Customize

- Click on Customize and change the capacity to 50K tokens per minute

- Hit deploy and wait for a few minutes for the model to deploy.
- Once the deployment finishes, you are redirected to the API endpoint and API_Key page.
- Copy the following values from that page.

- Add all of these values to your
.env.devfile as shown in the Setup Steps above
Required model: Ensure you have gpt-4o-mini (or similar) deployed in your Azure OpenAI resource.
Security tip: Store keys and endpoints securely and never commit to source control.
🚀 Quick Start
Now that you have your environment variables set up, you're ready to run evaluations!
Important: Run this tool FROM your M365 agent project directory (where your agent code lives), not from this repository. You don't need to clone or download this repo.
# Navigate to YOUR agent project directory
cd /path/to/your-agent-project
# Run evaluations (auto-discovers .env.local for ATK projects)
runevals
# Or specify an environment file
runevals --env devNo prompts file? If you don't have a prompts file yet, the tool will offer to create a starter file with example prompts for you.
Environment file lookup:
- Checks
.env.localfirst (ATK projects) - Then checks
env/.env.{name}if--env {name}is specified - Prompts file auto-discovery works the same for all projects
📝 Eval Document Format
The eval document schema is versioned independently from the CLI, following Semantic Versioning.
- Schema location:
schema/v1/eval-document.schema.json - Schema changelog:
schema/CHANGELOG.md
New in Schema v1.2.0: Multi-turn conversation threads — test context persistence across multiple turns within a shared conversation session. Each thread supports 1-20 turns.
New in Schema v1.1.0: Per-prompt evaluator overrides with
evaluators_mode(extend/replace), file-leveldefault_evaluators, andExactMatch/PartialMatchevaluators.
Getting Started
The CLI auto-discovers prompts files in your project. When you run runevals, it searches:
- Current directory:
prompts.json,evals.json,tests.json ./evals/subdirectory:prompts.json,evals.json,tests.json
No prompts file? The CLI will offer to create a starter file with example prompts for you.
A minimal eval document:
{
"schemaVersion": "1.2.0",
"items": [
{
"prompt": "What is Microsoft 365?",
"expected_response": "Microsoft 365 is a cloud-based productivity suite..."
}
]
}Evaluator Configuration
Use default_evaluators to set file-level defaults, and per-item evaluators with evaluators_mode to customize:
{
"schemaVersion": "1.2.0",
"default_evaluators": {
"Relevance": {},
"Coherence": {}
},
"items": [
{
"prompt": "What is Microsoft Graph?",
"expected_response": "A unified API endpoint for Microsoft services.",
"evaluators": {
"Citations": { "citation_format": "mixed" }
},
"evaluators_mode": "extend"
},
{
"name": "Expense policy flow",
"turns": [
{
"prompt": "I spent $250 on dinner. Is that okay?",
"expected_response": "The per-diem meal allowance is $200."
},
{
"prompt": "What should I do about the overage?",
"expected_response": "Request manager approval.",
"evaluators": {
"ExactMatch": { "case_sensitive": false }
},
"evaluators_mode": "replace"
}
]
}
]
}How evaluator modes work in this example:
| Item | evaluators_mode | Active Evaluators | Why |
|------|-------------------|-------------------|-----|
| Single-turn (Graph) | extend | Relevance, Coherence, Citations | Per-prompt Citations merged with defaults |
| Multi-turn turn 1 (dinner) | (none) | Relevance, Coherence | Inherits file-level defaults |
| Multi-turn turn 2 (overage) | replace | ExactMatch | Per-turn ExactMatch replaces defaults entirely |
Evaluator Modes
| Mode | Behavior |
|------|----------|
| "extend" (default) | Per-item evaluators merge with defaults. Both run. |
| "replace" | Per-item evaluators replace defaults entirely. Only per-item evaluators run. |
| (none) | Inherits file-level default_evaluators, or system defaults (Relevance, Coherence) if not set. |
See schema/v1/examples/ in the package for more examples including per-turn evaluator overrides, mixed single/multi-turn files, and output format.
Auto-Upgrade Behavior
When the CLI loads an eval document:
- Legacy documents (missing
schemaVersion): Automatically upgraded with a timestamped backup (e.g.,file.json.bak.20260205143052) - Older versions (same major version):
schemaVersionfield updated without backup - Invalid documents: CLI exits with an error message and guidance to review the schema changelog
- Future versions: CLI rejects with a message suggesting a CLI update
Version Compatibility
Within a major version (e.g., 1.x.x), we aim to maintain backward compatibility for documents that conform to the published schema for their version. Compatibility does not extend to undeclared or ad-hoc fields outside the schema definition; review the schema changelog when upgrading between minor versions.
🎯 Usage Examples
Remember: All commands below assume you're running them FROM your agent project directory, not from this repository.
What to Expect
When you run an evaluation from your agent project directory, you'll see:
🚀 M365 Copilot Agent Evaluations CLI
📂 Loading environment: dev
🤖 Agent ID: T_my-agent.declarativeAgent
📄 Using prompts file: ./evals/evals.json
📊 Running evaluations...
─────────────────────────────────────────────────────────────
✓ Evals completed successfully!
Results saved to: ./evals/2025-12-03_14-30-45.htmlCommands to run from your project root:
# Use .env.local (checked in current dir, then env/ folder)
runevals
# Use env/.env.dev configuration
runevals --env dev
# Use specific prompts file in your project
runevals --prompts-file ./evals/my-tests.json
# Inline prompts (no file needed, useful for quick tests)
runevals --prompts "What is Microsoft Graph?" --expected "Gateway to M365 data"
# Interactive mode (enter prompts interactively)
runevals --interactive
# Canonical logging verbosity
runevals --log-level debug
runevals --log-level info
runevals --log-level warning
runevals --log-level error
# Parallel prompt execution control
runevals --concurrency 5 --prompts-file ./evals/evals.json
runevals --concurrency 1000 --prompts-file ./evals/evals.json # Python CLI clamps to 5
# Custom output location in your project
runevals --output ./reports/results.html⚠️ Debug log safety notice: The
--log-level debugoption is opt-in and may include raw API payloads and response data in console output. Redaction is pattern-based (API keys, tokens, passwords, long mixed-case strings) and will not catch arbitrary PII or custom credentials embedded in prompts or responses. Do not share debug-level output publicly without manual review.
Auth and SDK errors: Warnings and errors from the Microsoft sign-in flow (MSAL) and Azure AI Evaluation SDK appear alongside the CLI's own diagnostics — useful when a run fails to authenticate or an evaluator can't reach Azure. Routine SDK chatter (token cache hits, HTTP retries) is hidden by default. If you're troubleshooting an auth or evaluator issue and want to see everything those libraries report, add
--log-level debug.
Optional: Add Shortcuts to package.json
You can add shortcuts (npm scripts) to your agent project's package.json:
{
"scripts": {
"eval": "runevals",
"eval:local": "runevals --env local",
"eval:dev": "runevals --env dev"
}
}Then use shorter commands:
# Uses .env.local (ATK default)
npm run eval
# Uses env/.env.local
npm run eval:local
# Uses env/.env.dev
npm run eval:devProduction note: For production environments, use CI/CD pipelines instead of local npm run commands. See CICD_CACHE_GUIDE.md for examples.
📊 Output Formats
Results are automatically saved to ./evals/YYYY-MM-DD_HH-MM-SS.html with:
- Per-prompt and per-turn evaluation scores from configured evaluators
- Aggregate statistics across all evaluated items
- Multi-turn thread summaries (turns passed/failed, overall status)
Other formats:
# JSON output
runevals --output results.json
# CSV output
runevals --output results.csv🔧 Command Reference
Options:
-V, --version output version number
--log-level [level] log level: debug|info|warning|error (bare flag -> info)
--prompts <prompts...> inline prompts to evaluate
--expected <responses...> expected responses (with --prompts)
--prompts-file <file> JSON file with prompts
-o, --output <file> output file (JSON, CSV, or HTML)
-i, --interactive interactive prompt entry mode
--m365-agent-id <id> override agent ID
--env <environment> environment name (default: dev)
--init-only just setup, don't run evals
-h, --help display help
Cache Commands:
cache-info show cache statistics
cache-clear remove cached Python runtime
cache-dir print cache directory path❓ Troubleshooting
Pre-cache Python Environment (Optional)
If you want to set up the Python environment ahead of time without running evaluations:
runevals --init-onlyThis is useful for:
- Pre-warming the cache in CI/CD pipelines
- Testing the setup without running evaluations
- Troubleshooting installation issues
Cache Issues
# View cache info
runevals cache-info
# Clear and rebuild
runevals cache-clear
runevals --init-only --log-level debugNetwork/Proxy Issues
# Set proxy
export HTTPS_PROXY=http://proxy:8080
# Retry with verbose output
runevals --init-only --log-level debugPermission Issues
# Check cache directory
runevals cache-dir
# Fix permissions (Unix/macOS)
chmod -R u+w $(runevals cache-dir)Custom Python Runtime (PYTHON_PATH)
If the automatic Python download fails (e.g., network restrictions, unsupported platform), provide your own Python installation:
# Windows
set PYTHON_PATH=C:\Python313\python.exe
# macOS/Linux
export PYTHON_PATH=/usr/local/bin/python3.13Python 3.13.x is the tested version. If a different version is found, you'll be prompted to confirm before proceeding. In CI/CD, a version mismatch fails automatically.
📚 Advanced Documentation
- CI/CD Integration - GitHub Actions, Azure DevOps caching
- Testing Guide - Cross-platform testing procedures
- Python CLI Guide - Direct Python usage (without Node.js)
- Local Development Setup - Setting up the repo for local development
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Terms of Use
By using this tool, you agree to the Microsoft Software License Terms.
See LICENSE for the full license text.
