Building an AI-powered IT Ops agent for Onpremise servers using Microsoft Foundry Agent Service , Azure Arc & MCP

Introduction:

Recently, a customer asked me a question that really got me thinking:: "Can we build an AI agent that integrates with our datacenter — one that IT admins talk to in plain English, and it figures out which tools to run, in what order, to get the job done?"

Not a chatbot. Not a script wrapper. An agent that reasons about the problem, picks the right automation tools,runbooks & executes them, and explains what it found — all while respecting Security controls & ITSM policies

So I built one as MVP for Enterpise customer IT admins, Here are details of Solutions

Solution Overview

The solution is an AI-powered IT Operations agent that manages on-premises servers through conversation. You type "Why is my ABC server running slow?" and the agent SSHs into the server, runs diagnostics, analyzes the results, and tells you the root cause — in under few seconds depeneds on nature of questions

Following are high level components of this solution

Azure Arc : It Act as bridge between Azure and Datacenter servers, Your servers stay in your datacenter. Azure Arc installs a lightweight agent that creates a secure outbound private connection to Azure — no VPN, no public IPs, no firewall holes. The AI agent reaches your server securely through this relay using SSH

MCP Server: Building an MCP server for Azure Arc using fastMCP SDK , transform the API into two core AI-accessible components: Tools & Resources while Azure Arc API operations involve complex resource IDs and specific command syntax. An MCP server abstracts this so you can simply ask, "Which of my Arc servers in the datacenter have missing security patches"

AI Agent: Build using Foundry Agent SDK and hosted on Foundry Agent service, The agent receives your request, decides which tools to call and in what order, interprets the results, and follows up if needed. Ask it to investigate high CPU — it runs a full diagnosis first, spots a memory pressure error in the output, then pulls the event logs to confirm the root cause. That multi-step reasoning is what separates an agent from a converational bot

WebApp: Chat UI Interface which interacts with AI Agent running in Foundry Agent service runtime, it serves the chat UI, and it handles the agent conversation loop . When you type a message, the backend sends it to the Foundry Agent, waits for tool call requests, executes them by calling the MCP server's REST API, submits results back to the agent, and streams the final reply to your browser

Durable Functions: Alerts for Arc server configured in Azure monitor, when something goes wrong on Arc enable server — high CPU, a service crashes, disk filling up — Azure Monitor fires a webhook to a URL. That URL needs to be something that's always listening and ready to receive it.Functions gives us that always-on webhook endpoint without running a dedicated server 24/7. It sits idle costing nothing until an alert arrives, spins up in milliseconds, hands the alert to the agent, and goes back to sleep

Four Major componenst of Architecture:

Component	What it does	Where it runs
Web App	Chat interface + handles the AI-to-tool execution loop	Azure Container App
AI Agent	Reasons about problems, decides which tools to call	Microsoft Foundry Agent Service
MCP Server	Executes SSH commands on servers, returns results	Azure Container App
Durable Function	Receives alert webhooks, triggers autonomous investigations	Azure Functions

Architecture Flow

Key Design Decisions

Azure Arc:
Azure Arc Agent enable onpremise servers to become Az ARM resource, which allow to managed Arc enable using Azure supported interface like UI, CLI and Rest APIs security without being exposed to internet

Foundry Agent Service :
Zero infrastructure to manage. No GPU provisioning, no container orchestration for the AI runtime.Built-in conversation state. Foundry manages threads natively — each conversation has persistent memory across messages.Native tool-calling orchestration. Enterprise identity and compliance, agent authenticates via Managed Identity and Azure RBAC — same identity layer as every other Azure resource

SSH over Run Command APIs:
Azure Arc gives you two ways to execute commands on your on-prem server.
Run Command (Azure REST API) — sends a PowerShell script to Azure, which queues it on the Arc agent, which executes it, which reports the result back through Azure. Round trip: 45-90 seconds. Every single time.

SSH via Arc — opens a real-time SSH tunnel through the Arc relay. The MCP server runs az ssh arc, connects to the server's SSH daemon, executes the PowerShell command, gets the output. Round trip: 5-7 seconds. That includes the relay setup, SSH handshake, and command execution.SSH user can be restricted to specific OS-level groups (Event Log Readers, Performance Monitor Users)

Security Layers

All Authentication and Authorization are faciliated via Entra ID using managed identity of Azure, no local creds or Hardcoded credentails

Layer 1 : Identity & Authentication

Managed Identity (All Cloud Components)

Component	Identity Type	Authenticates To
Web App (Container App)	System-Assigned MI	Azure AI Foundry Agent Service
MCP Server (Container App)	System-Assigned MI	ARM API, Key Vault, Arc SSH Relay
Durable Function	System-Assigned MI	Azure AI Foundry Agent Service

Layer 2: Authorization

Layer 3: Network & Transport

Layer 4: OS/App Controls

Pre-Requisites

Azure Subscription
Onpremises Windows/Linux Server onboarded to Azure via Arc Agent
Az AI foundry Project with Reasoning LLM models (GPT models)
Python 3.11

💡 Tip:** If you don't have an Arc server yet, you can use Azure Arc Jumpstart to set one up quickly with ArcBox.

Walkthrough of Solution Demo:

Part 1: Read-Only Operations (no ticket required)

First, we demonstrate read-only operations — health checks, diagnostics, event log queries, service status. These run freely without any change ticket because they don't modify the server.

Health-Check:
Question on health check of server triggers tools call in MCP for server health check which perform several steps on server and responsd with results with Server Perfromance, Last errors and its service review

Part 2: Change-Controlled Operation (ticket required)

Next, an admin requests a risky operation — restarting a critical service. The agent does not proceed. Instead, it informs the admin that this is a change-controlled operation and asks for a valid ServiceNow CHG or INC ticket number.

Part 3: Ticket Validation — Denied

The admin provides a ticket number. The agent validates it against ServiceNow and finds the ticket is pending approval — not yet authorized for execution. The agent politely denies the request and asks the admin to get the ticket approved first.

Part 4: Ticket Approved — Execution

The admin returns after the ticket is approved. The agent re-validates against ServiceNow, confirms the ticket is now approved, and proceeds to execute the service restart. It reports the result back to the admin with the ticket number referenced for audit

For more details on Internal and source code of this solution,check out github repo:
Osshaikh/IT-Ops-Agent

Building an AI-powered IT Ops agent for Onpremise servers using Microsoft Foundry Agent Service , Azure Arc & MCP

Introduction:

Solution Overview

Security Layers

Managed Identity (All Cloud Components)

Comments

Agentic Ops

From Chaos to RCA: How Azure SRE Agent Investigates Production Incidents (Part 1)

More from this blog

From Chaos to RCA: How Azure SRE Agent Investigates Production Incidents (Part 1)

Designing a Unified Log Platform on Azure

Observability: End to End Monitoring (Part 1)

Oracle on Azure

Command Palette

Introduction:

Solution Overview

Security Layers

Managed Identity (All Cloud Components)

Comments

Agentic Ops

From Chaos to RCA: How Azure SRE Agent Investigates Production Incidents (Part 1)

More from this blog