<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Osama's Tech Blog]]></title><description><![CDATA[Osama's Tech Blog]]></description><link>https://blog.osshaikh.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 17 Apr 2026 15:26:12 GMT</lastBuildDate><atom:link href="https://blog.osshaikh.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[From Chaos to RCA: How Azure SRE Agent Investigates Production Incidents (Part 1)]]></title><description><![CDATA[Introduction
It usually starts the same way.An application goes down. Users report errors. A alert gets fired. Someone opens the Azure portal, someone else starts checking logs, and within minutes the]]></description><link>https://blog.osshaikh.com/from-chaos-to-rca-a-practical-look-at-azure-sre-agent-part-1</link><guid isPermaLink="true">https://blog.osshaikh.com/from-chaos-to-rca-a-practical-look-at-azure-sre-agent-part-1</guid><category><![CDATA[Azure]]></category><category><![CDATA[agents]]></category><category><![CDATA[AI]]></category><category><![CDATA[ai agents]]></category><category><![CDATA[SRE]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[GitHub]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Thu, 16 Apr 2026 10:04:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/c798df5f-c514-404b-a095-16940b0f9c94.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2>
<p>It usually starts the same way.<strong>An application goes down.</strong> Users report errors. A alert gets fired. Someone opens the Azure portal, someone else starts checking logs, and within minutes the team is juggling dashboards, alerts, traces, deployment history, configuration settings, and GitHub commits — all while pressure keeps rising.</p>
<p>At first glance, modern cloud-native systems look highly observable. We have Azure Monitor, Application Insights, Datadog, Prometheus, New Relic, distributed tracing, metrics, logs, alerts, and dashboards everywhere. But when a real incident happens, most teams still fall back to the same manual process: <strong>hunt for symptoms, jump across tools, form a hypothesis, test it, repeat</strong>.</p>
<p>That approach breaks down quickly in environments built on microservices, containers, App Services, multiple data sources, and 24/7 uptime expectations. The problem isn’t that we lack telemetry. The problem is that <strong>humans still have to correlate everything fast enough to reduce customer impact</strong>.</p>
<p>Not as a chatbot.<br />Not as “AI for the sake of AI.”<br />But as an operational teammate that can reason through incidents, use prior knowledge, investigate across telemetry and code, and help engineers move from symptoms to root cause much faster.</p>
<p>In this post, I’ll walk through <strong>two practical scenarios</strong> from my lab environment:</p>
<ol>
<li><p><strong>A .NET application outage returning HTTP 503 errors</strong></p>
</li>
<li><p><strong>Autonomous handling of a high response latency time alert</strong></p>
</li>
</ol>
<p>The interesting part isn’t that the agent can read logs. Plenty of tools can do that.<br />The interesting part is how the agent combines <strong>telemetry, knowledge files, runtime configuration, and engineering workflows</strong> into a single investigation loop.</p>
<h2>Why SRE Agent?</h2>
<p>Over the years, one pattern has shown up again and again: adding more dashboards and more alerts does not automatically improve reliability. In many organizations, it does the opposite. It creates more noise, more fatigue, and more context switching.</p>
<p>Most incidents are not caused by a single obvious failure. They’re caused by a chain of small things:</p>
<ul>
<li><p>a config drift no one noticed,</p>
</li>
<li><p>a deployment mismatch,</p>
</li>
<li><p>an auth change,</p>
</li>
<li><p>a missing secret,</p>
</li>
<li><p>a misleading health signal,</p>
</li>
<li><p>known issue that exists somewhere in a runbook nobody has time to read during an outage.</p>
</li>
</ul>
<p>This is where an SRE agent becomes interesting.Instead of only surfacing symptoms, it can:</p>
<ul>
<li><p>gather context from prior operational knowledge,</p>
</li>
<li><p>inspect metrics, logs, and platform state,</p>
</li>
<li><p>correlate signals across layers,</p>
</li>
<li><p>check code and repository history,</p>
</li>
<li><p>propose or apply remediation,</p>
</li>
<li><p>and keep the human engineer in control for sensitive actions.</p>
</li>
</ul>
<p>In other words, it doesn’t just help you <strong>observe</strong> a problem. It helps you <strong>investigate</strong> it.</p>
<h2>This is Part 1 of a 2-part series</h2>
<p><strong>Part 1:</strong> What Azure SRE Agent can do — real incident investigation and resolution workflows</p>
<p><strong>Part 2:</strong> How it works — architecture, configuration, tools, memory, permissions, and implementation details</p>
<h3>Real world Incident Scenarios:</h3>
<p><strong>Scenario 1: Application Not Accesible with HTTP 503 error</strong></p>
<p>The first scenario starts with a very common but frustrating production symptom: <strong>the application is down, but the platform looks fine</strong>.</p>
<p>I had a .NET application running on <strong>Azure App Service</strong>, backed by <strong>Azure SQL</strong>. When I opened the site, instead of the application UI, I was greeted with a generic application error page.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/375fbaca-6f83-4d69-8506-ceff8b2947e6.png" alt="" style="display:block;margin:0 auto" />

<p>At this point, nothing about the error was especially helpful. It told me the app was broken, but not <em>why</em>.</p>
<p>So I did what most engineers do first: I checked the Azure portal.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/6fcefe5b-810a-4702-aace-168f112f5f9e.png" alt="" style="display:block;margin:0 auto" />

<p>The initial signal was misleading. App Service was running. The database was running. Basic resource health didn’t immediately scream “critical platform issue.” There were HTTP 5xx errors visible in monitoring, but CPU and memory were not showing the kind of dramatic spike you’d normally expect during a major failure.</p>
<p>This is exactly the kind of situation where manual incident response starts to drag:</p>
<ul>
<li><p>the platform is “up,”</p>
</li>
<li><p>the app is effectively down,</p>
</li>
<li><p>the dashboards are incomplete,</p>
</li>
<li><p>and the root cause is hiding somewhere between runtime, config, and code.</p>
</li>
</ul>
<p>So instead of manually hopping across five different tools, I asked the <strong>Azure SRE Agent</strong> to investigate:</p>
<blockquote>
<p>“My dotnet application is not working, I see application error on webpage, can you investigate and find out why is that?”</p>
</blockquote>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/52127db3-a748-4b1b-a927-f804330ee72a.png" alt="" style="display:block;margin:0 auto" />

<p>What happened next is what made the experience feel different from traditional monitoring.</p>
<p>The agent did not begin with blind trial and error. It first pulled in context from its <strong>knowledge files</strong> , specifically files like <a href="http://debugging.md"><code>debugging.md</code></a> and <a href="http://deployment.md"><code>deployment.md</code></a>. That’s important, because real incident response is rarely just about live telemetry. It also depends on <strong>what the team already knows</strong> about the service.</p>
<p>In my setup, the agent’s operational memory is not stored as a messy chronological log. It is organized <strong>semantically by topic</strong>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/f8bbc161-af02-4f11-ada4-9ec63357623a.png" alt="" style="display:block;margin:0 auto" />

<p>That structure matters. Instead of searching through one giant document, the agent can jump directly into focused knowledge areas such as:</p>
<ul>
<li><p>service overview,</p>
</li>
<li><p>team and ownership,</p>
</li>
<li><p>architecture,</p>
</li>
<li><p>logs and queries,</p>
</li>
<li><p>deployment details,</p>
</li>
<li><p>auth behavior,</p>
</li>
<li><p>debugging patterns,</p>
</li>
<li><p>and known issue references.</p>
</li>
</ul>
<p>That gives the agent something most dashboards do not: <strong>historical operational context</strong>.</p>
<p>In this case, the knowledge files already referenced a <strong>known recurring issue around SQL authentication</strong>. So the agent used that as a working hypothesis , but importantly, it didn’t stop there. It still moved on to validate the application’s real-time state through telemetry and platform inspection.</p>
<p>The agent checked:</p>
<ul>
<li><p>App Service health and metrics,</p>
</li>
<li><p>container/runtime configuration,</p>
</li>
<li><p>logs,</p>
</li>
<li><p>startup behavior,</p>
</li>
<li><p>and application exceptions.</p>
</li>
</ul>
<p>It noticed that CPU and memory were not showing meaningful activity, which suggested something deeper than a normal runtime load problem. It also recognized that the container might not actually be starting correctly.</p>
<p>Because I had configured the agent in <strong>review mode</strong>, it did not make changes silently. When it needed to access or modify configuration, it explicitly asked for approval.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/2b72cf7b-cc69-4fdc-9860-712ff15c2cf6.png" alt="" style="display:block;margin-left:auto" />

<p>That review model is a big part of what makes this usable in real enterprise environments. The value is not “uncontrolled automation.” The value is <strong>fast investigation with human guardrails</strong>.</p>
<p>Once approved, the agent started walking through App Service configuration, logs, and runtime settings.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/a4d7c6f3-c3ee-4517-b388-aaf37df88ee2.png" alt="" style="display:block;margin:0 auto" />

<p>The first clear issue it found was not in the application code at all — it was in the hosting configuration.</p>
<p>The container image exposed port <strong>8080</strong>, but the App Service configuration was missing the required <code>WEBSITES_PORT</code> setting. Even if the app code and database auth had been correct, App Service would still not have routed traffic properly to the container.</p>
<p>At the same time, the agent also confirmed the earlier knowledge-based suspicion: the application was using <strong>SQL authentication</strong>, while the SQL server was configured in a way that conflicted with that access pattern.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/2b8314bf-90fc-4d57-8ee9-7dc0a9d54cbb.png" alt="" style="display:block;margin:0 auto" />

<p>This is where the investigation started to become more powerful than a typical dashboard workflow.</p>
<p>A human engineer under pressure might fix the first visible issue and move on. But the agent continued reasoning through the dependency chain:</p>
<ul>
<li><p>one issue in container routing,</p>
</li>
<li><p>another in database authentication,</p>
</li>
<li><p>and likely more hidden misconfigurations behind them.</p>
</li>
</ul>
<p>It applied the first round of fixes and re-validated the service.</p>
<p>But the application was <strong>still returning HTTP 503</strong>.</p>
<p>That’s the point where many incident bridges get stuck one fix has been applied, the expectation is recovery, and when recovery doesn’t happen, the whole investigation needs to widen again.</p>
<p>The agent did exactly that.It went back into the App Service configuration and noticed something subtle but critical: the <code>DOCKER_REGISTRY_SERVER_PASSWORD</code> setting was effectively empty. That meant the container could not pull its image from Azure Container Registry correctly.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/8fe54bf4-7288-4947-8d59-8250a6bef657.png" alt="" style="display:block;margin:0 auto" />

<p>Now the picture was clear. This outage was not caused by one dramatic failure. It was caused by <strong>multiple compounding misconfigurations</strong>:</p>
<ol>
<li><p>Missing container port configuration (<code>WEBSITES_PORT</code>)</p>
</li>
<li><p>SQL authentication mismatch</p>
</li>
<li><p>Missing container registry credentials</p>
</li>
<li><p>And, as the final investigation summary confirmed, a connectivity issue related to SQL access configuration as well</p>
</li>
</ol>
<p>After applying the necessary fixes, the agent validated the service again.</p>
<p>This time, the results were exactly what you want to see during an incident: the application recovered, the endpoints returned <strong>HTTP 200</strong>, and the UI became available again.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/56e3d1a2-3a1f-4474-b11d-8dd78cb99d41.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/6e42534c-b804-4a5d-8acc-299d866b19ba.png" alt="" style="display:block;margin:0 auto" />

<p>What I liked most about this scenario is that the agent did not behave like a scripted “runbook bot.” It behaved more like a junior-but-fast SRE partner:</p>
<ul>
<li><p>it started from known issues,</p>
</li>
<li><p>validated live telemetry,</p>
</li>
<li><p>checked config,</p>
</li>
<li><p>correlated runtime behavior,</p>
</li>
<li><p>asked for approval before making changes,</p>
</li>
<li><p>and kept investigating even after the first fix did not fully resolve the incident.</p>
</li>
</ul>
<p>The final recommendation was also the right long-term one:<br />instead of relying on fragile SQL credentials, move the application toward <strong>Managed Identity / Entra-based authentication</strong> to reduce future failures.</p>
<p>That is what a good SRE workflow should do: not only restore service, but also point toward a more durable operating model.</p>
<h3><strong>Scenario 2: Autonomous Handling of Incident</strong></h3>
<p>The <strong>first</strong> scenario showed guided investigation in review mode.<br />The <strong>second scenario</strong> shows what happens when the agent is connected to an <strong>incident platform</strong> and can begin working automatically from an alert.</p>
<p>This time, the trigger was not a user-facing outage page. It was a <strong>Sev2 Azure Monitor alert</strong> for <strong>high response time</strong>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/9c214330-26b5-4ba7-a0c1-31f0828d913c.png" alt="" style="display:block;margin:0 auto" />

<p>The alert indicated that requests were averaging over 5 seconds. That kind of alert is common — and often noisy. Sometimes it points to a real production issue. Sometimes it reflects a spike, a bad test path, or a monitoring rule that needs more context.</p>
<p>What matters is how quickly the investigation can distinguish between those possibilities.</p>
<p>As soon as the alert was detected, the SRE Agent acknowledged it and created an investigation plan.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/90c97fa3-686b-41b7-9893-0cdb31216df6.png" alt="" style="display:block;margin:0 auto" />

<p>This part is important.</p>
<p>Instead of immediately jumping to one likely cause, the agent laid out a structured plan:</p>
<ol>
<li><p>gather knowledge from memory files,</p>
</li>
<li><p>search past incidents,</p>
</li>
<li><p>inspect the currently affected app,</p>
</li>
<li><p>correlate recent changes,</p>
</li>
<li><p>and then mitigate.</p>
</li>
</ol>
<p>That sequencing mirrors how a strong human SRE would approach the issue — but the agent can do it in parallel and without fatigue.</p>
<p>During investigation, it pulled from prior knowledge, reviewed logs, and correlated the alert with application behavior. It discovered that the high response time was not simply a generic infrastructure slowdown. A key part of the latency was being driven by <strong>chaos/testing endpoints</strong> being hit and by application behavior that still needed code fixes.</p>
<p>The agent did something particularly valuable here: it connected the incident to the <strong>engineering workflow</strong>, not just to the monitoring workflow.</p>
<p>It created a <strong>GitHub issue</strong> summarizing the incident, identifying the root cause, and documenting the remediation path.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/9fc18887-4451-4f2a-874a-3e3a6186d0e9.png" alt="" style="display:block;margin:0 auto" />

<p>That issue became more than an alert record. It became an engineering artifact:</p>
<ul>
<li><p>what happened,</p>
</li>
<li><p>what caused it,</p>
</li>
<li><p>what had already been investigated,</p>
</li>
<li><p>and what change was needed next.</p>
</li>
</ul>
<p>The agent then moved toward code-level remediation:</p>
<ul>
<li><p>it identified the needed changes,</p>
</li>
<li><p>prepared a fix branch / PR flow,</p>
</li>
<li><p>and connected the operational incident to the software delivery pipeline.</p>
</li>
</ul>
<p>The remediation path included both <strong>monitoring-side correction</strong> and <strong>application-side correction</strong>:</p>
<ul>
<li><p>refining the alert rule logic so chaos endpoints did not distort signal quality,</p>
</li>
<li><p>and updating application behavior to reduce timeout-related failure patterns.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/f7a4b993-42fe-4a2e-afde-8b91c48059aa.png" alt="" style="display:block;margin:0 auto" />

<p>What makes this scenario compelling is that the agent did not stop at “I found the problem.” It helped push the process all the way toward:</p>
<ul>
<li><p>alert understanding,</p>
</li>
<li><p>signal cleanup,</p>
</li>
<li><p>code fix identification,</p>
</li>
<li><p>deployment workflow,</p>
</li>
<li><p>and verification.</p>
</li>
</ul>
<p>That is a very different experience from the traditional pattern where one tool raises the alert, another tool shows logs, a third tool tracks the issue, and a fourth system eventually gets the fix.</p>
<p>Here, the incident workflow becomes much more continuous.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/77733ce7-c793-446a-9daa-97b80755f4c7.png" alt="" style="display:block;margin:0 auto" />

<h3>What Scenario 2 Demonstrates</h3>
<p>If Scenario 1 was about guided troubleshooting, Scenario 2 is about something broader: <strong>operational orchestration</strong>.</p>
<p>The agent is not just a log reader. It becomes a bridge between:</p>
<ul>
<li><p>observability,</p>
</li>
<li><p>incident handling,</p>
</li>
<li><p>historical context,</p>
</li>
<li><p>GitHub engineering workflows,</p>
</li>
<li><p>and remediation execution.</p>
</li>
</ul>
<p>That’s where the SRE model starts to evolve.</p>
<p>Instead of engineers spending most of their time gathering context, the agent does the context gathering first. The human engineer can then focus on:</p>
<ul>
<li><p>approving sensitive changes,</p>
</li>
<li><p>validating business impact,</p>
</li>
<li><p>reviewing code changes,</p>
</li>
<li><p>and making the final operational judgment.</p>
</li>
</ul>
<p>That shift matters because the biggest cost in many incidents is not only the outage itself — it is the <strong>time lost assembling the investigation</strong>.</p>
<h3><strong>Why THIS Matters</strong></h3>
<p>There’s a lot of hype right now around AI in operations. Some of it is useful. A lot of it is vague. That’s why I prefer to evaluate these systems through practical incident scenarios.</p>
<p>For me, the interesting question is not:</p>
<blockquote>
<p>“Can AI read logs?”</p>
</blockquote>
<p>The interesting question is:</p>
<blockquote>
<p>“Can an agent meaningfully reduce investigation time, correlate across layers, and safely help engineers move toward resolution?”</p>
</blockquote>
<p>From these scenarios, the answer looks increasingly like <strong>yes , with the right guardrails</strong>.</p>
<p>What stood out to me most was not raw automation. It was the combination of:</p>
<ul>
<li><p><strong>memory</strong> (knowledge files and past learnings),</p>
</li>
<li><p><strong>reasoning</strong> (moving beyond one-signal diagnosis),</p>
</li>
<li><p><strong>workflow integration</strong> (Azure resources + GitHub),</p>
</li>
<li><p>and <strong>human control</strong> (approval gates for risky actions).</p>
</li>
</ul>
<p>That combination is what takes an SRE agent beyond a fancy assistant and makes it operationally relevant.</p>
<h3><strong>Conclusion</strong></h3>
<p>Traditional monitoring stacks are very good at telling us <strong>something is wrong</strong>.</p>
<p>What they usually do not do well is help us answer, quickly and reliably:</p>
<ul>
<li><p>What exactly failed?</p>
</li>
<li><p>What changed?</p>
</li>
<li><p>Is this a known issue?</p>
</li>
<li><p>Is it platform, config, code, auth, or deployment?</p>
</li>
<li><p>What is the safest next action?</p>
</li>
<li><p>And how do we prevent this from happening again?</p>
</li>
</ul>
<p>That is the gap Azure SRE Agent begins to fill.</p>
<p>In the f<strong>irst scenario</strong>, it helped trace a 503 outage across telemetry, memory, config, registry credentials, and SQL access issues.<br />In the <strong>second,</strong> it moved from alert detection into structured incident investigation and GitHub-driven remediation.</p>
]]></content:encoded></item><item><title><![CDATA[Building an AI-powered IT Ops agent for Onpremise servers using Microsoft Foundry Agent Service , Azure Arc & MCP]]></title><description><![CDATA[Introduction:
Recently, a customer asked me a question that really got me thinking:: "Can we build an AI agent that integrates with our datacenter — one that IT admins talk to in plain English, and it]]></description><link>https://blog.osshaikh.com/building-an-ai-powered-it-ops-agent</link><guid isPermaLink="true">https://blog.osshaikh.com/building-an-ai-powered-it-ops-agent</guid><category><![CDATA[Azure]]></category><category><![CDATA[agents]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[foundry]]></category><category><![CDATA[openai]]></category><category><![CDATA[agentic]]></category><category><![CDATA[mcp]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Tue, 03 Mar 2026 14:25:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/ac9cdccb-cc38-4689-994f-9604c20dc509.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Introduction:</h3>
<p>Recently, a customer asked me a question that really got me thinking:: "Can we build an AI agent that integrates with our datacenter — one that IT admins talk to in plain English, and it figures out which tools to run, in what order, to get the job done?"</p>
<p>Not a chatbot. Not a script wrapper. An agent that reasons about the problem, picks the right automation tools,runbooks &amp; executes them, and explains what it found — all while respecting Security controls &amp; ITSM policies</p>
<p>So I built one as MVP for Enterpise customer IT admins, Here are details of Solutions</p>
<h3>Solution Overview</h3>
<p>The solution is an AI-powered IT Operations agent that manages on-premises servers through conversation. You type "Why is my ABC server running slow?" and the agent SSHs into the server, runs diagnostics, analyzes the results, and tells you the root cause — in under few seconds depeneds on nature of questions</p>
<p><strong>Following are high level components of this solution</strong></p>
<p><strong>Azure Arc</strong> : It Act as bridge between Azure and Datacenter servers, Your servers stay in your datacenter. Azure Arc installs a lightweight agent that creates a secure outbound private connection to Azure — no VPN, no public IPs, no firewall holes. The AI agent reaches your server securely through this relay using SSH</p>
<p><strong>MCP Server</strong>: Building an MCP server for Azure Arc using fastMCP SDK , transform the API into two core AI-accessible components: Tools &amp; Resources while Azure Arc API operations involve complex resource IDs and specific command syntax. An MCP server abstracts this so you can simply ask, <em>"Which of my Arc servers in the datacenter have missing security patches"</em></p>
<p><strong>AI Agent</strong>: Build using Foundry Agent SDK and hosted on Foundry Agent service, The agent receives your request, decides which tools to call and in what order, interprets the results, and follows up if needed. Ask it to investigate high CPU — it runs a full diagnosis first, spots a memory pressure error in the output, then pulls the event logs to confirm the root cause. That multi-step reasoning is what separates an agent from a converational bot</p>
<p><strong>WebApp</strong>: Chat UI Interface which interacts with AI Agent running in Foundry Agent service runtime, it serves the chat UI, and it handles the agent conversation loop . When you type a message, the backend sends it to the Foundry Agent, waits for tool call requests, executes them by calling the MCP server's REST API, submits results back to the agent, and streams the final reply to your browser</p>
<p><strong>Durable Functions</strong>: Alerts for Arc server configured in Azure monitor, when something goes wrong on Arc enable server — high CPU, a service crashes, disk filling up — Azure Monitor fires a webhook to a URL. That URL needs to be something that's always listening and ready to receive it.Functions gives us that always-on webhook endpoint without running a dedicated server 24/7. It sits idle costing nothing until an alert arrives, spins up in milliseconds, hands the alert to the agent, and goes back to sleep</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/c2040046-e525-4ecd-a1ef-55843f25070a.png" alt="" style="display:block;margin:0 auto" />

<p>Four Major componenst of Architecture:</p>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>What it does</strong></th>
<th><strong>Where it runs</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Web App</strong></td>
<td>Chat interface + handles the AI-to-tool execution loop</td>
<td>Azure Container App</td>
</tr>
<tr>
<td><strong>AI Agent</strong></td>
<td>Reasons about problems, decides which tools to call</td>
<td>Microsoft Foundry Agent Service</td>
</tr>
<tr>
<td><strong>MCP Server</strong></td>
<td>Executes SSH commands on servers, returns results</td>
<td>Azure Container App</td>
</tr>
<tr>
<td><strong>Durable Function</strong></td>
<td>Receives alert webhooks, triggers autonomous investigations</td>
<td>Azure Functions</td>
</tr>
</tbody></table>
<p><strong>Architecture Flow</strong></p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/96a7b16f-de82-44bc-afdf-7d756150d9f5.png" alt="" style="display:block;margin:0 auto" />

<p><strong>Key Design Decisions</strong></p>
<p><strong>Azure Arc:</strong><br />Azure Arc Agent enable onpremise servers to become Az ARM resource, which allow to managed Arc enable using Azure supported interface like UI, CLI and Rest APIs security without being exposed to internet</p>
<p><strong>Foundry Agent Service</strong> :<br />Zero infrastructure to manage. No GPU provisioning, no container orchestration for the AI runtime.Built-in conversation state. Foundry manages threads natively — each conversation has persistent memory across messages.Native tool-calling orchestration. Enterprise identity and compliance, agent authenticates via Managed Identity and Azure RBAC — same identity layer as every other Azure resource</p>
<p><strong>SSH over Run Command APIs:</strong><br />Azure Arc gives you two ways to execute commands on your on-prem server.<br /><strong>Run Command</strong> (Azure REST API) — sends a PowerShell script to Azure, which queues it on the Arc agent, which executes it, which reports the result back through Azure. Round trip: 45-90 seconds. Every single time.</p>
<p><strong>SSH via Arc</strong> — opens a real-time SSH tunnel through the Arc relay. The MCP server runs az ssh arc, connects to the server's SSH daemon, executes the PowerShell command, gets the output. Round trip: 5-7 seconds. That includes the relay setup, SSH handshake, and command execution.SSH user can be restricted to specific OS-level groups (Event Log Readers, Performance Monitor Users)</p>
<h3>Security Layers</h3>
<p>All Authentication and Authorization are faciliated via Entra ID using managed identity of Azure, no local creds or Hardcoded credentails</p>
<p>Layer 1 : Identity &amp; Authentication</p>
<h4><strong>Managed Identity (All Cloud Components)</strong></h4>
<table>
<thead>
<tr>
<th><strong>Component</strong></th>
<th><strong>Identity Type</strong></th>
<th><strong>Authenticates To</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Web App</strong> (Container App)</td>
<td>System-Assigned MI</td>
<td>Azure AI Foundry Agent Service</td>
</tr>
<tr>
<td><strong>MCP Server</strong> (Container App)</td>
<td>System-Assigned MI</td>
<td>ARM API, Key Vault, Arc SSH Relay</td>
</tr>
<tr>
<td><strong>Durable Function</strong></td>
<td>System-Assigned MI</td>
<td>Azure AI Foundry Agent Service</td>
</tr>
</tbody></table>
<p>Layer 2: Authorization</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/e9b6ede2-9025-4843-886b-b2cd3aacfe50.png" alt="" style="display:block;margin:0 auto" />

<p>Layer 3: Network &amp; Transport</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/832018df-7f1d-49cf-adc1-7774f361c4c8.png" alt="" style="display:block;margin:0 auto" />

<p>Layer 4: OS/App Controls</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/30d8c86a-3889-489c-a33a-ec760f3b120c.png" alt="" style="display:block;margin:0 auto" />

<p>Pre-Requisites</p>
<ul>
<li><p>Azure Subscription</p>
</li>
<li><p>Onpremises Windows/Linux Server onboarded to Azure via Arc Agent</p>
</li>
<li><p>Az AI foundry Project with Reasoning LLM models (GPT models)</p>
</li>
<li><p>Python 3.11</p>
</li>
</ul>
<p>💡 Tip:** If you don't have an Arc server yet, you can use <a href="https://azurearcjumpstart.com/">Azure Arc Jumpstart</a> to set one up quickly with ArcBox.</p>
<p>Walkthrough of Solution Demo:</p>
<p>Part 1: Read-Only Operations (no ticket required)</p>
<p>First, we demonstrate read-only operations — health checks, diagnostics, event log queries, service status. These run freely without any change ticket because they don't modify the server.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/a2ed46b2-b96f-41a2-87b7-0d5700b65ff5.png" alt="" style="display:block;margin:0 auto" />

<p>Health-Check:<br />Question on health check of server triggers tools call in MCP for server health check which perform several steps on server and responsd with results with Server Perfromance, Last errors and its service review</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/d22b60a1-9892-4cab-8421-73b108a94c83.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/a79b23a4-3845-4eb4-85ff-a17f05046802.png" alt="" style="display:block;margin:0 auto" />

<p>Part 2: Change-Controlled Operation (ticket required)</p>
<p>Next, an admin requests a risky operation — restarting a critical service. The agent does not proceed. Instead, it informs the admin that this is a change-controlled operation and asks for a valid ServiceNow CHG or INC ticket number.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/5ea1511c-406f-46e8-8e0b-d695487e9767.png" alt="" style="display:block;margin:0 auto" />

<p>Part 3: Ticket Validation — Denied</p>
<p>The admin provides a ticket number. The agent validates it against ServiceNow and finds the ticket is pending approval — not yet authorized for execution. The agent politely denies the request and asks the admin to get the ticket approved first.</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/11c25748-6122-494e-98cc-0034f63367eb.png" alt="" style="display:block;margin:0 auto" />

<p>Part 4: Ticket Approved — Execution</p>
<p>The admin returns after the ticket is approved. The agent re-validates against ServiceNow, confirms the ticket is now approved, and proceeds to execute the service restart. It reports the result back to the admin with the ticket number referenced for audit</p>
<img src="https://cdn.hashnode.com/uploads/covers/65bcb0daad92b2d5776669b3/cf4565fe-5af5-4c13-ad2a-a668fea5ae83.png" alt="" style="display:block;margin:0 auto" />

<p>For more details on Internal and source code of this solution,check out github repo:<br /><a href="https://github.com/Osshaikh/IT-Ops-Agent">Osshaikh/IT-Ops-Agent</a></p>
]]></content:encoded></item><item><title><![CDATA[Designing a Unified Log Platform on Azure]]></title><description><![CDATA[Introduction:
Recently i have seen similar requirement around many enterprises specially from FSI on centralise log store and analytics solution across thier IT infrastructure
In the modern enterprise, IT infrastructure is a complex tapestry woven fr...]]></description><link>https://blog.osshaikh.com/unified-log-platform-azure</link><guid isPermaLink="true">https://blog.osshaikh.com/unified-log-platform-azure</guid><category><![CDATA[Azure]]></category><category><![CDATA[Logs]]></category><category><![CDATA[analytics]]></category><category><![CDATA[#AIOps]]></category><category><![CDATA[storage]]></category><category><![CDATA[data]]></category><category><![CDATA[Data-lake]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Fri, 29 Aug 2025 11:52:37 GMT</pubDate><content:encoded><![CDATA[<h3 id="heading-introduction">Introduction:</h3>
<p>Recently i have seen similar requirement around many enterprises specially from FSI on centralise log store and analytics solution across thier IT infrastructure</p>
<p>In the modern enterprise, IT infrastructure is a complex tapestry woven from on-premises systems, private clouds, and multiple public cloud environments. For large organizations, particularly those in heavily regulated sectors like financial services, managing the sheer volume and diversity of machine-generated logs is not just an operational hurdle – it's a fundamental business requirement for maintaining security, ensuring compliance, and driving performance.</p>
<p>As a financial institution faces stringent regulatory requirements for log retention, security monitoring, and compliance reporting. The current log volume exceeds 100TB daily, creating significant challenges for storage, retention, and cost optimization.</p>
<h3 id="heading-key-challenges"><strong>Key Challenges:</strong></h3>
<p>The complexity and scale of a large enterprise company Infrastructure estate, as observed within our financial sector customer, present significant challenges for traditional or decentralized log management approaches:</p>
<ol>
<li><p><strong>Massive Data Volume:</strong> Handling the ingestion, processing, and storage of hundreds of terabytes (100TB+) of log data daily requires a platform built for scale and efficiency from the ground up.</p>
</li>
<li><p><strong>Diverse Log Sources:</strong> Logs originate from a heterogeneous mix of operating systems (Windows, Linux, Unix. Solaris, Mainframes), network devices, applications, security tools, and cloud services (Azure, AWS, GCP), each generating log data in different formats. Normalizing and analyzing this disparate Logs is complex.</p>
</li>
<li><p><strong>Compliance Requirements:</strong> Financial institutions face stringent regulations (e.g., SOX, PCI DSS, GDPR, and local financial laws) mandating specific log retention periods (often multi-year), data immutability, audit trails, and strict access controls. Meeting these across a distributed IT environment is critical.</p>
</li>
<li><p><strong>Cost Management:</strong> Storing and analyzing large volumes of data for extended periods can be prohibitively expensive. Optimizing storage costs while maintaining compliance and analytical capabilities is a key concern.</p>
</li>
<li><p><strong>Security Analytics:</strong> Effectively detecting and responding to increasingly sophisticated cyber threats requires aggregating security-relevant logs from all sources and applying advanced analytics and threat intelligence in near real-time.</p>
</li>
<li><p><strong>Performance Troubleshooting:</strong> Quickly identifying the root cause of performance issues across complex application stacks and infrastructure components relies on having centralized, searchable, and correlated performance logs.</p>
</li>
<li><p><strong>Operational Efficiency:</strong> Managing decentralized log collection, storage silos, and disparate analysis tools leads to high operational overhead and hinders rapid insights.</p>
</li>
</ol>
<p>Expectation:</p>
<h3 id="heading-architecture-overview">Architecture Overview:</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758124627927/b884ff75-2e2d-469b-88a7-4eb8b34c86dc.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-data-collection-options">Data Collection Options:</h3>
<p>Able to collecy logs across all diverse source is first major problem ,as it includes majority of physical security/network specific applainces even many legagcy operating system</p>
<p>In this solution depending types of sources different log collection method would be used, following are list of log collection options available which can be integrated with ADX</p>
<h3 id="heading-agent-based"><strong>Agent based</strong></h3>
<ul>
<li><p><strong>Azure Monitor Agent</strong><br />  <strong>When one would use AMA:</strong><br />  For Windows/Linux VMs (on-prem or cloud) where you need native Azure integration. Collect Windows Event Logs, Syslog, Performance Counters, IIS logs. Best for time-series telemetry and routing to Log Analytics or ADX via Data Collection Rules (DCR). Ideal for VM-centric monitoring and hybrid environments (with Azure Arc).</p>
</li>
<li><p><strong>Fluent-bit</strong><br />  <strong>When Should you use Fluentbit</strong><br />  For containers (AKS/Kubernetes), IoT edge, or custom application/system logs. When you need lightweight, high-performance log forwarding with rich parsing (JSON, regex, multiline). Supports multiple outputs (Event Hubs, ADX, Blob, OTLP). Great for streaming pipelines and edge scenarios.<br />  Fluent Bit started as a <strong>log processor and forwarder</strong>, and that’s still its sweet spot. It can now handle metrics and even output in <strong>OTLP</strong>, but its core strength is <strong>fast, lightweight for logs pipelines</strong>.<br />  <strong>Flexible</strong>:It supports a huge range of outputs— <strong>Kafka, Loki, ADX, Blob, OTLP/HTTP</strong>, and more</p>
</li>
<li><p><strong>OTEL Collector</strong><br />  It natively handles <strong>traces, metrics, and logs</strong> across receivers, processors, exporters, connectors and with <strong>multi-signal observability</strong>—handling <strong>logs, metrics, and traces</strong> in a unified pipeline.</p>
<p>  <strong>When should you use OTEl Collector</strong><br />  When you need multi-signal observability: logs + metrics + traces in one pipeline. For microservices, distributed apps, or multi-cloud environments. Provides advanced processors (sampling, filtering, redaction) and vendor-neutral OTLP. Scales well for complex observability pipelines and integrates with Azure Monitor, ADX, or Event Hubs.</p>
</li>
</ul>
<h3 id="heading-agents-os-supportiblity-matrix"><strong>Agents OS Supportiblity Matrix</strong></h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Collector</strong></td><td><strong>Linux</strong></td><td><strong>Windows</strong></td><td><strong>macOS</strong></td><td><strong>FreeBSD</strong></td><td><strong>Solaris</strong></td><td><strong>AIX</strong></td><td><strong>Remarks</strong></td><td><strong>OS-specific limitations</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>OpenTelemetry Collector</strong></td><td>✔ Official DEB/RPM/APK packages and Docker images</td><td>✔ MSI &amp; ZIP builds</td><td>⚠ Build from source; no official binary</td><td>✖ Unsupported – build failures reported</td><td>✖ Not supported</td><td>✖ Feature request open; no builds yet</td><td>First-class support only for Linux (x86-64/ARM64/PPC64-le) and Windows; multi-signal pipeline (logs + metrics + traces).</td><td>- macOS: manual Go tool-chain build, no CI.</td></tr>
<tr>
<td>- FreeBSD: known compilation errors.</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
<tr>
<td>- Solaris/AIX: no upstream support; cannot build some Go deps.</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
<tr>
<td><strong>Fluent Bit</strong></td><td>✔ RPM/DEB/Alpine packages &amp; Docker images</td><td>✔ Native installer &amp; Chocolatey</td><td>✔ Homebrew / tar.gz</td><td>✔ FreeBSD ports/pkg</td><td>⚠ Source build only; some plugins absent</td><td>⚠ Source build possible; no upstream packages</td><td>Very small C binary; wide plugin ecosystem for logs; OTLP input/output available.</td><td>- Solaris/AIX: must compile; verify needed plugins compile.</td></tr>
<tr>
<td>- FreeBSD: some inputs/outputs may be missing.</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
<tr>
<td><strong>Azure Arc Agent</strong> (Connected-Machine + Log Analytics)</td><td>✔ Ubuntu, RHEL, SLES, CentOS, Amazon Linux, Oracle Linux</td><td>✔ Windows Server 2008 R2/2012 R2 and newer (incl. Core)</td><td>✖ Not supported</td><td>✖ Not supported</td><td>✖ Not supported</td><td>✖ Not supported</td><td>Purpose-built for Azure governance/monitoring; ships only for Windows &amp; mainstream Linux.</td><td>- Linux: limited to listed distros; requires systemd &amp; kernel ≥ 3.10.</td></tr>
<tr>
<td>- Windows: only server SKUs; no Windows 10/11.</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
<tr>
<td>- All other Unix variants unsupported.</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr>
</tbody>
</table>
</div><h3 id="heading-log-streaming">Log Streaming</h3>
<ul>
<li><p><strong>Even Hub(Kafka API)</strong> :<br />  When one should use Event Hub?<br />  High-throughput streaming of logs, metrics, or events from multiple producers. When you need buffering, replay, and decoupling between data producers and consumers. Ideal for real-time pipelines where logs flow from Fluent Bit or OTel Collector to downstream analytics (ADX, SIEM, Data Lake). Use Kafka API when existing tools or agents already speak Kafka protocol.</p>
<p>  Why use Event Hub (Streaming)</p>
<p>  <strong>Flexiblity</strong>: Log sources only need to know about kafka stream; backend teams can swap underline Data/Log Warehouse on Azure without touching edge agents</p>
<p>  <strong>Durability</strong>:Never lose data – logs are written to disk and replicated, so they survive node failures and can be replayed after outages</p>
</li>
</ul>
<h3 id="heading-schedule-ingestion">Schedule Ingestion</h3>
<ul>
<li>Az Data Factory(Batch Jobs):<br />  When one should use ADF?<br />  For batch ingestion of log files from on-premises or multi-cloud environments. When logs are stored as CSV, JSON, Parquet, or other file formats and need scheduled or event-driven movement. Best for legacy systems or offline sources where real-time streaming isn’t possible. Provides secure hybrid connectivity and broad connector coverage.</li>
</ul>
<h3 id="heading-log-collection-options"><strong>Log Collection options :</strong></h3>
<p>Azure native solution provide comprehensive capablities in different type log Injestion from diverse log sources</p>
<h3 id="heading-windows-servers">Windows Servers</h3>
<ul>
<li><p>Windows Event logs (System, Application, Security)</p>
</li>
<li><p>Performance counters (CPU, Memory, Disk, Network)</p>
</li>
<li><p>IIS logs for web servers</p>
</li>
<li><p>SQL Server logs and metrics</p>
</li>
<li><p>Custom Windows logs and ETW events</p>
</li>
<li><p>Application-specific logs</p>
</li>
</ul>
<h3 id="heading-linux-servers">Linux Servers</h3>
<ul>
<li><p>Syslog messages (all facilities and priorities)</p>
</li>
<li><p>Performance metrics (CPU, Memory, Disk, Network)</p>
</li>
<li><p>Custom log files with configurable parsing</p>
</li>
<li><p>Journal logs from systemd-based systems</p>
</li>
<li><p>Container logs (Docker, Kubernetes)</p>
</li>
<li><p>Application-specific logs (Apache, Nginx, MySQL, etc.)</p>
</li>
</ul>
<h3 id="heading-aixunixsolaris-systems">AIX/Unix/Solaris Systems</h3>
<ul>
<li><p>Syslog forwarding to collection servers</p>
</li>
<li><p>Custom connectors for specialized Unix variants</p>
</li>
<li><p>Script-based collection for legacy systems</p>
</li>
<li><p>Direct integration with Event Hubs</p>
</li>
<li><p>API-based collection methods</p>
</li>
<li><p>Log file exports via SFTP/SCP</p>
</li>
</ul>
<h3 id="heading-network-appliances">Network Appliances</h3>
<ul>
<li><p>Flow logs for traffic analysis</p>
</li>
<li><p>SNMP traps and metrics</p>
</li>
<li><p>Syslog collection from network devices</p>
</li>
<li><p>Azure Network Watcher integration</p>
</li>
<li><p>NSG flow logs</p>
</li>
<li><p>API-based collection from vendor platforms</p>
</li>
</ul>
<h3 id="heading-firewall-devices">Firewall Devices</h3>
<ul>
<li><p>Security events and alerts</p>
</li>
<li><p>Traffic flow logs</p>
</li>
<li><p>Rule match logs</p>
</li>
<li><p>Administrative audit logs</p>
</li>
<li><p>Threat intelligence feeds</p>
</li>
<li><p>VPN connection logs</p>
</li>
</ul>
<h3 id="heading-cloud-services">Cloud Services</h3>
<ul>
<li><p>Native Azure Services logs</p>
</li>
<li><p>Azure Active Directory logs</p>
</li>
<li><p>Microsoft 365 Defender logs</p>
</li>
<li><p>Azure PaaS service logs</p>
</li>
<li><p>3P Cloud Platform Services logs</p>
</li>
</ul>
<h3 id="heading-saas-platform">SaaS Platform</h3>
<ul>
<li><p>Microsoft M365 Services</p>
</li>
<li><p>Salesforce</p>
</li>
<li><p>Qualys</p>
</li>
<li><p>ZScaler</p>
</li>
</ul>
<h2 id="heading-ingestion-patterns"><strong>Ingestion Patterns</strong></h2>
<h3 id="heading-legacy-systems-unixaixsolariscustomos">Legacy Systems (Unix/AIX/Solaris/CustomOS)</h3>
<p>Configure Unix/Solaris systems to forward syslog messages to a central syslog server:</p>
<ul>
<li><p>Deploy Linux servers with Fluentbit/Otel Collector as syslog collectors</p>
</li>
<li><p>Configure syslog forwarding on Unix/Solaris systems to Fluentbit forwader server</p>
</li>
<li><p>Using Fluentbit filters to reduced noise from all logs being ingested</p>
</li>
<li><p>Parsed Raw log data into stuctured format created for on ADX for syslog table</p>
</li>
</ul>
<h2 id="heading-windowslinuxfreebsd-server">Windows/Linux/FreeBSD Server</h2>
<h3 id="heading-option-a-ama-azure-monitor-agent">Option A — AMA (Azure Monitor Agent)</h3>
<p><strong>Best for:</strong> Windows Events, Syslog, <strong>Performance Counters (time‑series)</strong>, IIS logs, “VM observability first”.</p>
<p><strong>Pattern</strong></p>
<ol>
<li><p>Onboard servers to <strong>Azure Arc</strong> (for hybrid) or use native Azure VMs.</p>
</li>
<li><p>Deploy <strong>AMA</strong> and configure <strong>Data Collection Rules (DCR)</strong>:</p>
<ul>
<li><p>Windows: <strong>WindowsEvent</strong>, <strong>IIS</strong> (custom text), <strong>Perf counters</strong></p>
</li>
<li><p>Linux: <strong>Syslog</strong>, <strong>Perf counters</strong></p>
</li>
</ul>
</li>
<li><p>Route to <strong>Log Analytics</strong>; optionally <strong>export</strong> or <strong>data connect</strong> to <strong>ADX</strong> for advanced KQL and long‑term analytics.</p>
</li>
</ol>
<h3 id="heading-option-b-fluent-bit-lightweight-log-shipping-rich-parsing">Option B — Fluent Bit (lightweight log shipping + rich parsing)</h3>
<p><strong>Best for:</strong> Containers/AKS, edge, <strong>file/journald/syslog</strong> logs with <strong>multiline/regex</strong> parsing; routing to <strong>Event Hubs (Kafka API)</strong> → <strong>ADX</strong>.</p>
<p><strong>Pattern</strong></p>
<ol>
<li><p>Install <strong>Fluent Bit</strong> on Windows/Linux/Unix Servers</p>
</li>
<li><p>Collect logs: <strong>journald/systemd</strong>, <strong>syslog</strong>, <strong>file tail</strong> (IIS, App logs).</p>
</li>
<li><p>Apply <strong>filters</strong> (multiline, <code>parser</code>, <code>modify</code>, <code>lua/wasm</code> if needed).</p>
</li>
<li><p>Output to **Azure Data Explorer or Event Hub (Kafka API)**or buffering and replay.</p>
</li>
<li><p><strong>ADX</strong> ingests from the Event Hub (data connection + JSON mapping).</p>
</li>
</ol>
<h3 id="heading-option-c-opentelemetry-collector-multisignal-pipelines">Option C — OpenTelemetry Collector (multi‑signal pipelines)</h3>
<p><strong>Best for:Logs + Metrics + Traces</strong> together (microservices, distributed apps), standard <strong>OTLP</strong>, centralized processing (sampling, filtering, redaction).</p>
<p><strong>Pattern</strong></p>
<ol>
<li><p>Run <strong>OTel Collector</strong> as a service/sidecar/DaemonSet.</p>
</li>
<li><p>Receivers: <code>filelog</code> (app logs), <code>syslog</code>, <code>hostmetrics</code> / <code>windowsperfcounters</code>, <code>otlp</code> (SDKs).</p>
</li>
<li><p>Processors: <code>batch</code>, <code>attributes</code>, <code>memory_limiter</code>, optional <code>transform</code>/<code>redaction</code>.</p>
</li>
<li><p>Export:</p>
<ul>
<li><p><strong>Kafka exporter → Event Hubs (Kafka API)</strong> → <strong>ADX</strong> (logs)</p>
</li>
<li><p><strong>azuremonitor exporter</strong> → Azure Monitor / Application Insights (traces/metrics)</p>
</li>
</ul>
</li>
</ol>
<p>Sample OTEL Config YAML</p>
<h2 id="heading-kubernetescontainers-openshiftaroeksgke"><strong>Kubernetes/Containers (OpenShift/ARO/EKS/GKE)</strong></h2>
<ul>
<li><h3 id="heading-fluentd-lightweight-log-shipping-rich-parsing">FluentD (lightweight log shipping + rich parsing)</h3>
<p>  <strong>Best for:</strong> Containers/AKS, edge, <strong>file/journald/syslog</strong> logs with <strong>multiline/regex</strong> parsing; routing to <strong>Event Hubs (Kafka API)</strong> → <strong>ADX</strong>.</p>
<p>  <strong>Pattern</strong></p>
<ol>
<li><p>Deploye FluentD as Daemon set to collects logs from Kubernetes/Containers</p>
</li>
<li><p>Collect logs:Container stdout/stderr from all pods by tailing CRI log files like /var/log/containers/*.log,Kubernetes context such as pod name, namespace, labels, and annotations added to each log record via the kubernetes_metadata_filter plugin</p>
</li>
<li><p>Apply <strong>filters</strong> (multiline, <code>parser</code>, <code>modify</code>, <code>lua/wasm</code> if needed).</p>
</li>
<li><p>Output to <strong>Azure Event Hubs (Kafka API)</strong> for buffering and replay.</p>
</li>
<li><p><strong>ADX</strong> ingests from the Event Hub (data connection + JSON mapping).</p>
</li>
</ol>
</li>
</ul>
<h3 id="heading-network-device-and-firewall-collection">Network Device and Firewall Collection</h3>
<h4 id="heading-a-snmp-device-health-amp-performance"><strong>A) SNMP (device health &amp; performance)</strong></h4>
<ul>
<li><p>Azure Monitor does <strong>not</strong> poll SNMP on your behalf. Use one of:</p>
<ul>
<li><p><strong>Prometheus SNMP Exporter</strong> adjacent to devices → <strong>OTel Collector</strong> <code>prometheus</code> receiver → export to <strong>Azure Monitor metrics</strong> or <strong>ADX</strong>.</p>
</li>
<li><p><strong>OTel Collector</strong> <code>snmp</code> receiver (contrib) to poll SNMP <strong>v3</strong> directly → export as metrics.</p>
</li>
</ul>
</li>
<li><p>Always prefer <strong>SNMP v3</strong> (auth+priv), restrict to a <strong>management network</strong>, and standardize on <strong>MIBs</strong> + <strong>targets.yaml</strong>.trics</p>
</li>
</ul>
<h4 id="heading-b-syslog-recommended-primary-path"><strong>B) Syslog (recommended primary path)</strong></h4>
<ul>
<li><p><strong>Configure devices</strong> to send syslog (prefer <strong>TCP/TLS</strong> if supported, else UDP) to a <strong>collector VM</strong>.</p>
</li>
<li><p>Run <strong>rsyslog</strong>/<strong>syslog‑ng</strong> on the VM to <strong>receive</strong> and <strong>write</strong> messages to files (e.g., per‑device/per‑program).</p>
</li>
<li><p>Choose one of:</p>
<ul>
<li><p><strong>Azure Monitor Agent (AMA)</strong>: Create a <strong>DCR (Syslog data source)</strong> to read the local syslog files and route to <strong>Log Analytics</strong> (and/or <strong>ADX via export</strong>).</p>
</li>
<li><p><strong>Fluent Bit</strong>: Use the <code>syslog</code> input to receive logs directly (UDP/TCP), apply parsing/enrichment, and forward to <strong>Event Hubs (Kafka API)</strong>, <strong>ADX</strong>, or <strong>Blob/ADLS</strong>.</p>
</li>
<li><p><strong>OTel Collector</strong>: Use the <code>syslogreceiver</code> (contrib) to ingest syslog, then process and export to <strong>Event Hubs</strong>, <strong>Azure Monitor</strong>, or <strong>ADX</strong>.</p>
</li>
</ul>
</li>
</ul>
<h4 id="heading-c-api-based-collection-for-vendors-that-expose-logstelemetry-over-reststream"><strong>C) API-based collection (for vendors that expose logs/telemetry over REST/stream)</strong></h4>
<ul>
<li><p><strong>Azure Functions or Logic Apps</strong> to <strong>poll or stream</strong> vendor APIs (pagination, throttling, retries).</p>
</li>
<li><p>Normalize to a common schema (time, device, severity, category, message, properties) and emit to <strong>Event Hubs</strong> or <strong>ADX</strong>.</p>
</li>
</ul>
<h3 id="heading-data-transformation"><strong>Data Transformation:</strong></h3>
<p>Log transformation should be done at two parts, One at Edge layer (Agents ) &amp; another at Log store i.e. Azure Data explorer.<br />a)Agents: Do coarse-grained noise reduction, basic parsing, minimal sensitive-data removal, and light enrichment at the <strong>edge</strong> to reduce volume/cost and prevent sensitive content from leaving hosts or clusters</p>
<p>b)ADX: Do schema shaping, deep parsing, normalization, advanced masking, joins with reference data, routing, and lifecycle transformations in <strong>ADX</strong> using ingestion-time update policies and KQL-based pipelines for durable, consistent transformations at scale<br />Normalization, PII data masking, Sesitive Data eradication, Enrcihment, Log Labelling</p>
<ul>
<li><p><strong>Log Filters:</strong><br />  OTel Collector can parse raw log lines into structured fields (like time and severity) using regex-based parsing so filters and routing can act on real fields instead of brittle string matches.It can then drop unwanted records at the edge (for example, anything containing the word “secret”) using a filter processor so those logs never leave the node or hit storage.</p>
<p>  Fluent Bit tails log files and applies a JSON parser so unstructured lines become structured records, which makes filtering, routing, and indexing far more accurate and efficient.It uses the grep filter to exclude noisy patterns early (for example, health checks or verbose debug lines), cutting costs and noise before data leaves the host</p>
<p>  AMA lets data collection rules select by severity and XPath so only meaningful event IDs and levels are sent, reducing volume at the sources.DCR (KQL) can then add new fields, reshape columns, and redact or remove sensitive content before it reaches the workspace,</p>
</li>
<li><p><strong>Log Parsing</strong></p>
<p>  OTel’s filelog receiver parses unstructured lines into structured fields using operators such as regex_parser,<br />  After parsing at the receiver, transformations can further standardize fields (for example, aligning severity) with OTTL in the transform processor when needed for normalization and routing logic</p>
<p>  Fluent Bit parses at the input using Parsers (JSON or Regex) so filters and outputs operate on structured keys like time, level, and message rather than brittle substring matches</p>
<p>  AMA doesn’t define regex/JSON parsers like agents; instead, parsing is done in Data Collection Rule “transformations” with KQL functions such as parse_json, extract, split, extend, and project before data lands in Log Analytics</p>
</li>
<li><p><strong>Data masking &amp; Eradiction:</strong><br />  <strong>OTel Collector</strong> masks logs using the transform processor and OTTL: redact substrings with regex, delete sensitive attributes, or do deterministic hashing (e.g., SHA256) so correlation is possible without leaking raw values, all inline in the logs context before export.</p>
<p>  e: redact card-like patterns, delete auth tokens, and hash emails at source with OTTL</p>
<p>  <strong>Fluent Bit</strong> masks by removing keys or entire value fields with the modify and record_modifier filters, and uses the Lua filter for partial redaction within strings when regex replacement is needed, enabling precise field-level or substring masking at high throughput on the edge<br />  eg: delete sensitive keys (e.g., password, token) or allowlist only safe keys, add placeholders, and apply custom redaction logic via Lua to mask substrings like emails, IPs, SSNs, or card numbers across any fields in a record.</p>
<p>  <strong>AMA</strong>(Azure Monitor Agent) applies masking by not projecting them, set sensitive fields to constants like “redacted”, and perform pattern-based rewrites using supported KQL operators; note that not all KQL string functions are available in DCR transforms, so rely on the documented supported set and test in the DCR editor<br />  eg: remove secrets and redact card-like strings with a supported replace form and projection in a DCR transformation query</p>
<p>  Sample RAW logs which includes Noise, PII Data and Secret/password</p>
<pre><code class="lang-plaintext">  {"time":"2025-08-31T16:40:01Z","level":"DEBUG","service":"payments","log":"healthcheck ok"}
  {"time":"2025-08-31T16:40:02Z","level":"VERBOSE","service":"cart","log":"retrying request"}
  {"time":"2025-08-31T16:40:03Z","level":"INFO","service":"orders","log":"pay 4111-1111-1111-1111 PAN AAAPL1234C Aadhaar 3675 9834 6012"}
  {"time":"2025-08-31T16:40:04Z","level":"WARN","service":"auth","Authorization":"Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjoiYWJjIn0.RxYb2V7o0wTF3aoRkFkh7y8H0Pr1dI3o1gqTQc2y7yY","password":"P@ssw0rd!","log":"login failed"}
  2025-08-31T16:40:01Z DEBUG payments: healthcheck ok
  2025-08-31T16:40:02Z VERBOSE cart: retrying request
  2025-08-31T16:40:03Z INFO orders: pay 4111-1111-1111-1111 PAN AAAPL1234C Aadhaar 3675 9834 6012
  2025-08-31T16:40:04Z WARN auth: login failed password="P@ssw0rd!" Authorization="Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjoiYWJjIn0.RxYb2V7o0wTF3aoRkFkh7y8H0Pr1dI3o1gqTQc2y7yY"
</code></pre>
</li>
<li><p>OTEL Sample YAML Config based on RAW logs</p>
<pre><code class="lang-yaml">  <span class="hljs-attr">receivers:</span>
    <span class="hljs-attr">filelog/app:</span>
      <span class="hljs-attr">include:</span> [<span class="hljs-string">/var/log/app/*.log</span>]
      <span class="hljs-attr">start_at:</span> <span class="hljs-string">beginning</span>
      <span class="hljs-attr">operators:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">regex_parser</span>
          <span class="hljs-attr">regex:</span> <span class="hljs-string">'^(?P&lt;ts&gt;[^ ]+) (?P&lt;level&gt;[A-Z]+) (?P&lt;service&gt;[^:]+): (?P&lt;body&gt;.*)$'</span>
          <span class="hljs-attr">timestamp:</span>
            <span class="hljs-attr">parse_from:</span> <span class="hljs-string">attributes.ts</span>
            <span class="hljs-attr">layout:</span> <span class="hljs-string">'%Y-%m-%dT%H:%M:%S%z'</span>
          <span class="hljs-attr">severity:</span>
            <span class="hljs-attr">parse_from:</span> <span class="hljs-string">attributes.level</span>

  <span class="hljs-attr">processors:</span>
    <span class="hljs-attr">filter/logs:</span>
      <span class="hljs-attr">logs:</span>
        <span class="hljs-attr">log_record:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">'attributes["level"] == "DEBUG" or attributes["level"] == "VERBOSE"'</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">'IsMatch(body, "(?i)healthcheck")'</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">'IsMatch(body, "(?i)password=")'</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">'IsMatch(body, "(?i)authorization=")'</span>

    <span class="hljs-attr">transform/logs:</span>
      <span class="hljs-attr">log_statements:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">context:</span> <span class="hljs-string">log</span>
          <span class="hljs-attr">statements:</span>
            <span class="hljs-comment"># Redact JWT (JWS/JWE)</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">set(body,</span> <span class="hljs-string">ReplaceAllMatches(body,</span> <span class="hljs-string">"(?:ey[A-Za-z0-9-_]+\\.){2,4}[A-Za-z0-9-_]+"</span><span class="hljs-string">,</span> <span class="hljs-string">"JWT_REDACTED"</span><span class="hljs-string">))</span>
            <span class="hljs-comment"># Redact credit cards (common 4-4-4-4 or 16-digits with separators)</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">set(body,</span> <span class="hljs-string">ReplaceAllMatches(body,</span> <span class="hljs-string">"((?:\\d{4}[-\\s]*){3}\\d{4})"</span><span class="hljs-string">,</span> <span class="hljs-string">"****-****-****-****"</span><span class="hljs-string">))</span>
            <span class="hljs-comment"># Redact PAN (AAAAA9999A)</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">set(body,</span> <span class="hljs-string">ReplaceAllMatches(body,</span> <span class="hljs-string">"\\b[A-Z]{5}[0-9]{4}[A-Z]\\b"</span><span class="hljs-string">,</span> <span class="hljs-string">"PAN_REDACTED"</span><span class="hljs-string">))</span>
            <span class="hljs-comment"># Redact Aadhaar (spaced and unspaced; starts 2-9)</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">set(body,</span> <span class="hljs-string">ReplaceAllMatches(body,</span> <span class="hljs-string">"\\b[2-9][0-9]{3}\\s[0-9]{4}\\s[0-9]{4}\\b"</span><span class="hljs-string">,</span> <span class="hljs-string">"AADHAAR_REDACTED"</span><span class="hljs-string">))</span>
            <span class="hljs-bullet">-</span> <span class="hljs-string">set(body,</span> <span class="hljs-string">ReplaceAllMatches(body,</span> <span class="hljs-string">"\\b[2-9][0-9]{11}\\b"</span><span class="hljs-string">,</span> <span class="hljs-string">"AADHAAR_REDACTED"</span><span class="hljs-string">))</span>

  <span class="hljs-attr">exporters:</span>
    <span class="hljs-attr">otlphttp:</span>
      <span class="hljs-attr">endpoint:</span> <span class="hljs-string">https://collector.example/v1/logs</span>

  <span class="hljs-attr">service:</span>
    <span class="hljs-attr">pipelines:</span>
      <span class="hljs-attr">logs:</span>
        <span class="hljs-attr">receivers:</span> [<span class="hljs-string">filelog/app</span>]
        <span class="hljs-attr">processors:</span> [<span class="hljs-string">filter/logs</span>, <span class="hljs-string">transform/logs</span>]
        <span class="hljs-attr">exporters:</span> [<span class="hljs-string">otlphttp</span>]
</code></pre>
</li>
<li><p>Sample Fluenbit YAML Config for RAW logs</p>
</li>
</ul>
<pre><code class="lang-yaml">[<span class="hljs-string">PARSER</span>]
  <span class="hljs-string">Name</span>   <span class="hljs-string">app_json</span>
  <span class="hljs-string">Format</span> <span class="hljs-string">json</span>

[<span class="hljs-string">INPUT</span>]
  <span class="hljs-string">Name</span>   <span class="hljs-string">tail</span>
  <span class="hljs-string">Path</span>   <span class="hljs-string">/var/log/app/app.jsonl</span>
  <span class="hljs-string">Parser</span> <span class="hljs-string">app_json</span>
  <span class="hljs-string">Tag</span>    <span class="hljs-string">app.logs</span>

<span class="hljs-comment"># Filter out DEBUG/VERBOSE and healthchecks</span>
[<span class="hljs-string">FILTER</span>]
  <span class="hljs-string">Name</span>   <span class="hljs-string">grep</span>
  <span class="hljs-string">Match</span>  <span class="hljs-string">app.logs</span>
  <span class="hljs-string">Exclude</span> <span class="hljs-string">level</span>   <span class="hljs-string">^(DEBUG|VERBOSE)$</span>
[<span class="hljs-string">FILTER</span>]
  <span class="hljs-string">Name</span>   <span class="hljs-string">grep</span>
  <span class="hljs-string">Match</span>  <span class="hljs-string">app.logs</span>
  <span class="hljs-string">Exclude</span> <span class="hljs-string">log</span>     <span class="hljs-string">healthcheck</span>

<span class="hljs-comment"># Eradicate secret-bearing keys</span>
[<span class="hljs-string">FILTER</span>]
  <span class="hljs-string">Name</span>   <span class="hljs-string">modify</span>
  <span class="hljs-string">Match</span>  <span class="hljs-string">app.logs</span>
  <span class="hljs-string">Remove</span> <span class="hljs-string">password</span>
  <span class="hljs-string">Remove</span> <span class="hljs-string">authorization</span>
  <span class="hljs-string">Remove</span> <span class="hljs-string">access_token</span>
  <span class="hljs-string">Remove</span> <span class="hljs-string">refresh_token</span>
  <span class="hljs-string">Add</span>    <span class="hljs-string">pii_masked</span> <span class="hljs-literal">true</span>

<span class="hljs-comment"># Redact JWT, Credit Cards, PAN, Aadhaar in strings</span>
[<span class="hljs-string">FILTER</span>]
  <span class="hljs-string">Name</span>   <span class="hljs-string">lua</span>
  <span class="hljs-string">Match</span>  <span class="hljs-string">app.logs</span>
  <span class="hljs-string">Script</span> <span class="hljs-string">redact.lua</span>
  <span class="hljs-string">Call</span>   <span class="hljs-string">filter</span>

[<span class="hljs-string">OUTPUT</span>]
    <span class="hljs-string">Name</span>                         <span class="hljs-string">azure_kusto</span>
    <span class="hljs-string">Match</span>                        <span class="hljs-string">*</span>
    <span class="hljs-comment"># --- ADX connection ---</span>
    <span class="hljs-string">Ingestion_Endpoint</span>           <span class="hljs-string">https://ingest-adx-log-store-01.eastus2.kusto.windows.net</span>
    <span class="hljs-string">Database_Name</span>                <span class="hljs-string">log_store</span>
    <span class="hljs-string">Table_Name</span>                   <span class="hljs-string">FluentBitLogs</span>
    <span class="hljs-comment"># --- Auth (choose ONE) ---</span>
    <span class="hljs-string">Tenant_Id</span>                    <span class="hljs-string">89b26e60-021d-43d4-8e47-b0bb496a95a4</span>          
    <span class="hljs-string">Client_Id</span>                    <span class="hljs-string">93e81a16-d997-4027-93e0-5fc660ceb022</span>      
    <span class="hljs-string">Client_Secret</span>                <span class="hljs-string">xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</span>      
    <span class="hljs-comment"># --- Reliability / performance ---</span>
    <span class="hljs-string">buffering_enabled</span>            <span class="hljs-string">On</span>
    <span class="hljs-string">buffer_dir</span>          <span class="hljs-string">C:\ProgramData\fluent-bit\buffer</span>
    <span class="hljs-string">upload_timeout</span>               <span class="hljs-string">2m</span>
    <span class="hljs-string">upload_file_size</span>             <span class="hljs-string">150M</span>
    <span class="hljs-string">store_dir_limit_size</span>         <span class="hljs-string">8GB</span>
    <span class="hljs-string">compression_enabled</span>          <span class="hljs-string">On</span>
</code></pre>
<ul>
<li>Sample Azure Monitor Agent DCR rule to Filter,Parse based on RAW logs text</li>
</ul>
<pre><code class="lang-sql">source
// Filter out noise (DEBUG/VERBOSE/healthcheck)
| where tolower(tostring(severity)) !in ("debug","verbose")
| where not(contains(tostring(message), "healthcheck"))

// Remove secret-bearing columns
| project-away password, authorization, access_token, refresh_token, id_token

// Redact JWT, credit cards, PAN, Aadhaar inside message
| extend m = tostring(message)
| extend m = replace_regex(m, @"(ey[\w\-_]+\.){2,4}[\w\-_]+", "JWT_REDACTED")
| extend m = replace_regex(m, @"((\d{4}[-\s]*){3}\d{4})", "****-****-****-****")
| extend m = replace_regex(m, @"\b[A-Z]{5}[0-9]{4}[A-Z]\b", "PAN_REDACTED")
| extend m = replace_regex(m, @"\b[2-9][0-9]{3}\s[0-9]{4}\s[0-9]{4}\b", "AADHAAR_REDACTED")
| extend m = replace_regex(m, @"\b[2-9][0-9]{11}\b", "AADHAAR_REDACTED")

// Final schema
| project TimeGenerated = todatetime(time), Severity = tostring(severity), Message = m
</code></pre>
<p><strong>Sorage &amp; Retention</strong></p>
<p><strong>Hot</strong></p>
<ul>
<li><p><strong>Where it is:</strong> Local <strong>SSD</strong> on ADX engine nodes</p>
</li>
<li><p><strong>Control :Caching policy</strong> (pin the last <em>N</em> days in cache)</p>
</li>
<li><p><strong>Purpose:Low‑latency</strong> dashboards and investigations on <strong>recent</strong> data</p>
</li>
<li><p><strong>Notes:</strong> Data is still persisted to Azure Storage; hot is a performance cache.</p>
</li>
</ul>
<p><strong>Warm</strong></p>
<ul>
<li><p><strong>Where it is:</strong> ADX <strong>internal extents</strong> stored on <strong>Azure Storage</strong> (not pinned to SSD)</p>
</li>
<li><p><strong>Control:Retention policy</strong> (soft‑delete window, plus <strong>recoverability</strong> option for 14‑day post‑delete recovery)</p>
</li>
<li><p><strong>purpose: Historical analytics</strong> that you still want inside ADX for fast access without keeping everything in SSD</p>
</li>
<li><p><strong>Notes:</strong> Great for 30–180‑day investigative horizons, then let retention reclaim space.</p>
</li>
</ul>
<p><strong>Cold</strong></p>
<ul>
<li><p><strong>Where it is: External ADLS/Blob</strong> Store Attached to ADX cluster</p>
</li>
<li><p><strong>How you control it:Continuous/periodic export</strong> from ADX; define <strong>external tables</strong> for on‑demand queries; keep storage with immutability/lifecycle rules if required</p>
</li>
<li><p><strong>Typical purpose:Long‑term, low‑cost</strong> retention; <strong>query on demand</strong> or <strong>restore a slice</strong> back to ADX for hot performance</p>
</li>
</ul>
<p><strong>Lifecycle Policies</strong><br />Caching &amp; Retention polciies control data movement across different tier from Hot to Cold storage</p>
<p><strong>Ingest</strong>: Agents send sanitized logs to ADX CuratedLogs with ingestion mappings and update policies (already in place from the edge work), so operational queries run fast on hot/warm tiers.</p>
<p><strong>Warm</strong>: Cache policy keeps the freshest 15 days “hot,” while older data stays “warm” and is still queryable from the cluster</p>
<p><strong>Archive</strong>: Continuous export streams the same data to ADLS in Parquet for long‑term retention; after 90 days, ADX retention deletes aged extents, but archived data remains queryable via external tables, or can be rehydrated for high‑performance needs<br />Example:<br /><strong>Hot → warm → archive timeline</strong></p>
<ul>
<li><p>Days 0–15: Data is “hot” in local SSD due to the caching policy and is immediately queryable at the lowest latency</p>
</li>
<li><p>Days 16<a target="_blank" href="https://learn.microsoft.com/en-us/kusto/management/cache-policy?view=microsoft-fabric">–</a>90: Data cools to “warm”; it’s still in the cluster’s reliable storage and fully queryable, just less likely to be in SSD unless a hot window is defined for targeted historical ranges.</p>
</li>
<li><p>After 90 days: Extents age out per retention policy; by then, the continuous export has already landed Parquet copies in ADLS, so the same data can be queried in p<a target="_blank" href="https://learn.microsoft.com/en-us/azure/data-explorer/hot-windows">lace</a> via external tables or used for offline analytics pipelines</p>
</li>
</ul>
<p><strong>Analytics &amp; AIOps</strong></p>
<p>I have ingested two sample application logs into ADX with name ‘Java‘ &amp; dotnet . Here is easy to build dashbaord which depicts status of Applications logs</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759702507854/5a667e1c-69fb-47f6-995c-82a101ff5cbe.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-sql">//<span class="hljs-keyword">Check</span> distribution <span class="hljs-keyword">of</span> <span class="hljs-keyword">each</span> <span class="hljs-keyword">log</span> <span class="hljs-keyword">type</span>
appLogs
| summarize <span class="hljs-keyword">count</span>() <span class="hljs-keyword">by</span> log_type
| render piechart

//<span class="hljs-keyword">Error</span> <span class="hljs-keyword">logs</span> <span class="hljs-keyword">over</span> <span class="hljs-built_in">time</span>
appLogs
| <span class="hljs-keyword">where</span> <span class="hljs-keyword">level</span> == <span class="hljs-string">"ERROR"</span>
| summarize <span class="hljs-keyword">count</span>() <span class="hljs-keyword">by</span> <span class="hljs-keyword">level</span>, <span class="hljs-keyword">bin</span>(<span class="hljs-built_in">timestamp</span>,<span class="hljs-number">1</span>m)
| render timechart
</code></pre>
<pre><code class="lang-sql">//<span class="hljs-keyword">Check</span> distribution <span class="hljs-keyword">of</span> <span class="hljs-keyword">each</span> <span class="hljs-keyword">log</span> <span class="hljs-keyword">type</span>
appLogs
| summarize <span class="hljs-keyword">count</span>() <span class="hljs-keyword">by</span> log_type, <span class="hljs-keyword">bin</span>(<span class="hljs-built_in">timestamp</span>,<span class="hljs-number">5</span>m)
</code></pre>
<p><strong>Azure MCP/Data Agents in ADX:</strong><br />The Azure MCP Server implements a full open-source MCP server integration that supports natural language queries while dynamically discovering schemas and metadata from ADX resources. This capability enables seamless interaction with log lakes and data repositories without requiring prior KQL knowledge, making data exploration accessible to a broader range of users including developers, analysts, and business stakeholder</p>
<p>Example Use Case</p>
<p><strong>Natural Language Question:</strong><br /><em>"I want to know about trends observed in ADX from my sample application (Java/DotNet) logs."</em></p>
<p>The MCP server translates this query into appropriate KQL statements, executes them against the specified ADX database, and returns trend analysis results—all without the user writing a single line of KQL code.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759703163199/4823e05e-bd88-42c8-9527-d461f3bfc007.png" alt class="image--center mx-auto" /></p>
<p>Example Use Case:</p>
<p><em>Root Cause Analysis:</em><br /><em>Asking to perform RCA of an error observed in Apps logs and recommend possible next set of actions</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1759704293775/4a8edaf7-01d5-48b2-b542-80c5a256a0bf.png" alt class="image--center mx-auto" /></p>
<p><strong>Security &amp; Compliance</strong>(RBAC, Encryption, Connectivity, RLS)</p>
<p><strong>RBAC for ADX:</strong></p>
<p><strong>Permission Hierarchy</strong></p>
<p><strong>Cluster Scope</strong> provides organization-wide control with roles for all databases (Admin, Viewer, Monitor) and cluster management operations.</p>
<p><strong>Database Scope</strong> offers granular access through Admin (full control), User (query and create entities), Viewer (read-only queries), Ingestor (data ingestion), Monitor (metadata viewing), and Management (permission administration) roles.</p>
<p><strong>Table Scope</strong> restricts permissions to specific tables with Admin (full control), Ingestor (data ingestion), and Management (role assignment) capabilities.</p>
<p><strong>External Table Scope</strong> manages permissions for external data sources, requiring Database User or Viewer roles as prerequisites, with Admin and Management roles available.</p>
<p><strong>Materialized View Scope</strong> controls access to specific views, requiring Database User or Table Admin permissions, with Admin (alter/delete/delegate) and Management (role assignment) roles.</p>
<p><strong>Row-Level Security</strong> enables data filtering based on user identity, restricting query results to authorized rows within tables.</p>
<p>All role assignments are managed through Kusto management commands, with higher-level roles (cluster/database) typically granting access to lower-level resources within their scope.</p>
<p><strong>Network Connectivity:</strong><br />Public Connectivity</p>
<ul>
<li><p>ADX clusters can be accessed over the public internet using a public endpoint (URL), allowing users with valid identities to connect from any location if public access is enabled.</p>
</li>
<li><p>Administrators can restrict public accessibility using IP address filtering, specifying allowed IP addresses, service tags, or CIDR ranges in the Azure portal</p>
</li>
</ul>
<p>Private Connectivity</p>
<ul>
<li><p>ADX offers private endpoint integration using Azure Private Link, which places the cluster within a designated virtual network and assigns a private IP address for secure, internal connectivity.</p>
</li>
<li><p>Clients on the same virtual network connect seamlessly via private endpoints using standard connection strings; DNS resolution routes traffic internally, isolating access from the public internet</p>
</li>
</ul>
<p><strong>Encryption</strong><br />ADX data stored in Azure Storage is <strong>automatically encrypted by default</strong> using <strong>Microsoft-managed keys</strong> at the service level.<br /><strong>Double encryption</strong> can optionally be enabled, adding infrastructure-level encryption on top of service-level encryption using separate algorithms and keys. This dual-layer approach protects against scenarios where one encryption algorithm or key becomes compromised<br /><strong>Customer-Managed Keys</strong>: Organizations requiring additional control can configure <strong>customer-managed keys</strong> stored in Azure Key Vault instead of Microsoft-managed keys. This option provides full control over key lifecycle operations including creation, rotation, disabling, and revocation, with audit capabilities for compliance requirements. CMK is configured at the cluster level and requires Azure Key Vault integration</p>
<p><strong>Cost Controls</strong></p>
<p><strong>Understading what drives ADX Cost</strong></p>
<ul>
<li><p><strong>Storage duration</strong>: The longer you store data, the higher the cost. Each table and materialized view has a <strong>retention policy</strong> that defines how long the data is kept.</p>
</li>
<li><p><strong>High CPU usage</strong>: Actions like heavy queries, data processing, or transformations cause high CPU usage.</p>
</li>
<li><p><strong>Cache settings</strong>: Caching more data boosts performance but can increase costs. caching policy allows you to choose which data should be cached. You can differentiate between <em>hot data cache</em> and <em>cold data cache</em> by setting a caching policy on hot data. Hot data is kept in local SSD storage for faster query performance, while cold data is stored in object storage</p>
</li>
<li><p><strong>Cold data usage</strong>: Queries that access cold data trigger read transactions and add to cost.</p>
</li>
<li><p><strong>Data transformation and optimization</strong>: Features like Update Policies, Materialized Views, and Partitioning consume CPU resources and can raise cost.</p>
</li>
<li><p><strong>Ingestion volume</strong>: Clusters operate more cost-effectively at higher ingestion volumes.clusters with <strong>higher cost per GB are less common</strong> and usually have <strong>lower ingestion volumes</strong>. In general, smaller data volumes mean higher cost per GB.</p>
</li>
<li><p><strong>Schema design</strong>: Wide tables with many columns need more compute and storage resources, which raises costs.</p>
</li>
<li><p><strong>Advanced features</strong>: Options like followers, private endpoints, and Python sandboxes consume more resources and can add to cost.</p>
</li>
</ul>
<p>Conclusion:<br />A centralized log store is no longer a “nice to have” for large enterprises—especially in financial services. When daily volumes cross 100TB and logs span everything from Windows/Linux to legacy Unix variants, appliances, and multiple clouds, decentralised tooling quickly becomes costly, noisy, and operationally brittle. What enterprises need instead is a single, governed observability plane: one place to <strong>ingest at scale</strong>, <strong>normalize</strong>, <strong>secure</strong>, <strong>retain</strong>, and <strong>analyze</strong> logs consistently across the entire estate.</p>
<p>The architecture outlined in this post addresses that reality by combining flexible collection options (AMA, Fluent Bit, OpenTelemetry, syslog collectors, and batch ingestion) with a scalable analytics engine. Azure Data Explorer is purpose-built for <strong>high-volume ingestion and fast interactive querying across structured, semi-structured, and free-text data</strong>, and it supports both streaming and batch ingestion patterns,making it a strong foundation for enterprise-wide log analytics.</p>
<p>Equally important, the solution treats governance and cost as first-class design goals. By performing <strong>coarse filtering + sensitive-data redaction at the edge</strong>, and reserving <strong>deep parsing/normalization and durable transformations in the log store</strong>, you reduce noise, shrink spend, and ensure sensitive data is controlled before it spreads. On the retention side, tiering logs through hot/warm and into low-cost archive with export-based lifecycle keeps recent investigations fast while meeting long retention requirements.</p>
<p>Finally, once logs are centralized and curated, the platform becomes more than storage, it becomes an enabler for <strong>security investigations, compliance reporting, troubleshooting, and AIOps</strong> (correlation, anomaly detection, and faster RCA). The net outcome is a standardized, scalable, and secure logging strategy that can evolve over time without rewriting every collector whenever backend analytics or storage choices change</p>
]]></content:encoded></item><item><title><![CDATA[Observability: End to End Monitoring (Part 1)]]></title><description><![CDATA[Common requested feature from Observblity solution from customers:

Open Telemetry based

Verbose logging support

Retention period customisation

Cost Efficient

Alerts/Notification customisation

Absorb High Volume scale

Analytics and Holistic Das...]]></description><link>https://blog.osshaikh.com/observability-end-to-end-monitoring-part-1</link><guid isPermaLink="true">https://blog.osshaikh.com/observability-end-to-end-monitoring-part-1</guid><category><![CDATA[Azure]]></category><category><![CDATA[observability]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[app]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Sat, 09 Aug 2025 12:41:00 GMT</pubDate><content:encoded><![CDATA[<p>Common requested feature from Observblity solution from customers:</p>
<ul>
<li><p>Open Telemetry based</p>
</li>
<li><p>Verbose logging support</p>
</li>
<li><p>Retention period customisation</p>
</li>
<li><p>Cost Efficient</p>
</li>
<li><p>Alerts/Notification customisation</p>
</li>
<li><p>Absorb High Volume scale</p>
</li>
<li><p>Analytics and Holistic Dashboards</p>
</li>
</ul>
<p>First important question is What all components one should typical monitor specifically for Cloud Native Apps to have holistive view of entire estate across multiple moving parts</p>
<ul>
<li><p>Infra/Platform Monitoring - Log Analytics</p>
</li>
<li><p>App Monitoring - App Insights</p>
</li>
<li><p>UI/UX - App Insights</p>
</li>
<li><p>Dependency Monitroing - App Insights with LA</p>
</li>
<li><p>Database - Az Monitor</p>
</li>
<li><p>Edge Devices - Az Monitor</p>
</li>
<li><p>Dashboard - Az Grafana</p>
</li>
</ul>
<p>In This scenario we are using Azure Monitor’s different feature to covers all important components of cloud native systems, Here is High level of Data collection Pillars of Azure Monitor Solution</p>
<p><img src="https://learn.microsoft.com/en-us/azure/azure-monitor/media/overview/overview-blowout-20230707-opt.svg#lightbox" alt="Diagram that shows an overview of Azure Monitor with data sources on the left sending data to a central data platform and features of Azure Monitor on the right that use the collected data." /></p>
<ul>
<li><p>Monitor Security posture of all azure subscription in single pane glass via enable via Grafana dashboad</p>
</li>
<li><p>Create Azure managed Grafana &amp; add azure monitor workspace as Data sources.</p>
</li>
<li><p>In Dashboards section you will find many builtin dashboard for azure resource including Defender for Cloud, Choose Defender for Cloud to see al alerts and recommendation from MDC.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740919550344/b7342485-6329-47f5-bc23-f7e480133a02.png" alt class="image--center mx-auto" /></p>
<p><strong>Infrastructure Monitoring capablities using Azure Monitor</strong></p>
<p><strong>Enable Traffic Analysis:</strong></p>
<ol>
<li><p>Ensure that you have Network Watcher deployed in the same region and have a Log Analytics workspace set up.</p>
</li>
<li><p>Navigate to the <strong>Azure Portal</strong> and go to <strong>Network Watcher</strong>. Then, select either <strong>NSG Flow Logs</strong> or <strong>VNet Flow Logs</strong>.</p>
</li>
<li><p>Choose the <strong>Network Security Group (NSG)</strong> or <strong>Virtual Network</strong> you want to monitor.</p>
</li>
<li><p>Click <strong>Enable</strong> for Flow Logs ,select storage account and retention period</p>
</li>
<li><p>In same flogs logs plane, toggle Traffic Analytics to On and select Log Analytic workspace</p>
</li>
</ol>
<p>Traffic Anlaysis of your Virtual Network on Azure, including the following information:</p>
<ul>
<li><p>Who is connecting to the network?</p>
</li>
<li><p>Where are they connecting from?</p>
</li>
<li><p>Which ports are open to the internet?</p>
</li>
<li><p>What's the expected network behavior?</p>
</li>
<li><p>Are there any sudden rises in traffic?</p>
</li>
</ul>
<p>Total volume of traffic in your Network Recorded information about ingress and egress IP</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747983653685/c7bc6a90-26b8-45b3-95e2-307f5eaca0dd.png" alt class="image--center mx-auto" /></p>
<p>Below screenshot reflects Top VM/server with most egress communication with Internet</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749639498524/49754cec-0d65-4644-9d67-034b2b02cf1c.png" alt class="image--center mx-auto" /></p>
<p>Workloads receving Traffic from Internet on different Ports</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749640843141/e2e1f513-a0f7-4690-937f-8eb8b4e68b6f.png" alt class="image--center mx-auto" /></p>
<p><strong>Traffic consumption of workloads on Azure based on IP</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749642446542/48a44e79-950c-4f72-8691-3eb8241db93e.png" alt class="image--center mx-auto" /></p>
<p>Map view of Malicious Traffic blocked all over world via Intenet</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749641283920/e3e39995-445e-4853-b259-cccbad3c3db4.png" alt class="image--center mx-auto" /></p>
<p><strong>Virtual Machines Performance Metrics</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749655037871/d12a8001-cdda-4613-9c5b-238ad48c0fc1.png" alt class="image--center mx-auto" /></p>
<p>Container-based workloads can be effectively monitored using Azure Monitor's Container Insights. This tool provides a detailed overview of container health, including the status of underlying nodes and performance utilization of both containers and the supporting infrastructure.</p>
<p>Additionally, it offers logging and tracing capabilities for application pods and their peripheral services, enabling you to track and analyze application performance and troubleshoot issues efficiently.</p>
<p>High Level Dashboard view Infra components of Application(via native Az Monitor)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732137837156/2333a581-bd26-4b44-a3d0-9c89c5930b6e.png" alt class="image--center mx-auto" /></p>
<p>Kubernetes Cluster Overview in Grafana via Prometheus Metrics (via Azure Managed Prometheus)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732137964132/5c72c256-f472-4038-9147-6cdfbbcb7ffc.png" alt class="image--center mx-auto" /></p>
<p>Web, application, and database logs/traces can be configured as custom logs using Log Analytics in Azure Monitor. In this example, logs are parsed to highlight specific details such as URLs, raw data, and HTTP status codes. This setup can be further customized based on the specific requirements of the application or analytics team, allowing for tailored insights and analysis.</p>
<p>Custom logs in data collection rules via Azure Monitor can be used to capture any time series-based text logs. This feature allows you to define specific log formats and structures, enabling the collection and analysis of time-stamped data that is crucial for monitoring and troubleshooting applications. By configuring custom logs, you can tailor the data collection to meet the specific needs of your organization or application, ensuring that relevant information is captured and available for analysis.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753027358042/94979bd2-7bc4-43bd-afb2-ea94a6d2ab90.png" alt class="image--center mx-auto" /></p>
<p>I will delve deeper into each component involved in end-to-end monitoring in the next blog post.</p>
]]></content:encoded></item><item><title><![CDATA[Oracle on Azure]]></title><description><![CDATA[This Blogs aims to explain why Azure is a better destination than GCP/OCI/AWS for migrating on-premises Oracle workloads, including Oracle proprietary databases, business applications, custom applications, and ISV applications.
There are multiple opt...]]></description><link>https://blog.osshaikh.com/oracle-on-azure</link><guid isPermaLink="true">https://blog.osshaikh.com/oracle-on-azure</guid><category><![CDATA[Azure]]></category><category><![CDATA[OCI]]></category><category><![CDATA[Oracle]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Sat, 01 Mar 2025 14:16:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1740590141855/702b2731-bb9e-4076-8953-23db4e6c6c34.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>This Blogs aims to explain why Azure is a better destination than GCP/OCI/AWS for migrating on-premises Oracle workloads, including Oracle proprietary databases, business applications, custom applications, and ISV applications.</strong></p>
<p><strong>There are multiple options on Hosting Application leveraging Oracle as DB, Based on complexity i have categorised this into different solution categories.</strong></p>
<h2 id="heading-oracle-on-native-azure-vm"><strong>Oracle on Native Azure VM</strong></h2>
<p>In <strong>Rehost</strong> or <strong>‘Lift and Shift’</strong> scenarios, on-premises workloads are migrated to run on <strong>Azure Native Virtual Machines (VMs)</strong>. This approach involves moving your Oracle Database (DB) to standard Azure VMs, which you manage yourself, offering flexibility and cost savings compared to fully managed services.</p>
<p>Recommending <strong>native Azure VMs</strong> for Oracle workloads is appropriate in specific scenarios, balancing trade-offs between performance, management effort, and cost. Here are the key use cases:</p>
<ul>
<li><p><strong>Small to Medium Workloads:</strong> Ideal for databases where high performance isn’t critical, such as smaller Online Transaction Processing (OLTP) or analytical workloads. Standard VM hardware often suffices, reducing costs compared to Exadata-based managed services like Oracle Database@Azure.</p>
</li>
<li><p><strong>Cost-Sensitive Projects:</strong> When budget constraints are a priority, native VMs avoid the premium pricing of managed services and specialized hardware. Using <strong>constrained core vCPUs</strong> can further lower Oracle licensing costs, making this option attractive for budget-conscious projects. (<a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-reference-architecture">Architectures for Oracle Database on Azure Virtual Machines)</a></p>
</li>
<li><p><strong>Custom Configurations:</strong> If you require specific hardware or software setups unavailable with managed options—such as custom OS images or unique storage configurations—native VMs provide the flexibility to meet those needs.</p>
</li>
<li><p><strong>Non-Critical Applications:</strong> Suitable for development, testing, or less critical applications where downtime or performance hiccups are tolerable. This includes environments where experimentation is common, and managed services might be excessive.</p>
</li>
<li><p><strong>Full Control</strong> <strong>and Compliance:</strong> When regulatory or compliance requirements (e.g., data residency laws or security policies) demand complete control over the database environment, native VMs allow you to configure settings like <strong>Network Security Groups</strong> and <strong>Azure Firewall</strong> to align with those needs. (<a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-overview">Overview</a> <a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-overview">of</a> <a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-reference-architecture">Oracle Applications and Solutions on Azure)</a></p>
<p>  <a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-reference-architecture">  
  </a></p>
</li>
</ul>
<h3 id="heading-oracle-licensing-on-native-azure-vms">Oracle Licensing on Native Azure VMs</h3>
<p>There are some constraints around Oracle Database licensing when running on native VMs in cloud environments like Azure or GCP. Microsoft Azure follows this licensing model:</p>
<ul>
<li><p><strong>With Multi-Threading Enabled:</strong> Two vCPUs are counted as equivalent to one Oracle Processor license.</p>
</li>
<li><p><strong>With Multi-Threading Disabled:</strong> One vCPU equals one Oracle Processor license.</p>
</li>
</ul>
<p>For this reason, we recommend using <strong>constrained vCPU VM series</strong>.These VMs offer the same underlying memory, throughput, and network capabilities as standard VMs but with a limited vCPU count. This helps fit within Oracle’s per-core licensing model, reducing costs.</p>
<p><strong>How Constrained VMs Work:</strong></p>
<ul>
<li><p>In a hyper-threaded VM, Azure allocates <strong>2 threads per core</strong>, known as vCPUs.</p>
</li>
<li><p>When hyper-threading is disabled, Azure presents <strong>single-threaded cores (physical cores)</strong>, resulting in a 1:1 vCPU-to-CPU ratio.</p>
</li>
</ul>
<p><strong>Recommended Constrained VM SKU</strong><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?tabs=family-E#list-of-available-sizes-with-constrained-vcpus"><strong>s for Oracle:</strong> Check the list of available sizes here: List of Available Sizes</a> <a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?tabs=family-E#list-of-available-sizes-with-constrained-vcpus">with Constrained vCPUs</a>.</p>
<h3 id="heading-pros">Pros</h3>
<ul>
<li><p><strong>Ideal Migration Scenarios:</strong> Perfect for cases where Oracle is currently running on <strong>Hyper-V</strong> or <strong>physical servers</strong>. Azure’s right-sizing tools, paired with constrained VM SKUs, make it a seamless fit for these environments.<a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?tabs=family-E#list-of-available-sizes-with-constrained-vcpus">  
  </a></p>
</li>
<li><p><strong>High Availability (HA):</strong> Oracle Databases in a <strong>Real Application Clusters (RAC)</strong> setup are used for HA. While RAC isn’t officially certified by Oracle on Azure, you can achieve HA using Azure’s native capabilities, such as <strong>Availability Zones</strong> and <strong>Oracle Data Guard</strong> replication across zones. This provides redundancy at both rack and datacenter levels.</p>
</li>
<li><p><strong>Cost Reduction:</strong> Using constrained VM SKUs reduces Oracle licensing costs without significantly impacting performance for many workloads.</p>
</li>
<li><p><strong>Licensing Optimization:</strong> Disabling multi-threading or hyper-threading on regular Azure VMs adjusts the CPU ratio to 1:1, further lowering license costs.</p>
</li>
</ul>
<h3 id="heading-conshttpslearnmicrosoftcomen-usazurevirtual-machinesworkloadsoracleoracle-reference-architecture"><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-reference-architecture">Cons</a></h3>
<ul>
<li><p><strong>Licensing Complexity:</strong> Oracle’s licensing rules can be challenging in cloud environments. You’ll often need to navigate <strong>Bring Your Own License (BYOL)</strong> scenarios and ensure VM configurations align with Oracle’s core-counting policies.<a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-reference-architecture">  
  </a></p>
</li>
<li><p><strong>RAC Limitations:</strong> Although high availability is achievable with Azure’s native features, <strong>RAC is not officially certified by Oracle on Azure</strong>, which might deter some users relying on official support.</p>
</li>
</ul>
<h2 id="heading-oracle-onhttpslearnmicrosoftcomen-usazurevirtual-machinesworkloadsoracleoracle-overview-avs-azure-vmwarhttpslearnmicrosoftcomen-usazurevirtual-machinesconstrained-vcputabsfamily-elist-of-available-sizes-with-constrained-vcpuse-solutionhttpslearnmicrosoftcomen-usazurevirtual-machinesworkloadsoracleoracle-reference-architecture"><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-overview">Oracle on</a> <a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?tabs=family-E#list-of-available-sizes-with-constrained-vcpus">AVS (Azure VMwar</a><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-reference-architecture">e Solution)</a></h2>
<p>In Scenario where customers are running their Oracle Databases on VMware environment, easiest migration path would be moving VMware on cloud with easier lfit &amp; shift approach using VMware VMotion or Bulk migration method.</p>
<p>Although there are license contrains here as well, will though details one by one</p>
<p>Oracle does not recognize VMware as a <strong>hard partitioning technology</strong>, which has significant implications for licensing. Hard partitioning allows users to license only a subset of a server’s processors, thus offering more control and potentially lowering costs. However, since Oracle considers VMware a <strong>soft partitioning</strong> tool, customers must license all physical cores accessible by the Oracle software.</p>
<p>Careful management of VMware clusters is essential to minimize licensing costs. Cluster segregation can be an effective strategy to avoid having to license an entire vCenter. Organizations can limit the number of physical cores that need to be licensed by isolating clusters that run Oracle software. This requires rigorous <strong>VMware management practices</strong> to ensure that Oracle workloads are restricted to designated clusters and do not migrate freely across environments that are not fully licensed.</p>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/workloads/oracle/oracle-overview">Key Poin</a><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?tabs=family-E#list-of-available-sizes-with-constrained-vcpus">ts</a></p>
<p>Oracle makes a distinction between <strong>soft partitioning</strong> and <strong>hard partitioning</strong> in its licensing model:</p>
<ul>
<li><p><strong>Soft Partitioning</strong>: Technologies that allow for dynamic allocation of resources but do not provide Oracle-recognized methods to limit the number of processors being used. VMware falls under this category, requiring licensing of all cores accessible.</p>
</li>
<li><p><strong>Hard Partitioning</strong>: Oracle-approved methods to limit resource usage to specific cores, such as Oracle VM Server, IBM LPAR, and certain physical hardware partitioning tools. These technologies allow sub-capacity licensing, making them more cost-effective in environments with fewer Oracle workloads.</p>
</li>
</ul>
<p>For example, consider a 3-node AVS cluster using AV36P nodes, where each node has 36 physical cores:</p>
<ul>
<li><p>Total cores = 3 nodes × 36 cores/node = 108 cores.With a core factor of 0.5 (typical for Intel CPUs with hyper-threading), the number of processors to license = 108 × 0.5 = 54 processors.</p>
</li>
<li><p>Oracle Database Enterprise Edition costs approximately $47,500 per processor (per Oracle’s Technology Global Price List). Thus, the licensing cost = 54 × $47,500 = $2,565,000.</p>
</li>
</ul>
<p><strong>Example</strong>: If a company has Oracle deployed on a VMware cluster with <strong>3 Hosts</strong>, all of those hosts’ cores must be licensed, regardless of how many virtual CPUs are allocated to Oracle in the environment. Even if only a portion of the virtual CPUs is actively running Oracle, the company must account for all the physical hardware.</p>
<p>This makes AVS an expensive option for Oracle workloads unless mitigated by specific licensing agreements, such as an Unlimited License Agreement (ULA).</p>
<h3 id="heading-strategies-to-manage-licensing-costs">Strategies to Manage Licensing Costs</h3>
<p>To minimize licensing costs:</p>
<ul>
<li><p><strong>Cluster Segregation</strong>: Isolate Oracle workloads to a dedicated cluster (e.g., a 3-node AV36 cluster). This limits the number of cores requiring licenses but demands strict management to prevent Oracle VMs from migrating to unlicensed clusters via vMotion.</p>
</li>
<li><p><strong>Rigorous Management</strong>: Use VMware tools to enforce boundaries, ensuring Oracle software does not inadvertently span across the entire vCenter environment, which would require licensing all accessible cores.</p>
</li>
</ul>
<h3 id="heading-unlimited-license-agreementsulahttpslearnmicrosoftcomen-usazurevirtual-machinesconstrained-vcputabsfamily-elist-of-available-sizes-with-constrained-vcpus"><strong>Unlimited License Agreements(ULA</strong><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?tabs=family-E#list-of-available-sizes-with-constrained-vcpus"><strong>):</strong></a></h3>
<ul>
<li><ul>
<li><p>ULAs allow unlimited use of specified Oracle software for a fixed term, avoiding per-core or per-cluster licensing.</p>
<ul>
<li><p>For AVS, the ULA must include a license mobility clause, enabling on-premises licenses to be applied to the cloud. Oracle’s strategic partnership with Microsoft supports this mobility (per the Oracle and Microsoft Strategic Partnership FAQ).</p>
</li>
<li><p>This provides flexibility to scale Oracle deployments on AVS without additional licensing costs during the ULA period, making it a cost-effective option for large enterprises.</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><a target="_blank" href="https://www.oracle.com/a/ocom/docs/corporate/pricing/technology-price-list-070617.pdf">Pros</a></p>
<ul>
<li><p><strong>Familiarity and Ease of Use</strong>: If your team uses VMware on-premises, AVS lets you stick with the same tools, reducing the learning curve and potential errors for managing Oracle databases.</p>
</li>
<li><p><strong>Enhanced Security</strong>: AVS offers a private cloud on dedicated hardware, providing better isolation for sensitive Oracle data, which is crucial for compliance.</p>
</li>
<li><p><strong>Similar Setup</strong>: AVS mirrors your on-premises VMware environment, so you can move your Oracle VM with minimal changes, reducing migration time and complexity.</p>
</li>
<li><p><strong>Familiar Tools</strong>: Using the same management tools means less learning and fewer errors, speeding up the process.</p>
</li>
<li><p><strong>Automated Migration</strong>: AVS provides tools designed for VMware migrations such vMotion,svMotion, automating much of the work, unlike native VM which may need more manual setup.</p>
</li>
</ul>
<p><a target="_blank" href="https://www.oracle.com/a/ocom/docs/corporate/pricing/technology-price-list-070617.pdf">Co</a><a target="_blank" href="https://learn.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?tabs=family-E#list-of-available-sizes-with-constrained-vcpus">ns</a>:</p>
<ul>
<li><p><strong>Lack of Scalability Savings</strong>: Businesses cannot license only the cores they need; instead, they must license all cores that could potentially run Oracle software, regardless of actual use.</p>
</li>
<li><p><strong>Higher Costs</strong>: Organizations often face <strong>exponentially higher costs</strong> because Oracle insists on licensing all the cores within a given cluster or data center rather than the specific hosts or VMs running Oracle.</p>
</li>
<li><p><strong>Compliance Challenges</strong>: The movement of virtual machines between physical hosts through VMware’s <strong>vMotion</strong> further complicates compliance, as the potential for Oracle workloads to migrate triggers the requirement to license more physical cores.</p>
</li>
<li><p>Technically Oracle RAC supported on VMware environment, but as per Oracle policy is not officialy certified to run on non oracle public clouds</p>
</li>
</ul>
<h2 id="heading-oracle-databaseazureexadata">Oracle Database@Azure(Exadata)</h2>
<p>Oracle Database@Azure is a service where Oracle databases run on Exadata X9M hardware within Microsoft Azure's data centers. This setup ensures low latency access to Azure resources, making it ideal for customers wanting Oracle’s database performance within Azure’s ecosystem.Enterprise-critical features like Oracle Real Application Clusters (Oracle RAC), Oracle Data Guard, Oracle GoldenGate, managed backups, self-managed Oracle Recovery Manager (RMAN) backups, Oracle Zero Downtime Migration (Oracle ZDM), on-premises connectivity, and seamless integration with other Azure services are supported.</p>
<p>Key features include:</p>
<ul>
<li><p><strong>High Performance</strong> with Exadata X9M HardwareIt currently runs on Oracle Exadata X9M hardware, with configurations starting at a quarter-rack (minimum 2 database servers and 3 storage servers), scalable up to 32 database servers and 64 storage servers.</p>
</li>
<li><p><strong>Private Connectivity</strong> via Azure Virtual Network: customers can deploy Oracle Exadata within their Azure Virtual Network (VNet), using private IP addresses for secure, isolated access without public internet exposure, Helps meet data residency and security requirements, critical for industries like BFSI</p>
</li>
<li><p><strong>High Availability:</strong> Supports Oracle Real Application Clusters (RAC) and Oracle Data Guard for redundancy and disaster recovery, with built-in redundancy and failover capabilities and offer SLA of 99.9%</p>
</li>
<li><p><strong>Sub-1 Millisecond</strong> Network Latency:When applications and the database are in the same Azure region, network latency can be sub-1 millisecond due to co-location within Azure data centers.</p>
</li>
<li><p><strong>Integration with Azure Services:</strong> Integrates with Azure’s ecosystem, including Microsoft Entra ID for identity management and Azure AI/ML services,Combine Oracle’s database capabilities with Azure’s AI tools for advanced analytics or generative AI applications.</p>
</li>
<li><p><strong>Simplified Migration and Compatibility:</strong> Full compatibility with on-premises Oracle Database and Exadata deployments, supported by tools like Oracle Zero Downtime Migration,Lift and shift on-premises workloads to Azure without refactoring, minimizing migration effort and risk.</p>
</li>
<li><p><strong>Managed Service with Clear Responsibility Matrix</strong>: Oracle manages the Exadata hardware, Oracle Database software, and operating system (Oracle Linux 8.6+), while Microsoft manages the Azure infrastructure, and customers manage data, applications, and VNet.</p>
</li>
<li><p><strong>Cost Management and Licensing</strong> : Available through the Azure Marketplace with options to use existing Oracle licenses (BYOL) or Microsoft Azure Consumption Commitments</p>
</li>
<li><p><strong>Oracle Database software</strong> includes Enterprise Edition, with supported versions from 12c to 23c, customer-selectable based on compatibility and needs.</p>
</li>
</ul>
<p>Responsiblity Matrix for Oracle Exadata</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Owner</strong></td><td><strong>Responsibilities</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Oracle</td><td>Manages Exadata hardware, Oracle Database software, operating system, updates, patches</td></tr>
<tr>
<td>Microsoft</td><td>Manages underlying Azure infrastructure (data centers, networking, power)</td></tr>
<tr>
<td>Customer</td><td>Manages data, applications, and Virtual Network (configuration and connections)</td></tr>
</tbody>
</table>
</div><h3 id="heading-oracle-software-stack-in-exadata">Oracle Software Stack in Exadata</h3>
<ul>
<li><p>Oracle Database Software Versions: software includes Enterprise Edition, with supported versions from 12c to 23c, customer-selectable based on compatibility and needs.</p>
</li>
<li><p>Additonal Components: Data Guard for high availability and disaster recovery &amp; Golden Gate for data replication for DR</p>
</li>
<li><p>Operating System: Oracle Linux 8.6 Later</p>
</li>
<li><p><strong>Licensing:</strong> Customers can bring their own Oracle Database licenses (BYOL) or purchase them through the Azure Marketplace,</p>
</li>
</ul>
<h3 id="heading-oracle-exadata-x9m-overview">Oracle Exadata X9M Overview</h3>
<p>Oracle typically offers the X9M in several rack‑scale configurations:</p>
<ul>
<li><p><strong>Quarter Rack:</strong><br />  • Typically includes 2 database servers and 3 storage servers.<br />  • Often used for smaller-scale or departmental workloads.</p>
</li>
<li><p><strong>Half Rack:</strong><br />  • Generally configured with 4 database servers and 6 storage servers.<br />  • Balances performance and capacity for medium‑sized enterprises.</p>
</li>
<li><p><strong>Full Rack:</strong><br />  • Usually features 8 database servers and 12 storage servers.<br />  • Designed for very large, mission‑critical deployments requiring maximum throughput and IOPS.</p>
</li>
</ul>
<p>Hardware Specifications for Oracle DB Exadata(X9M)</p>
<p>Exadata X9M consists of database servers and storage servers, with different configurations for high capacity (HC) and potentially extreme flash (EF)</p>
<p><strong>Database Servers:</strong></p>
<ul>
<li><p>The database servers in Exadata X9M, as used in Oracle Database@Azure, are equipped with advanced compute capabilities. Research suggests the following specifications, based on Exadata Cloud Services X9M documentation:</p>
</li>
<li><p><strong>Processor:</strong> 1 x AMD EPYC 7J13, offering 64 cores and 128 threads. The base clock speed is 2.6 GHz, with a turbo frequency up to 3.5 GHz, providing high performance for database operations (AMD EPYC 7J13 Processor)</p>
</li>
<li><p><strong>Cores:</strong> 64 cores, with 128 threads, supporting high compute workloads.</p>
</li>
<li><p><strong>Memory:</strong> 512 GB, expandable, supporting large in-memory database workloads, as seen in Exadata Cloud Services X9M configurations (Exadata Cloud Services X9M Specifications).</p>
</li>
<li><p><strong>Storage:</strong> 2 x 3.84 TB NVMe Flash SSD for local storage, with the option to expand to 4 x 3.84 TB, ensuring fast access to frequently used data (Oracle Exadata Database Machine Hardware Components).</p>
</li>
</ul>
<p><strong>Storage Server (High Capacity - HC):</strong></p>
<ul>
<li><p><strong>CPU Manufacturer and Family:</strong> Intel, specifically the Intel Xeon Scalable 4th generation (codenamed Sapphire Rapids), with the model being Intel® Xeon® Gold 6448Y</p>
</li>
<li><p><strong>Cores:</strong> 32 cores total, with 2 processors each having 16 cores, operating at 2.6 GHz, based on Exadata X9M HC storage server specs</p>
</li>
<li><p><strong>Memory:</strong> 256 GB DDR4 for general operations and 1.5 TB Persistent Memory, allocated for Exadata RDMA Memory (XRMEM)</p>
</li>
<li><p><strong>Storage:</strong> Includes 12 x 18 TB 7,200 RPM SAS HDDs for bulk stora<a target="_blank" href="https://docs.oracle.com/en/engineered-systems/exadata-database-machine/dbmso/high-capacity-exadata-storage-server-x9m-2-components.html">ge (tota</a>l raw capacity 216 TB) and 4 x 6.4 TB NVMe Flash SSDs for caching (total 25.6 TB)</p>
</li>
</ul>
<p><strong>Cons:</strong></p>
<p><strong>Cost:</strong> Oracle Exadata on Azure uses specialized Exadata hardware and is a managed service, likely resulting in higher costs compared to running Oracle on standard Azure VMs. Customers can choose from a variety of VM sizes in Azure VM, potentially finding more cost-effective options, especially with constrained vCPU VMs to reduce Oracle licensing costs</p>
<p><strong>Scalability</strong>: Exadata on Azure offers scalability in predefined rack configurations (quarter-rack to full-rack), which might not be as flexible as scaling individual VM sizes in standard Azure VM. Customers can choose from various VM sizes in Azure VM, including constrained vCPUs for cost savings, offering more granular scaling.</p>
<p>Reference Artcile:<br /><a target="_blank" href="https://docs.oracle.com/en-us/iaas/Content/database-at-azure/oaa.htm">https://docs.oracle.com/en-us/iaas/Content/database-at-azure/oaa.htm</a><br />X9M: <a target="_blank" href="https://docs.oracle.com/en-us/iaas/exadatacloud/doc/exa-service-desc.html#GUID-EC1A62C6-DDA1-4F39-B28C-E5091A205DD3">https://docs.oracle.com/en-us/iaas/exadatacloud/doc/exa-service-desc.html#GUID-EC1A62C6-DDA1-4F39-B28C-E5091A205DD3</a><br />License: <a target="_blank" href="https://palisadecompliance.com/counting-licenses-on-azure-avs/">Counting Licenses on Azure AVS - Palisade Compliance</a><br />Implementation: <a target="_blank" href="https://docs.oracle.com/en-us/iaas/Content/database-at-azure-exadata/odexa-exadata-services-azure.html">https://docs.oracle.com/en-us/iaas/Content/database-at-azure-exadata/odexa-exadata-services-azure.html</a></p>
]]></content:encoded></item><item><title><![CDATA[DR Options for Storage on Azure]]></title><description><![CDATA[Key Considerations for Storage DR on Azure
1. Understanding RPO and RTO
Define RPO (Recovery Point Objective):RPO determines the maximum acceptable amount of data loss in the event of a disaster. For critical workloads such as financial systems or he...]]></description><link>https://blog.osshaikh.com/dr-options-for-storage-on-azure</link><guid isPermaLink="true">https://blog.osshaikh.com/dr-options-for-storage-on-azure</guid><category><![CDATA[Azure]]></category><category><![CDATA[NetApp]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[storage]]></category><category><![CDATA[RPO]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Thu, 16 Jan 2025 14:39:22 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-key-considerations-for-storage-dr-on-azure"><strong>Key Considerations for Storage DR on Azure</strong></h2>
<h3 id="heading-1-understanding-rpo-and-rto"><strong>1. Understanding RPO and RTO</strong></h3>
<p><strong>Define RPO (Recovery Point Objective):</strong><br />RPO determines the maximum acceptable amount of data loss in the event of a disaster. For critical workloads such as financial systems or healthcare applications, a <strong>sub-10-minute RPO</strong> is essential for near-continuous data protection.</p>
<p><strong>Define RTO (Recovery Time Objective):</strong><br />RTO specifies how quickly applications and services need to be restored after an outage. For business-critical applications, an <strong>RTO of minutes</strong> is often required to minimize downtime and ensure operational continuity.</p>
<p><strong>Achieving Near Zero RPO:</strong><br />Achieving <strong>near-zero RPO</strong> can be challenging, but it is feasible with technologies like <strong>Azure NetApp Files with cross-region replication</strong>, which employs low-latency replication strategies. These solutions require fine-tuned configurations and a high-performance infrastructure to meet the stringent requirements of critical workloads.</p>
<hr />
<h2 id="heading-2-azure-storage-dr-options"><strong>2. Azure Storage DR Options</strong></h2>
<h3 id="heading-a-overview-of-azure-storage-types"><strong>a. Overview of Azure Storage Types</strong></h3>
<ul>
<li><p><strong>Blob Storage:</strong><br />  Blob storage is an object storage solution optimized for unstructured data, including images, videos, logs, and backups. It is highly scalable and cost-effective, making it ideal for large-scale data storage needs.</p>
</li>
<li><p><strong>Azure Files:</strong><br />  Azure Files offers fully managed CIFS/SMB file shares, suitable for workloads requiring file-level access, such as collaboration platforms or enterprise applications.</p>
</li>
<li><p><strong>NetApp Files:</strong><br />  Azure NetApp Files supports high-performance workloads with SMB, NFS, or dual protocol support. It's designed for mission-critical applications that require high availability and performance, such as databases and SAP applications.</p>
</li>
<li><p><strong>Azure Disk:</strong><br />  Managed Disks provide block-level storage designed for Azure VMs, offering high durability and scalability for mission-critical virtualized workloads.</p>
</li>
</ul>
<h3 id="heading-b-redundancy-options"><strong>b. Redundancy Options</strong></h3>
<ul>
<li><p><strong>Blob &amp; Files:</strong></p>
<ul>
<li><p><strong>Locally Redundant Storage (LRS):</strong> Suitable for non-critical workloads where data loss in a single data center failure is acceptable.</p>
</li>
<li><p><strong>Zone-Redundant Storage (ZRS):</strong> Ensures availability within a region across availability zones, providing protection against zone-level outages.</p>
</li>
<li><p><strong>Geo-Redundant Storage (GRS):</strong> Provides automatic cross-region replication, ideal for applications requiring data durability across regions.</p>
</li>
<li><p><strong>Read-Access Geo-Redundant Storage (RA-GRS):</strong> Offers read access in secondary regions, facilitating faster recovery.</p>
</li>
</ul>
</li>
<li><p><strong>NetApp Files:</strong><br />  NetApp’s <strong>cross-region replication</strong> (via SnapMirror technology) replicates data asynchronously across regions, enabling high availability and disaster recovery for performance-sensitive workloads.</p>
</li>
</ul>
<h3 id="heading-c-when-to-choose-each-option"><strong>c. When to Choose Each Option</strong></h3>
<ul>
<li><p><strong>Blob Storage:</strong></p>
<ul>
<li><p><strong>LRS:</strong> Ideal for non-critical workloads like development environments, where cost is a priority and occasional downtime is acceptable.<br />  RPO : Near ZERO (Sync replication)</p>
</li>
<li><p><strong>ZRS:</strong> Best for regional applications needing fault tolerance at the zone level.<br />  RPO: Near ZERO (Sync replication)</p>
</li>
<li><p><strong>GRS &amp; RA-GRS:</strong> Suitable for business-critical workloads needing cross-region disaster recovery, with an acceptable RPO of up to 12 hours for asynchronous replication.<br />  RPO: 1 Hour<br />  SLA RPO : None (Max could be Several hours )  </p>
</li>
</ul>
</li>
<li><p><strong>Azure Files (GRS):</strong><br />  Ideal for use cases like archival storage and backup systems that require cross-region replication but can tolerate a slight delay in data synchronization.<br />  RPO: 15 mins<br />  SLA RPO: None (several hours for Large Datasets)</p>
</li>
<li><p><strong>NetApp Files:</strong><br />  High-Performance SMB/NFS File Shares: Use NetApp Files for workloads like virtual desktop infrastructures (VDI), large-scale databases, SAP applications, or high-performance computing (HPC). NetApp Files' cross-region replication is ideal for mission-critical applications requiring near-zero RPO and rapid recovery. For example, replicating SAP HANA databases or media processing workloads across regions to ensure high availability and disaster recovery. it typical replicate your data less than 10 mins based on replication schedule configuration<br />  RPO: 10 Min<br />  SLA RPO : Less than 20 Mins</p>
</li>
<li><p><strong>Azure Disk:</strong><br />  Managed Disks are excellent for VM workloads, but to enable geo-replication, you would need to implement Azure Backup, Site Recovery or Custom Solution using Azure Disk Snapshot<br />  SLA RPO: One Hour</p>
</li>
</ul>
<hr />
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Azure provides a diverse range of storage solutions that meet various disaster recovery needs, from object storage to high-performance file systems. For <strong>critical workloads</strong> requiring <strong>near-zero RPO</strong> and fast recovery, <strong>Azure NetApp Files</strong> offers the best performance and flexibility. <strong>Blob Storage</strong> and <strong>Azure Files</strong>, with Geo-Redundant Storage, provide cost-effective solutions for less critical workloads but still offer robust data durability. The right choice depends on the specific needs of the workload, including performance, cost, and recovery objectives. By selecting the appropriate Azure storage solution and configuring it correctly, organizations can ensure that their critical workloads remain highly available and resilient to disruptions.</p>
]]></content:encoded></item><item><title><![CDATA[Azure Arc Connectivity Options: Choosing the Right Path for Your On-Premises Servers]]></title><description><![CDATA[In this post, we'll explore the private connectivity options available for integrating on-premises servers with Azure via Azure Arc. Connecting servers to Azure Arc can involve interacting with a significant number of endpoints—ranging from 15 URLs f...]]></description><link>https://blog.osshaikh.com/azure-arc-connectivity-options</link><guid isPermaLink="true">https://blog.osshaikh.com/azure-arc-connectivity-options</guid><category><![CDATA[Azure]]></category><category><![CDATA[Hybrid Cloud]]></category><category><![CDATA[arc]]></category><category><![CDATA[server]]></category><category><![CDATA[Microsoft]]></category><category><![CDATA[networking]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Fri, 09 Aug 2024 04:18:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1723176958259/ea1dddea-a5cd-451f-b570-6a7863b76760.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this post, we'll explore the private connectivity options available for integrating on-premises servers with Azure via Azure Arc. Connecting servers to Azure Arc can involve interacting with a significant number of endpoints—ranging from 15 URLs for basic onboarding to over 150 URLs when utilizing all features and extensions, such as SQL Server, Azure Defender, and Kubernetes.</p>
<p>Azure Arc provides a versatile platform that allows you to manage servers running outside of Azure (whether on-premises or in other cloud environments) as if they were native Azure resources. When connecting these servers to Azure, you can choose from several connectivity options:</p>
<ul>
<li><p><strong>Public (Default option)</strong></p>
</li>
<li><p><strong>Private Link (Partially private)</strong></p>
</li>
<li><p><strong>Proxy (Onprem)</strong></p>
</li>
<li><p><strong>Proxy (Azure)</strong></p>
</li>
<li><p><strong>Arc Gateway (Preview)</strong></p>
</li>
</ul>
<h3 id="heading-1-public-connectivity">1. Public Connectivity</h3>
<p><strong>Overview:</strong> Public connectivity involves direct internet connections where servers communicate with Azure services over the public internet. This option is typically the easiest to set up, as it doesn’t require any changes to existing network infrastructure or private networking configurations.</p>
<p><strong>Use Cases:</strong></p>
<ul>
<li><p><strong>Quick Deployment:</strong> Ideal for rapid deployments or testing environments where security is not the primary concern.</p>
</li>
<li><p><strong>Remote Locations:</strong> Suitable for remote sites where private networking options like VPNs or ExpressRoute are not feasible.</p>
</li>
</ul>
<p><strong>Security Considerations:</strong></p>
<ul>
<li><p><strong>Network Security:</strong> Servers must be protected by firewalls and other security measures to mitigate internet-based threats.</p>
</li>
<li><p><strong>Encryption:</strong> Implement TLS 1.2 or higher for data in transit to ensure secure communication.</p>
</li>
</ul>
<p><strong>Advantages:</strong></p>
<ul>
<li><strong>Cost-Effective:</strong> No additional costs associated with setting up private network connections.</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li><p><strong>Security Risks:</strong> Direct internet connections are more vulnerable to security threats.</p>
</li>
<li><p><strong>Performance Variability:</strong> Potential for variable latency and bandwidth depending on internet conditions.</p>
</li>
</ul>
<p><strong>Configuration:</strong> No specific configuration is required other than whitelisting at least 15+ public endpoint URLs.</p>
<h3 id="heading-2-private-link-connectivity">2. Private Link Connectivity</h3>
<p><strong>Overview:</strong> Private Link establishes a secure, private connection to Azure services through a private endpoint in your virtual network. This setup ensures that traffic between the server and Azure remains within the Microsoft backbone network, avoiding the public internet.</p>
<p><strong>Use Cases:</strong></p>
<ul>
<li><p><strong>High Security Requirements:</strong> Suitable for environments that require stringent security and compliance, such as healthcare and government sectors.</p>
</li>
<li><p><strong>Regulatory Compliance:</strong> Helps meet regulatory mandates that require private network communication for sensitive data.</p>
</li>
</ul>
<p><strong>Security Considerations:</strong></p>
<ul>
<li><p><strong>Private Network:</strong> Minimizes exposure to internet-based threats by keeping traffic within a private network.</p>
</li>
<li><p><strong>Public Authentication Traffic:</strong> Note that traffic for authentication (Microsoft Entra ID) and Azure Resource Manager still routes through the public internet, which should be secured.</p>
</li>
</ul>
<p><strong>Advantages:</strong></p>
<ul>
<li><p><strong>Enhanced Security:</strong> Provides an additional layer of security by isolating traffic from the public internet.</p>
</li>
<li><p><strong>Reliable Performance:</strong> Offers more consistent and reliable performance by leveraging the Microsoft backbone network.</p>
</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li><p><strong>Complexity:</strong> More complex to set up and manage, requiring network configuration changes and potentially additional infrastructure.</p>
</li>
<li><p>Security: Although there will be still few Microsoft Entra &amp; Azure Resource manager URLs go via public endpoint</p>
</li>
<li><p><strong>Cost:</strong> Involve additional expenses maintaining private endpoint connections.</p>
</li>
</ul>
<p><strong>Configuration:</strong> Leverage an existing circuit connection or IPsec tunnel to route Arc traffic privately. Set up a Private Link scope and configure a private endpoint in a subnet that has line-of-sight to the on-premises gateway VNet. For DNS resolution in a hybrid environment, deploy a DNS forwarder solution on Azure and configure conditional DNS forwarding on the on-premises DNS to forward Azure-related requests to the DNS forwarder VM.</p>
<ul>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723167686252/ec909ac3-2da2-432d-8644-8979d19e56d2.png" alt class="image--center mx-auto" /></p>
<p>  Once private link is enabled, one could verify that name resolution for arc related endpoint (4 URL's) for telemetries are routed via private endpoint</p>
</li>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723167761285/549bfe02-63d0-4285-b73d-457bdfc192da.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h3 id="heading-3-proxy-connectivity">3. Proxy Connectivity</h3>
<p><strong>Overview:</strong> With proxy connectivity, servers connect to Azure services through a proxy server, which acts as an intermediary. This is particularly useful in environments where direct internet access is restricted or where compliance requires that all traffic passes through a proxy.</p>
<p><strong>Use Cases:</strong></p>
<ul>
<li><p><strong>Controlled Environments:</strong> Ideal for environments that restrict direct internet access or require centralized management of outbound traffic.</p>
</li>
<li><p><strong>Compliance:</strong> Ensures that all traffic adheres to corporate security and compliance policies by routing it through a proxy.</p>
</li>
</ul>
<p><strong>Security Considerations:</strong></p>
<ul>
<li><p><strong>Traffic Monitoring:</strong> Proxies allow for inspection, logging, and monitoring of outbound traffic to detect and respond to suspicious activities.</p>
</li>
<li><p><strong>Controlled Access:</strong> Proxy servers can enforce policies on what traffic is permitted to reach Azure, enhancing overall security.</p>
</li>
<li><p><strong>SSL/TLS Inspection:</strong> FW/proxies can inspect SSL/TLS traffic to ensure secure communications and prevent data exfiltration.</p>
</li>
</ul>
<p><strong>Advantages:</strong></p>
<ul>
<li><p><strong>Enhanced Security:</strong> Provides additional security by isolating traffic from the public internet.</p>
</li>
<li><p><strong>Compliance:</strong> Facilitates compliance with industry regulations and standards.</p>
</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li><p><strong>Complexity:</strong> More complex to set up and manage, requiring changes to network configurations and potentially additional infrastructure.</p>
</li>
<li><p><strong>Cost:</strong> May involve additional costs for setting up and maintaining the proxy server infrastructure.</p>
</li>
</ul>
<p><strong>Configuration:</strong> Azure Arc agent creates four services on your system, including an Arc proxy. Configure your on-premises firewall or proxy IP address in the Arc agent proxy settings, so that all traffic destined for Azure Arc passes through the proxy server, which then communicates with Azure endpoints on behalf of the Arc agent.</p>
<pre><code class="lang-bash">azcmagent config <span class="hljs-built_in">set</span> proxy.url <span class="hljs-string">"http://ProxyServerFQDNorIP:port"</span>
azcmagent config get proxy.url
</code></pre>
<p>Here is screenshot of on-premises server using proxy to facilitate traffic of Arc Agent<br />Since I have not whitelisted URL for SQL, traffic for Arc extension for SQL is failing</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722807745424/b3edaaf6-e65f-4987-b53c-951f72fea791.png" alt class="image--center mx-auto" /></p>
<p>Sample Architecture for Arc server which could leverage onprem proxy solution:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723174094423/0cbebdb5-baf7-4aad-b2f9-a341dbb3f68e.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-4-proxy-private-connectivity">4. Proxy (Private Connectivity)</h3>
<p><strong>Overview:</strong> In this scenario, the proxy server is hosted on Azure, and traffic from on-premises servers is routed privately to Azure via ExpressRoute or IPsec VPN. This approach ensures that all communication between on-premises servers and Azure remains private, leveraging the secure connection to the Azure-based proxy before reaching Azure services over the Microsoft backbone network.</p>
<p><strong>Use Cases:</strong></p>
<ul>
<li><p><strong>High-Security Deployments:</strong> Suitable for environments that demand to keep traffic off internet or all traffic, ensuring that no data traverses the public internet</p>
</li>
<li><p><strong>Centralized Security Management:</strong> Ideal for organizations that want to centralize outbound traffic management while ensuring private connectivity.</p>
</li>
</ul>
<p><strong>Advantages:</strong></p>
<ul>
<li><p><strong>Enhanced Security:</strong> Offers a high level of security by ensuring that all traffic from on-premises servers remains off public internet until it connects Arc service,<br />  traffic can be routed across multiple layers of firewall in customer DMZ before it connects to Arc/Azure endpoint via Azure backbone network</p>
</li>
<li><p><strong>Compliance:</strong> Meets stringent security and compliance requirements by leveraging private connections.</p>
</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li><p><strong>Complexity:</strong> More complex to set up due to the need for private networking configurations, including ExpressRoute or IPsec VPN.</p>
</li>
<li><p><strong>Cost:</strong> Higher costs associated with maintaining private connectivity and the Azure-based proxy infrastructure.</p>
</li>
</ul>
<p><strong>Configuration:</strong></p>
<ol>
<li><p><strong>Enable Forward Proxy on Azure Firewall (Premium SKU):</strong> Configure the proxy service to expose its service over specific ports (e.g., HTTP/HTTPS).</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723018665760/f37b027f-85e2-47bf-b7e6-62408633b13b.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Configure Proxy in Arc Agent:</strong> Use the <code>azcmagent</code> utility on your on-premises servers to configure the proxy settings using the IP and port details of the Azure-based proxy.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723034906101/c6d3ed6e-8667-43c1-820e-10645e84f335.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Verify Traffic:</strong> Monitor traffic on the Azure Firewall logs, ensuring that it passes through the proxy as configured. Based on the whitelisted rules, allow traffic to the necessary Arc endpoints.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723047059064/308d0ea6-5650-4a47-b4bd-e0bd2b9a9348.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>DMZ Considerations:</strong> For added security, route traffic through a DMZ zone firewall before connecting to Arc endpoints via the Microsoft backbone network.<br /> Sample architecture for secure Arc traffic via proxy which traverse through multiple layers of firewall (ingress, egress) &amp; connect Arc via backbone network</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723050318850/41033f29-99d9-4daa-997d-048dbbf31562.png" alt class="image--center mx-auto" /></p>
</li>
</ol>
<h3 id="heading-5-arc-gateway-preview">5. Arc Gateway (Preview)</h3>
<p><strong>Overview:</strong></p>
<p>Managing Azure Arc often involves dealing with numerous endpoints—15+ for basic scenarios, and over 150 for full extension use. The Arc Gateway simplifies this by reducing the number of URLs that need to be whitelisted in enterprise proxies or firewalls.</p>
<p><strong>Components:</strong></p>
<ul>
<li><p><strong>Arc Gateway:</strong> A centralized front end for all infrastructure traffic between Azure and on-premises servers.</p>
</li>
<li><p><strong>Azure Arc Proxy:</strong> A component within the Arc agent that routes traffic through the Arc Gateway, reducing the operational burden of managing multiple endpoints.</p>
</li>
</ul>
<p><strong>Use Cases:</strong></p>
<ul>
<li><p><strong>Simplified Management:</strong> Ideal for enterprises that want to reduce the operational complexity associated with managing numerous Arc-related endpoints.</p>
</li>
<li><p><strong>Secure, Centralized Traffic Routing:</strong> Ensures all Arc traffic is routed through a single, secure gateway.</p>
</li>
</ul>
<p><strong>Advantages:</strong></p>
<ul>
<li><p><strong>Reduced Complexity:</strong> Simplifies endpoint management by limiting the number of required URLs to 7 for all scenarios</p>
</li>
<li><p><strong>Improved Security:</strong> Centralizes traffic management, enhancing security and reducing the attack surface.</p>
</li>
</ul>
<p><strong>Disadvantages:</strong></p>
<ul>
<li><p><strong>Limited Availability:</strong> As a preview feature, it may not yet be suitable for production environments without thorough testing.</p>
</li>
<li><p><strong>Security</strong>: All traffic to Arc Gateway would traverse via public endpoint</p>
</li>
</ul>
<p><strong>Configuration:</strong></p>
<ol>
<li><p><strong>Create Arc Gateway Resource:</strong> Set up the Arc Gateway within your Azure environment.</p>
<pre><code class="lang-bash"> az connectedmachine gateway create --name demo-gateway --resource-group demo-arc-gw --location eastus2 --gateway-type public --allowed-features <span class="hljs-string">'*'</span> --subscription <span class="hljs-string">'subsid'</span>
</code></pre>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723051073008/98d26441-dbfc-496e-b116-5155b9a3b279.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Enable Association:</strong> Associate your Arc-enabled servers with the Arc Gateway.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723051132055/f97fe6e1-9211-4633-96dc-6075dc32c291.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-bash"> az connectedmachine setting update --resource-group demo-arc-gw --subscription subscription-name --base-provider Microsoft.HybridCompute --base-resource-type machines --base-resource-name arcbox-win2k19  --settings-resource-name default --gateway-resource-id <span class="hljs-string">'/subscriptions/subsid/resourceGroups/demo-arc-gw/providers/Microsoft.HybridCompute/gateways/demo-gateway'</span>
</code></pre>
</li>
<li><p><strong>Initiate Arc Proxy Service:</strong> Configure the Arc Proxy to leverage the Arc Gateway, reducing the number of required endpoints.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723168842987/bfce7315-49b9-4033-ac4d-810f98f552dc.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Verify Traffic Routing:</strong> Use the <code>azcmagent</code> utility to confirm that traffic is being routed through the Arc Gateway, reducing the list of required endpoints to just four.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1723169317779/6cd226e2-b839-4ab0-8f0a-523d91a01ac2.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Representation of traffic in visual architecture would look similar to this</p>
<p> <img src="https://learn.microsoft.com/en-us/azure/azure-arc/servers/media/arc-gateway/arc-gateway-overview.png" alt="Diagram showing the route of traffic flow for Azure Arc gateway." /></p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Using GenAI Gateway capabilities for OpenAI]]></title><description><![CDATA[Introduction
In this blog post, I will demonstrate how to leverage the newly announced GenAI gateway features in API Management (APIM) to enhance the resiliency and capacity of your Azure OpenAI deployments using circuit breaker and load balancing pa...]]></description><link>https://blog.osshaikh.com/genai-gateway-openai</link><guid isPermaLink="true">https://blog.osshaikh.com/genai-gateway-openai</guid><category><![CDATA[Azure]]></category><category><![CDATA[openai]]></category><category><![CDATA[llm]]></category><category><![CDATA[APIs]]></category><category><![CDATA[APIM]]></category><category><![CDATA[genai]]></category><category><![CDATA[Microsoft]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Sat, 03 Aug 2024 03:55:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1722657183010/f2efbe61-8eac-4786-82a9-3ff45d2178fc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-introduction">Introduction</h3>
<p>In this blog post, I will demonstrate how to leverage the newly announced GenAI gateway features in API Management (APIM) to enhance the resiliency and capacity of your Azure OpenAI deployments using circuit breaker and load balancing patterns.</p>
<h3 id="heading-background">Background</h3>
<p>Previously, I have shared various methods for using different routing algorithms (weightage/priority) to load balance traffic across multiple OpenAI endpoints to improve resilience and performance. However, these custom solutions often failed under extremely heavy traffic scenarios.</p>
<p>Since I have already discussed the challenges associated with OpenAI TPM and APIM patterns in a previous blog post, this post will focus directly on the solution.</p>
<h3 id="heading-load-balancing-features-in-apim-for-openai">Load Balancing Features in APIM for OpenAI</h3>
<p>The solution comprises two main configuration parts: load balancing and the circuit breaker. These configurations can be managed via two methods: the APIM portal (UI) and Bicep (Infrastructure as Code).</p>
<h3 id="heading-ia"> </h3>
<p>Configuration of APIM via Portal (UI)</p>
<ol>
<li><p><strong>Support for Load Balancing Techniques</strong>: Recently announced at the last Microsoft Build event, APIM now supports weighted and priority-based load balancing techniques.</p>
</li>
<li><p><strong>Creating Backend Resources</strong>: Begin by creating backend resources for each OpenAI endpoint in APIM. For instance, you can add two OpenAI endpoints from different regions.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722536100419/41e824e0-ff27-40ed-909e-b5bda19225a4.png" alt class="image--center mx-auto" /></p>
<ul>
<li><strong>Import New API</strong>: Use the OpenAI service template to import a new API. Select an OpenAI endpoint and base URL suffix to form the APIM URI path, e.g., <a target="_blank" href="https://myapim.azure-api.net/(basesuffix)"><code>https://myapim.azure-api.net/(basesuffix)</code></a>.</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722536891352/12864698-5c9b-4c6d-802e-81b554e3eccb.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>Add each of your Az OpenAI endpoint as backend resource in API management,</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722648270803/15e19737-068b-41a9-a02b-666785669aff.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h3 id="heading-configuration-of-lb-amp-circuit-breaker-via-bicep">Configuration of LB &amp; Circuit Breaker via Bicep</h3>
<p>Due to the current lack of UI support for load balancer or circuit breaker configurations, these must be done via Bicep (Infrastructure as Code).</p>
<p><strong>Load Balancer Configuration</strong>:</p>
<ol>
<li><p><strong>Define Load Balancer Method</strong>: You can configure the load balancer to use either weight or priority, or both. In this demo, we will use weight-based load balancing, where one endpoint will have a higher weight than another.</p>
</li>
<li><p><strong>Code Snippet Adjustments</strong>: Modify the names of APIM and its backend pools, specifying the backend pools created in the previous step. This example uses two pools, but you can add more as needed.  </p>
<pre><code class="lang-yaml"> <span class="hljs-string">resource</span> <span class="hljs-string">symbolicname</span> <span class="hljs-string">'Microsoft.ApiManagement/service/backends@2023-09-01-preview'</span> <span class="hljs-string">=</span> {
   <span class="hljs-attr">name:</span> <span class="hljs-string">'(APIM-Name)/mybackendpool'</span>
   <span class="hljs-attr">properties:</span> {
     <span class="hljs-attr">description:</span> <span class="hljs-string">'Load balancer for multiple backends'</span>
     <span class="hljs-attr">type:</span> <span class="hljs-string">'Pool'</span>
     <span class="hljs-attr">pool:</span> {
       <span class="hljs-attr">services:</span> [
         {
           <span class="hljs-attr">id:</span> <span class="hljs-string">'/subscriptions/(subscripitonid)/resourceGroups/DefaultResourceGroup-SUK/providers/Microsoft.ApiManagement/service/opai-apim/backends/backend1-openai-endpoint'</span>
           <span class="hljs-attr">priority:</span> <span class="hljs-number">1</span>
           <span class="hljs-attr">weight:</span> <span class="hljs-number">1</span>
         }
         {
           <span class="hljs-attr">id:</span> <span class="hljs-string">'/subscriptions/(subscripitonid)/resourceGroups/DefaultResourceGroup-SUK/providers/Microsoft.ApiManagement/service/opai-apim/backends/backend2-openai-endpoint'</span>
           <span class="hljs-attr">priority:</span> <span class="hljs-number">1</span>
           <span class="hljs-attr">weight:</span> <span class="hljs-number">10</span>
         }
       ]
     }
   }
 }
</code></pre>
</li>
<li><p><strong>Deploy the Bicep Template</strong>: Deploy this Bicep template and look for changes in the backend pool to ensure it is deployed successfully.</p>
</li>
<li><p><strong>Update API with New Backend Name</strong>: Once deployed, you will see the new backend name in APIM backends. Copy that name and replace it in the API you created in the initial steps.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722652750957/19509848-b1e2-4162-b506-7730c68284d0.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-bash">  az deployment group create --resource-group Rresource-group-name --template-file apim.bicep
</code></pre>
</li>
<li><p><strong>Circuit Breaker Configuration</strong>:</p>
<ol>
<li><p><strong>Health Probe Setup</strong>: Deploy circuit breaker conditions for each backend pool resource, consisting of a 'failure condition' and a 'retry after' period.</p>
</li>
<li><p><strong>Failure Condition</strong>: Define the number of errors observed within a specific duration (e.g., 5 errors in 1 minute) and specify HTTPS error code ranges considered as errors.</p>
</li>
<li><p><strong>Retry After</strong>: Specify the duration after which APIM can retry the health probe on a failed backend pool endpoint. In this configuration, error code ranges from 400-599 and a duration of 1 minute are set for each backend pool.  </p>
</li>
</ol>
</li>
</ol>
<pre><code class="lang-yaml">    <span class="hljs-string">resource</span> <span class="hljs-string">openaiprodbackend</span> <span class="hljs-string">'Microsoft.ApiManagement/service/backends@2023-09-01-preview'</span> <span class="hljs-string">=</span> {
      <span class="hljs-attr">name:</span> <span class="hljs-string">'APIM-name/nameofbackendpool'</span>
      <span class="hljs-attr">properties:</span> {
        <span class="hljs-attr">url:</span> <span class="hljs-string">'https://openaiendpointname.openai.azure.com/openai'</span>
        <span class="hljs-attr">protocol:</span> <span class="hljs-string">'http'</span>
        <span class="hljs-attr">circuitBreaker:</span> {
          <span class="hljs-attr">rules:</span> [
            {
              <span class="hljs-attr">failureCondition:</span> {
                <span class="hljs-attr">count:</span> <span class="hljs-number">1</span>
                <span class="hljs-attr">errorReasons:</span> [
                  <span class="hljs-string">'Server errors'</span>
                ]
                <span class="hljs-attr">interval:</span> <span class="hljs-string">'PT1M'</span>
                <span class="hljs-attr">statusCodeRanges:</span> [
                  {
                    <span class="hljs-attr">min:</span> <span class="hljs-number">400</span>
                    <span class="hljs-attr">max:</span> <span class="hljs-number">599</span>
                  }
                ]
              }
              <span class="hljs-attr">name:</span> <span class="hljs-string">'myBreakerRulePTU'</span>
              <span class="hljs-attr">tripDuration:</span> <span class="hljs-string">'PT1M'</span>
              <span class="hljs-attr">acceptRetryAfter:</span> <span class="hljs-literal">true</span>
            }
          ]
        }
        <span class="hljs-attr">description:</span> <span class="hljs-string">'OpenAI PTU'</span>
        <span class="hljs-attr">title:</span> <span class="hljs-string">'openai-PTU'</span>
      }
    }
</code></pre>
<ol start="6">
<li><h4 id="heading-verification-via-api">Verification via API</h4>
<p> Since there is no UI to verify the circuit breaker configuration, use the APIM backend REST API to confirm changes. Use Postman or refer to the APIM documentation for the validation tool. Specify the backend name in <code>backendID</code> and the APIM name in <code>serviceName</code>.</p>
</li>
<li><p>Bicep configuration is now completed, Next is to test LB &amp; circuit breaker policy in action</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722652277574/05465ca0-1bee-457a-913a-eb4c89f63101.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722652380261/27f6e668-5f7e-4a54-9951-d9e181593814.png" alt class="image--center mx-auto" /></p>
</li>
</ol>
<h3 id="heading-validationtest">Validation/Test</h3>
<p>To test the circuit breaker with load balancing:</p>
<ol>
<li><p><strong>Induce Errors</strong>: Ensure the OpenAI endpoint throws errors within the 400-599 range. Perform a stress test on the OpenAI endpoint preferred in weight/priority, which should result in an HTTP 429 error ("Server is Busy").</p>
</li>
<li><p><strong>Test APIM Response</strong>: Once the preferred endpoint is unresponsive, make a call via APIM to OpenAI and verify if the response is received from the second endpoint.</p>
</li>
<li><p><strong>Example</strong>: Perform a stress test on the OpenAI endpoint in EastUS, resulting in a 429 error state. Validate with an ad-hoc call to the EastUS OpenAI endpoint to confirm it fails with 429. At the same time, a POST request to APIM should work fine, as it is handled by the secondary backend, which is the OpenAI endpoint in UKSouth.</p>
</li>
<li><p><strong>Backend Endpoint 2</strong>: Backend endpoint 2 has a higher weight, located in EastUS. Perform a stress test on this endpoint using Postman.</p>
</li>
<li><p><strong>Stress Test Results</strong>: Stress test on the OpenAI endpoint in EastUS should result in a 429 error state.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722654684357/1a754a09-0480-4a8a-8f9c-4f8947c5492e.png" alt class="image--center mx-auto" /></p>
<p><strong>Ad-Hoc Call Validation</strong>: Validate with an ad-hoc call to the EastUS OpenAI endpoint to confirm it is failing with a 429 error.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722654809535/740d31f4-2a93-4f4f-9c2a-c91cec15b096.png" alt class="image--center mx-auto" /></p>
<p>Once the preferred endpoint is unresponsive, make a call via APIM to OpenAI and verify if the response is received from the second endpoint.</p>
<p>In this POST request to APIM should work fine, as it is handled by the secondary backend, which is the OpenAI endpoint in UKSouth.  </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722655026626/8e028d73-d001-471b-8dfd-e5b3861eac03.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-conclusion">Conclusion:</h3>
<p>In conclusion, using the new GenAI gateway features in API Management can greatly improve the reliability and performance of Azure OpenAI services. By setting up smart traffic routing and safety measures, we ensure that your services stay responsive even under heavy load. This approach helps balance the workload and quickly redirects traffic if something goes wrong, making sure your users have a smooth and reliable experience.  </p>
<p>Appendix:<br /><a target="_blank" href="https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/intelligent-load-balancing-with-apim-for-openai-weight-based/ba-p/4115155">Intelligent Load Balancing with APIM for OpenAI: Weight-Based Routing - Microsoft Community Hub</a></p>
]]></content:encoded></item><item><title><![CDATA[Cross Tenant Authentication for TDE via Workload Federation identity]]></title><description><![CDATA[Introduction:
In my recent engagement with a customer operating under a multi-tenant architecture, I encountered a shared service model where a centralized Hub included a Key Management Service (KMS) using a Dedicated Hardware Security Module (HSM). ...]]></description><link>https://blog.osshaikh.com/cross-tenant-authentication</link><guid isPermaLink="true">https://blog.osshaikh.com/cross-tenant-authentication</guid><category><![CDATA[Azure]]></category><category><![CDATA[crypto]]></category><category><![CDATA[Cryptography]]></category><category><![CDATA[keys]]></category><category><![CDATA[authentication]]></category><category><![CDATA[authorization]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Mon, 15 Jul 2024 07:02:02 GMT</pubDate><content:encoded><![CDATA[<h3 id="heading-introduction">Introduction:</h3>
<p>In my recent engagement with a customer operating under a multi-tenant architecture, I encountered a shared service model where a centralized Hub included a Key Management Service (KMS) using a Dedicated Hardware Security Module (HSM). In this setup, all encryption activities across the organization leveraged the HSM for Bring Your Own Key (BYOK) to secure various data services, including MySQL and MSSQL databases.</p>
<p>Given that the data services and the HSM resided in different Azure tenants, it was crucial for the managed identities associated with these data services to authenticate to the HSM cross-tenant with the necessary Role-Based Access Control (RBAC) privileges. However, managed identities typically do not support cross-tenant scenarios by default.</p>
<p>With the recent General Availability (GA) of the 'Workload Identity Federation' feature in Microsoft Entra, a new approach has emerged to address this challenge. In this article, I will explore how federated credentials can be leveraged to facilitate cross-tenant authentication, enabling secure and efficient data encryption across disparate Azure tenants.  </p>
<p><strong>Pre-Requsite</strong></p>
<ul>
<li><p>User Access Administrator+Contributor RBAC on source Az tenant subcriptions(Data services subscription)</p>
</li>
<li><p>Microsoft Entra Application Administrator RBAC on source tenant</p>
</li>
<li><p>Entra ID App Admin+Contributor RBAC on destination subscription that host HSM/AKV</p>
</li>
</ul>
<p><strong>Setup Microsoft Entra for App Registration</strong></p>
<ul>
<li><p>Make sure data service(MSSQL) where encryption is needed configure with 'user managed identity'</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721020053689/c7a1121c-b36d-4237-a789-908939fa8fe1.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Create App Registartion(SP) in MS Entra ID(source) with Multi-tenat options  </p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721020287467/c4c575f6-52bd-426f-b925-5d770607b55f.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Now next step is to connect App registration that we just created with 'User Managed Identity' associated with data services(SQL)</p>
</li>
<li><p>Enable Federated credetails feature on App registration with CMK options and select user managed identity of MSSQL</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721020905217/f4d06bbe-14cc-43b3-ac0c-a254bee3e276.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Login to destination Az tenant(AKV hosted) &amp; create App Registration using Client ID of App registartion created in ealier step  </p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721022513793/a129ea8f-fe39-411b-b4d8-2d6b5c77f25b.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>App registration is now linked with existing App regsitartion in source tenant</p>
</li>
<li><p>Next Assign RBAC "Keyvault Crypto Office" against App Registration using its name</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721024473156/d9446269-f50c-45d3-b44f-080bb5661935.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Once permission are assigned, now switch to source tenant which host your data service i.e SQL databases.</p>
</li>
<li><p>Under identity blade of SQL, configure federated identity &amp; select Application registered name that we created intially</p>
</li>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721026196822/5703d713-f501-4e25-97ea-8f95a1677b59.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Now under Data encryption for SQL select 'user manage identity' &amp; 'Federated identity' then paste HSM/AKV keys identifier URI from other tenant &amp; save</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721026439672/1ba3e399-7293-4327-8821-9f41459a2a92.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>Once configuration is saved succesfully its completed TDE for your databases using HSM/AKV from cross tenant</p>
</li>
<li><p>This cross scenarios wouldnt have possible with leveraging Federated Credential feature of Entra ID</p>
</li>
</ul>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Karpenter: Run your Workloads upto 80% Off using Spot with AKS]]></title><description><![CDATA[Introduction
At last year's KubeCon North America, Microsoft announced the adoption of Karpenter in Azure Kubernetes Service (AKS) as an alternative to the Cluster Autoscaler (CA), referred to as Node Autoprovisioning (NAP). While Cluster Autoscaler ...]]></description><link>https://blog.osshaikh.com/karpenter-aks-spot-nodes</link><guid isPermaLink="true">https://blog.osshaikh.com/karpenter-aks-spot-nodes</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Azure]]></category><category><![CDATA[k8s]]></category><category><![CDATA[AWS]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Tue, 21 May 2024 15:26:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1716374995785/97b4a3ae-7531-438f-b377-f32c894f4900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-introduction"><strong>Introduction</strong></h3>
<p>At last year's KubeCon North America, Microsoft announced the adoption of Karpenter in Azure Kubernetes Service (AKS) as an alternative to the Cluster Autoscaler (CA), referred to as Node Autoprovisioning (NAP). While Cluster Autoscaler has been the default node scaler in AKS/Kubernetes, there have been significant challenges that led to the adoption of Karpenter. This post delves into these challenges and explores how Karpenter addresses them.</p>
<h3 id="heading-challenges-with-cluster-autoscaler">Challenges with Cluster Autoscaler</h3>
<p>Here is Node Autosclaing flow chart for Cluster-Autoscaler</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1716374617575/e812c05e-9d7f-4b75-b335-1554e82bdb05.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p><strong>Limited to VMSS Groups:</strong> Cluster Autoscaler can only operate with Virtual Machine Scale Sets (VMSS) in AKS. Each VMSS consists of a specific group of VM instances with a specific VM SKU, hardware, and CPU:Memory ratio (e.g., Standard D4sv5 with 4 CPUs and 16 GB RAM).</p>
<p>  <strong>2. Node Latency:</strong> CA triggers the node pool API, which calls the VMSS instance API. This scaling process has latency, taking over a minute for a node to be ready in AKS.</p>
<p>  <strong>3. Node Pool Constraints:</strong> When deploying new pods, if the existing node capacity is exhausted, CA attempts to spin up a new node of the same VMSS SKU type. If that instance is unavailable, pods remain in a pending state.</p>
<p>  <strong>4. Scalability Limitations:</strong> CA can only scale up based on specific node pool SKU VMSS availability. It cannot leverage the capacity of other VM SKUs even if they have available resources.</p>
</li>
</ul>
<h3 id="heading-introducing-karpenter-node-autoprovisioning"><strong>Introducing Karpenter (Node Autoprovisioning)</strong></h3>
<p>Karpenter is an efficient node autoscaler for Kubernetes clusters, designed to optimize performance and cost. It can scale up and down worker nodes faster than Cluster Autoscaler and can launch appropriate individual nodes without creating traditional node groups in AKS.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1716374666038/b08fe3a7-407f-4374-b2c4-21d4f5aac471.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p><strong>Key Features of Karpenter:</strong></p>
<ul>
<li><p><strong>Efficiency:</strong> Faster scaling of Kubernetes nodes.</p>
</li>
<li><p><strong>Flexibility:</strong> Launches nodes without needing VMSS.</p>
</li>
<li><p><strong>Cost Optimization:</strong> Reduces overall costs and helps with patching of node images and Kubernetes versions.</p>
</li>
<li><p>Nodepool YAML based config which defined what types of nodes it can provision</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-handling-disruptions">Handling Disruptions</h3>
<p>Disruption Controller responsible for Terminating/Replacing nodes in kubernetes cluster.<br />its uses one of 3 automated methods to finalise which nodes to handle via Disruption controller</p>
<ul>
<li>Expiration: Karpenter will mark nodes as expired and disrupt them after they have lived a set number of seconds. this parameters act as TTL for k8s nodes</li>
</ul>
<pre><code class="lang-bash">spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 300s
</code></pre>
<ul>
<li><p>Consolidateafter: it used to configure disrupiton interval,amount of time it should wait before considering disruption cycle again</p>
</li>
<li><p>Consolidation: It operates to actively reduce cluster cost by analyzing nodes<br />  Consolidation policy has two modes<br />  a)When Empty: Karpenter will only disrupt nodes with no workloads pods<br />  b)Whenunderutilized: It will attempt to reduce/replace nodes when underutilised</p>
</li>
</ul>
<pre><code class="lang-bash">apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: ondemand
spec:
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s
</code></pre>
<h3 id="heading-enable-napkarpenter-on-aks">Enable NAP(Karpenter) on AKS</h3>
<p>There are few pre requisites to enable NAP on AKS</p>
<ul>
<li><p>Install Az CLI with preview extension of version greater than 0.5.17</p>
</li>
<li><p>Regsiter NAP provides called "NodeAutoProvisioningPreview"</p>
</li>
<li><p>AKS with network configuration as Cilium + Overlay</p>
</li>
</ul>
<p><strong>Enable NAP on existing AKS cluster</strong><br />Make sure existing AKS cluster has 'Azure' network plugin with Cilium as Network Policy. Key thing in this command is feature flag '--node-provisioning-mode Auto', which set NAP as default Node Autoscaler</p>
<pre><code class="lang-bash">az aks update --name aksclustername --resource-group rgname --node-provisioning-mode Auto
</code></pre>
<p>Deploy NAP with new AKS cluster</p>
<pre><code class="lang-bash">az aks create --name aksclustername--resource-group rgname--node-provisioning-mode Auto --network-plugin azure --network-plugin-mode overlay --network-dataplane cilium
</code></pre>
<p><strong>Verify Karpenter Enablement:</strong></p>
<pre><code class="lang-bash">kubectl api-resources | grep -e aksnodeclasses -e nodeclaims -e nodepools

aksnodeclasses                     aksnc,aksncs                        karpenter.azure.com/v1alpha2           <span class="hljs-literal">false</span>        AKSNodeClass
nodeclaims                                                             karpenter.sh/v1beta1                   <span class="hljs-literal">false</span>        NodeClaim
nodepools                                                              karpenter.sh/v1beta1                   <span class="hljs-literal">false</span>        NodePool
</code></pre>
<h3 id="heading-disabling-cluster-autoscaler"><strong>Disabling Cluster-Autoscaler</strong></h3>
<p>To switch from Cluster-Autoscaler to Karpenter, disable Cluster-Autoscaler on your AKS cluster:</p>
<pre><code class="lang-bash">az aks update --name aksclustername --resource-group aksrg --disable-cluster-autoscaler
</code></pre>
<h3 id="heading-deploying-a-sample-application"><strong>Deploying a Sample Application</strong></h3>
<p>To see Node-Autoprovisioning in action, deploy a sample application:</p>
<pre><code class="lang-bash">osama [ ~ ]$ kubectl get nodes
NAME                                STATUS   ROLES   AGE     VERSION
aks-default-h2jxh                   Ready    agent   35m     v1.27.9
aks-nodepool1-41633911-vmss000000   Ready    agent   3d19h   v1.27.9
</code></pre>
<p>Scale replicas of Vote Application to trigger scale out events</p>
<pre><code class="lang-bash">osama [ ~ ]$ kubectl scale deployment azure-vote-front --replicas=12 -n karpenter-demo-ns
^[[Adeployment.apps/azure-vote-front scaled
osama [ ~ ]$ kubectl scale deployment azure-vote-back --replicas=12 -n karpenter-demo-ns
deployment.apps/azure-vote-back scaled
</code></pre>
<p>Verify auto scaling of nodes by reading via karpenter using below kubectl cmd</p>
<pre><code class="lang-bash">kubectl get events -A --field-selector <span class="hljs-built_in">source</span>=karpenter --sort-by=<span class="hljs-string">'.lastTimestamp'</span> -n 10
NAMESPACE           LAST SEEN   TYPE     REASON                  OBJECT                                  MESSAGE
default             50m         Normal   Unconsolidatable        nodeclaim/default-95f54                 SpotToSpotConsolidation is disabled, can<span class="hljs-string">'t replace a spot node with a spot node
default             50m         Normal   Unconsolidatable        node/aks-default-95f54                  SpotToSpotConsolidation is disabled, can'</span>t replace a spot node with a spot node
default             38m         Normal   DisruptionBlocked       nodepool/default                        No allowed disruptions due to blocking budget
default             5m33s       Normal   Unconsolidatable        nodeclaim/default-h2jxh                 Can<span class="hljs-string">'t remove without creating 2 candidates
default             5m33s       Normal   Unconsolidatable        node/aks-default-h2jxh                  Can'</span>t remove without creating 2 candidates
default             2m12s       Normal   DisruptionBlocked       nodepool/system-surge                   No allowed disruptions due to blocking budget
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-bnq7p   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-gbwk6   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-l2bgj   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-nvc56   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-22glj   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-sxdl6   Pod should schedule on: nodeclaim/default-mrh7w
karpenter-demo-ns   63s         Normal   Nominated               pod/azure-vote-front-6855444955-t69w4   Pod should schedule on: nodeclaim/default-mrh7w
</code></pre>
<h3 id="heading-customise-karpenter-config"><strong>Customise Karpenter Config</strong></h3>
<p>Karpenter leverage new resource type in kubernetes Kind i.e. Nodepools</p>
<ul>
<li><p>Customise Nodepools: Specific specific VM series or VM family or even Specific CPU or Memory ratio.</p>
</li>
<li><p>Select node based on features sets like GPU enable or Network Acceleration</p>
</li>
<li><p>Defined Archiecture of CPU type either ARM or AMD based on capablity of specfic workload</p>
</li>
<li><p>Architect your nodes for resiliency by configure zone topology</p>
</li>
<li><p>Limit numbers of CPU &amp; Memory could be utilised from nodes on nodelevel</p>
</li>
</ul>
<p>Here is default Nodepool Yaml for karpenter(NAP), Which has confiuration on Node SKU types and Capacity, Also limit on nodes CPU:Memory along with Weight incase of Multiple nodepools</p>
<pre><code class="lang-bash">apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 10s
  template:
    spec:
      nodeClassRef:
        name: default

      <span class="hljs-comment"># Requirements that constrain the parameters of provisioned nodes.</span>
      <span class="hljs-comment"># These requirements are combined with pod.spec.affinity.nodeAffinity rules.</span>
      <span class="hljs-comment"># Operators { In, NotIn, Exists, DoesNotExist, Gt, and Lt } are supported.</span>
      <span class="hljs-comment"># https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#operators</span>
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - ondemand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - E
        - D
      - key: karpenter.azure.com/sku-name
        operator: In
        values:
        - Standard_E2s_v5
        - Standard_D4s_v3
limits:
  cpu: <span class="hljs-string">"1000"</span>
  memory: 1000Gi
weight: 100
</code></pre>
<h3 id="heading-using-spot-node-with-karpenter">Using Spot Node with Karpenter</h3>
<ul>
<li><p>Add toleration in Sample AKS-Vote application i.e. "<a target="_blank" href="http://karpenter.sh/disruption:NoSchedule">karpenter.sh/disruption:NoSchedule</a>" which comes as default in spot node when provision with AKS Cluster</p>
</li>
<li><p>Please refer my github <a target="_blank" href="https://github.com/Osshaikh/Karpenter-AKS-Spot">repo</a> for Application yaml and sample nodepool config</p>
<pre><code class="lang-bash">  spec:
        nodeSelector:
          <span class="hljs-string">"kubernetes.io/os"</span>: linux
        tolerations:
        - key: <span class="hljs-string">"kubernetes.azure.com/scalesetpriority"</span>
          operator: <span class="hljs-string">"Equal"</span>
          value: <span class="hljs-string">"spot"</span>
          effect: <span class="hljs-string">"NoSchedule"</span>
        containers:
        - name: azure-vote-front
          image: mcr.microsoft.com/azuredocs/azure-vote-front:v1
</code></pre>
</li>
<li><p>Scale down your application replicas to allow Karpenter to evict existing on-demand nodes and replace them with Spot nodes:</p>
<pre><code class="lang-bash">  osama [ ~/karpenter ]$ kubectl get nodes
  NAME                                STATUS   ROLES   AGE     VERSION
  aks-nodepool1-41633911-vmss000000   Ready    agent   3d21h   v1.27.9
  aks-nodepool1-41633911-vmss00000b   Ready    agent   24m     v1.27.9

  osama [ ~/karpenter ]$ kubectl get pods -n karpenter-demo-ns -o wide
  No resources found <span class="hljs-keyword">in</span> karpenter-demo-ns namespace.

  osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-back --replicas=10 -n karpenter-demo-ns
  deployment.apps/azure-vote-back scaled
  osama [ ~/karpenter ]$ kubectl scale deployment azure-vote-front --replicas=10 -n karpenter-demo-ns
  deployment.apps/azure-vote-front scaled
  osama [ ~/karpenter ]$
</code></pre>
</li>
<li><p>Deploy and scale vote application replicas so that karpenter spins up spot nodes based on nodepool configuration and schedule pods after toleration validation on spot</p>
</li>
<li><p>Karpenter spins up new spot nodes and Nominate that node for sceduling sample vote-app</p>
<pre><code class="lang-bash">
  osama [ ~/karpenter ]$ kubectl get events -A --field-selector <span class="hljs-built_in">source</span>=karpenter --sort-by=<span class="hljs-string">'.lastTimestamp'</span>
  NAMESPACE           LAST SEEN   TYPE      REASON                       OBJECT                                    MESSAGE
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-pz8sp      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-ckdcq      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-v9nqj      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-vswvs      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-lnxmp      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-jc2jz      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-hwnbh      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-r7msb      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-96lm9      Pod should schedule on: nodeclaim/default-52gbg
  karpenter-demo-ns   104s        Normal    Nominated                    pod/azure-vote-back-687ddb67bd-5qcvk      Pod should schedule on: nodeclaim/default-52gbg
  default             1s          Normal    DisruptionLaunching          nodeclaim/default-bkz6c                   Launching NodeClaim: Expiration/Replace
  default             1s          Normal    DisruptionWaitingReadiness   nodeclaim/default-bkz6c                   Waiting on readiness to <span class="hljs-built_in">continue</span> disruption
  default             1s          Normal    DisruptionBlocked            nodepool/system-surge                     No allowed disruptions due to blocking budget
  default             1s          Normal    DisruptionWaitingReadiness   nodeclaim/default-5vp7x                   Waiting on readiness to <span class="hljs-built_in">continue</span> disruption
  default             1s          Normal    DisruptionLaunching          nodeclaim/default-5vp7x                   Launching NodeClaim: Expiration/Replace
</code></pre>
<h3 id="heading-configuring-multiple-nodepools"><strong>Configuring Multiple NodePools</strong></h3>
</li>
<li><p>To configure separate NodePools for Spot and On-Demand capacity:</p>
<p>  Spot nodes configure with E series VM "Standard E2s_v5" and OnDemand with D series VM as "Standard_D4s_v5"</p>
</li>
<li><p>In multi-nodepool scenario each nodepool needs to be configured with 'Weight' attribute, nodepool with highest weight would be priotized over another, here we have Spot node with weight:100 and ondemand with weight:60</p>
</li>
</ul>
<pre><code class="lang-bash">osama [ ~ ]$ kubectl get nodepool default -o yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - B
      - key: karpenter.azure.com/sku-name
        operator: In
        values:
        - Standard_B2s_v2
  weight: 100
</code></pre>
<ul>
<li><p>If we do not specify an explicit SKU name, Karpenter will consider the entire VM series.</p>
</li>
<li><p>To validate that the sample VoteApp is running on Spot nodes, use the following commands:</p>
</li>
<li><p>The output should indicate that the nodes are of capacity type "spot":</p>
<pre><code class="lang-bash">  osama [ ~ ]$ kubectl get pods -n karpenter-demo-ns -o wide
  NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                NOMINATED NODE   READINESS GATES
  azure-vote-back-687ddb67bd-w7ghm    1/1     Running   0          63m   10.244.3.11    aks-default-5cr5f   &lt;none&gt;           &lt;none&gt;
  azure-vote-front-6855444955-64558   1/1     Running   0          63m   10.244.3.168   aks-default-5cr5f   &lt;none&gt;           &lt;none&gt;
  osama [ ~ ]$ kubectl describe node aks-default-5cr5f | grep karpenter.sh
                      karpenter.sh/capacity-type=spot
                      karpenter.sh/initialized=<span class="hljs-literal">true</span>
                      karpenter.sh/nodepool=default
                      karpenter.sh/registered=<span class="hljs-literal">true</span>
                      karpenter.sh/nodepool-hash: 12393960163388511505
                      karpenter.sh/nodepool-hash-version: v2
</code></pre>
<h3 id="heading-simulating-spot-node-eviction"><strong>Simulating Spot Node Eviction</strong></h3>
<p>  To test the spot eviction scenario, simulate a spot eviction using the Azure CLI:</p>
<pre><code class="lang-bash">  osama [ ~ ]$ az vm simulate-eviction --resource-group MC_aks-lab_aks-karpenter_eastus --name aks-default-5cr5f
  osama [ ~ ]$ date
  Tue May 21 06:20:02 PM IST 2024
</code></pre>
</li>
<li><p>Monitor the availability of your VoteApp using a simple curl command:</p>
<pre><code class="lang-bash">  <span class="hljs-keyword">while</span> <span class="hljs-literal">true</span>; <span class="hljs-keyword">do</span> <span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-subst">$(date)</span> <span class="hljs-subst">$(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2&gt;&amp;1 | grep 'HTTP')</span>"</span>; sleep 2; <span class="hljs-keyword">done</span>
</code></pre>
</li>
<li><p>After running the spot simulation, the existing node will be marked for termination, and a new Spot node will be created to schedule the VoteApp pods. Within less than a minute, the VoteApp should start responding with HTTP 200 status codes.</p>
</li>
<li><pre><code class="lang-bash">    root@MININT-8C81HDE:/home/osamaex <span class="hljs-keyword">while</span> <span class="hljs-literal">true</span>; <span class="hljs-keyword">do</span> <span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-subst">$(date)</span> <span class="hljs-subst">$(curl -s -v -o /dev/null -w 'HTTP %{http_code}\n' http://voteapp.com 2&gt;&amp;1 | grep 'HTTP')</span>"</span>; sleep 2; <span class="hljs-keyword">done</span>
    Tue May 21 18:20:04 IST 2024 &gt; GET / HTTP/1.1
    &lt; HTTP/1.1 200 OK
    HTTP 200
    Tue May 21 18:20:07 IST 2024 &gt; GET / HTTP/1.1
    &lt; HTTP/1.1 200 OK
    HTTP 200
    Tue May 21 18:20:09 IST 2024 &gt; GET / HTTP/1.1
    &lt; HTTP/1.1 200 OK
    HTTP 200
    Tue May 21 18:20:12 IST 2024 HTTP 000  <span class="hljs-variable">$Failure</span>-Alert
    Tue May 21 18:21:14 IST 2024 &gt; GET / HTTP/1.1
    &lt; HTTP/1.1 200 OK                      <span class="hljs-variable">$Successful</span>-Response
    HTTP 200
    Tue May 21 18:22:58 IST 2024 &gt; GET / HTTP/1.1
    &lt; HTTP/1.1 200 OK
    HTTP 200
</code></pre>
</li>
<li><p>Check the events logged by Karpenter:</p>
</li>
<li><pre><code class="lang-bash">    kuctl get events -A --field-selector <span class="hljs-built_in">source</span>=karpenter --sort-by=<span class="hljs-string">'.lastTimestamp'</span>
</code></pre>
</li>
<li><p>Results of events logged by karpenter to replace spot noded with ondemand</p>
</li>
<li><pre><code class="lang-bash">    osama [ ~ ]$ 
    NAMESPACE           LAST SEEN   TYPE      REASON           OBJECT                                  MESSAGE
    default             23s         Warning   FailedDraining   node/aks-default-5cr5f                  Failed to drain node, 10 pods are waiting to be evicted
    karpenter-demo-ns   22s         Normal    Evicted          pod/azure-vote-back-687ddb67bd-w7ghm    Evicted pod
    karpenter-demo-ns   22s         Normal    Evicted          pod/azure-vote-front-6855444955-64558   Evicted pod
    karpenter-demo-ns   21s         Normal    Nominated        pod/azure-vote-back-687ddb67bd-tb2pv    Pod should schedule on: nodeclaim/default-6zkkl
    karpenter-demo-ns   21s         Normal    Nominated        pod/azure-vote-front-6855444955-7wzss   Pod should schedule on: nodeclaim/default-6zkkl
</code></pre>
</li>
<li><p>Verify that the pods are running on the new Spot node:</p>
<pre><code class="lang-bash">  kubectl get pods -n karpenter-demo-ns -o wide
  NAME                                READY   STATUS    RESTARTS   AGE   IP             NODE                NOMINATED NODE   READINESS GATES
  azure-vote-back-687ddb67bd-tb2pv    1/1     Running   0          18m   10.244.2.103   aks-default-6zkkl   &lt;none&gt;           &lt;none&gt;
  azure-vote-front-6855444955-7wzss   1/1     Running   0          18m   10.244.2.47    aks-default-6zkkl   &lt;none&gt;           &lt;none&gt;
</code></pre>
</li>
</ul>
<h3 id="heading-save-cost-by-utilizing-reserved-instance-vms"><strong>Save Cost by utilizing Reserved Instance VM's</strong></h3>
<ul>
<li><p>NodePool configuration allows you to specify different VM series along with multiple VM SKUs. Create a separate NodePool with the highest weight value and specify all Reserved Instance VM SKU families or explicit SKU names using the <a target="_blank" href="http://karpenter.azure.com/sku-name"><code>karpenter.azure.com/sku-name</code></a> or <a target="_blank" href="http://karpenter.azure.com/sku-family"><code>karpenter.azure.com/sku-family</code></a>parameter.</p>
<pre><code class="lang-bash">     spec:
        nodeClassRef:
          name: default
        requirements:
        - key: kubernetes.io/arch
          operator: In
          values:
          - amd64
        - key: kubernetes.io/os
          operator: In
          values:
          - linux
        - key: karpenter.sh/capacity-type
          operator: In
          values:
          - on-demand
        - key: karpenter.azure.com/sku-family
          operator: In
          values:
          - D
        - key: karpenter.azure.com/sku-name
          operator: In
          values:
          - [Standard_D2s_v3, Standard_D4s_v3, Standard_D8s_v3, Standard_D16s_v3, Standard_D32s_v3, Standard_D64s_v3, Standard_D96s_v3]
    weight: 90
</code></pre>
</li>
</ul>
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>The adoption of Karpenter in AKS signifies a major advancement in node scaling efficiency, flexibility, and cost optimization. By addressing the limitations of the Cluster Autoscaler and introducing dynamic, rapid provisioning of nodes, Karpenter provides a robust solution for managing Kubernetes clusters. Its flexibility in handling different VM types, faster scaling capabilities, and cost optimization make it a valuable addition to Kubernetes cluster management. By leveraging Karpenter, organizations can achieve more responsive and cost-effective Kubernetes deployments.</p>
]]></content:encoded></item><item><title><![CDATA[Intelligent Load Balancing with APIM: Using Weight-Based Routing for Improved OpenAI Performance]]></title><description><![CDATA[Ever Since launch of ChatGPT, demand for OpenAI GPT Models has increased exponentially.Due such vast demand in short span of time, its been challenging for customer to get thier desired capacity in thier respective region
In that case my recommendati...]]></description><link>https://blog.osshaikh.com/intelligent-lb-apim-openai</link><guid isPermaLink="true">https://blog.osshaikh.com/intelligent-lb-apim-openai</guid><category><![CDATA[openai]]></category><category><![CDATA[AI]]></category><category><![CDATA[llm]]></category><category><![CDATA[Azure]]></category><category><![CDATA[APIs]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Mon, 08 Apr 2024 15:10:27 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712609399661/08a21a9a-d76c-48c0-8d37-43c7584e7c94.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ever Since launch of ChatGPT, demand for OpenAI GPT Models has increased exponentially.<br />Due such vast demand in short span of time, its been challenging for customer to get thier desired capacity in thier respective region</p>
<p>In that case my recommendation has been to deploy multiple OpenAI instance with <strong>S0 plan</strong>(<strong>Token-based-consumption</strong>) in any region where capacity is available &amp; then used an load Balancer as fascade to distribute traffic across all your Az OPAI endpoints</p>
<p>Each OpenAI Models has its own limit called <strong>token limit</strong> which is measured based per minute i.e. TPM. In this case i am referring to <strong>GPT3.5</strong> Model which comes with maximum TPM limit of 300K for an given region</p>
<p>Since there is surge in demand from customer all over world for LLM models like GPT, we have limited capacity of OpenAI instance in each region.</p>
<p>Many Customer specially D2C eventually get requirment of over 700-800K+ token for thier use case.<br />In that scenario customers tend to deploy OpenAI Instance with S0 plan across multiple location<br />Although TPM limit for all instance is not same because of capacity constrain overall</p>
<p>Following is real world example of customer which uses multiple OpenAI instance with different TPM limit for each region</p>
<p>OpenAI-EastUS :300k<br />OpenAI-WestUS: 240K<br />OpenAI-SouthUK: 150K<br />OpenAI-Southindia:100k<br />OpenAI-CentralIndia:50K</p>
<h3 id="heading-why-not-use-existing-load-balancers-amp-how-it-is-different-from-other-existing-lb-technique">Why not use existing Load balancers &amp; how it is different from other existing LB technique?</h3>
<p>There are multiple load balancing option are now available using AppGW/FrontDoor or API management custom policies with algorithm based on roundrobin, random or even prirotiy based. Although there are some cons with each routing algorithm</p>
<p><strong>AppGW/FrontDoor</strong>:Roundrobin pattern to Distribte traffic across OpeAI endpoints.<br />Can be used in scenario when all OpenAI instance has same TPM Limit.But it cannot do health probe of OpenAI endpoints if its TPM exhausted &amp; received HTTP 429(Server is Busy) &amp; it will continue to forward request to throttled openai endpoint</p>
<p><strong>API management:</strong><br />One differentiting capablities with APIM using its policies that it is aware about endpoint HTTP 429 status code and offers Retry mechanism which allows send traffic to another available endpoints from backend pool until current endpoints becomes active/healthy again.</p>
<p><strong>APIM Policies for OpenAI:</strong><br /><strong>Roundrobin</strong>: if you have multiple OpenAI endpoint across location in APIM backendpool, it simply select any of backend based on roundrobin provided its not being throtlled at that moment &amp; it distribute traffic across sequentially doesnt honor OpenAI Endpoints TPM limits</p>
<p><strong>Random</strong>: Its mostly similar with roundrobin method, primary differnence is that it selected backend endpoint randomly from backend pool. &amp; its also doesnt honor openai TPM limit</p>
<p><strong>Priority</strong>: In this case if you have endpoint across many location, you can assigned priotiy order sequence either based on TPM limit or latency from your base region,<br />But even then all traffic would always forwarded to endpoint which has lowest priority &amp; rest of available openai endpoint would simply be waiting on standby unless lowest priority instance is throttled</p>
<p>Now in ths case customer requested for load balancing technique based on its TPM limit assigned against each OpenAI endpoint which should distribute traffic accordingly.</p>
<h3 id="heading-weightage"><strong>Weightage:</strong></h3>
<p>There is no direct feature capablities in APIM for weightage based routing.I have tried achieve same results using custom logic with APIM policies</p>
<p><strong>Selection Process:</strong><br />Backend logic used in this policy is based on weighted selection method to choose an endpoint route for retry.endpoint with higher weights are more likely to be chosen, but each endpoints route has at least some chance of being selected. This is because the selection is based on a random number that is compared against cumulative weights, which means the selection process inherently favors routes with higher weights due to the way cumulative weights are calculated and utilized. Let’s break down the process with a simple example to clarify how routes with higher weights are more likely to be selected.</p>
<p>*in this image, its deplects APIM configured based on Weight policy across OpenAI</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712577551542/4cbbdb42-56b7-4471-863d-852a54cc75c2.png" alt class="image--center mx-auto" /></p>
<p>There are three variable used in selelction process for backend<br />1)<strong>Totalweight</strong><br />2)<strong>Cumulative weight</strong><br />3)<strong>Random Number</strong></p>
<p><strong>Example:</strong> Lets understand this policy using following based on above image<br />Assume you have five openai endpoints as mentioned in image with following weight</p>
<p><strong>Endpoint A</strong>: Weight = 50<br /><strong>Endpoint B</strong>: Weight = 100<br /><strong>Endpoint C</strong>: Weight = 150<br /><strong>Endpoint D</strong>: Weight = 300<br /><strong>Endpoint E</strong>: Weight = 600</p>
<p><strong>Step 1:Calculate Total Weight</strong><br />First, you calculate the total weight of all endpoint routes, which in this case would be 50+100+150+300+600=1200.</p>
<p><strong>Step 2: Generate Random Weight</strong></p>
<p>Next, a random number (let’s call it <code>randomWeight</code>) is generated between 1 and the total weight (inclusive). So, <code>randomWeight</code> is between 1 and 1200.</p>
<p><strong>Step 3: Calculate Cumulative Weights</strong><br />The cumulative weights are calculated to determine the ranges that correspond to each endpoint route. Here’s how they look based on the weights:</p>
<ul>
<li><p><strong>Cumulative Weight after Endpoint A</strong>: 50 (just the weight of A,)</p>
</li>
<li><p><strong>Cumulative Weight after Endpoint B</strong>: 100 (the weight of A + B, 50 + 100)</p>
</li>
<li><p><strong>Cumulative Weight after Endpoint C</strong>: 150 (the weight of A + B + C, 50+100+150)</p>
</li>
<li><p><strong>Cumulative Weight after Endpoint D</strong>: 300 (the weight of A + B + C +D, 50+100+150+300)</p>
</li>
<li><p><strong>Cumulative Weight after Endpoint E</strong>: 600 (the weight of A + B + C + D +E, 150+100+150+300+500)</p>
<p>  <strong>Weight Distribution Percentage Calcuation:</strong></p>
<ul>
<li><p><strong>Route A</strong> 50/1200×100%=4.17%</p>
</li>
<li><p><strong>Route B</strong> 100/1200×100%=8.33%</p>
</li>
<li><p><strong>Route C</strong> 150/1200×100%=12.50%</p>
</li>
<li><p><strong>Route D</strong> 300/1200×100%=25.00%</p>
</li>
<li><p><strong>Route E</strong> 600/1200×100%=50.00%</p>
</li>
</ul>
</li>
</ul>
<p><strong>Step 4: Select the Endpoint Route Based on Random Weight</strong></p>
<p>The <code>randomWeight</code> determines which Endpoint route is selected:</p>
<ul>
<li><p>If <code>randomWeight</code> is between 1 and 50, <strong>Endpoint A</strong> is selected (4.17% chance)</p>
</li>
<li><p>If <code>randomWeight</code> is between 51 and 100, <strong>Endpoint B</strong> is selected(8.33% chance)</p>
</li>
<li><p>If <code>randomWeight</code> is between 100 and 150, <strong>Endpoint C</strong> is selected(12.50% chance)</p>
</li>
<li><p>If <code>randomWeight</code> is between 150 and 300, <strong>Endpoint D</strong> is selected(25 % chance)</p>
</li>
<li><p>If <code>randomWeight</code> is between 300 and 600, <strong>Endpoint E</strong> is selected(50% chance)</p>
</li>
</ul>
<h3 id="heading-opai-tpm-exhausted-scenario">OPAI TPM Exhausted Scenario:</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712578337623/2ddd62f0-34d5-4b46-b7b7-8d06a43168f2.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>In this image, An first endpoint is throttled with HTTP response code 429 &amp; APIM is routing request to other available backend based on weight</p>
</li>
<li><p>When OpenAI endpoint starts getting traffic beyond its token capacity limit, it throws an HTTP Status code "<strong>429</strong>" which translate to that is '<strong>server is busy</strong>'</p>
</li>
<li><p>In that situation APIM is configured with health probe of endpoint based on '429' using Retry-After logic</p>
</li>
<li><p>Once APIM gets HTTP 429 from its endpoint , it starts to priortize other endpoint in weightage order and forward most request next highest weight assigned endpoint</p>
</li>
<li><p>Meanwhile it continue to retry health probe request to actuall thortlled endpoint after waiting for 60 second</p>
</li>
</ul>
<h3 id="heading-openai-ptu-tier-provision-throughput-unit">OpenAI PTU Tier (Provision throughput Unit ):</h3>
<p>To Ensure OpenAI endpoint with PTU plan should be always be preferred. you would need adjust PTU instance endpoint in such way that it out weight other endpoint significantly, effectively making it the most likely choice under normal circumstances.</p>
<p><strong>1. Significantly Increase PTU endpoint Weight: I</strong>ncrease the weight of PTU endpoint so high compared to Endpoint E,D,C that the random selection virtually always lands on PTU based OpenAI endpoint.In Above example highest weight was 600 for Endpoint E, we could set PTU endpoint Weight something like 5000 or even more.This would make PTU endpoint overwhelmingly more likely to be selected.<br /><strong>2</strong>.<strong>Preferential Selection:</strong> Modify the selection logic to check for PTU endpoint availability first and choose it by default, only falling back to other endpoint A/B/C under certain conditions (e.g., Route C is down or throttling). This requires a bit of custom logic in your policy.<br />3.<strong>Use Priority Method: I</strong>f your system allows, Clubbed a Prority based system where routes are tried in order of priority rather than by weight. PTU would be given the highest priority, with endpoint A,B,C &amp; D as fallbacks.</p>
<h3 id="heading-setup-policy-in-apim">Setup Policy in APIM:</h3>
<ul>
<li><p>Create APIM instance with desired SKU if not exist already &amp; enable managed identity on APIM</p>
</li>
<li><p>Make sure to have same model(GPT35/4) &amp; deployment name across all OpenAI endpoint</p>
</li>
<li><p>Grant following Az RBAC "Cognitive OpenAI User" on All OpenAI resources via 'Access Control(IAM)' blade to APIM managed identity</p>
</li>
<li><p>Download Az OpenAI API version Schema and inetrgrate with APIM using Import, you could read more on OpenAI Integration instruction <a target="_blank" href="https://techcommunity.microsoft.com/t5/apps-on-azure-blog/build-an-enterprise-ready-azure-openai-solution-with-azure-api/ba-p/3907562">here</a></p>
</li>
<li><p>Once OpenAI API is imported, copy and paste this policy into APIM editor</p>
</li>
<li><p>Modify or Replace from line no #15-43 with your existing All OpenAI instance endpoint details</p>
</li>
<li><p>Assigned weight to each endpoint listed under routes as per its TPM Limits.</p>
</li>
<li><p>Update your OpenAI model deployment name and routes index array sequence at line no #47,48 also at line no #129</p>
</li>
<li><p>Perform Test run with APIM traces enabled to see policy logic in action</p>
</li>
</ul>
<pre><code class="lang-xml"><span class="hljs-tag">&lt;<span class="hljs-name">policies</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">inbound</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">base</span> /&gt;</span>
        <span class="hljs-comment">&lt;!-- Getting OpenAI clusters configuration --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">authentication-managed-identity</span> <span class="hljs-attr">resource</span>=<span class="hljs-string">"https://cognitiveservices.azure.com"</span> /&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">cache-lookup-value</span> <span class="hljs-attr">key</span>=<span class="hljs-string">"@("</span><span class="hljs-attr">oaClusters</span>" + <span class="hljs-attr">context.Deployment.Region</span> + <span class="hljs-attr">context.Api.Revision</span>)" <span class="hljs-attr">variable-name</span>=<span class="hljs-string">"oaClusters"</span> /&gt;</span>
        <span class="hljs-comment">&lt;!-- If we can't find the configuration, it will be loaded --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">choose</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">when</span> <span class="hljs-attr">condition</span>=<span class="hljs-string">"@(context.Variables.ContainsKey("</span><span class="hljs-attr">oaClusters</span>") == <span class="hljs-string">true)</span>"&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"oaClusters"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                    JArray routes = new JArray();
                    JArray clusters = new JArray();
                    if(context.Deployment.Region == "</span><span class="hljs-attr">West</span> <span class="hljs-attr">Europe</span>" || <span class="hljs-attr">true</span>)
                    {
                        <span class="hljs-attr">routes.Add</span>(<span class="hljs-attr">new</span> <span class="hljs-attr">JObject</span>()
                        {
                            { "<span class="hljs-attr">name</span>", "<span class="hljs-attr">openai</span> <span class="hljs-attr">1</span>" },
                            { "<span class="hljs-attr">location</span>", "<span class="hljs-attr">eastus</span>" },
                            { "<span class="hljs-attr">url</span>", "<span class="hljs-attr">https:</span>//<span class="hljs-attr">openaiendpoint-a.openai.azure.com</span>/" },
                            { "<span class="hljs-attr">isThrottling</span>", <span class="hljs-attr">false</span> }, 
                            { "<span class="hljs-attr">weight</span>", "<span class="hljs-attr">100</span>"},
                            { "<span class="hljs-attr">retryAfter</span>", <span class="hljs-attr">DateTime.MinValue</span> } 
                        });

                        <span class="hljs-attr">routes.Add</span>(<span class="hljs-attr">new</span> <span class="hljs-attr">JObject</span>()
                        {
                            { "<span class="hljs-attr">name</span>", "<span class="hljs-attr">openai</span> <span class="hljs-attr">2</span>" },
                            { "<span class="hljs-attr">location</span>", "<span class="hljs-attr">UK</span> <span class="hljs-attr">SOuth</span>" },
                            { "<span class="hljs-attr">url</span>", "<span class="hljs-attr">https:</span>//<span class="hljs-attr">openaiendpoint-b.openai.azure.com</span>/" },
                            { "<span class="hljs-attr">isThrottling</span>", <span class="hljs-attr">false</span> },
                            { "<span class="hljs-attr">weight</span>", "<span class="hljs-attr">150</span>"},
                            { "<span class="hljs-attr">retryAfter</span>", <span class="hljs-attr">DateTime.MinValue</span> }
                        });

                        <span class="hljs-attr">routes.Add</span>(<span class="hljs-attr">new</span> <span class="hljs-attr">JObject</span>()
                        {
                            { "<span class="hljs-attr">name</span>", "<span class="hljs-attr">openai</span> <span class="hljs-attr">3</span>" },
                            { "<span class="hljs-attr">location</span>", "<span class="hljs-attr">Central</span> <span class="hljs-attr">india</span>" },
                            { "<span class="hljs-attr">url</span>", "<span class="hljs-attr">https:</span>//<span class="hljs-attr">openendpointai-c.openai.azure.com</span>/" },
                            { "<span class="hljs-attr">isThrottling</span>", <span class="hljs-attr">false</span> },
                            { "<span class="hljs-attr">weight</span>", "<span class="hljs-attr">300</span>"},
                            { "<span class="hljs-attr">retryAfter</span>", <span class="hljs-attr">DateTime.MinValue</span> }
                        });

                        <span class="hljs-attr">clusters.Add</span>(<span class="hljs-attr">new</span> <span class="hljs-attr">JObject</span>()
                        {
                            { "<span class="hljs-attr">deploymentName</span>", "<span class="hljs-attr">gpt35turbo16k</span>" },
                            { "<span class="hljs-attr">routes</span>", <span class="hljs-attr">new</span> <span class="hljs-attr">JArray</span>(<span class="hljs-attr">routes</span>[<span class="hljs-attr">0</span>], <span class="hljs-attr">routes</span>[<span class="hljs-attr">1</span>], <span class="hljs-attr">routes</span>[<span class="hljs-attr">2</span>]) }
                        });
                    }
                    <span class="hljs-attr">else</span>
                    {
                        //<span class="hljs-attr">Error</span> <span class="hljs-attr">has</span> <span class="hljs-attr">no</span> <span class="hljs-attr">clusters</span> <span class="hljs-attr">for</span> <span class="hljs-attr">the</span> <span class="hljs-attr">region</span>
                    }

                    <span class="hljs-attr">return</span> <span class="hljs-attr">clusters</span>;   
                }" /&gt;</span>
                <span class="hljs-comment">&lt;!-- Add cluster configurations to cache --&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">cache-store-value</span> <span class="hljs-attr">key</span>=<span class="hljs-string">"@("</span><span class="hljs-attr">oaClusters</span>" + <span class="hljs-attr">context.Deployment.Region</span> + <span class="hljs-attr">context.Api.Revision</span>)" <span class="hljs-attr">value</span>=<span class="hljs-string">"@((JArray)context.Variables["</span><span class="hljs-attr">oaClusters</span>"])" <span class="hljs-attr">duration</span>=<span class="hljs-string">"86400"</span> /&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">when</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">choose</span>&gt;</span>
        <span class="hljs-comment">&lt;!-- Getting OpenAI routes configuration based on deployment name, region and api revision --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">cache-lookup-value</span> <span class="hljs-attr">key</span>=<span class="hljs-string">"@(context.Request.MatchedParameters["</span><span class="hljs-attr">deployment-id</span>"] + "<span class="hljs-attr">Routes</span>" + <span class="hljs-attr">context.Deployment.Region</span> + <span class="hljs-attr">context.Api.Revision</span>)" <span class="hljs-attr">variable-name</span>=<span class="hljs-string">"routes"</span> /&gt;</span>
        <span class="hljs-comment">&lt;!-- If we can't find the configuration, it will be loaded --&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">choose</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">when</span> <span class="hljs-attr">condition</span>=<span class="hljs-string">"@(context.Variables.ContainsKey("</span><span class="hljs-attr">routes</span>") == <span class="hljs-string">true)</span>"&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routes"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                    string deploymentName = context.Request.MatchedParameters["</span><span class="hljs-attr">deployment-id</span>"];
                    <span class="hljs-attr">JArray</span> <span class="hljs-attr">clusters</span> = <span class="hljs-string">(JArray)context.Variables[</span>"<span class="hljs-attr">oaClusters</span>"];
                    <span class="hljs-attr">JObject</span> <span class="hljs-attr">cluster</span> = <span class="hljs-string">(JObject)clusters.FirstOrDefault(o</span> =&gt;</span> o["deploymentName"]?.Value<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>() == deploymentName);
                    if(cluster == null)
                    {
                        //Error has no cluster matched the deployment name
                    }
                    JArray routes = (JArray)cluster["routes"];
                    return routes;
                }" /&gt;
                <span class="hljs-comment">&lt;!-- Set total weights for selected routes based on model --&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"totalWeight"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                int totalWeight = 0;
                JArray routes = (JArray)context.Variables["</span><span class="hljs-attr">routes</span>"];
                <span class="hljs-attr">foreach</span> (<span class="hljs-attr">JObject</span> <span class="hljs-attr">route</span> <span class="hljs-attr">in</span> <span class="hljs-attr">routes</span>)
                {
                    <span class="hljs-attr">totalWeight</span> += <span class="hljs-string">int.Parse(route[</span>"<span class="hljs-attr">weight</span>"]<span class="hljs-attr">.ToString</span>());
                }
                <span class="hljs-attr">return</span> <span class="hljs-attr">totalWeight</span>;
                }" /&gt;</span>
                <span class="hljs-comment">&lt;!-- Set cumulative weights for selected routes based on model--&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"cumulativeWeights"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                JArray cumulativeWeights = new JArray();
                int totalWeight = 0;
                JArray routes = (JArray)context.Variables["</span><span class="hljs-attr">routes</span>"];
                <span class="hljs-attr">foreach</span> (<span class="hljs-attr">JObject</span> <span class="hljs-attr">route</span> <span class="hljs-attr">in</span> <span class="hljs-attr">routes</span>)
                {
                    <span class="hljs-attr">totalWeight</span> += <span class="hljs-string">int.Parse(route[</span>"<span class="hljs-attr">weight</span>"]<span class="hljs-attr">.ToString</span>());
                    <span class="hljs-attr">cumulativeWeights.Add</span>(<span class="hljs-attr">totalWeight</span>);
                }
                <span class="hljs-attr">return</span> <span class="hljs-attr">cumulativeWeights</span>;
            }" /&gt;</span>
                <span class="hljs-comment">&lt;!-- Add cluster configurations to cache --&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">cache-store-value</span> <span class="hljs-attr">key</span>=<span class="hljs-string">"@(context.Request.MatchedParameters["</span><span class="hljs-attr">deployment-id</span>"] + "<span class="hljs-attr">Routes</span>" + <span class="hljs-attr">context.Deployment.Region</span> + <span class="hljs-attr">context.Api.Revision</span>)" <span class="hljs-attr">value</span>=<span class="hljs-string">"@((JArray)context.Variables["</span><span class="hljs-attr">routes</span>"])" <span class="hljs-attr">duration</span>=<span class="hljs-string">"86400"</span> /&gt;</span>
            <span class="hljs-tag">&lt;/<span class="hljs-name">when</span>&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">choose</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routeIndex"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"-1"</span> /&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"remainingRoutes"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"1"</span> /&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">inbound</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">backend</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">retry</span> <span class="hljs-attr">condition</span>=<span class="hljs-string">"@(context.Response != null &amp;&amp; (context.Response.StatusCode == 429 || context.Response.StatusCode &gt;= 500) &amp;&amp; ((Int32)context.Variables["</span><span class="hljs-attr">remainingRoutes</span>"]) &gt;</span> 0)" count="3" interval="0"&gt;
            <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routeIndex"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
            Random random = new Random();
            int totalWeight = (Int32)context.Variables["</span><span class="hljs-attr">totalWeight</span>"];
            <span class="hljs-attr">JArray</span> <span class="hljs-attr">cumulativeWeights</span> = <span class="hljs-string">(JArray)context.Variables[</span>"<span class="hljs-attr">cumulativeWeights</span>"];
            <span class="hljs-attr">int</span> <span class="hljs-attr">randomWeight</span> = <span class="hljs-string">random.Next(1,</span> <span class="hljs-attr">totalWeight</span> + <span class="hljs-attr">1</span>);
            <span class="hljs-attr">int</span> <span class="hljs-attr">nextRouteIndex</span> = <span class="hljs-string">0;</span>
            <span class="hljs-attr">for</span> (<span class="hljs-attr">int</span> <span class="hljs-attr">i</span> = <span class="hljs-string">0;</span> <span class="hljs-attr">i</span> &lt; <span class="hljs-attr">cumulativeWeights.Count</span>; <span class="hljs-attr">i</span>++)
            {
                <span class="hljs-attr">if</span> (<span class="hljs-attr">randomWeight</span> &lt;= <span class="hljs-string">cumulativeWeights[i].Value</span>&lt;<span class="hljs-attr">int</span>&gt;</span>())
                {
                    nextRouteIndex = i;
                    break;
                }
            }
            return nextRouteIndex;
        }" /&gt;
            <span class="hljs-comment">&lt;!-- This is the main logic to pick the route to be used --&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routeUrl"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@(((JObject)((JArray)context.Variables["</span><span class="hljs-attr">routes</span>"])[(<span class="hljs-attr">Int32</span>)<span class="hljs-attr">context.Variables</span>["<span class="hljs-attr">routeIndex</span>"]])<span class="hljs-attr">.Value</span>&lt;<span class="hljs-attr">string</span>&gt;</span>("url") + "/openai")" /&gt;
            <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routeLocation"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@(((JObject)((JArray)context.Variables["</span><span class="hljs-attr">routes</span>"])[(<span class="hljs-attr">Int32</span>)<span class="hljs-attr">context.Variables</span>["<span class="hljs-attr">routeIndex</span>"]])<span class="hljs-attr">.Value</span>&lt;<span class="hljs-attr">string</span>&gt;</span>("location"))" /&gt;
            <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routeName"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@(((JObject)((JArray)context.Variables["</span><span class="hljs-attr">routes</span>"])[(<span class="hljs-attr">Int32</span>)<span class="hljs-attr">context.Variables</span>["<span class="hljs-attr">routeIndex</span>"]])<span class="hljs-attr">.Value</span>&lt;<span class="hljs-attr">string</span>&gt;</span>("name"))" /&gt;
            <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"deploymentName"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@("</span><span class="hljs-attr">gpt35turbo16k</span>")" /&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">set-backend-service</span> <span class="hljs-attr">base-url</span>=<span class="hljs-string">"@((string)context.Variables["</span><span class="hljs-attr">routeUrl</span>"])" /&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">forward-request</span> <span class="hljs-attr">buffer-request-body</span>=<span class="hljs-string">"true"</span> /&gt;</span>
        <span class="hljs-tag">&lt;/<span class="hljs-name">retry</span>&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">backend</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">outbound</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">base</span> /&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">outbound</span>&gt;</span>
    <span class="hljs-tag">&lt;<span class="hljs-name">on-error</span>&gt;</span>
        <span class="hljs-tag">&lt;<span class="hljs-name">base</span> /&gt;</span>
    <span class="hljs-tag">&lt;/<span class="hljs-name">on-error</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">policies</span>&gt;</span>
</code></pre>
<h3 id="heading-policy-internals">Policy Internals:</h3>
<p>This policy incorporates several key configuration items that enable it to efficiently manage and route API requests to OpenAI ChatGPT instances. so Highlighting few of configurations can provide valuable insights to blog readers</p>
<pre><code class="lang-xml"><span class="hljs-tag">&lt;<span class="hljs-name">authentication-managed-identity</span> <span class="hljs-attr">resource</span>=<span class="hljs-string">"https://cognitiveservices.azure.com"</span> /&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">cache-lookup-value</span> <span class="hljs-attr">key</span>=<span class="hljs-string">"@("</span><span class="hljs-attr">oaClusters</span>" + <span class="hljs-attr">context.Deployment.Region</span> + <span class="hljs-attr">context.Api.Revision</span>)" <span class="hljs-attr">variable-name</span>=<span class="hljs-string">"oaClusters"</span> /&gt;</span>
</code></pre>
<ul>
<li><p>For authentication managed identity is used in this case instead of keys, so first line enable policy to leverage manage identity.</p>
</li>
<li><p>In next block Attempts to fetch pre-configured OpenAI cluster information from the cache, significantly reducing the overhead of reconstructing the configuration for each API call.</p>
</li>
</ul>
<pre><code class="lang-xml"> <span class="hljs-tag">&lt;<span class="hljs-name">choose</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">when</span> <span class="hljs-attr">condition</span>=<span class="hljs-string">"@(context.Variables.ContainsKey("</span><span class="hljs-attr">oaClusters</span>") == <span class="hljs-string">true)</span>"&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"oaClusters"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                    JArray routes = new JArray();
                    JArray clusters = new JArray();
                    if(context.Deployment.Region == "</span><span class="hljs-attr">West</span> <span class="hljs-attr">Europe</span>" || <span class="hljs-attr">true</span>)
                    {
                        <span class="hljs-attr">routes.Add</span>(<span class="hljs-attr">new</span> <span class="hljs-attr">JObject</span>()
                        {
                            { "<span class="hljs-attr">name</span>", "<span class="hljs-attr">openai-a</span>" },
                            { "<span class="hljs-attr">location</span>", "<span class="hljs-attr">india</span>" },
                            { "<span class="hljs-attr">url</span>", "<span class="hljs-attr">https:</span>//<span class="hljs-attr">openai-endpoint-a.openai.azure.com</span>/" },
                            { "<span class="hljs-attr">isThrottling</span>", <span class="hljs-attr">false</span> }, 
                            { "<span class="hljs-attr">weight</span>", "<span class="hljs-attr">600</span>"},
                            { "<span class="hljs-attr">retryAfter</span>", <span class="hljs-attr">DateTime.MinValue</span> } 
                        });
                        <span class="hljs-attr">clusters.Add</span>(<span class="hljs-attr">new</span> <span class="hljs-attr">JObject</span>()
                        {
                            { "<span class="hljs-attr">deploymentName</span>", "<span class="hljs-attr">gpt35turbo16k</span>" },
                            { "<span class="hljs-attr">routes</span>", <span class="hljs-attr">new</span> <span class="hljs-attr">JArray</span>(<span class="hljs-attr">routes</span>[<span class="hljs-attr">0</span>], <span class="hljs-attr">routes</span>[<span class="hljs-attr">1</span>], <span class="hljs-attr">routes</span>[<span class="hljs-attr">2</span>]) }
                        });
                    }
                    <span class="hljs-attr">else</span>
                    {
                        //<span class="hljs-attr">Error</span> <span class="hljs-attr">has</span> <span class="hljs-attr">no</span> <span class="hljs-attr">clusters</span> <span class="hljs-attr">for</span> <span class="hljs-attr">the</span> <span class="hljs-attr">region</span>
                    }

                    <span class="hljs-attr">return</span> <span class="hljs-attr">clusters</span>;</span>
</code></pre>
<ul>
<li><p>In choose block complex logic dynamically generates the configuration for OpenAI clusters if not found in the cache.</p>
</li>
<li><p>Depending on the deployment region, the policy constructs clusters with specific attributes (e.g., name, location, URL, weight) reflecting each OpenAI instance's characteristics and intended traffic handling capacity.</p>
</li>
<li><p>Each route is assigned a specific weight, dictating its selection probability for handling incoming requests, thereby facilitating a balanced and efficient distribution of traffic based on the defined capacities or priorities.</p>
</li>
<li><p>Defines the OpenAI model (in this case, <code>gpt35turbo16k</code>) to which API requests should be routed. This specification allows the policy to support different OpenAI models,By clearly defining the <code>deploymentName</code>.</p>
</li>
</ul>
<pre><code class="lang-xml"><span class="hljs-tag">&lt;<span class="hljs-name">choose</span>&gt;</span>
            <span class="hljs-tag">&lt;<span class="hljs-name">when</span> <span class="hljs-attr">condition</span>=<span class="hljs-string">"@(context.Variables.ContainsKey("</span><span class="hljs-attr">routes</span>") == <span class="hljs-string">true)</span>"&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routes"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                    string deploymentName = context.Request.MatchedParameters["</span><span class="hljs-attr">deployment-id</span>"];
                    <span class="hljs-attr">JArray</span> <span class="hljs-attr">clusters</span> = <span class="hljs-string">(JArray)context.Variables[</span>"<span class="hljs-attr">oaClusters</span>"];
                    <span class="hljs-attr">JObject</span> <span class="hljs-attr">cluster</span> = <span class="hljs-string">(JObject)clusters.FirstOrDefault(o</span> =&gt;</span> o["deploymentName"]?.Value<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>() == deploymentName);
                    if(cluster == null)
                    {
                        //Error has no cluster matched the deployment name
                    }
                    JArray routes = (JArray)cluster["routes"];
                    return routes;
                }" /&gt;
</code></pre>
<ul>
<li>In this block it checks if the <code>routes</code> configuration for the current API request's deployment (e.g., a specific OpenAI model indicated by <code>deployment-id</code>) is already available within the context variables. If so, it proceeds to confirm and utilize these endpoint routes for further processing.</li>
</ul>
<pre><code class="lang-xml"><span class="hljs-comment">&lt;!-- Set total weights for selected routes based on model --&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"totalWeight"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                int totalWeight = 0;
                JArray routes = (JArray)context.Variables["</span><span class="hljs-attr">routes</span>"];
                <span class="hljs-attr">foreach</span> (<span class="hljs-attr">JObject</span> <span class="hljs-attr">route</span> <span class="hljs-attr">in</span> <span class="hljs-attr">routes</span>)
                {
                    <span class="hljs-attr">totalWeight</span> += <span class="hljs-string">int.Parse(route[</span>"<span class="hljs-attr">weight</span>"]<span class="hljs-attr">.ToString</span>());
                }
                <span class="hljs-attr">return</span> <span class="hljs-attr">totalWeight</span>;
                }" /&gt;</span>
                <span class="hljs-comment">&lt;!-- Set cumulative weights for selected routes based on model--&gt;</span>
                <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"cumulativeWeights"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
                JArray cumulativeWeights = new JArray();
                int totalWeight = 0;
                JArray routes = (JArray)context.Variables["</span><span class="hljs-attr">routes</span>"];
                <span class="hljs-attr">foreach</span> (<span class="hljs-attr">JObject</span> <span class="hljs-attr">route</span> <span class="hljs-attr">in</span> <span class="hljs-attr">routes</span>)
                {
                    <span class="hljs-attr">totalWeight</span> += <span class="hljs-string">int.Parse(route[</span>"<span class="hljs-attr">weight</span>"]<span class="hljs-attr">.ToString</span>());
                    <span class="hljs-attr">cumulativeWeights.Add</span>(<span class="hljs-attr">totalWeight</span>);
                }
                <span class="hljs-attr">return</span> <span class="hljs-attr">cumulativeWeights</span>;
            }" /&gt;</span>
</code></pre>
<ul>
<li><p>Both of these variable are key elements in functioning of this policy</p>
</li>
<li><p>Totalweight: by iterating over each route in the endpoint routes collection and summing up their weights, it represents the aggregate capacity of all endpoint routes. This total is crucial because it defines the range within which a random number can be generated to select a route proportionally based on its weight.</p>
</li>
<li><p>Cumulative weight:As the policy iterates through endpoint routes, it progressively adds each route's weight to a running total, creating a series of increasing values. This cumulative approach allows for determining which route to select by generating a random number within the totalweight range and finding the segment (route) where this number falls.</p>
</li>
</ul>
<pre><code class="lang-xml"> <span class="hljs-tag">&lt;<span class="hljs-name">retry</span> <span class="hljs-attr">condition</span>=<span class="hljs-string">"@(context.Response != null &amp;&amp; (context.Response.StatusCode == 429 || context.Response.StatusCode &gt;= 500) &amp;&amp; ((Int32)context.Variables["</span><span class="hljs-attr">remainingRoutes</span>"]) &gt;</span> 0)" count="3" interval="30"&gt;
            <span class="hljs-tag">&lt;<span class="hljs-name">set-variable</span> <span class="hljs-attr">name</span>=<span class="hljs-string">"routeIndex"</span> <span class="hljs-attr">value</span>=<span class="hljs-string">"@{
            Random random = new Random();
            int totalWeight = (Int32)context.Variables["</span><span class="hljs-attr">totalWeight</span>"];
            <span class="hljs-attr">JArray</span> <span class="hljs-attr">cumulativeWeights</span> = <span class="hljs-string">(JArray)context.Variables[</span>"<span class="hljs-attr">cumulativeWeights</span>"];
            <span class="hljs-attr">int</span> <span class="hljs-attr">randomWeight</span> = <span class="hljs-string">random.Next(1,</span> <span class="hljs-attr">totalWeight</span> + <span class="hljs-attr">1</span>);
            <span class="hljs-attr">int</span> <span class="hljs-attr">nextRouteIndex</span> = <span class="hljs-string">0;</span>
            <span class="hljs-attr">for</span> (<span class="hljs-attr">int</span> <span class="hljs-attr">i</span> = <span class="hljs-string">0;</span> <span class="hljs-attr">i</span> &lt; <span class="hljs-attr">cumulativeWeights.Count</span>; <span class="hljs-attr">i</span>++)
            {
                <span class="hljs-attr">if</span> (<span class="hljs-attr">randomWeight</span> &lt;= <span class="hljs-string">cumulativeWeights[i].Value</span>&lt;<span class="hljs-attr">int</span>&gt;</span>())
                {
                    nextRouteIndex = i;
                    break;
                }
            }
            return nextRouteIndex;
        }" /&gt;
</code></pre>
<ul>
<li><p><code>Retry</code> policy segment is mentioned within backend block is a mechanism designed to handle scenarios where an OpenAI ChatGPT instance—fails due to throttling (HTTP 429) or server errors (HTTP status codes &gt;= 500). This segment also takes into account the availability of alternative endpoint routes for retrying the request.</p>
</li>
<li><p>The maximum number of retry attempts is set to 3, meaning the policy will try up to three times to forward the request to a backend service before marking as unhealthy.Also retry attempts will be made on interval of 30 second one after another</p>
</li>
<li><p>Random number is generated within the range of 1 to the <code>totalWeight</code> of all available endpoint routes. This random weight determines which route will be selected for the next retry attempt.it terates through the cumulative weights of all endpoint routes stored in <code>cumulativeWeights</code> variable</p>
</li>
<li><p>While index of the selected route (<code>nextRouteIndex</code>) is determined and stored, guiding the API management platform on which route to use for the next retry attempt.</p>
</li>
<li><p>By incorporating a retry mechanism with weighted random route selection, it ensures that requests are not simply retried on the same route that might be experiencing issues but are intelligently rerouted based on predefined weights reflecting each route's capacity</p>
</li>
<li><p>system effectively balances the load across multiple routes, minimizing the impact of temporary issues on one or more endpoint</p>
</li>
</ul>
<h3 id="heading-appendix">Appendix</h3>
<p>For simple easy to deploy load balance options on AppGW/AFD refer my this <a target="_blank" href="https://github.com/Osshaikh/OpenAI-Demo">repo</a></p>
<p>On overall reference Archiecture pattern of APIM &amp; OpenAI can refer this <a target="_blank" href="https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-openai-architecture-patterns-and-implementation-steps/ba-p/3979934">doc</a></p>
]]></content:encoded></item><item><title><![CDATA[Azure HSM: Navigating Compliance for FSI]]></title><description><![CDATA[Introduction
As a technology enthusiast I recently had the opportunity to dive deep into the world of Azure Managed Hardware Security Modules (HSMs) for FSI customer. These powerful cryptographic guardians play a pivotal role in helping Non-Banking F...]]></description><link>https://blog.osshaikh.com/azure-managed-hsm</link><guid isPermaLink="true">https://blog.osshaikh.com/azure-managed-hsm</guid><category><![CDATA[#FSI]]></category><category><![CDATA[Security]]></category><category><![CDATA[hsm]]></category><category><![CDATA[Cryptography]]></category><category><![CDATA[keys]]></category><category><![CDATA[Microsoft]]></category><category><![CDATA[Azure]]></category><category><![CDATA[keyvault]]></category><category><![CDATA[RBI]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Sun, 17 Mar 2024 03:42:22 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-introduction"><strong>Introduction</strong></h2>
<p>As a technology enthusiast I recently had the opportunity to dive deep into the world of <strong>Azure Managed Hardware Security Modules (HSMs)</strong> for FSI customer. These powerful cryptographic guardians play a pivotal role in helping Non-Banking Financial Companies (NBFCs) meet the stringent compliance requirements i.e. FIPS-140-L3 set by the Reserve Bank of India (RBI). In this blog post, I’ll cover some best practices in implementation of Azure Managed HSM, explore its practical applications, and guide you through its operational aspects.</p>
<h2 id="heading-what-is-managed-hsm-why-its-important-for-nbfc"><strong>What is Managed HSM? Why its important for NBFC</strong></h2>
<hr />
<p>A <strong>managed HSM</strong> is a single-tenant, highly available, and <strong>FIPS 140-2 Level 3 validated</strong> hardware security module. Imagine a secure vault within Azure, purpose-built to protect your most sensitive secrets. Whether you’re safeguarding financial transactions, securing healthcare data, or ensuring the integrity of critical applications, Managed HSM has your back.</p>
<ul>
<li><p><strong>Managed HSM</strong> establishes a <strong>cryptographic boundary</strong> for key material using a unique <strong>security domain</strong>. It ensures that <strong>Microsoft cannot access</strong> your keys within the HSM. Customers have full ownership and control over your cryptographic keys. it Isolates your keys within the HSM, preventing unauthorized access.</p>
</li>
<li><p><strong>Security domain</strong> is an <strong>encrypted blob file</strong> unique to each managed HSM instance. It contains critical artifacts such as:<br />  <strong>HSM backup, User credentials, Signing key, Data encryption key.</strong></p>
</li>
</ul>
<p><strong>Advantages of HSM on AKV:</strong></p>
<ol>
<li><p><strong>Granular Access Control</strong>:</p>
<ul>
<li><p><strong>Per-key permissions</strong> enable fine-grained control over access.</p>
</li>
<li><p><strong>Local RBAC model</strong> ensures designated HSM cluster administrators have full control.</p>
</li>
</ul>
</li>
<li><p><strong>Private Endpoints</strong>:</p>
<ul>
<li><p>Securely connect to Managed HSM from your application using <strong>private endpoints</strong>.</p>
</li>
<li><p>Ensures data privacy by avoiding public internet access.</p>
</li>
</ul>
</li>
<li><p><strong>FIPS 140-2 Level 3 Validated HSMs</strong>:</p>
<ul>
<li><p>Managed HSMs use <strong>Marvell Liquid Security HSM adapters</strong>.</p>
</li>
<li><p>Complies with stringent security standards.</p>
</li>
</ul>
</li>
<li><p><strong>Integrated Monitoring and Audit</strong>:</p>
<ul>
<li><p>Fully integrated with <strong>Azure Monitor</strong>.</p>
</li>
<li><p>Provides complete logs of all activity.</p>
</li>
<li><p>Use <strong>Azure Log Analytics</strong> for analytics and alerts.</p>
</li>
</ul>
</li>
<li><p><strong>Data Residency</strong>:</p>
<ul>
<li>Managed HSM ensures data doesn’t leave the region where the HSM instance is deployed.</li>
</ul>
</li>
<li><p><strong>Centralized Key Management</strong>:</p>
<ul>
<li>Manage critical keys across your organization in one place, follow the <strong>least privileged access</strong> principle.</li>
</ul>
</li>
</ol>
<h2 id="heading-operational-excellence-making-hsm-private-for-secure-access"><strong>Operational Excellence: Making HSM Private for secure access.</strong></h2>
<hr />
<p>Access to a managed HSM is controlled through two interfaces:</p>
<ul>
<li><p>Management plane: On the management plane, you manage the HSM itself. Operations in this plane include creating and deleting managed HSMs and retrieving managed HSM properties.</p>
</li>
<li><p>Data plane: On the data plane, you work with the data that's stored in a managed HSM. which is basically the keys generated on HSM or imported to HSM from different key manager.</p>
</li>
</ul>
<blockquote>
<h3 id="heading-authorization">Authorization</h3>
</blockquote>
<p>There are two level of permission required to work with HSM.</p>
<p>Azure RBAC: All management plane operation on HSM, Operations in this plane include Create/Delete, Backup/restore, Networking, Manage security domain.</p>
<p>Local RBAC: Role assignment at this is either scope at All keys or Single key.<br />Perform all sort of operation i.e. Create/Retrieve/Delete on Keys are granted using Local RBAC.</p>
<blockquote>
<h3 id="heading-networking">Networking</h3>
</blockquote>
<p>In Networking section for HSM, options are either 'Allow All network' or 'Private Endpoints with allow trusted services.'</p>
<p>As general rule of thumb always prefers private connection over public.</p>
<ul>
<li><p>All Networks: Exposed over public endpoint, accessible over internet by default.</p>
</li>
<li><p>Private Endpoint: It only exposes HSM on your specific VNet, Resources in that VNet only will have access to HSM Data Plane.<br />  Create private endpoint in HSM steps are similar to private endpoints for any Az resources. specify VNet and Subnet for PE where an application/service has line of sight to HSM PE.</p>
</li>
<li><p>One thing to note with private endpoint is that <strong>it restricts access to HSM data plane from Az ARM interface</strong>. you would need <strong>Azure VM in Same VNet or its Peered VNet</strong> to be able Access/manage HSM keys from Az ARM interface i.e. Portal/CLI</p>
<p>  <em>Error observed accessing HSM via portal once PE is enabled for HSM.</em></p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710599511438/d25f0076-4712-4dbf-ba18-81ecb125b549.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<p><strong>Operational Excellence:</strong> Encryption with managed HSM</p>
<hr />
<ul>
<li><p>Based on your requirement use HSM keys to encrypt data at rest on Azure service such as Blob Storage, PostgreSQL, MySQL etc.</p>
</li>
<li><p>Let's take an example of encryption existing blob storage with CMK on HSM, we already have HSM configured with Private endpoint &amp; Required RBAC "Managed HSM Crypto User."</p>
</li>
<li><p>Noticed once we enable PE on HSM, we can't access data plane on HSM, that means all operation on keys would be restricted from portal/cli including local RBAC management.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710609867440/5a4dc533-4161-4db7-836f-9ad9d7ff647c.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>When trying to encrypt storage account using CMK option and after HSM is selected, noticed error related to connection with HSM data plane on Storage account blade.</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710619255891/ccf6636b-576c-4b0a-9ccf-0d39af669350.png" alt class="image--center mx-auto" /></p>
<p>  Note: Allow Microsoft trusted service is also enabled along with PE on HSM</p>
</li>
<li><p>I have tested couple of more service i.e. MySQL/PGSQL with similar error that mean its common with all Azure data services that supports encryption with CMK on HSM</p>
</li>
<li><p>Primary reason for this error, when we enable Private network on HSM. it lockdown network access on HSM and only enable access to HSM via Private Network i.e. VNet connected Device to maximize security.</p>
</li>
<li><p>Even though option for "Allow Microsoft trusted services" is enabled, setting up encryption on Blob storage failed because request to HSM API's goes via end user browser not via Storage service IP range.</p>
</li>
<li><p>When we access HSM interface from Azure portal, user's browser interacts with managed HSM API. Even when configuring encryption for other services via portal user's browser used as client to interact with ARM Api for HSM</p>
<p>  <em>Here is error Screenshot of setting encryption for new Blob Storage</em></p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710631597234/637611fc-ad6d-4105-9aab-801ac1cd754b.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Now if admin want to perform operations on Private managed HSM, would need an Azure VM which has line of sight connectivity towards HSM private endpoint.<br />  *<br />  In this screenshot users are able to configure encryption on Blob storage via Azure portal on Az VM*</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710633603024/ae5bf6be-3273-4524-b188-ae1567438c54.png" alt="Using Azure VM users is able to configure encryption on Blob storage via Azure portal with Private managed HSM" /></p>
<h2 id="heading-resiliency-disaster-recovery-with-hsm">Resiliency: Disaster Recovery with HSM</h2>
<hr />
<ul>
<li><p>HSM offers multi region feature which allows data from primary instance replicate to Secondary instance.</p>
</li>
<li><p>Once HSM Replica is enabled for secondary region, its function as Active-Passive behind backend Traffic manager endpoint.</p>
</li>
<li><p>However, replica instance isn't visible to users in its subscription rather function in backend as extension to primary instance.</p>
</li>
<li><p>Failover of HSM is managed by Azure in case of outage of its service for primary instance.</p>
</li>
<li><p>Since secondary instance not visible on portal/cli interfaces many data related services such as Postgres, MySQL requires keystore/HSM to be available local on DR region.</p>
</li>
</ul>
<p>Recommendation deploy another HSM instance rather than replica if users are configured HSM with many native Az DB services.</p>
<h2 id="heading-disaster-recovery-backup-restore-on-managed-hsm">Disaster Recovery: Backup Restore on Managed HSM</h2>
<hr />
<p>Easiest way is to setup another HSM instance in DR region to perform complete Backup &amp; Restore. HSM stores it backup on Blob Storage would need access to Blob storage with sufficient RBAC on security principle i.e. managed identity.</p>
<p>Note: you would need <strong>security domain</strong> of primary HSM while restoring backup</p>
<ol>
<li><p>Create manage identity and Assign RBAC 'Storage Blob Contributor' on SA.</p>
</li>
<li><p>Associate managed identity on your primary HSM to enable backup write permission on SA (Storage account).</p>
<pre><code class="lang-json">  az keyvault update-hsm --hsm-name primary-hsm --mi-user-assigned <span class="hljs-string">"/subscriptions/subid/resourcegroups/rgname/providers/Microsoft.ManagedIdentity/userAssignedIdentities/manageidentityname"</span>
</code></pre>
</li>
<li><p>Once MI is associated to HSM, create container on SA for your HSM backups &amp; triggers backup using Az CLI (backup/restore options are not on Az Portal)</p>
<pre><code class="lang-abap"> az keyvault backup <span class="hljs-keyword">start</span> --use-managed-identity true --hsm-name <span class="hljs-keyword">primary</span>-hsm --storage-account-name hsmbackupsaname --blob-container-name conatiner1  --subscription Subs-guid
</code></pre>
</li>
<li><p>Look for success message when received response from backup command, Now create another vanilla HSM instance in DR region, but don't Activate it</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710643719339/b065a537-195e-44c8-ab2f-44310fb42f1b.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Normally after creating HSM instance, we initialize and download the new HSM's Security Domain as mentioned at start. However, since we're executing DR procedure, we will enable Security Recovery mode on this HSM.</p>
<pre><code class="lang-abap"> az keyvault security-domain <span class="hljs-keyword">init</span>-recovery --hsm-name secondry-hsm --sd-exchange-<span class="hljs-keyword">key</span> hsmrecoveryfilename
</code></pre>
</li>
<li><p>Collect/download security domain of primary HSM along with its 2/3 Keys based on Quorum configurations.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710643613283/13daf4c6-759c-44bb-ab38-73ae4bbe5da8.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Before we triggers restore make sure secondary HSM has access to Blob storage where backup is stored</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710643897055/25a89c7e-7069-4f29-bf36-f6acaf337e27.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Now Initiate restore of backup on secondary HSM, noticed that I have received an error which means before restoration of any backup on HSM, backup of target HSM should be triggered within 30 minutes.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710646430984/723fce29-959c-4e5b-8a7e-78b53a57967e.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>Once backup is completed, we reinitiated restoration of secondary HSM, this time it completed successfully.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710646581635/9b42a3fd-e999-41dd-ae7e-2ccb8c49e423.png" alt class="image--center mx-auto" /></p>
<p> I have tried to cover important areas around Implementation of HSM which faced during discussion with FSI customers. Feel free to share your thoughts or question in comments &amp; for more details around HSM solution, refer to <a target="_blank" href="https://learn.microsoft.com/en-us/azure/key-vault/managed-hsm/overview">official documentation</a>.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Managed DNS solution for Hybrid customers]]></title><description><![CDATA[Azure DNS Private Resolver is a first-party Azure managed network component that facilitates DNS name resolution integration between On-premises to Azure and vice-versa.
Historically, on Azure cloud customers needed to rely on Infrastructure-as-a-Ser...]]></description><link>https://blog.osshaikh.com/managed-dns-solution-for-hybrid-customers</link><guid isPermaLink="true">https://blog.osshaikh.com/managed-dns-solution-for-hybrid-customers</guid><category><![CDATA[Azure]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[networking]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[Microsoft]]></category><category><![CDATA[architecture]]></category><category><![CDATA[dns]]></category><dc:creator><![CDATA[Osama Shaikh]]></dc:creator><pubDate>Sun, 10 Mar 2024 19:54:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1710100396562/ae15d7e8-a797-46dd-8ba3-812a034294d3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Azure DNS Private Resolver is a first-party Azure managed network component that facilitates DNS name resolution integration between On-premises to Azure and vice-versa.</p>
<p>Historically, on Azure cloud customers needed to rely on Infrastructure-as-a-Service (IaaS) solutions, configuring virtual machines either as DNS forwarders or Network Virtual Appliances (NVA) to achieve correct name resolution across their on-premises environments and Azure. These setups were not only complex but also introduced potential single points of failure, diminishing the network's overall resilience and efficiency.</p>
<p>Recently I have encountered scenarios where customers expressed a desire to transition away from these IaaS-based DNS forwarders. Their goal was clear: to eliminate these single points of failure and significantly boost their network's resiliency, especially in disaster recovery (DR) scenarios. They envisioned an active-active DR solution where, even in the event of a DNS service outage in the primary region, their organization's name resolution would continue uninterrupted. Moreover, they expected name resolution to be handled by their regional Azure Private DNS (Az PDNS) service by default, only routing to another region if the primary was compromised.<br />Here is Architecture design for Prod/DR in Active Scenario</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710094575503/226c3e4b-db05-433f-ab0f-4ceb6e1bc92c.png" alt class="image--center mx-auto" /></p>
<p>To fulfill these requirements, I have devised an architectural design tailored for production and DR in an active-active configuration. This design necessitates several critical adjustments:</p>
<ol>
<li><p><strong>DNS Resolver Service Configuration</strong>: Each DNS resolver service in every region must be capable of resolving the names (URLs) of all workloads.</p>
</li>
<li><p><strong>Private DNS Zone Linking</strong>: As Platform-as-a-Service (PaaS) resources have separate instances running in both regions, their respective private DNS zones should be linked to the Virtual Network (VNet) of the PDNS resolver service.</p>
</li>
<li><p><strong>Conditional Forwarder Zones</strong>: Production and DR Active Directory (AD) DNS servers must be set up with non-AD integrated conditional forwarder zones for Azure workloads (e.g., Blob storage, Az SQL.), enabling private access through the on-premises network.</p>
</li>
<li><p><strong>Configuration of Conditional Forwarders</strong>: The conditional forwarder zones in Prod and DR AD-DNS servers must have a different sequence of primary and secondary endpoints. This ensures that DNS traffic from each environment or region is managed by its respective DNS resolver service, addressing the customer's need for regional traffic management and eliminating cross-region dependencies during DR scenarios.</p>
</li>
</ol>
<p>Here is screenshot of conditional forwarders config &amp; name resolution details.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710027304812/8070aae3-2109-408c-9e54-f0966a055992.png" alt class="image--center mx-auto" /></p>
<p>Name Resolution from on-premises against Blob storage private endpoint.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710100011265/0fb5772e-a859-44ad-8421-846ebcaabb71.png" alt class="image--center mx-auto" /></p>
<p>Private DNS zone details for blob storage on azure</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710100170524/cb537424-6a36-4f74-a6fb-7b2d67570a6e.png" alt class="image--center mx-auto" /></p>
<p>This architecture pattern not only signifies improvement in DNS management but also reflects in more resilient, efficient, and scalable overall azure infrastructure which complement each other in active-active DR scenarios.  </p>
<p>Thanks for reading!</p>
<p>Do you have any other tips to improve my blog? Let me know in the comments, or just say hi 👋.</p>
<p>Until next time, Happy Learning</p>
]]></content:encoded></item></channel></rss>