Posts How to Clean Up and Reset Your Azure Environment in 30 Days
Post
Cancel

How to Clean Up and Reset Your Azure Environment in 30 Days

Essential KQL Queries for Azure with and without Log Analytics

Azure environments rarely become messy all at once. They slowly drift.

Most Azure environments start clean and well organized. Landing zones are deployed, monitoring is configured, policies are assigned, and infrastructure is often provisioned through Infrastructure as Code (IaC). At that moment the environment typically reflects the intended architecture and governance model.

Over time, however, operational drift begins to appear.

New workloads are deployed, teams make small adjustments, and urgent fixes are applied directly in the portal. Monitoring alerts are added but rarely reviewed. Policies are sometimes relaxed to allow a deployment to proceed. Even environments that rely heavily on Infrastructure as Code can experience drift when operational changes happen outside the deployment pipelines.

After several months the environment still works, but it is no longer as predictable as it once was.

Common symptoms include:

  • Alerts that generate noise but rarely indicate real incidents
  • Resources that are deployed but not monitored
  • Costs increasing without clear visibility into why
  • Policies existing in audit mode but not enforcing guardrails
  • Security controls depending on manual processes

None of this necessarily means the environment was designed poorly. Azure environments evolve quickly, and operational complexity naturally grows as more workloads are introduced.

Instead of attempting a disruptive redesign, it is often far more effective to perform a structured operational reset. The goal is to review the environment, restore visibility, and strengthen governance using Azure-native capabilities.

This article introduces a practical 30-day operational reset plan for Azure administrators. The focus is on realistic operational improvements that can be implemented gradually without disrupting production workloads.

Throughout the article we will focus on practical techniques using:

  • Azure Monitor and Log Analytics for visibility
  • Azure Policy for continuous governance
  • Operational dashboards and workbooks for ongoing insights
  • PowerShell automation for scalable administration

Where possible, platform capabilities such as Policy and monitoring will be used instead of one-time scripts. Automation will be used to operationalize governance and remediation at scale.

The goal is not only to clean the environment once, but to make it easier to keep it clean in the future.

The 30-Day Azure Operational Reset Framework

Improving a large Azure environment can quickly become overwhelming if everything is addressed at once. Monitoring, cost management, governance, and security are closely connected, and trying to fix all of them simultaneously usually leads to scattered improvements.

A more effective approach is to focus on one operational layer at a time. The reset plan in this article is divided into four focused phases, each addressing a critical aspect of operating Azure at scale.

The framework looks like this:

WeekFocus Area
Week 1Observability and monitoring
Week 2Cost visibility and resource hygiene
Week 3Governance and Azure Policy
Week 4Security posture and operational resilience

Each phase builds on the previous one.

Week 1 – Observability focuses on ensuring the environment is actually visible. Administrators should be able to clearly see what resources exist, what telemetry they produce, and which alerts represent real operational signals.

Week 2 – Cost visibility focuses on understanding where money is being spent and whether resources are used efficiently. This phase also reviews tagging strategies and resource hygiene.

Week 3 – Governance strengthens guardrails using Azure Policy. Instead of relying on manual checks, governance should enforce standards automatically and consistently across subscriptions.

Week 4 – Security and resilience reviews identity access, credential hygiene, and recovery readiness. Many operational incidents occur not because of missing technology but because basic operational controls were overlooked.

Throughout these phases the focus remains on platform-driven governance and visibility, supported by automation where necessary. Infrastructure as Code remains the preferred way to deploy resources, while Policy, monitoring, and dashboards help ensure the environment stays compliant and observable as it evolves.

In the next section we start with the most important operational foundation: making sure the Azure environment is fully observable and producing reliable operational signals.

Week 1 - Reset Observability and Monitoring

Before improving governance, cost management, or security, administrators must be able to clearly see what is happening inside the environment. Observability is the operational foundation of Azure. If telemetry is missing, alerts are noisy, or monitoring coverage is inconsistent, every other operational improvement becomes more difficult.

In many environments, monitoring was configured early in the cloud journey but slowly drifted over time. New services were deployed without diagnostic settings, alert rules were added during incidents but never reviewed afterward, and dashboards stopped reflecting the actual state of the platform.

The goal of the first week is to restore clear, reliable operational visibility. This means ensuring that resources emit telemetry, that alerts represent meaningful signals, and that administrators have dashboards that allow them to quickly understand the health of the environment.

This does not require rebuilding monitoring from scratch. Instead, the focus should be on identifying gaps and standardizing how telemetry is collected.

Ensuring Resources Emit Telemetry

One of the most common monitoring gaps in Azure environments is resources that exist but do not send logs or metrics anywhere. This usually happens when resources are deployed outside Infrastructure as Code pipelines or when diagnostic settings were simply overlooked.

For example, it is common to find:

  • Storage accounts without diagnostic logs enabled
  • Key Vault access logs not sent to Log Analytics
  • Network Security Groups without flow logs
  • Load balancers with no monitoring enabled

In enterprise environments, the best way to prevent this problem is not through periodic manual reviews but through Azure Policy enforcement.

Instead of checking whether monitoring is enabled, we can require that it always is.

For example, an organization may assign a policy initiative that ensures critical services automatically send telemetry to a central Log Analytics workspace. When a resource is created without diagnostics, the policy can automatically deploy the required configuration.

This type of policy often uses the DeployIfNotExists effect. The policy evaluates newly deployed resources and automatically attaches diagnostic settings if they are missing.

PowerShell can then be used to assign the policy at scale, typically at the management group level.

Example:

1
2
3
4
5
6
7
$mg = "corp-platform"
$policyAssignmentName = "enforce-diagnostics"

New-AzPolicyAssignment `
  -Name $policyAssignmentName `
  -Scope "/providers/Microsoft.Management/managementGroups/$mg" `
  -PolicyDefinitionName "Deploy-Diagnostics-LogAnalytics"

Once the policy is assigned, Azure continuously ensures that monitoring remains enabled for new resources.

Detecting Monitoring Blind Spots

Even with policies in place, it is useful to periodically review telemetry coverage.

Log Analytics can quickly reveal resources that exist but rarely or never produce telemetry. This may indicate missing diagnostic settings, misconfigured services, or simply unused resources.

A simple KQL query can help identify which resources are actively sending data to Log Analytics.

Example query:

1
2
3
AzureActivity
| summarize LastActivity = max(TimeGenerated) by ResourceId
| order by LastActivity asc

When visualized in an Azure Monitor Workbook, this query helps administrators identify resources that have not produced activity for a long period of time.

For example, you may discover:

  • a virtual machine that has been running for months but produces no monitoring data
  • a Key Vault that has never logged an access event
  • a load balancer that receives no traffic

These situations often indicate configuration drift or unused infrastructure.

Fixing Alert Noise

Alert fatigue is one of the most common operational problems in cloud environments. When administrators receive too many alerts, they gradually stop reacting to them.

Many environments contain alerts that were created during a previous incident but never reviewed afterward. Over time, this leads to hundreds of alerts that generate little operational value.

Instead of reviewing alerts manually, administrators can use Azure Monitor queries and dashboards to understand alert behavior.

For example, a simple query can show how frequently alerts fired during the past month.

1
2
3
Alert
| summarize FiredCount = count() by AlertRuleName
| order by FiredCount desc

This makes it easy to identify alerts that trigger excessively.

A practical example might look like this:

  • A CPU alert firing every 10 minutes during batch processing
  • Disk queue alerts that trigger during normal backup windows
  • network alerts triggered by expected traffic spikes

These alerts are technically correct but operationally useless.

Instead of removing them completely, thresholds can often be adjusted or converted into dashboard metrics instead of alerts.

The goal of the reset process is simple: alerts should represent situations that require action.

Building an Operations Dashboard

Once telemetry and alerts are reviewed, the next step is ensuring that administrators can easily understand the state of the platform.

Many organizations rely on individual portal views, which makes it difficult to gain a complete operational overview.

Azure Monitor Workbooks are an effective way to solve this problem.

A simple operational dashboard may include views such as:

  • recent alerts across all subscriptions
  • top resource groups by activity
  • policy compliance summary
  • cost trends for the current month

This type of dashboard becomes the daily operational view for administrators.

For example, a typical morning review might include:

  • checking if any alerts fired overnight
  • reviewing backup job results
  • verifying that no unusual cost spikes occurred

Instead of manually navigating through different services in the Azure portal, administrators can quickly assess the health of the environment from a single location.

To mix things up, here is an example of Dashboard defiend in Bicep:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@description('Location for the workbook resource')
param location string = resourceGroup().location

@description('Workbook display name')
param workbookDisplayName string = 'Azure Operations Dashboard'

@description('Resource ID of the Log Analytics workspace')
param logAnalyticsWorkspaceResourceId string

@description('Optional workbook description')
param workbookDescription string = 'Operational dashboard for Azure administrators. Includes Azure Activity, failures, VM heartbeat health, and resource group activity trends.'

var workbookId = guid(resourceGroup().id, workbookDisplayName)

resource workbook 'Microsoft.Insights/workbooks@2022-04-01' = {
  name: workbookId
  location: location
  kind: 'shared'
  properties: {
    displayName: workbookDisplayName
    description: workbookDescription
    category: 'workbook'
    sourceId: logAnalyticsWorkspaceResourceId
    serializedData: string({
      version: 'Notebook/1.0'
      items: [
        {
          type: 1
          name: 'title'
          content: {
            json: '# Azure Operations Dashboard\nThis workbook provides a practical day-to-day operational view for Azure administrators. It focuses on control-plane activity, failed operations, VM heartbeat coverage, and activity trends.'
          }
        }
        {
          type: 3
          name: 'recent-activity'
          content: {
            version: 'KqlItem/1.0'
            query: '''
AzureActivity
| where TimeGenerated >= ago(24h)
| project TimeGenerated, OperationNameValue, ActivityStatusValue, ResourceGroup, ResourceProviderValue, ResourceId, Caller
| order by TimeGenerated desc
'''
            size: 0
            title: 'Recent Azure Activity (Last 24 Hours)'
            timeContext: {
              durationMs: 86400000
            }
            queryType: 0
            resourceType: 'microsoft.operationalinsights/workspaces'
            crossComponentResources: [
              logAnalyticsWorkspaceResourceId
            ]
            visualization: 'table'
          }
        }
        {
          type: 3
          name: 'failed-operations'
          content: {
            version: 'KqlItem/1.0'
            query: '''
AzureActivity
| where TimeGenerated >= ago(24h)
| where ActivityStatusValue =~ 'Failure'
| summarize Failures = count() by OperationNameValue, ResourceGroup, Caller
| order by Failures desc
'''
            size: 0
            title: 'Failed Operations (Last 24 Hours)'
            timeContext: {
              durationMs: 86400000
            }
            queryType: 0
            resourceType: 'microsoft.operationalinsights/workspaces'
            crossComponentResources: [
              logAnalyticsWorkspaceResourceId
            ]
            visualization: 'table'
          }
        }
        {
          type: 3
          name: 'heartbeat-status'
          content: {
            version: 'KqlItem/1.0'
            query: '''
Heartbeat
| summarize LastHeartbeat = max(TimeGenerated) by Computer, ResourceId
| extend MinutesSinceLastHeartbeat = datetime_diff('minute', now(), LastHeartbeat) * -1
| extend Status = case(
    MinutesSinceLastHeartbeat <= 15, 'Healthy',
    MinutesSinceLastHeartbeat <= 60, 'Warning',
    'Critical'
)
| project Computer, ResourceId, LastHeartbeat, MinutesSinceLastHeartbeat, Status
| order by MinutesSinceLastHeartbeat desc
'''
            size: 0
            title: 'VM Heartbeat Status'
            timeContext: {
              durationMs: 86400000
            }
            queryType: 0
            resourceType: 'microsoft.operationalinsights/workspaces'
            crossComponentResources: [
              logAnalyticsWorkspaceResourceId
            ]
            visualization: 'table'
          }
        }
        {
          type: 3
          name: 'top-resource-groups'
          content: {
            version: 'KqlItem/1.0'
            query: '''
AzureActivity
| where TimeGenerated >= ago(7d)
| summarize ActivityCount = count() by ResourceGroup
| order by ActivityCount desc
| take 10
'''
            size: 0
            title: 'Top 10 Most Active Resource Groups (Last 7 Days)'
            timeContext: {
              durationMs: 604800000
            }
            queryType: 0
            resourceType: 'microsoft.operationalinsights/workspaces'
            crossComponentResources: [
              logAnalyticsWorkspaceResourceId
            ]
            visualization: 'barchart'
          }
        }
        {
          type: 3
          name: 'activity-trend'
          content: {
            version: 'KqlItem/1.0'
            query: '''
AzureActivity
| where TimeGenerated >= ago(7d)
| summarize ActivityCount = count() by bin(TimeGenerated, 1h)
| order by TimeGenerated asc
'''
            size: 0
            title: 'Azure Activity Trend (Last 7 Days)'
            timeContext: {
              durationMs: 604800000
            }
            queryType: 0
            resourceType: 'microsoft.operationalinsights/workspaces'
            crossComponentResources: [
              logAnalyticsWorkspaceResourceId
            ]
            visualization: 'timechart'
          }
        }
      ]
      isLocked: false
      fallbackResourceIds: [
        logAnalyticsWorkspaceResourceId
      ]
    })
  }
}

output workbookResourceId string = workbook.id
output workbookName string = workbook.name

Example deployment command. Replace the values with your own.

1
2
3
4
5
6
7
$rgName = "rg-monitoring-prod"
$workspaceId = "/subscriptions/<subscription-id>/resourceGroups/rg-monitoring-prod/providers/Microsoft.OperationalInsights/workspaces/law-prod-monitoring"

az deployment group create `
  --resource-group $rgName `
  --template-file .\azure-operations-dashboard.bicep `
  --parameters logAnalyticsWorkspaceResourceId=$workspaceId

By the end of the first week, the Azure environment should provide clear and reliable visibility. Resources emit telemetry, alert rules represent meaningful signals, and dashboards provide administrators with an operational overview of the platform.

With observability restored, the next step is to focus on cost visibility and resource hygiene, ensuring that the environment is not only observable but also efficient and well organized.

Week 2 – Restore Cost Visibility and Resource Hygiene

Once monitoring and observability are in place, the next step is understanding how resources are actually being used and how money is being spent. Cost issues in Azure rarely appear suddenly. More often, they develop gradually as environments grow and operational discipline weakens over time.

During normal operations, teams deploy new services, scale existing workloads, and test new solutions. Some resources are temporary but remain deployed longer than expected. Others continue running even though the workload they supported no longer exists. Without proper visibility, these small inefficiencies accumulate and eventually become noticeable on the monthly invoice.

The goal of the second week is not simply to reduce costs. Instead, the focus is on restoring cost transparency and improving resource hygiene so administrators can clearly understand where resources are being used and whether they still serve a purpose.

In enterprise environments, this is best achieved through consistent tagging, centralized cost visibility, and platform insights, rather than one-time cleanup scripts.

Enforcing a Reliable Tagging Strategy

A consistent tagging strategy is one of the most important foundations for managing costs at scale. Tags allow organizations to associate resources with applications, environments, teams, or cost centers. Without reliable tagging, cost analysis quickly becomes fragmented and difficult to interpret.

In many environments, tagging starts with good intentions but becomes inconsistent as more resources are deployed. Some resources include full metadata, while others contain only partial information or no tags at all.

For example, administrators might discover:

  • Virtual machines missing the Environment tag
  • Storage accounts without an Application owner
  • Shared services deployed without a CostCenter identifier

Instead of periodically scanning for missing tags, a better approach is to enforce tagging through Azure Policy. Policies can require specific tags during deployment or automatically append them when possible.

A common enterprise tagging model might include tags such as:

  • Environment (Production, NonProduction, Development)
  • Application
  • Owner
  • CostCenter

If you are reviewing or cleaning up tags in an existing environment, I previously covered several practical techniques that can help:

These approaches can be particularly helpful when retrofitting tagging standards into environments that have been running for some time.

By assigning tagging policies at the management group level, organizations ensure that newly deployed resources remain consistent across subscriptions. Infrastructure deployed through Infrastructure as Code pipelines also benefits from this approach because policy enforcement acts as an additional safety layer when templates are incomplete.

Building Cost Visibility

Once tagging is reliable, administrators can begin analyzing spending patterns more effectively. Azure Cost Management provides a strong foundation for this visibility, but many environments still rely on manual portal exploration to understand spending trends.

A more practical approach is to create a cost visibility dashboard that surfaces important metrics at a glance. Administrators should be able to quickly answer questions such as:

  • Which subscriptions are responsible for the majority of spending this month?
  • Which resource groups have seen the largest cost increase recently?
  • Which environments consume the most compute resources?

For example, a simple operational dashboard might display:

  • Monthly cost trends across subscriptions
  • Top resource groups by spending
  • Cost distribution by environment tag

These views help operational teams identify unusual spending patterns early, before they become budget issues.

Many organizations also configure cost exports that deliver daily cost data to a storage account. This data can then be analyzed using Power BI, Azure Monitor workbooks, or other reporting tools.

Reviewing Resource Hygiene

Cost visibility often reveals another common issue: resources that continue to exist even though they are no longer needed. These are not necessarily mistakes; they are simply remnants of previous workloads or temporary deployments.

Typical examples include:

  • Virtual machines created for testing that were never removed
  • Managed disks left behind after a VM deletion
  • Public IP addresses allocated but never attached to a resource
  • Old snapshots that are no longer required

Instead of relying on custom scripts to detect these situations, Azure provides built-in insights through Azure Advisor. Advisor analyzes resource utilization and configuration patterns to identify potential inefficiencies.

For example, Advisor might highlight:

  • Virtual machines with consistently low CPU utilization
  • Idle public IP addresses
  • Opportunities to resize or consolidate resources

Administrators can review these recommendations periodically and decide whether remediation is appropriate.

PowerShell can also help operationalize this process by exporting Advisor recommendations across subscriptions, allowing teams to track remediation progress over time.

By the end of the second week, administrators should have a much clearer understanding of how resources are used within the environment. Tagging is consistently enforced, cost trends are visible through dashboards, and resource hygiene is monitored through built-in platform insights.

With improved visibility into both monitoring and spending, the next step is to strengthen governance. In the following section we focus on reinforcing guardrails with Azure Policy, ensuring that future deployments remain aligned with organizational standards.

Week 3 – Reinforcing Governance with Azure Policy

After observability and cost visibility are restored, the next step is strengthening governance. In many Azure environments, governance frameworks exist but are not consistently enforced. Policies may be assigned in Audit mode, initiatives may have been created early in the cloud journey but never revisited, and exceptions sometimes accumulate without clear tracking.

Over time this leads to an environment where governance technically exists, but it does not actively prevent configuration drift.

The goal of this phase is to move from passive governance to operational guardrails. Instead of relying on manual reviews or documentation, Azure Policy should automatically enforce standards and prevent common mistakes.

When implemented correctly, policies allow teams to continue deploying resources quickly while ensuring that those deployments remain compliant with organizational requirements.

Reviewing Existing Policy Assignments

Before introducing new policies, it is useful to review what is already assigned across the environment. Many organizations discover that they already have a significant number of policies in place, but their effectiveness is limited because they are either misconfigured or rarely monitored.

Azure provides several ways to review policy compliance, including the Policy blade in the Azure portal and Policy Insights queries.

For example, administrators can query policy compliance data to identify the most common violations across subscriptions.

1
2
3
4
5
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| summarize NonCompliantResources = count() by PolicyDefinitionName, ComplianceState
| where ComplianceState == "NonCompliant"
| order by NonCompliantResources desc

When visualized in a workbook or dashboard, this query helps administrators quickly identify policies that generate the largest number of violations.

Typical findings might include:

  • Storage accounts without secure transfer enabled
  • Virtual machines missing backup configuration
  • Resources deployed outside approved regions

These insights help determine where governance should be strengthened.

Turning Policies into Guardrails

Many policies are initially deployed with the Audit effect. This is a safe starting point because it allows teams to understand the impact of a policy before enforcing it. However, leaving policies in audit mode indefinitely limits their value.

Once administrators confirm that a policy behaves as expected, it often makes sense to convert it into an enforcement mechanism.

Common effects used in enterprise environments include:

  • Deny – blocks non-compliant deployments
  • DeployIfNotExists – automatically deploys missing configuration
  • Modify – automatically corrects resource properties

For example, a policy can ensure that diagnostic settings are automatically deployed for supported resources. Instead of relying on every deployment pipeline to configure diagnostics correctly, Azure Policy ensures the configuration is applied whenever a new resource appears.

Policies are typically assigned at the management group level so that governance applies consistently across multiple subscriptions.

A PowerShell example assigning a policy initiative might look like this:

1
2
3
4
5
6
7
8
$mgName = "corp-platform"
$initiativeName = "platform-governance"

New-AzPolicyAssignment `
    -Name "Platform-Governance-Initiative" `
    -Scope "/providers/Microsoft.Management/managementGroups/$mgName" `
    -PolicySetDefinitionName $initiativeName `
    -Location "eastus"

Once assigned, the initiative continuously evaluates resources within that scope and ensures that deployments remain compliant.

Triggering Policy Remediation

In existing environments, many resources may already be non-compliant when a new policy is introduced. Instead of fixing these resources manually, Azure Policy supports remediation tasks that automatically apply required configurations.

For example, if a policy ensures diagnostic settings are enabled, a remediation task can update all existing resources that are missing those settings.

PowerShell can be used to trigger remediation across large environments.

1
2
3
4
5
6
$assignment = Get-AzPolicyAssignment -Name "Platform-Governance-Initiative"

Start-AzPolicyRemediation `
    -Name "RemediateDiagnostics" `
    -PolicyAssignmentId $assignment.Id `
    -ResourceDiscoveryMode ReEvaluateCompliance

This approach allows administrators to apply governance improvements across hundreds or thousands of resources without manually modifying each one.

Managing Policy Exceptions

Even in well-governed environments, some workloads may require exceptions. For example, a legacy application might temporarily require deployment in a region that is normally restricted.

Instead of disabling policies entirely, Azure supports policy exemptions that document and track exceptions.

This approach maintains transparency and prevents governance from becoming fragmented.

For example, administrators can create a policy exemption for a specific resource group or subscription while keeping the policy enforced everywhere else.

1
2
3
4
5
New-AzPolicyExemption `
    -Name "Legacy-App-Region-Exception" `
    -Scope "/subscriptions/<subscription-id>/resourceGroups/legacy-rg" `
    -PolicyAssignmentId $assignment.Id `
    -DisplayName "Temporary exemption for legacy workload"

Using exemptions instead of disabling policies ensures that governance remains consistent and that exceptions are clearly documented.

By the end of the third week, governance should move from passive monitoring to active enforcement. Policies prevent common configuration mistakes, remediation tasks bring existing resources into compliance, and dashboards provide visibility into the overall governance posture.

With monitoring, cost visibility, and governance in place, the final step is to review security posture and operational resilience, ensuring the environment can safely handle real-world operational incidents.

Week 4 – Security Posture and Operational Resilience

By the time observability, cost visibility, and governance are improved, the Azure environment is already much easier to operate. The final phase focuses on an area that often receives attention only after an incident occurs: operational security and resilience.

Many production incidents in cloud environments are not caused by sophisticated attacks or platform outages. Instead, they are often triggered by operational issues such as expired credentials, excessive permissions, or recovery procedures that were never tested.

The purpose of this phase is to ensure that the environment can safely handle these situations.

Rather than focusing on one-time reviews, the goal is to establish continuous operational safeguards that reduce the likelihood and impact of these problems.

Controlling Privileged Access

One of the most common risks in long-running Azure environments is the gradual accumulation of privileged access. During migrations, troubleshooting, or project work, users are often granted elevated roles such as Owner or Contributor. Over time, these permissions may remain assigned even when they are no longer required.

Instead of periodically reviewing role assignments manually, organizations should establish a structured access model based on Azure RBAC, role separation, and identity governance.

In mature environments this usually means:

  • Limiting permanent Owner access to a small number of platform administrators
  • Assigning operational roles through Azure AD groups instead of individual users
  • Using Privileged Identity Management (PIM) for temporary elevation of roles
  • Requiring approval workflows for privileged role activation

With this approach, administrators only receive elevated permissions when they actually need them, and those permissions expire automatically after a predefined period.

Operational dashboards can also track privileged assignments so that unusual changes become visible quickly.

Eliminating Long-Lived Credentials

Another frequent source of operational outages is expiring credentials used by automation tools, pipelines, or integrations. Many environments still rely on service principals with long-lived client secrets.

When these credentials expire, deployments suddenly stop working or integrations fail unexpectedly.

A more resilient approach is to gradually replace static credentials with managed identities wherever possible. Managed identities eliminate the need for secret rotation and allow services to authenticate to Azure resources without storing credentials.

Typical improvements in this phase may include:

  • Migrating automation accounts to system-assigned managed identities
  • Updating application services to use managed identity authentication
  • Removing unused service principals from the tenant

For services that must still rely on credentials, teams often implement credential monitoring through Azure Monitor alerts so that expiring secrets are detected well before they cause operational issues.

Monitoring Backup and Recovery Readiness

Backup configuration is often verified during initial deployment, but it is rarely revisited afterward. As environments grow, it becomes easy to assume that workloads are protected simply because a backup policy exists.

However, operational resilience depends on more than policy configuration. Teams must ensure that backups are actually running successfully and that recovery procedures work when needed.

Enterprise environments often monitor backup posture through centralized dashboards that show:

  • workloads protected by Azure Backup
  • recent backup job results
  • restore point availability
  • backup failures across subscriptions

These dashboards allow administrators to detect problems quickly and confirm that critical systems remain protected.

Some organizations also schedule periodic recovery validation exercises, where selected workloads are restored into isolated environments to verify that recovery procedures work as expected.

Here is an example script that can collect basic Azure Backup information about your VMs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
param (
    [string]$CSVPath = "VMs.csv",
    [string]$OutputPath = "AzureBackupReport.csv"
)

# Generate output path with timestamp
$timestamp = Get-Date -Format "yyyyMMdd_HHmmss"
$outputDirectory = if ($PSScriptRoot) { $PSScriptRoot } else { (Get-Location).Path }
$outputFileName = "AzureBackupReport_$timestamp.csv"
$OutputPath = Join-Path $outputDirectory $outputFileName

# Import the CSV file
Write-Host "Reading VM list from CSV..." -ForegroundColor Cyan
$vms = Import-Csv -Path $CSVPath

# Group VMs by subscription
$vmsBySubscription = $vms | Group-Object -Property SubscriptionName

$report = @()

# Process each subscription
foreach ($subscriptionGroup in $vmsBySubscription) {
    $subscriptionName = $subscriptionGroup.Name
    Write-Host "Processing subscription: $subscriptionName" -ForegroundColor Yellow
    
    # Set context to the subscription
    $context = Set-AzContext -SubscriptionName $subscriptionName -ErrorAction SilentlyContinue
    if (-not $context) {
        Write-Warning "Could not set context to subscription: $subscriptionName"
        continue
    }
    
    # Get all recovery services vaults in the subscription
    $vaults = Get-AzRecoveryServicesVault -ErrorAction SilentlyContinue
    
    # Process each VM in the subscription
    foreach ($vm in $subscriptionGroup.Group) {
        $vmName = $vm.VMName
        $rgName = $vm.RGName
        
        Write-Host "  Checking VM: $vmName" -ForegroundColor Cyan
        
        try {
            # Get the VM object
            $vmObject = Get-AzVM -ResourceGroupName $rgName -Name $vmName -ErrorAction SilentlyContinue
            
            if (-not $vmObject) {
                Write-Warning "VM $vmName not found in resource group $rgName"
                $report += @{
                    "VM Name" = $vmName
                    "Resource Group" = $rgName
                    "Backup Enrolled" = "N/A"
                    "Backup Policy Assigned" = "N/A"
                    "Last Backup Taken" = "N/A"
                    "Vault Name" = "N/A"
                    "Policy Name" = "N/A"
                }
                continue
            }
            
            $backupEnrolled = "No"
            $policyAssigned = "No"
            $lastBackupDate = "N/A"
            $vaultName = "N/A"
            $policyName = "N/A"
            
            # Check each vault for this VM's backup status
            foreach ($vault in $vaults) {
                Write-Host "    Checking vault: $($vault.Name)" -ForegroundColor DarkCyan
                
                # Set vault context
                $null = Set-AzRecoveryServicesVaultContext -Vault $vault -ErrorAction SilentlyContinue
                
                try {
                    # Get all backup containers and search for this VM
                    $containers = Get-AzRecoveryServicesBackupContainer -ContainerType AzureVM -ErrorAction SilentlyContinue
                    
                    if ($containers) {
                        # Search for container matching this VM (by friendly name or resource group)
                        $vmContainer = $containers | Where-Object { $_.FriendlyName -like "*$vmName*" -or $_.Name -like "*$vmName*" }
                        
                        if ($vmContainer) {
                            Write-Host "      Found backup container: $($vmContainer.FriendlyName)" -ForegroundColor Green
                            
                            # Get all backup items in this container
                            $items = Get-AzRecoveryServicesBackupItem -Container $vmContainer -WorkloadType AzureVM -ErrorAction SilentlyContinue
                            
                            if ($items) {
                                $backupEnrolled = "Yes"
                                $vaultName = $vault.Name
                                
                                # Get details from the first matched item
                                $item = $items | Select-Object -First 1
                                
                                if ($item.ProtectionStatus -ne "ProtectionStopped") {
                                    $policyAssigned = "Yes"
                                    $policyName = $item.ProtectionPolicyName
                                    
                                    Write-Host "      Policy: $policyName" -ForegroundColor Green
                                    
                                    # Get last recovery point (backup)
                                    try {
                                        $recoveryPoints = Get-AzRecoveryServicesBackupRecoveryPoint -Item $item -ErrorAction SilentlyContinue | 
                                            Sort-Object -Property RecoveryPointTime -Descending | Select-Object -First 1
                                        
                                        if ($recoveryPoints) {
                                            $lastBackupDate = $recoveryPoints.RecoveryPointTime.ToString("yyyy-MM-dd HH:mm:ss")
                                            Write-Host "      Last backup: $lastBackupDate" -ForegroundColor Green
                                        }
                                        else {
                                            Write-Host "      No recovery points found" -ForegroundColor Yellow
                                        }
                                    }
                                    catch {
                                        Write-Verbose "Error getting recovery points: $_"
                                    }
                                }
                                else {
                                    $policyAssigned = "No (Protection Stopped)"
                                }
                                
                                break
                            }
                        }
                    }
                }
                catch {
                    Write-Verbose "Error checking vault $($vault.Name): $_"
                }
            }
            
            # Add to report
            $report += @{
                "VM Name" = $vmName
                "Resource Group" = $rgName
                "Backup Enrolled" = $backupEnrolled
                "Backup Policy Assigned" = $policyAssigned
                "Last Backup Taken" = $lastBackupDate
                "Vault Name" = $vaultName
                "Policy Name" = $policyName
            }
        }
        catch {
            Write-Error "Error processing VM $vmName : $_"
            $report += @{
                "VM Name" = $vmName
                "Resource Group" = $rgName
                "Backup Enrolled" = "Error"
                "Backup Policy Assigned" = "Error"
                "Last Backup Taken" = "Error"
                "Vault Name" = "Error"
                "Policy Name" = "Error"
            }
        }
    }
}

# Convert to PSObjects for CSV export
$reportObjects = $report | ForEach-Object { [PSCustomObject]$_ }

# Export to CSV
Write-Host "Exporting report to: $OutputPath" -ForegroundColor Green
$reportObjects | Export-Csv -Path $OutputPath -NoTypeInformation -Force

Write-Host "Report generation completed successfully!" -ForegroundColor Green
Write-Host "Total VMs processed: $($reportObjects.Count)"
Write-Host "Output file: $OutputPath"

Preparing for Operational Failure Scenarios

Even well-governed environments occasionally encounter unexpected situations. For example, a policy change may accidentally block deployments, an identity configuration change may restrict access, or an expired certificate may disrupt application connectivity.

The difference between a minor issue and a major outage often depends on whether operational teams have prepared for these situations.

Many organizations maintain simple operational runbooks describing how to respond to common failure scenarios. These runbooks typically cover situations such as:

  • restoring access using break-glass accounts
  • temporarily disabling blocking policies
  • restoring critical services from backup
  • recovering access to subscriptions or management groups

These procedures do not need to be complex, but documenting them ensures that administrators can respond quickly when unexpected situations arise.

By the end of the fourth week, the environment should not only be observable and well governed, but also resilient to operational failures. Identity access is controlled, credentials are managed safely, backups are verified, and recovery procedures are clearly defined.

Conclusion – Keeping Azure Clean

Azure environments naturally evolve over time. New workloads are deployed, teams change, and operational practices adapt to new requirements. Even well-designed environments gradually accumulate complexity as these changes occur.

This is normal.

The key is ensuring that operational practices evolve along with the environment.

The 30-day operational reset plan presented in this article provides a practical way to restore clarity and control without disrupting running workloads. By addressing observability, cost visibility, governance, and operational security in structured phases, administrators can significantly improve the reliability and maintainability of their Azure environments.

A healthy Azure environment typically has several characteristics:

  • monitoring provides clear operational signals
  • cost visibility allows teams to understand spending patterns
  • governance guardrails prevent configuration drift
  • identity access follows least-privilege principles
  • operational dashboards provide continuous insight into platform health

Most importantly, the environment becomes predictable and easier to operate.

And sometimes, all it takes is a little spring cleaning to get things back on track.

Azure Spring Clean 2026

This article is part of the Azure Spring Clean initiative, a community-driven event focused on sharing knowledge and best practices for Azure. Check out Azure Spring Clean for more insightful content from the Azure community.

Azure Spring Clean

Thanks for following along and keep clouding around!

Vukasin Terzic

This post is licensed under CC BY 4.0