## 第二章 - 诊断少量机器无法部署的问题 1. 运行下面的查询来查看机器的基本信息。 ```powershell $machineObject = Get-CentralAdminMachine $machine $machineObject | Format-List Name, *Definition, Dag, Nag, DesiredDefinition, DeployRing, AutopilotMode, *Version*, ProvisioningState, ActivityState, Forest, When*, City, ServiceInstanceType ``` 这些基本信息帮助快速了解机器的当前状态。 我们也可能还需要知道机器所在单位的上下文信息,可以运行下面的命令: ```powershell if (-not [string]::IsNullOrEmpty($machineObject.Dag)) { Write-Host "Downloading DAG context..." -ForegroundColor Yellow; Write-Host "You are $machine and you are viewing DAG: $($machineObject.Dag)"; Get-CentralAdminMachine -Filter "Dag -eq '$($machineObject.Dag)'" -ShowAll | Sort-Object ProvisioningState | Format-Table -AutoSize Name, ActivityState, ProvisioningState, *Definition, ActualVersion, Location, AutopilotMode } if (-not [string]::IsNullOrEmpty($machineObject.CapacityUnit)) { Write-Host "Downloading CapacityUnit context..." -ForegroundColor Yellow; Write-Host "You are $machine and you are viewing CapacityUnit: $($machineObject.CapacityUnit) ."; Get-CentralAdminMachine -Filter "CapacityUnit -eq '$($machineObject.CapacityUnit)'" -ShowAll | Sort-Object ProvisioningState | Format-Table -AutoSize Name, ActivityState, ProvisioningState, *Definition, ActualVersion, Location, AutopilotMode } ``` 2. 查看机器的历史部署记录,以了解机器的部署历史。 如果需要针对一台机器,查查它历史上都是如何被Deploy的, 使用下面的Kusto: ```kusto DeploymentCogsEvent_Global | where machineName == '$machine' | where timestamp >= ago(3d) | sort by timestamp desc | project timestamp, machineName, actionType, requestor, provisioningState, deployMode, activityId, forest, actionStatus, deployVersion, SourceVersion, workflowId ``` 上面的 Kusto 可以在 DMS 里交叉验证: ```powershell Get-DeploymentAPSWorkitem -Machines $machine -ShowAll ` | Sort-Object WorkflowStartTime -Descending ` | Select-Object -First 20 ` | Format-Table -AutoSize Status, WorkflowId, TargetIntention, WorkflowStartTime ``` **一定要检查机器的部署历史**。从而判断机器是真的部署失败了,还是根本没有尝试部署,亦或是可重试而恢复的错误,还是根本性的错误。是应该部署还是不应该部署。 3. 了解一台机器为什么不部署或者部署了(可选): 在 Kusto 中运行下面的查询,以查看机器的部署期待性。这有助于诊断为什么机器迟迟没有部署。 ```kusto ApsEvaluatorTraceEvent_Global | where Message has "GV2PEPF0000385A" and env_time > ago(12h) | where Message has "Failed with rule:" | parse Message with * "Failed with rule:" FailRule:string "||" * | project env_time, FailRule, PolicyIdentifier, MessageId, Message | sort by env_time desc | limit 200 ``` 4. 了解机器为什么一定要部署一个版本(可选): 有的时候,机器可能会被强制部署一个意外版本(例如过老的版本)。在 Kusto 中运行下面的查询,以查看机器被强制部署的原因: ```kusto ApsPrioritizerTraceEvent_Global | where Message has "GV2PEPF0000385A" and env_time > ago(12h) | where Message has "" | where PrioritizerIdentifier startswith 'Sweeper:CapacityDeploymentSweeper' | sort by env_time desc | project PrioritizerIdentifier,Message,env_time | limit 200 ``` 5. 诊断机器的部署错误: 对于第二步的输出,我们可以看到 WorkflowId。我们可以使用这个 WorkflowId 来查看机器的部署错误。 ```powershell Enable-SeeAnything See-Workflow $workflowId ``` 如果没有 DMS,则考虑使用下面的 Kusto: ```kusto CentralAdminWorkflows_Global | where RootWorkflowId == '$guid' | extend WorkflowId = strcat("\\\\", ManagementUnit, "\\", Id) | project ClassName, Result, CreateTimeUtc, EndTimeUtc, WorkflowId, Exception, LastGoodKnownState, UserContext, TenantVersion,RootWorkflowId | sort by CreateTimeUtc asc ``` 对于 Itar,则使用 [Jarvis](https://portal.microsoftgeneva.com/logs/dgrep?be=DGrep&ep=CA%20Fairfax&ns=O365PassiveITAR&en=CentralAdminWorkflows&time=2025-03-05T07:23:00.000Z&UTC=true&offset=-3&offsetUnit=Days&conditions=[[%22ClassName%22,%22%3D%3D%22,%22PatchPersistenceInspector%22]]&kqlClientQuery=source%0A|%20extend%20WorkflowId%20%3D%20strcat(%22\\\\%22,%20ManagementUnit,%20%22\\%22,%20Id)%0A|%20project%20ClassName,%20Result,%20CreateTimeUtc,%20EndTimeUtc,%20WorkflowId,%20Exception,%20LastGoodKnownState,%20UserContext,%20TenantVersion%0A|%20sort%20by%20CreateTimeUtc%20desc&aggregates=[%22Count%20by%20env_cloud_roleInstance%22]&chartEditorVisible=true&chartType=line&chartLayers=[[%22New%20Layer%22,%22%22],[%22Count%20by%20env_cloud_roleInstance%22,%22groupby%20env_time.roundDown(\%22PT1M\%22)%20as%20X,%20env_cloud_roleInstance\nwhere%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0029\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN5MGT0401CS009\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS013\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0031\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0037\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN1MGT04CS103\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS019\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22CY1MGT04CS110\%22\nlet%20Count%20%3D%20Count()%22]]%20). 一般到这里,我们已经可以知道机器为什么部署失败了。如果还不清楚,可以继续下面的步骤。 6. 将部署的错误按原因分类: 有的时候,我们只检查一两条错误对于诊断机器为什么部署失败是没有说服力的。我们可以使用下面的 Kusto 查询来将错误按原因分类: ```kusto APSFailedWorkitemEvent_Global | where env_time > ago(200h) | where resourceName == "GV2PEPF0000385A" | where prioritizer contains "Sweeper:CapacityDeploymentSweeper" | where failureMessage !has "forcing ReimageMode to WinPE" | project failureTagId, resourceName, targetResourceUnit, targetIntention, deployRing, failureWorkflowName, workflowId, workflowStartTime, failureTagWords, failureMessage, workflowEndTime, sku | summarize Count = count(), ExampleErrorMessage = take_any(failureMessage), ExampleWorkflowId = take_any(workflowId), ExampleFailureTagId = take_any(failureTagId) by targetIntention, failureWorkflowName | where Count > 3 | order by targetIntention asc, Count desc ``` 这样,出现较多的错误就是我们需要关注的根本性错误。 7. 诊断 ExchangeSetup 的错误: 如果机器是 ExchangeSetup 的错误,我们可以使用下面的 DMS 查询来诊断: ```powershell Get-MachineLog C_ExchangeSetup -Target "GV2PEPF0000385A" -Download ``` 8. 进行缓解措施: 机器的情况可能分为两类: * 机器是可以部署的,但是由于某些原因没有部署。这种情况下,我们可以手动触发机器的部署和修复手段。转到步骤 9 。 * 大范围的问题或版本本身的问题。这种情况下,即使这一台机器可以部署,它也会失败。转到步骤 10 。 9. 手动触发机器的部署: 如果机器是可以部署的,我们可以手动触发机器的部署。有一系列命令可以针对一台机器进行修复: ### 文件操作类 1. **列出目标机器上的文件夹内容** 命令:`Get-ChildItem.ps1 -Target "GV2PEPF0000385A" -Path "C:\"` 2. **下载目标机器上的文件** 命令:`Get-TorusFile -Path "C:\program files\microsoft\exchange server\v15\config\AntiMalware.settings.ini" -Machine "GV2PEPF0000385A"` 3. **查看目标机器上文件的属性** 命令:`Get-ItemProperty.ps1 -Path "C:\program files\microsoft\exchange server\v15\config\AntiMalware.settings.ini" -Target "GV2PEPF0000385A" | Format-List` 4. **下载Exchange安装日志** 命令:`Get-MachineLog C_ExchangeSetup -Target "GV2PEPF0000385A" -Download` ### 状态诊断类 1. **检查机器性能** 命令:`Measure-Performance -Machine GV2PEPF0000385A` 2. **检查机器磁盘使用情况** 命令:`Get-WmiObject.ps1 -Target GV2PEPF0000385A -Class Win32_LogicalDisk -NoFormatting | Foreach-Object { 'Disk {0}({4}) has free space: {1:0.0}GB/{2:0.0}GB ({3:0.0}%)\n' -f .Caption, (.FreeSpace / 1024MB), (.Size / 1024MB), (.FreeSpace / .Size * 100), .VolumeName }` 3. **测试网络连接** 命令:`Test-MachineNetworkConnectivity.ps1 -TargetMachine GV2PEPF0000385A` 4. **诊断WinPE启动失败问题** 命令:`Invoke-WinPEFailureDiagnose.ps1 -TargetMachine "GV2PEPF0000385A"` 5. **检查机器组件信息** 命令:`Get-MachineComponentV2 -Filter "MachineName -eq 'GV2PEPF0000385A'" | Format-Table -AutoSize` 6. **检查机器服务信息** 命令:`Get-ServiceInfo.ps1.dms -RemoteFQDNs GV2PEPF0000385A` 7. **查看机器部署能力** 命令:`See-MachineDeployAbility -MachineName GV2PEPF0000385A` ### 缓解修复类 1. **进入维护模式** 命令:`Request-MachineBeginMaintenance_V2.ps1 -TargetMachine "GV2PEPF0000385A" -Reason "Bring the machine to maintainance to reimage it."` 2. **重启目标机器** 命令:`Request-SetMachinePowerStateV2.ps1 -TargetMachine "GV2PEPF0000385A" -DesiredState "Restart" -Reason "Machine stacked."` 3. **清理机器缓存** 命令:`Invoke-ComponentReplicationWorkflow.ps1 -TargetMachine GV2PEPF0000385A -ComponentIds 'WipeCache'` 4. **优先分配备用机器** 命令:`New-MachineDeploymentControl.ps1 -MachineName "GV2PEPF0000385A" -MachineDeploymentControlType PrioritizeSpareAssignment -Justification "GV2PEPF0000385A has a bad disk, we need to replace it"` 5. **优先修复目标机器** 命令:`Request-RepairByDeployment.ps1 -MachineName GV2PEPF0000385A -ReimageMode "" -Justification "Try to repair this machine."` 6. **立即尝试修复机器** 命令:`Invoke-RepairOnDemandWorkflow.ps1 -TargetMachine GV2PEPF0000385A` 使用合理的指令,触发机器部署即可。 > 结束 10. 大范围问题或版本本身的问题: 参考第一章的诊断步骤,诊断大范围问题或版本本身的问题。