anduin revised this gist . Go to revision
2 files changed, 16 insertions, 1 deletion
FixBatch.md
@@ -42,7 +42,6 @@ $machines | Where-Object { $_.DesiredMachineDefinition -eq 'AD' } | Group-Object | |||
42 | 42 | ||
43 | 43 | 如果没有 DMS,可以使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange) | |
44 | 44 | ||
45 | - | ||
46 | 45 | ```kusto | |
47 | 46 | SubstrateMachine | |
48 | 47 | | where DeployRing == "SDFV2" |
FixStruggler.md(file created)
@@ -0,0 +1,16 @@ | |||
1 | + | ## 第三章 - 诊断剩余机器不部署的问题 | |
2 | + | ||
3 | + | 1. 运行下面的查询来查看剩余机器信息 | |
4 | + | ||
5 | + | 使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange) | |
6 | + | ||
7 | + | ```kusto | |
8 | + | SubstrateMachine | |
9 | + | | where ActivityState == "DotBuildUpgrade" and DesiredMachineDefinition == "BE" | |
10 | + | | where ActualExchangeVersion contains "15.20.8534" | |
11 | + | | where DeployRing in ('SIP', 'WW') | |
12 | + | | extend unpatched = strcmp(ActualExchangeVersion, "15.20.8534.031") < 0 | |
13 | + | | summarize TotalCount=count(), unpatchedCount = countif(unpatched) by Forest | |
14 | + | | extend UnPatchedPercentage = round(100.0 * unpatchedCount / TotalCount, 2) | |
15 | + | | order by UnPatchedPercentage desc | |
16 | + | ``` |
anduin revised this gist . Go to revision
1 file changed, 2 insertions
FixMachine.md
@@ -96,6 +96,8 @@ CentralAdminWorkflows_Global | |||
96 | 96 | | sort by CreateTimeUtc asc | |
97 | 97 | ``` | |
98 | 98 | ||
99 | + | 对于 Itar,则使用 [Jarvis](https://portal.microsoftgeneva.com/logs/dgrep?be=DGrep&ep=CA%20Fairfax&ns=O365PassiveITAR&en=CentralAdminWorkflows&time=2025-03-05T07:23:00.000Z&UTC=true&offset=-3&offsetUnit=Days&conditions=[[%22ClassName%22,%22%3D%3D%22,%22PatchPersistenceInspector%22]]&kqlClientQuery=source%0A|%20extend%20WorkflowId%20%3D%20strcat(%22\\\\%22,%20ManagementUnit,%20%22\\%22,%20Id)%0A|%20project%20ClassName,%20Result,%20CreateTimeUtc,%20EndTimeUtc,%20WorkflowId,%20Exception,%20LastGoodKnownState,%20UserContext,%20TenantVersion%0A|%20sort%20by%20CreateTimeUtc%20desc&aggregates=[%22Count%20by%20env_cloud_roleInstance%22]&chartEditorVisible=true&chartType=line&chartLayers=[[%22New%20Layer%22,%22%22],[%22Count%20by%20env_cloud_roleInstance%22,%22groupby%20env_time.roundDown(\%22PT1M\%22)%20as%20X,%20env_cloud_roleInstance\nwhere%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0029\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN5MGT0401CS009\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS013\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0031\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0037\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN1MGT04CS103\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS019\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22CY1MGT04CS110\%22\nlet%20Count%20%3D%20Count()%22]]%20). | |
100 | + | ||
99 | 101 | 一般到这里,我们已经可以知道机器为什么部署失败了。如果还不清楚,可以继续下面的步骤。 | |
100 | 102 | ||
101 | 103 | 6. 将部署的错误按原因分类: |
anduin revised this gist . Go to revision
1 file changed, 3 insertions, 2 deletions
FixBatch.md
@@ -48,10 +48,11 @@ SubstrateMachine | |||
48 | 48 | | where DeployRing == "SDFV2" | |
49 | 49 | | where DesiredMachineDefinition == "BE" | |
50 | 50 | | where DesiredVersion contains "15.20.8495" | |
51 | - | | count | |
51 | + | | where ProvisioningState != "Provisioned" | |
52 | + | | project Name, ActualVersion, DesiredVersion, Dag, Forest, DesiredMachineDefinition, ProvisioningState, ActivityState | |
53 | + | | sort by Dag | |
52 | 54 | ``` | |
53 | 55 | ||
54 | - | ||
55 | 56 | * 在这一步:确定不能部署的机器的Role | |
56 | 57 | ||
57 | 58 | 5. 检查期待性:在DMS里将机器按DesiredVersion Group,检查是否有机器试图部署这个版本。 |
anduin revised this gist . Go to revision
1 file changed, 12 insertions
FixBatch.md
@@ -40,6 +40,18 @@ $machines | Where-Object { $_.DesiredMachineDefinition -eq 'FE' } | Group-Object | |||
40 | 40 | $machines | Where-Object { $_.DesiredMachineDefinition -eq 'AD' } | Group-Object ActualVersion | Sort-Object { $_.Name } | |
41 | 41 | ``` | |
42 | 42 | ||
43 | + | 如果没有 DMS,可以使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange) | |
44 | + | ||
45 | + | ||
46 | + | ```kusto | |
47 | + | SubstrateMachine | |
48 | + | | where DeployRing == "SDFV2" | |
49 | + | | where DesiredMachineDefinition == "BE" | |
50 | + | | where DesiredVersion contains "15.20.8495" | |
51 | + | | count | |
52 | + | ``` | |
53 | + | ||
54 | + | ||
43 | 55 | * 在这一步:确定不能部署的机器的Role | |
44 | 56 | ||
45 | 57 | 5. 检查期待性:在DMS里将机器按DesiredVersion Group,检查是否有机器试图部署这个版本。 |
anduin revised this gist . Go to revision
1 file changed, 10 insertions
FixMachine.md
@@ -86,6 +86,16 @@ Enable-SeeAnything | |||
86 | 86 | See-Workflow $workflowId | |
87 | 87 | ``` | |
88 | 88 | ||
89 | + | 如果没有 DMS,则考虑使用下面的 Kusto: | |
90 | + | ||
91 | + | ```kusto | |
92 | + | CentralAdminWorkflows_Global | |
93 | + | | where RootWorkflowId == '$guid' | |
94 | + | | extend WorkflowId = strcat("\\\\", ManagementUnit, "\\", Id) | |
95 | + | | project ClassName, Result, CreateTimeUtc, EndTimeUtc, WorkflowId, Exception, LastGoodKnownState, UserContext, TenantVersion,RootWorkflowId | |
96 | + | | sort by CreateTimeUtc asc | |
97 | + | ``` | |
98 | + | ||
89 | 99 | 一般到这里,我们已经可以知道机器为什么部署失败了。如果还不清楚,可以继续下面的步骤。 | |
90 | 100 | ||
91 | 101 | 6. 将部署的错误按原因分类: |
anduin revised this gist . Go to revision
1 file changed, 11 insertions
FixBatch.md
@@ -78,6 +78,17 @@ APSFailedWorkitemEvent_Global | |||
78 | 78 | | order by targetIntention asc, Count desc | |
79 | 79 | ``` | |
80 | 80 | ||
81 | + | 如果输出了大量 DownloadComponent 的错误,可以使用这个 Query 查询它的分布: | |
82 | + | ||
83 | + | ```kusto | |
84 | + | ComponentReplicationCogsEvent_Global() | |
85 | + | | where deployRing == "TDF" and env_time > ago(100h) | |
86 | + | | summarize | |
87 | + | Failed = countif(result == 'Failed'), | |
88 | + | Succeeded = countif(result == 'Succeeded') by bin(env_time, 30min) | |
89 | + | | render timechart | |
90 | + | ``` | |
91 | + | ||
81 | 92 | 上面的查询会输出一些机器示例。请参考第二章以进一步诊断这些机器。 | |
82 | 93 | ||
83 | 94 | 8. 找到错误的信息,检查日志,找到正确的责任人。 |
anduin revised this gist . Go to revision
1 file changed, 8 insertions
FixBatch.md
@@ -131,6 +131,14 @@ Get-DeploymentConfigApprovedVersion -ApprovedVersion 15.20.74 | |||
131 | 131 | Get-DeploymentConfigPrerequisiteVersion -EntityName BE -ApprovedVersion 15.20.7472.030 | ft -a | |
132 | 132 | ``` | |
133 | 133 | ||
134 | + | 在没有 DMS 时,使用下面的 Kusto 应急: | |
135 | + | ||
136 | + | ``` | |
137 | + | SubstrateConfigWorkItem | |
138 | + | | where DeployRing contains "TDF" and ApprovedVersion contains "8374" and ServerRole contains "BE" | |
139 | + | | project HandlerType, HandlerStatus, WhenChanged | |
140 | + | ``` | |
141 | + | ||
134 | 142 | 是否完整 | |
135 | 143 | ||
136 | 144 | 12. 检查其前一个 Ring 有没有 config version 创建出来 |
anduin revised this gist . Go to revision
1 file changed, 2 insertions
FixBatch.md
@@ -1,3 +1,5 @@ | |||
1 | + | 这部分内容是通用的用于诊断 Substrate 数据中心机器部署失败的方法。可以从宏观上找到核心问题。 | |
2 | + | ||
1 | 3 | 1. 准备工作区:立刻打开两个DMS,两个OSP和一个Kusto Explorer。 | |
2 | 4 | ||
3 | 5 | 2. 识别:识别有故障的范围,是版本还是Ring。在OSP检查此Ring趋势图。检查Substrate版本历史,确认其版本类型(Dogfood、Daily)。 |
anduin revised this gist . Go to revision
1 file changed, 1 insertion
FixMachine.md
@@ -82,6 +82,7 @@ ApsPrioritizerTraceEvent_Global | |||
82 | 82 | 对于第二步的输出,我们可以看到 WorkflowId。我们可以使用这个 WorkflowId 来查看机器的部署错误。 | |
83 | 83 | ||
84 | 84 | ```powershell | |
85 | + | Enable-SeeAnything | |
85 | 86 | See-Workflow $workflowId | |
86 | 87 | ``` | |
87 | 88 |
anduin revised this gist . Go to revision
1 file changed, 1 insertion, 1 deletion
FixBatch.md
@@ -10,7 +10,7 @@ | |||
10 | 10 | ||
11 | 11 | **不要**跳过这一步!很多问题都是由于Override引起的。或许你完全可以发现已经有人在Override这个问题了。 | |
12 | 12 | ||
13 | - | 在OSP Overrides 页面搜索: | |
13 | + | 在 [OSP Overrides](https://m365pulse.microsoft.com/DeploymentCore/DeploymentMonitorApp/control%20panel/override) 页面搜索: | |
14 | 14 | ||
15 | 15 | * 这个版本本身的信息 | |
16 | 16 | * 包含 999 的 override |