Last active 1742359572

anduin's Avatar anduin revised this gist 1742359572. Go to revision

2 files changed, 16 insertions, 1 deletion

FixBatch.md

@@ -42,7 +42,6 @@ $machines | Where-Object { $_.DesiredMachineDefinition -eq 'AD' } | Group-Object
42 42
43 43 如果没有 DMS,可以使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange)
44 44
45 -
46 45 ```kusto
47 46 SubstrateMachine
48 47 | where DeployRing == "SDFV2"

FixStruggler.md(file created)

@@ -0,0 +1,16 @@
1 + ## 第三章 - 诊断剩余机器不部署的问题
2 +
3 + 1. 运行下面的查询来查看剩余机器信息
4 +
5 + 使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange)
6 +
7 + ```kusto
8 + SubstrateMachine
9 + | where ActivityState == "DotBuildUpgrade" and DesiredMachineDefinition == "BE"
10 + | where ActualExchangeVersion contains "15.20.8534"
11 + | where DeployRing in ('SIP', 'WW')
12 + | extend unpatched = strcmp(ActualExchangeVersion, "15.20.8534.031") < 0
13 + | summarize TotalCount=count(), unpatchedCount = countif(unpatched) by Forest
14 + | extend UnPatchedPercentage = round(100.0 * unpatchedCount / TotalCount, 2)
15 + | order by UnPatchedPercentage desc
16 + ```

anduin's Avatar anduin revised this gist 1741160159. Go to revision

1 file changed, 2 insertions

FixMachine.md

@@ -96,6 +96,8 @@ CentralAdminWorkflows_Global
96 96 | sort by CreateTimeUtc asc
97 97 ```
98 98
99 + 对于 Itar,则使用 [Jarvis](https://portal.microsoftgeneva.com/logs/dgrep?be=DGrep&ep=CA%20Fairfax&ns=O365PassiveITAR&en=CentralAdminWorkflows&time=2025-03-05T07:23:00.000Z&UTC=true&offset=-3&offsetUnit=Days&conditions=[[%22ClassName%22,%22%3D%3D%22,%22PatchPersistenceInspector%22]]&kqlClientQuery=source%0A|%20extend%20WorkflowId%20%3D%20strcat(%22\\\\%22,%20ManagementUnit,%20%22\\%22,%20Id)%0A|%20project%20ClassName,%20Result,%20CreateTimeUtc,%20EndTimeUtc,%20WorkflowId,%20Exception,%20LastGoodKnownState,%20UserContext,%20TenantVersion%0A|%20sort%20by%20CreateTimeUtc%20desc&aggregates=[%22Count%20by%20env_cloud_roleInstance%22]&chartEditorVisible=true&chartType=line&chartLayers=[[%22New%20Layer%22,%22%22],[%22Count%20by%20env_cloud_roleInstance%22,%22groupby%20env_time.roundDown(\%22PT1M\%22)%20as%20X,%20env_cloud_roleInstance\nwhere%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0029\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN5MGT0401CS009\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS013\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0031\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0037\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN1MGT04CS103\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS019\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22CY1MGT04CS110\%22\nlet%20Count%20%3D%20Count()%22]]%20).
100 +
99 101 一般到这里,我们已经可以知道机器为什么部署失败了。如果还不清楚,可以继续下面的步骤。
100 102
101 103 6. 将部署的错误按原因分类:

anduin's Avatar anduin revised this gist 1739623195. Go to revision

1 file changed, 3 insertions, 2 deletions

FixBatch.md

@@ -48,10 +48,11 @@ SubstrateMachine
48 48 | where DeployRing == "SDFV2"
49 49 | where DesiredMachineDefinition == "BE"
50 50 | where DesiredVersion contains "15.20.8495"
51 - | count
51 + | where ProvisioningState != "Provisioned"
52 + | project Name, ActualVersion, DesiredVersion, Dag, Forest, DesiredMachineDefinition, ProvisioningState, ActivityState
53 + | sort by Dag
52 54 ```
53 55
54 -
55 56 * 在这一步:确定不能部署的机器的Role
56 57
57 58 5. 检查期待性:在DMS里将机器按DesiredVersion Group,检查是否有机器试图部署这个版本。

anduin's Avatar anduin revised this gist 1739623062. Go to revision

1 file changed, 12 insertions

FixBatch.md

@@ -40,6 +40,18 @@ $machines | Where-Object { $_.DesiredMachineDefinition -eq 'FE' } | Group-Object
40 40 $machines | Where-Object { $_.DesiredMachineDefinition -eq 'AD' } | Group-Object ActualVersion | Sort-Object { $_.Name }
41 41 ```
42 42
43 + 如果没有 DMS,可以使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange)
44 +
45 +
46 + ```kusto
47 + SubstrateMachine
48 + | where DeployRing == "SDFV2"
49 + | where DesiredMachineDefinition == "BE"
50 + | where DesiredVersion contains "15.20.8495"
51 + | count
52 + ```
53 +
54 +
43 55 * 在这一步:确定不能部署的机器的Role
44 56
45 57 5. 检查期待性:在DMS里将机器按DesiredVersion Group,检查是否有机器试图部署这个版本。

anduin's Avatar anduin revised this gist 1739618839. Go to revision

1 file changed, 10 insertions

FixMachine.md

@@ -86,6 +86,16 @@ Enable-SeeAnything
86 86 See-Workflow $workflowId
87 87 ```
88 88
89 + 如果没有 DMS,则考虑使用下面的 Kusto:
90 +
91 + ```kusto
92 + CentralAdminWorkflows_Global
93 + | where RootWorkflowId == '$guid'
94 + | extend WorkflowId = strcat("\\\\", ManagementUnit, "\\", Id)
95 + | project ClassName, Result, CreateTimeUtc, EndTimeUtc, WorkflowId, Exception, LastGoodKnownState, UserContext, TenantVersion,RootWorkflowId
96 + | sort by CreateTimeUtc asc
97 + ```
98 +
89 99 一般到这里,我们已经可以知道机器为什么部署失败了。如果还不清楚,可以继续下面的步骤。
90 100
91 101 6. 将部署的错误按原因分类:

anduin's Avatar anduin revised this gist 1739618502. Go to revision

1 file changed, 11 insertions

FixBatch.md

@@ -78,6 +78,17 @@ APSFailedWorkitemEvent_Global
78 78 | order by targetIntention asc, Count desc
79 79 ```
80 80
81 + 如果输出了大量 DownloadComponent 的错误,可以使用这个 Query 查询它的分布:
82 +
83 + ```kusto
84 + ComponentReplicationCogsEvent_Global()
85 + | where deployRing == "TDF" and env_time > ago(100h)
86 + | summarize
87 + Failed = countif(result == 'Failed'),
88 + Succeeded = countif(result == 'Succeeded') by bin(env_time, 30min)
89 + | render timechart
90 + ```
91 +
81 92 上面的查询会输出一些机器示例。请参考第二章以进一步诊断这些机器。
82 93
83 94 8. 找到错误的信息,检查日志,找到正确的责任人。

anduin's Avatar anduin revised this gist 1736349883. Go to revision

1 file changed, 8 insertions

FixBatch.md

@@ -131,6 +131,14 @@ Get-DeploymentConfigApprovedVersion -ApprovedVersion 15.20.74
131 131 Get-DeploymentConfigPrerequisiteVersion -EntityName BE -ApprovedVersion 15.20.7472.030 | ft -a
132 132 ```
133 133
134 + 在没有 DMS 时,使用下面的 Kusto 应急:
135 +
136 + ```
137 + SubstrateConfigWorkItem
138 + | where DeployRing contains "TDF" and ApprovedVersion contains "8374" and ServerRole contains "BE"
139 + | project HandlerType, HandlerStatus, WhenChanged
140 + ```
141 +
134 142 是否完整
135 143
136 144 12. 检查其前一个 Ring 有没有 config version 创建出来

anduin's Avatar anduin revised this gist 1733755559. Go to revision

1 file changed, 2 insertions

FixBatch.md

@@ -1,3 +1,5 @@
1 + 这部分内容是通用的用于诊断 Substrate 数据中心机器部署失败的方法。可以从宏观上找到核心问题。
2 +
1 3 1. 准备工作区:立刻打开两个DMS,两个OSP和一个Kusto Explorer。
2 4
3 5 2. 识别:识别有故障的范围,是版本还是Ring。在OSP检查此Ring趋势图。检查Substrate版本历史,确认其版本类型(Dogfood、Daily)。

anduin's Avatar anduin revised this gist 1733755030. Go to revision

1 file changed, 1 insertion

FixMachine.md

@@ -82,6 +82,7 @@ ApsPrioritizerTraceEvent_Global
82 82 对于第二步的输出,我们可以看到 WorkflowId。我们可以使用这个 WorkflowId 来查看机器的部署错误。
83 83
84 84 ```powershell
85 + Enable-SeeAnything
85 86 See-Workflow $workflowId
86 87 ```
87 88

anduin's Avatar anduin revised this gist 1733754977. Go to revision

1 file changed, 1 insertion, 1 deletion

FixBatch.md

@@ -10,7 +10,7 @@
10 10
11 11 **不要**跳过这一步!很多问题都是由于Override引起的。或许你完全可以发现已经有人在Override这个问题了。
12 12
13 - 在OSP Overrides 页面搜索:
13 + 在 [OSP Overrides](https://m365pulse.microsoft.com/DeploymentCore/DeploymentMonitorApp/control%20panel/override) 页面搜索:
14 14
15 15 * 这个版本本身的信息
16 16 * 包含 999 的 override
Newer Older