Intel Skylake/Cascade Lake CPU Performance Issue

Some Swarm clients may be vulnerable to a kernel issue, which leads to a spontaneous change in CPU clock source from the TSC to HPET clock for some CPUs under some circumstances. All operations in the kernel such as process scheduling interrupts depend on the clock source. TSC is the most accurate clock and falling back to HPET causes the operating system instability. The issue is described in a kernel bug filing here and applies only to some kernel versions and some Intel CPUs. Further details have been tracked in Jira under SWAR-9055.

Swarm versions from 10.0 onward are impacted by this issue. Swarm 14.1.0 and later builds allow a kernel option work-around that mitigates the issue, described below.

Symptoms of the issue include highly degraded performance, abnormally long mount times, and histogram tails over 7 seconds. Dmesg output may contain lines like “clocksource: Switched to clocksource hpet".

To check whether a particular CPU is impacted, find the CPU model name dmesg dumps, or hwinfo dumps. Look up the CPU on the Intel web site. Once you are on the page, search for “lake”. If that is present, then mitigation is needed. Note that a very large number of CPU types require mitigation.

Example 1: Intel(R) Xeon(R) Silver 4216 links to here. The page mentions “Products formerly Cascade Lake”. This CPU requires mitigation.

Example 2: Intel(R) Xeon(R) CPU E5-2650 v2 links to here. No mention of “lake”. This CPU does not require mitigation.

With Swarm 14.1.0 and future releases where this issue is a concern, the build attempts an automatic discovery of vulnerable CPUs. A CRITICAL message is issued soon after boot if found and not mitigated by the work-around (described next): “This chassis needs a clocksource boot parameter to address the Skylake performance vulnerability (Intel Skylake/Lewisburg/Purley). Contact support." Swarm 15.2 improves this detection.

Mitigation via CSN, SCS, or PXE Boot

A kernel boot parameter “clocksource.max_cswd_read_retries=50" must be applied in the cluster boot environment, either in the CSN, SCS, or other PXE boot server to work-around this issue. This option can go after the bonding mode parameter. The above CRITICAL message does not appear once this kernel option is applied. The CPUs should behave close to normally with the work-around.

For SCS:

Run the following so any kernel arguments already in place are not accidentally removed:
scsctl network_boot config show kernel.extraArgs -d
The following SCS command changes the kernel boot options setting:
scsctl network_boot config set kernel.extraArgs=clocksource.max_cswd_read_retries=50 -d
A restart of the swarm-platform service is required in order for the changes to take effect:
systemctl restart swarm-platform

For CSN:

Edit /etc/caringo/netboot/netboot.cfg on the CSN and add the above boot parameter to the kernelOptions setting. An example below shows how the config file looks with the new argument in addition to an already present argument:

kernelOptions = castor_net=active-backup: clocksource.max_cswd_read_retries=50

Restart netboot on the CSN with: service netboot restart. Lastly, restart the swarm storage nodes one at a time. Since this is a modification to a kernel boot parameter it only takes effect early in the boot sequence.

USB Boot

Modify syslinux.cfg if USB boot method is used. Example before and after: