Debugging CAstor node using Linux tools

Runtime Environments

Datacore Swarm runs in 3 different modes, production, debug, and probe. Debugging issues on these builds are not the same.

The production build does not have SSH (Secure Shell) access, hence, debug issues through cluster interface (GUI).
The debug build is interactive and has SSH enabled. Login to the cluster nodes and run various commands to debug the issues or go through logs to understand issues.
The probe build runs minimal stack and does not have clustering or storage capabilities. It is used to check the system capabilities, configuration, network connectivity, and overall performance of the system.

Issue Types

The issues are divided into four categories, CPU, memory, storage, and networking.

CPU - Always check stats, logs, and settings of the above components while troubleshooting. CPU-related issues occur due to CPU power settings so check the system CPU power policy and ensure that the power setting is running in the performance mode. If the system is overloaded with multiple processes, check the CPU load average. If the load average is double, then try adding more CPUs.
Memory - A system has limited memory and is divided into various system components. System and user processes consume the largest chunk of the memory. Due to frequent allocation and deallocation, memory gets fragmented and causes memory allocation to fail. This leads to slow memory allocation or no memory allocation, and the process suffers and runs terribly slow. Always keep enough free memory for the process, otherwise increase the RAM.
Storage - Several types of storage media are available such as slow, fast, and extremely fast. Applications frequently access those storage media to store and read data. The performance of an application gets hampered if the storage media is faulty, overloaded, or not configured properly. Confirm the same by looking at the kernel logs, disk configuration, or by running some benchmark test, such as Fio.
Networking - Communication is an important component of a system. Bad configuration or link can trigger a false alarm, for example, bond0 can switch to a secondary slave if no activity is detected on the primary slave. A poor or bad network connection causes frequent retransmission of the packets and leads to slow packet transmission over a network. A bad MTU size of a NIC also causes fragmentation issues. For such scenarios, check path MTU size (MTU size of a sender NIC, switch/router port and the destination NIC must be the same). To debug network issues, check the kernel logs, NIC, and socket stats.

System Component	Problem	How to Troubleshoot

System Component	Problem	How to Troubleshoot
CPU	Slow system response takes time for ssh Process and commands are running slow	Use `htop` command to check CPU load average and process/task thread count. For example, htop Identify which process is consuming lot of CPUs and check the CPU% consumed. The load average of last 5, 10 and 15 minutes appears. If load average is high, increase the CPU count on system or reduce the application task/thread count. To know about the CPU usage, run `mpstat` command. The stats for each CPU appears like where CPU is spending most of the time in kernel or in user space, and which all CPUs are busy and idle. For example, `mpstat -A`
CPU	System reponse is quick, but applications are running slow	Use `htop` command to check load average. Check the current CPU governor and ensure that it is set to performance. For example, cat/sys/devices/system/cpu/cpu/cpufreq/scaling_governor* To learn more about CPU frequency and its type, refer to https://wiki.archlinux.org/title/CPU_frequency_scaling. Check kernel logs, if any driver is missing or not compatible with current CPU. Kernel logs the warning/info message in dmesg.
Memory	System/process is running slow OR a process is failed to run	Due to high memory pressure, system starts behaving weird and process runs slow. Due to poor memory allocation, system tries to defrag the memory, but it is slow. To identify such issues, run the following commands to check the memory availability. Run `htop` command to check memory stats in visual form or run `free -thw` command to check the current memory usage Check system memory usage pattern of past few days with the help of sar command. For example, `sar -2 -rh` Here, -2 denotes last 2 days. This gives an idea about past memory usage pattern. Compare it with system logs to identify what operations were executing during that period.
Memory	The process is killed due to Out of Memory (OOM).	Check the memory stats using above commands. Due to continuous memory pressure, system tries to kill some process, though the process selection is complex. Check the process going to be killed next. Each process has a score. Based on score, system chooses process to kill. Higher the score, higher the chances of getting killed by OOM. For example, Cat /proc/<process id>/oom_score
Storage	The application is running slow due to slow read & write	Read and write performance is impacted due to a bad disk, overloaded disks, multiple input-outputs (IOs) issued on a disk, or queue is full. Run the `iostat -tkx 2` command to check the disk stats. It shows how many IO's are in queue and what is the IO latency for read and write. The output of the command shows stats for all disks, find unusual latency and queue size if any. It is possible that a disk has gone bad. To check the disk health, run the `smartctl -a </dev/device name>` command and check errors.
Storage	Disk benchmarking with and without SWARM software stack	Customers often feel that SWARM read and write is slow on an expensive hardware. To isolate the issue or to prove that the SWARM is working as expected, DataCore Swarm runs the Fio benchmark test with and without SWARM to prove disks are slow. Or Before installing a SWARM cluster, customers wants to know the compatibility and performance with their existing or new hardware (storage). For this, DataCore Swarm runs the Fio benchmark test. Note Do not run write tests on production setup or production (debug build) setup, it may corrupt the data. Instead, run the read test. SWARM 15.0 provides the provision to run read test through SWARM GUI. To run the read performance tests on individual disks, refer to the below steps: Go to <hostname:90/nodestatus>. For example, http://abc.caringo.com:90/nodestatus Click on node info. Select fio read test from the drop-down. 4. Select the disk name. It starts the test and takes a few minutes to complete. On a debug non-production setup, run the below commands: Login to setup using ssh. Run the below command in the terminal. `fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=readtest --filename=/dev/sda --iodepth=1 --size=128M --time_based --runtime=30 --readwrite=randread`
Storage	Benchmarking file system operations	To benchmark filesystem operations such as read write, flush, mkdir, rmdir, and so on, use “dbench” benchmarking tool. See also https://linux.die.net/man/1/dbench.
Networks	The file upload is slow or internode object replication is slow	Multiple reasons for slow reads and writes include: If storage and memory subsystems are not a bottleneck, then it is most likely network that causing the issue. A major problem in networks is retransmission that occurs due to a bad link, jittery connection, and overflowing of sender side or receiving side network buffers. Network latency Check network latency using ping command. `ping <destination ip address>` If ping latency exceeds 5-6 milliseconds, it indicates bad link. Check the physical link/switch/router. It is also possible that the MTU set in SWARM and the MTU of the link (switch) are not the same. To verify, run the below ping command. It sends the ping packets of size 9000 bytes. `ping -M do -s 8972 <destination IP address>` If the above ping command fails, check the MTU size of the host NIC/switch port/router port. The size must be 9000 bytes. Network card and socket buffers An overflowing buffer causes the packet retransmission. Check the NIC RX and TX queue size. Set the size to max if not set already. Check the protocol stats, if retransmission is above 5%, then increase the socket buffer size. Check the current settings and set it to maximum, if not set. For example, The maximum hardware size is 4096 bytes on virtual machine. It could be different so check the size before setting. Protocol based stats Run the below command to check stats. Per NIC stats. protocol based stats Change the network buffer size based on stats, if required. The current buffer size is available at the following locations: /proc/sys/net/ipv4/udp_mem /proc/sys/net/core/rmem_max /proc/sys/net/ipv4/tcp_rmem /proc/sys/net/ipv4/tcp_wmem The commands to change the network buffer size are: echo ‘net.core.wmem.max=<max size>’ >> /etc/sysctl.conf echo ‘net.core.rmem.max=<max size>’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_rmem=<minimum> <initial> <max>’ >> /etc/sysctl.conf echo ‘net.ipv4.tcp_wmem=<minimum> <initial> <max>’ >> /etc/sysctl.conf echo ‘net.ipv4.udp_mem=<minimum> <initial> <max>’ >> /etc/sysctl.conf
Network	Network monitoring	To monitor network load on each NIC, use the nload tool. For example, Monitor all NICs - `nload -m` Monitor single NIC - `nload eno1` To monitor network packets, use tcpdump. For example, Display available / known interfaces - tcpdump -D Capture all packets on eno1 / display output in ASCII format - tcpdump -A -i eno1 Display captured packets in HEX and ASCII - tcpdump -xx -i eno1 Capture packets based on IP address - tcpdump -n -i eno1 Capture only tcp packets - tcpdump -i eno1 tcp Capture packet from specific port - tcpdump -i eno1 port 90 Capture packets from specific IO address / sender based on IP - tcpdump -i eth0 src 192.168.2.20
Network	Debugging network throughout / latency issues on multicast, TCP or UDP configuration.	High network latency and low throughput are very common issues, to identify such issues we can take the help of the iperf tool. Based on the protocol configuration we can run iperf with different parameters Debugging multicast Debugging TCP Debugging UDP Debugging multicast: Run iperf server on each of the cluster node, make sure you don't select SWARM ports (used ports), then run iperf client on one of the cluster node. Similarly do it for rest of the nodes in the cluster. Debugging TCP: Run iperf server on each of the cluster node, then run iperf client from other nodes. Debugging UDP: Run iperf server on each of the cluster node, then run iperf client one by one from rest of the nodes. Note: please refer following link to know SWARM services and their port numbers, this would help you while selecting the non-SWARM (unused) ports. https://perifery.atlassian.net/wiki/spaces/public/pages/2443808571