- How to show dropped packets per interface on Linux
- Displaying show dropped packets per interface on Linux using the netstat
- To display summary statistics for each protocol, run:
- Show tcp stats
- Display udp stats
- Showing dropped packets statistics per network interface on Linux using the ip
- Queries the specified network device for NIC- and driver-specific statistics with ethtool
- Finding out why a Linux server is dropping packets
- Building dropwatch
- Conclusion
- Searching for the cause of dropped packets on Linux
How to show dropped packets per interface on Linux
H ow do I display dropped packets per interface on Linux operating systems from the command line option? How can I determine why a Linux server is dropping packets?
We can use the ip command or netstat command or ethtool command to show dropped packets statistics per network interface on Linux. Let us see how to use both commands list dropped packets per interface.
Tutorial details | |
---|---|
Difficulty level | Advanced |
Root privileges | Yes |
Requirements | Linux with GCC compilers |
Est. reading time | 6 minutes (depends upon your skills & situation) |
Displaying show dropped packets per interface on Linux using the netstat
The netstat command is mostly obsolete. Replacement for netstat is ss and ip command. However, netstat still available on older Linux distros, which are in productions. Hence, I will start with netstat but if possible, use the ip/ss tools. The syntax is:
netstat -i
netstat —interfaces
Displaying network stats per network interface on LInux
To display summary statistics for each protocol, run:
netstat -s
netstat —statistics
Outputs:
- No ads and tracking
- In-depth guides for developers and sysadmins at Opensourceflare✨
- Join my Patreon to support independent content creators and start reading latest guides:
- How to set up Redis sentinel cluster on Ubuntu or Debian Linux
- How To Set Up SSH Keys With YubiKey as two-factor authentication (U2F/FIDO2)
- How to set up Mariadb Galera cluster on Ubuntu or Debian Linux
- A podman tutorial for beginners – part I (run Linux containers without Docker and in daemonless mode)
- How to protect Linux against rogue USB devices using USBGuard
Join Patreon ➔
Show tcp stats
netstat —statistics —tcp
netstat -s -t
Display udp stats
netstat —statistics —udp
netstat -s -u
Showing dropped packets statistics per network interface on Linux using the ip
Let us see how to see link device stats using the ip command. The syntax is:
ip -s link
ip -s link show
ip -s link show eth0
In this example display link stats for wg0:
ip -s link show wg0
It is clear that TX is Transmit and RX is Receive. Wireguard creates the wg0 interface. So either Wireguard or firewall dropping packets as per policy.
Queries the specified network device for NIC- and driver-specific statistics with ethtool
Pass the -S or —statistics option to display stats. Again the syntax is straightforward:
ethtool -S
ethtool -S eth0
Another option is to directly query the /proc/net/dev file either using the cat command or column command:
cat /proc/net/dev
column -t /proc/net/dev
And this is what we see:
Finding out why a Linux server is dropping packets
It is a project I started in an effort to improve the ability for developers and system administrator to diagnose problems in the Linux Networking stack, specifically in our ability to diagnose where packets are getting dropped.
Building dropwatch
Install required tools, libs and gcc compiler collection on Ubuntu or Debian Linux:
sudo apt-get install libpcap-dev libnl-3-dev libnl-genl-3-dev binutils-dev libreadline6-dev autoconf libtool pkg-config build-essential
Next, clone the repo and then compile it:
git clone https://github.com/nhorman/dropwatch
cd dropwatch
./autogen.sh
./configure
make
make install
Session:
Run it as follows:
# dropwatch -l kas
You see nf_hook_slow or icmpv6_rcv are Linux kernel functions. It means I need to search Linux kernel trees and see what is going on. Naturally, you must understand C programming and have a good understanding of Linux kernel internals including TCP/IP stack. Now, you know how to see if packets are dropped at the IP layer, the link layer, the UDP/TCP layer, or the application layer. If packets are dropped in TCP/IP, we need to look into the Linux kernel source code documentation. See man page and source code of dropwatch for futher information:
man dropwatch
I would also suggest trying out tcpdump to dump traffic on a network interface. Often it gives hints about packets and easily analyzed in wireshark tool:
man tcpdump
Conclusion
You learned about various Linux commands to see packet loss on Linux per-interface, including excellent tools such as dropwatch. We can also use Linux profiling with performance counters utility called perf. Check out perf examples for further information.
🐧 Get the latest tutorials on Linux, Open Source & DevOps via
Источник
Searching for the cause of dropped packets on Linux
November 17, 2020
Hi all, just wanted to share a recent experience I had investigating a packet drop issue on a linux system. Hope it’s useful to anyone facing similar issues.
Recently I was alerted by our monitoring system that it detected network packet loss. The loss was not excessive, only a couple of hundred packets, nothing compared to the billions of packets flowing through our systems. There was no observable degradation of the functionality of the system. The number of lost packets was low and they were TCP packets, so TCP retransmission would overcome the loss. However, packet loss is often an early sign that something is wrong with the systems, i.e. load is building up beyond capacity. So I took the alert seriously and started searching for the cause. There can be various reasons for packet loss. It can be that the network transport is unreliable and packet loss is natural, the network link could be congested, applications cannot handle the offered load. In this case the network link was reliable fibre, the network link was far from congested. The monitoring system indicated that the packet loss was on a linux box running several server applications that process data sent by our trading machines for analysis by our traders. The packet drops were on incoming traffic. The drop count is shown by the ifconfig command.
It seemed most likely that the applications received more traffic than they could handle. Given that on this box one application was running close to its maximum capacity, I expected that this application could not handle the load during some periods during the day, something we had seen before. So I analyzed the performance metrics of this application. CPU load was not very high, the incoming traffic was not out of the ordinary. All metrics indicated that the application was running normally. Nothing indicated that it was overloaded.
This is a single threaded application handling approximately 7000 TCP connections. The CPU load showed some spikes, but nothing excessive. Some of the packet loss periods coincided with those spikes, but not all. Also some periods of high load did not show any packet loss. The same held for the network traffic measured at the NIC level. The periods with dropped packets did not coincide with the periods having the highest packet rate. I also checked the other applications on the box and as expected their load was way less and did not show any signs of overload.
This looked odd. The network seems fine, the applications do not seem overload, yet packets are dropped. Could the monitoring system be wrong? We captured network traffic with tcpdump. If the monitoring system showed packet loss, but the TCP streams in the capture do not, then the monitoring system could be wrong. It could still be that the monitoring system is right as the packets are dropped on the machine, possibly after capturing. The capture did show gaps in the TCP sequence numbers. Tcpdump may not be able to capture all traffic and create those gaps by itself. In this case, the gaps were real as later in time the retransmissions for those missing packets could be seen, i.e. the clients sending those packets did not receive acknowledgements and retransmit the packets.
The capture confirmed the alerts from the monitoring system. We learned something useful from the capture though. The packet drops were seen on TCP streams from all applications running on the box, not only for the one heavy application! That means application performance could not be the cause. Still I was not 100% convinced that it was not something in the application. What else could it be? I decided to do an experiment:
- Run a server that accepts incoming connections, but does not read any data from the connected socket
- Connect a client to the server
- Keep sending data from that client to the server
At some point the server should drop the incoming data. How would this packet drop show in the metrics?
During the experiment I watched the output of the netstat -atn command on the machine running the server:
As expected the size of the receive queue (Recv-Q) kept growing. It grew till
5MB and then the queue was full, no more packets could be received. Tcpdump showed retransmissions, coming in slower and slower. The TCP exponential backoff algorithm did its work at the client side.
To my surprise the ‘dropped’ count in ifconfig did not increase! I searched for other metrics to see where those dropped packets showed up. I found that the following counters, displayed by netstat -s incremented with each incoming retransmission.
Now I was convinced that the packet drops that we had in our production system were not due to performance issues in the application itself. But then what was it? The packets arrived at the machine, the application can handle the load, still the packets were dropped. What dropped it? As a software developer I was aware of network issues and application performance, but I never really thought about what is between the NIC and my application. Something there must be dropping packets. In between is the network driver and the kernel, something I never really looked into. Now I had to. The NIC, a Solarflare card, has a ring buffer of 1024 slots by default. Every incoming packet is put into this buffer through DMA. Each is 2K and can hold 1 packet at most, i.e. a small packet of only several bytes will occupy a full slot. This ring buffer can be increased to 4096 slots. We did and then there were no more packet drops!
Apparently the buffer overflowed and increasing the buffer stopped the packets from being dropped. Still I felt unsatisfied as I did not really understand why the buffer overflowed. After all the NIC was a 10GB/s interface and the incoming traffic was well below that rate. When a packet arrives at the NIC the following happens:
- An IRQ is raised
- This IRQ eventually triggers a task net_rx_action
- The net_rx_action task takes packets from the NIC ring buffer and puts them in socket buffer (skb) structures
- The skb’s are passed up to the protocol layers, IP, TCP
The net_rx_action has a limited budget to run. The budget is determined by 2 constraints:
- A packet budget defined by net.core.netdev_budget – This is the amount of packets that can be handled at most in a single run.
- A time constraint of 2 jiffies.
If either of these constraints is reached, then the task is stopped. Remaining packets will be left in the ring buffer till the task runs again. This is to prevent net_rx_action hogging the kernel. When this happens the squeeze_time counter in the /proc/net/softnet_stat statistics gets incremented. These statistics show 1 line per CPU core. Each counter is 4 bytes. The 3rd counter is the squeeze_time counter. Looking at the core handling the IRQ’s from the NIC we saw that the squeeze_time counter got incremented once in a while:
6cbc9e8e 00000000 0005e467 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Monitoring this count showed a pattern similar to the packet drops that we had first. The packet drops stopped due to increased buffer size, but at the times we had drops before we now observed time squeezes. By default the packet budget net.core.netdev_budget is 300. This seems quite a low number and apparently we receive bursts of packets exceeding this number, probably exceeding the default ring buffer size of 1024. Searching the internet showed that the default budget is too low for handling 10GB/s. Hogging the CPU core did not seem to be an issue as we had dedicated one CPU core to handle the network IRQ’s. So far this core never exceeded a load of 15%.
Increasing the budget indeed reduced the number of time squeezes. Still I was puzzled why this happened in the first place as the box did not receive anything close to 10GB/s or did it? All traffic rate metrics have a sampling rate of 1s. What if there are some exceptionally large bursts in sub-second intervals?
With the packet drops gone, we captured network traffic once more. This time I made an I/O graph for incoming packets with a 1ms interval using Wireshark to get an insight into traffic rates. This is what I got:
Three sharp peaks stand out and in between lower peaks can be seen.The 3 peaks show that packet rate goes up to
700 packets/ms. This is quite high and perhaps the rate is even higher on some sub-millisecond intervals. The peaks are caused by hundreds of clients sending a small packet to the server at the same time.
Although the average amount of traffic was moderate, there were micro bursts of traffic due to synchronized behavior of the clients. The micro bursts were the cause of the packet drops we saw. They are somewhat like microscopic denial of service attacks.
Conclusion
When this issue started I thought it would be a case of application performance issues. I did not expect to end up in the depths of the kernel. As with all complex issues, there are multiple causes for the packets loss:
- Multiple clients sending data in a synchronized manner cause micro traffic bursts at the server
- The default settings of the net_rx_task kernel task are not sufficient to process these bursts in one go
- The default size of the ring buffer is too small to buffer all packets from these bursts
- The packets making up the bursts are small, so the data rate measured in bytes stays well below the maximum
The art of software development requires one to think in layers of abstraction. Running an application requires layers upon layers upon layers with the hardware at the bottom. Most of the time we never really think about all those layers as they just do their job, making themselves invisible to us. When problems arise we search in the top layer, our application. This is where we find most of our problems. But once in a while the problems are in those layers below. To find those you need to be aware what is beneath your application. You need expertise from others. I got help from colleagues with expertise in networks and the Linux kernel to help me out.
In this case there is also the effect from multiple clients behaving in a synchronized way causing those bursts. Every single client worked as expected, but together they did not.
Without our monitoring systems we would not have known about this problem. Most likely the amount of traffic would have grown over time, possibly making the bursts larger or more frequent up to the point where functionality of the system starts to degrade, i.e. unresponsive applications on the trader’s desktop. Doing this investigation under such conditions would have been much more stressful.
No matter how well you design and build your systems, unforeseen issues will happen. You need monitoring to find these issues before they become harmful.
Источник