Tcp in linux kernel

Networking¶

Lab objectives¶

  • Understanding the Linux kernel networking architecture
  • Acquiring practical IP packet management skills using a packet filter or firewall
  • Familiarize yourself with how to use sockets at the Linux kernel level

Overview¶

The development of the Internet has led to an exponential increase in network applications and, as a consequence, to increasing the speed and productivity requirements of an operating system’s networking subsystem. The networking subsystem is not an essential component of an operating system kernel (the Linux kernel can be compiled without networking support). It is, however, quite unlikely for a computing system (or even an embedded device) to have a non-networked operating system due to the need for connectivity. Modern operating systems use the TCP/IP stack. Their kernel implements protocols up to the transport layer, while application layer protocols are tipically implemented in user space (HTTP, FTP, SSH, etc.).

Networking in user space¶

In user space the abstraction of network communication is the socket. The socket abstracts a communication channel and is the kernel-based TCP/IP stack interaction interface. An IP socket is associated with an IP address, the transport layer protocol used (TCP, UDP etc) and a port. Common function calls that use sockets are: creation ( socket ), initialization ( bind ), connecting ( connect ), waiting for a connection ( listen , accept ), closing a socket ( close ).

Network communication is accomplished via read / write or recv / send calls for TCP sockets and recvfrom / sendto for UDP sockets. Transmission and reception operations are transparent to the application, leaving encapsulation and transmission over network at the kernel’s discretion. However, it is possible to implement the TCP/IP stack in user space using raw sockets (the PF_PACKET option when creating a socket), or implementing an application layer protocol in kernel (TUX web server).

For more details about user space programming using sockets, see Beej’s Guide to Network Programming Using Internet Sockets.

Linux networking¶

The Linux kernel provides three basic structures for working with network packets: struct socket , struct sock and struct sk_buff .

The first two are abstractions of a socket:

  • struct socket is an abstraction very close to user space, ie BSD sockets used to program network applications;
  • struct sock or INET socket in Linux terminology is the network representation of a socket.

The two structures are related: the struct socket contains an INET socket field, and the struct sock has a BSD socket that holds it.

The struct sk_buff structure is the representation of a network packet and its status. The structure is created when a kernel packet is received, either from the user space or from the network interface.

The struct socket structure¶

The struct socket structure is the kernel representation of a BSD socket, the operations that can be executed on it are similar to those offered by the kernel (through system calls). Common operations with sockets (creation, initialization/bind, closing, etc.) result in specific system calls; they work with the struct socket structure.

The struct socket operations are described in net/socket.c and are independent of the protocol type. The struct socket structure is thus a generic interface over particular network operations implementations. Typically, the names of these operations begin with the sock_ prefix.

Operations on the socket structure¶

Socket operations are:

Creation¶

Creation is similar to calling the socket() function in user space, but the struct socket created will be stored in the res parameter:

  • int sock_create(int family, int type, int protocol, struct socket **res) creates a socket after the socket() system call;
  • int sock_create_kern(struct net *net, int family, int type, int protocol, struct socket **res) creates a kernel socket;
  • int sock_create_lite(int family, int type, int protocol, struct socket **res) creates a kernel socket without parameter sanity checks.

The parameters of these calls are as follows:

  • net , where it is present, used as reference to the network namespace used; we will usually initialize it with init_net ;
  • family represents the family of protocols used in the transfer of information; they usually begin with the PF_ (Protocol Family) string; the constants representing the family of protocols used are found in linux/socket.h , of which the most commonly used is PF_INET , for TCP/IP protocols;
  • type is the type of socket; the constants used for this parameter are found in linux/net.h , of which the most used are SOCK_STREAM for a connection based source-to-destination communication and SOCK_DGRAM for connectionless communication;
  • protocol represents the protocol used and is closely related to the type parameter; the constants used for this parameter are found in linux/in.h , of which the most used are IPPROTO_TCP for TCP and IPPROTO_UDP for UDP.
Читайте также:  Перестала скрываться панель задач windows 10

To create a TCP socket in kernel space, you must call:

and for creating UDP sockets:

A usage sample is part of the sys_socket() system call handler:

Closing¶

Close connection (for sockets using connection) and release associated resources:

  • void sock_release(struct socket *sock) calls the release function in the ops field of the socket structure:
Sending/receiving messages¶

The messages are sent/received using the following functions:

  • int sock_recvmsg(struct socket *sock, struct msghdr *msg, int flags);
  • int kernel_recvmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec, size_t num, size_t size, int flags);
  • int sock_sendmsg(struct socket *sock, struct msghdr *msg);
  • int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec, size_t num, size_t size);

The message sending/receiving functions will then call the sendmsg / recvmsg function in the ops field of the socket. Functions containing kernel_ as a prefix are used when the socket is used in the kernel.

The parameters are:

  • msg , a struct msghdr structure, containing the message to be sent/received. Among the important components of this structure are msg_name and msg_namelen , which, for UDP sockets, must be filled in with the address to which the message is sent ( struct sockaddr_in );
  • vec , a struct kvec structure, containing a pointer to the buffer containing its data and size; as can be seen, it has a similar structure to the struct iovec structure (the struct iovec structure corresponds to the user space data, and the struct kvec structure corresponds to kernel space data).

A usage example can be seen in the sys_sendto() system call handler:

The struct socket fields¶

The noteworthy fields are:

  • ops — the structure that stores pointers to protocol-specific functions;
  • sk — The INET socket associated with it.
The struct proto_ops structure¶

The struct proto_ops structure contains the implementations of the specific operations implemented (TCP, UDP, etc.); these functions will be called from generic functions through struct socket ( sock_release() , sock_sendmsg() , etc.)

The struct proto_ops structure therefore contains a number of function pointers for specific protocol implementations:

The initialization of the ops field from struct socket is done in the __sock_create() function, by calling the create() function, specific to each protocol; an equivalent call is the implementation of the __sock_create() function:

This will instantiate the function pointers with calls specific to the protocol type associated with the socket. The sock_register() and sock_unregister() calls are used to fill the net_families vector.

For the rest of the socket operations (other than creating, closing, and sending/receiving a message as described above in the Operations on the socket structure section), the functions sent via pointers in this structure will be called. For example, for bind , which associates a socket with a socket on the local machine, we will have the following code sequence:

As you can see, for transmitting the address and port information that will be associated with the socket, a struct sockaddr_in is filled.

The struct sock structure¶

The struct sock describes an INET socket. Such a structure is associated with a user space socket and implicitly with a struct socket structure. The structure is used to store information about the status of a connection. The structure’s fields and associated operations usually begin with the sk_ string. Some fields are listed below:

  • sk_protocol is the type of protocol used by the socket;
  • sk_type is the socket type ( SOCK_STREAM , SOCK_DGRAM , etc.);
  • sk_socket is the BSD socket that holds it;
  • sk_send_head is the list of struct sk_buff structures for transmission;
  • the function pointers at the end are callbacks for different situations.

Initializing the struct sock and attaching it to a BSD socket is done using the callback created from net_families (called __sock_create() ). Here’s how to initialize the struct sock structure for the IP protocol, in the inet_create() function:

Читайте также:  Macrium reflect как перенести windows

The struct sk_buff structure¶

The struct sk_buff (socket buffer) describes a network packet. The structure fields contain information about both the header and packet contents, the protocols used, the network device used, and pointers to the other struct sk_buff . A summary description of the content of the structure is presented below:

  • next and prev are pointers to the next, and previous element in the buffer list;
  • dev is the device which sends or receives the buffer;
  • sk is the socket associated with the buffer;
  • destructor is the callback that deallocates the buffer;
  • transport_header , network_header , and mac_header are offsets between the beginning of the packet and the beginning of the various headers in the packets. They are internally maintained by the various processing layers through which the packet passes. To get pointers to the headers, use one of the following functions: tcp_hdr() , udp_hdr() , ip_hdr() , etc. In principle, each protocol provides a function to get a reference to the header of that protocol within a received packet. Keep in mind that the network_header field is not set until the packet reaches the network layer and the transport_header field is not set until the packet reaches the transport layer.

The structure of an IP header ( struct iphdr ) has the following fields:

  • protocol is the transport layer protocol used;
  • saddr is the source IP address;
  • daddr is the destination IP address.

The structure of a TCP header ( struct tcphdr ) has the following fields:

  • source is the source port;
  • dest is the destination port;
  • syn , ack , fin are the TCP flags used; for a more detailed view, see this diagram.

The structure of a UDP header ( struct udphdr ) has the following fields:

  • source is the source port;
  • dest is the destination port.

An example of accessing the information present in the headers of a network packet is as follows:

Conversions¶

In different systems, there are several ways of ordering bytes in a word (Endianness), including: Big Endian (the most significant byte first) and Little Endian (the least significant byte first). Since a network interconnects systems with different platforms, the Internet has imposed a standard sequence for the storage of numerical data, called network byte-order. In contrast, the byte sequence for the representation of numerical data on the host computer is called host byte-order. Data received/sent from/to the network is in the network byte-order format and should be converted between this format and the host byte-order.

For converting we use the following macros:

  • u16 htons(u16 x) converts a 16 bit integer from host byte-order to network byte-order (host to network short);
  • u32 htonl(u32 x) converts a 32 bit integer from host byte-order to network byte-order (host to network long);
  • u16 ntohs(u16 x) converts a 16 bit integer from network byte-order to host byte-order (network to host short);
  • u32 ntohl(u32 x) converts a 32 bit integer from network byte-order to host byte-order (network to host long).

netfilter¶

Netfilter is the name of the kernel interface for capturing network packets for modifying/analyzing them (for filtering, NAT, etc.). The netfilter interface is used in user space by iptables.

In the Linux kernel, packet capture using netfilter is done by attaching hooks. Hooks can be specified in different locations in the path followed by a kernel network packet, as needed. An organization chart with the route followed by a package and the possible areas for a hook can be found here.

The header included when using netfilter is linux/netfilter.h .

A hook is defined through the struct nf_hook_ops structure:

Источник

Query Admin

System Administration Tips, Security, Internet

Tuning the Linux Kernel and TCP Parameters with Sysctl

Posted on April 25, 2017 at 1:10 pm

There are many guides online about Linux kernel and TCP tuning, I tried to sum the most useful and detailed Linux kernel and TCP tuning tips, including the best guides about TCP and kernel tuning on Linux, useful to scale and handle more concurrent connections on a linux server.

This is a more advanced post about Linux TCP and Kernel optimization.

Sysctl.conf Optimization

This is the /etc/sysctl.conf file I use on my servers (Debian 8.7.1):

I included references and personal comments.

# Increase number of max open-files fs.file-max = 150000 # Increase max number of PIDs kernel.pid_max = 4194303 # Increase range of ports that can be used net.ipv4.ip_local_port_range = 1024 65535 # https://tweaked.io/guide/kernel/ # Forking servers, like PostgreSQL or Apache, scale to much higher levels of concurrent connections if this is made larger kernel.sched_migration_cost_ns=5000000 # https://tweaked.io/guide/kernel/ # Various PostgreSQL users have reported (on the postgresql performance mailing list) gains up to 30% on highly concurrent workloads on multi-core systems kernel.sched_autogroup_enabled = 0 # https://github.com/ton31337/tools/wiki/tcp_slow_start_after_idle—tcp_no_metrics_save-performance # Avoid falling back to slow start after a connection goes idle net.ipv4.tcp_slow_start_after_idle=0 net.ipv4.tcp_no_metrics_save=0 # https://github.com/ton31337/tools/wiki/Is-net.ipv4.tcp_abort_on_overflow-good-or-not%3F net.ipv4.tcp_abort_on_overflow=0 # Enable TCP window scaling (enabled by default) # https://en.wikipedia.org/wiki/TCP_window_scale_option net.ipv4.tcp_window_scaling=1 # Enables fast recycling of TIME_WAIT sockets. # (Use with caution according to the kernel documentation!) net.ipv4.tcp_tw_recycle = 1 # Allow reuse of sockets in TIME_WAIT state for new connections # only when it is safe from the network stack’s perspective. net.ipv4.tcp_tw_reuse = 1 # Turn on SYN-flood protections net.ipv4.tcp_syncookies=1 # Only retry creating TCP connections twice # Minimize the time it takes for a connection attempt to fail net.ipv4.tcp_syn_retries=2 net.ipv4.tcp_synack_retries=2 net.ipv4.tcp_orphan_retries=2 # How many retries TCP makes on data segments (default 15) # Some guides suggest to reduce this value net.ipv4.tcp_retries2=8 # Optimize connection queues # https://www.linode.com/docs/web-servers/nginx/configure-nginx-for-optimized-performance # Increase the number of packets that can be queued net.core.netdev_max_backlog = 3240000 # Max number of «backlogged sockets» (connection requests that can be queued for any given listening socket) net.core.somaxconn = 50000 # Increase max number of sockets allowed in TIME_WAIT net.ipv4.tcp_max_tw_buckets = 1440000 # Number of packets to keep in the backlog before the kernel starts dropping them # A sane value is net.ipv4.tcp_max_syn_backlog = 3240000 net.ipv4.tcp_max_syn_backlog = 3240000 # TCP memory tuning # View memory TCP actually uses with: cat /proc/net/sockstat # *** These values are auto-created based on your server specs *** # *** Edit these parameters with caution because they will use more RAM *** # Changes suggested by IBM on https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/Linux%20System%20Tuning%20Recommendations # Increase the default socket buffer read size (rmem_default) and write size (wmem_default) # *** Maybe recommended only for high-RAM servers? *** net.core.rmem_default=16777216 net.core.wmem_default=16777216 # Increase the max socket buffer size (optmem_max), max socket buffer read size (rmem_max), max socket buffer write size (wmem_max) # 16MB per socket — which sounds like a lot, but will virtually never consume that much # rmem_max over-rides tcp_rmem param, wmem_max over-rides tcp_wmem param and optmem_max over-rides tcp_mem param net.core.optmem_max=16777216 net.core.rmem_max=16777216 net.core.wmem_max=16777216 # Configure the Min, Pressure, Max values (units are in page size) # Useful mostly for very high-traffic websites that have a lot of RAM # Consider that we already set the *_max values to 16777216 # So you may eventually comment these three lines net.ipv4.tcp_mem=16777216 16777216 16777216 net.ipv4.tcp_wmem=4096 87380 16777216 net.ipv4.tcp_rmem=4096 87380 16777216 # Keepalive optimizations # By default, the keepalive routines wait for two hours (7200 secs) before sending the first keepalive probe, # and then resend it every 75 seconds. If no ACK response is received for 9 consecutive times, the connection is marked as broken. # The default values are: tcp_keepalive_time = 7200, tcp_keepalive_intvl = 75, tcp_keepalive_probes = 9 # We would decrease the default values for tcp_keepalive_* params as follow: net.ipv4.tcp_keepalive_time = 600 net.ipv4.tcp_keepalive_intvl = 10 net.ipv4.tcp_keepalive_probes = 9 # The TCP FIN timeout belays the amount of time a port must be inactive before it can reused for another connection. # The default is often 60 seconds, but can normally be safely reduced to 30 or even 15 seconds # https://www.linode.com/docs/web-servers/nginx/configure-nginx-for-optimized-performance net.ipv4.tcp_fin_timeout = 7

Читайте также:  Вредные обновления для windows

The following modifications caused many 500 errors, so I removed them:

# Disable TCP SACK (TCP Selective Acknowledgement), DSACK (duplicate TCP SACK), and FACK (Forward Acknowledgement) # SACK requires enabling tcp_timestamps and adds some packet overhead # Only advised in cases of packet loss on the network net.ipv4.tcp_sack = 0 net.ipv4.tcp_dsack = 0 net.ipv4.tcp_fack = 0 # Disable TCP timestamps # Can have a performance overhead and is only advised in cases where sack is needed (see tcp_sack) net.ipv4.tcp_timestamps=0

Type “sysctl -p” to apply the sysctl changes (I also reboot the server).

Reduce Disk I/O Requests

Another optimization I have done on my servers is to mount the /webserver partition with “noatime” to disable the access time on files to reduce the disk I\O. Just edit /etc/fstab and add “noatime” to the partition where you have the web server data (vhosts, database, etc):

UUID=[. ] /webserver ext4 defaults,noexec,nodev,nosuid,noatime 0 2

For the changes to take effect reboot the server or remount the partition:

mount -o remount /webserver

Use “mount” to verify that /webserver has been remounted with “noatime” attribute.

You may disable access time also on / partition and other partitions.

Disable Nginx Access Log

Reduce disk I\O by disabling web server access logs:

Concurrent Connections Test

This is a screenshot of the concurrent connections handled with the above changes:

Источник

Оцените статью