Linux kernel networking rami rosen pdf

Содержание

Ваш IP заблокирован
Your IP is blocked
Linux Kernel Networking (network_overview) by Rami Rosen
Contents
Introduction
Hierarchy of networking layers
Networking Data Structures
SK_BUFF
net_device
Routing Subsystem
Routing Tables
Routing Cache
Creating a Routing Cache Entry
Policy Routing (multiple tables)
Policy Routing: add/delete a rule example
Routing Table lookup algorithm
Receiving a packet
Forwarding
Sending a Packet
Multipath routing
Netfilter rule example
ICMP redirect message
Neighboring Subsystem
Bridging Subsystem
IPSec
Example: Host to Host VPN (using openswan)
Managing multiple queues: affinity and other issues
Fragmentation:

Ваш IP заблокирован

Убедитесь, что Вы не используете анонимайзеры/прокси/VPN или другие подобные средства (TOR, friGate, ZenMate и т.п.) для доступа к сайту.

Отправьте письмо на abuse[at]twirpx.club если Вы уверены, что эта блокировка ошибочна.

В письме укажите следующие сведения о блокировке:

Кроме того, пожалуйста, уточните:

Каким Интернет-провайдером Вы пользуетесь?
Какие плагины установлены в Вашем браузере?
Проявляется ли проблема если отключить все плагины?
Проявляется ли проблема в другим браузере?
Какое программное обеспечение для организации VPN/прокси/анонимизации Вы обычно используете? Проявляется ли проблема если их отключить?
Давно ли в последний раз проверяли компьютер на вирусы?

Your IP is blocked

Ensure that you do not use anonymizers/proxy/VPN or similar tools (TOR, friGate, ZenMate etc.) to access the website.

Contact abuse[at]twirpx.club if you sure this block is a mistake.

Attach following text in your email:

Please specify also:

What Internet provider (ISP) do you use?
What plugins and addons are installed to your browser?
Is it still blocking if you disable all plugins installed to your browser?
Is it still blocking if you use another browser?
What software do you often use for VPN/proxy/anonymization? Is it still blocking if you disable it?
How long ago have you checked your computer for viruses?

Источник

Linux Kernel Networking (network_overview) by Rami Rosen

by Rami Rosen

This wiki page gives a broad overview of Linux kernel networking, going

deep into design and implementation details.

It is based on a my practical experience with Linux kernel networking and a series of lectures I gave in the Technion:
See:
Rami Rosen lectures

Please feel free send any feedback or question to Rami Rosen by sending

email to: ramirose@gmail.com

I will try hard to answer each and every question (though sometimes it takes time)

Introduction

● Understanding a packet walkthrough in the kernel is a key to understanding kernel networking. Understanding it is a must if we want to understand Netfilter or IPSec internals, and more.

● This doc concentrates on this walkthrough (design and implementation details).

● The Linux networking kernel code (including network device drivers) is a large part of the Linux kernel code.

Hierarchy of networking layers

● The layers that we will deal with (based on the 7 layers model) are:

— Link Layer (L2) (ethernet)

— Network Layer (L3) (ip4, ipv6)

— Transport Layer (L4) (udp,tcp. )

Networking Data Structures

● The two most important structures of linux kernel network layer are:

– sk_buff (defined in include/linux/skbuff.h)

– netdevice (defined in include/linux/netdevice.h)

● It is better to know a bit about them before delving into the walkthrough code.

SK_BUFF

All network-related queues and buffers in the kernel use a common data structure, struct sk_buff. This is a large struct containing all the control information required for the packet (datagram, cell, whatever). The sk_buff elements are organized as a doubly linked list, in such a way that it is very efficient to move an sk_buff element from the beginning/end of a list to the beginning/end of another list. A queue is defined by struct sk_buff_head, which includes a head and a tail pointer to sk_buff elements.

All the queuing structures include an sk_buff_head representing the queue. For instance, struct sock includes a receive and send queue. Functions to manage the queues (skb_queue_head(), skb_queue_tail(), skb_dequeue(), skb_dequeue_tail()) operate on an sk_buff_head. In reality, however, the sk_buff_head is included in the doubly linked list of sk_buffs (so it actually forms a ring).

When a sk_buff is allocated, also its data space is allocated from kernel memory. sk_buff allocation is done with alloc_skb() or dev_alloc_skb(); drivers use dev_alloc_skb();. (free by kfree_skb() and dev_kfree_skb(). However, sk_buff provides an additional management layer. The data space is divided into a head area and a data area. This allows kernel functions to reserve space for the header, so that the data doesn’t need to be copied around. Typically, therefore, after allocating an sk_buff, header space is reserved using skb_reserve(). skb_pull(int len) – removes data from the start of a buffer (skipping over an existing header) by advancing data to data+len and by decreasing len.

We also handle alignment when allocating sk_buff:

— when allocateing an sk_buff, by netdev_alloc_skb(), we eventually
call __alloc_skb() and in fact, we have two allocations here:
— the sk_buff itself (struct sk_buff *skb)

this is done by
.
skb = kmem_cache_alloc_node(cache, gfp_mask &

__GFP_DMA, node);
.
see __alloc_skb() in net/core/skbuff.c

the second is allocating data:
.
size = SKB_DATA_ALIGN(size);
data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
gfp_mask, node);
.
see also __alloc_skb() in net/core/skbuff.c
the data is for packet headers (layer 2, layer 3 , layer 4) and packet data

Now, the data pointer is not fixed; we advance/decrease it as we move from
layer to layer. The head pointer is fixed.

The allocation of data above forces alignement.

Now, when we call from the network driver to netdev_alloc_skb(),
the data points to the ethernet header. The IP header follows immediately
after the IP header. Since ethernet header is 14 bytes, this means that
assuming data = kmalloc_node_track_caller() returned a 16-bytes aligned
address, as mentioned above, the IP header will **not** be 16 bytes aligned.
(it starts on data+14). In order
to align it, we should advance data in 2 bytes before putting there the
ethernet header. This is done by skb_reserve(skb, NET_IP_ALIGN);
NET_IP_ALIGN is 2, and what skb_reserve() does is incerment data in 2 bytes.
(let’s ignore the incement of the tail, it is not important to this discussion)
So now the ip header is 16 bytes aligned.
see netdev_alloc_skb_ip_align() in include/linux/skbuff.h

struct sk_buff has fields to point to the specific network layer headers:

transport_header (previously called h) – for layer 4, the transport layer (can include tcp header or udp header or icmp header, and more)
network_header – (previously called nh) for layer 3, the network layer (can include ip header or ipv6 header or arp header).
mac_header – (previously called mac) for layer 2, the link layer.
skb_network_header(skb), skb_transport_header(skb) and skb_mac_header(skb) return pointer to the header.
The rxhash of the skb is calculated in the receive path, in get_rps_cpu(),
invoked from both from netif_receive_skb() and from netif_rx().
The hash is calculate according to the source and dest address of the
ip header, and the ports from the transport header.

The struct sk_buff objects themselves are private for every network layer. When a packet is passed from one layer to another, the struct sk_buff is cloned. However, the data itself is not copied in that case. Note that struct sk_buff is quite large, but most of its members are unused in most situations. The copy overhead when cloning is therefore limited.

Almost always sk_buff instances appear as “skb” in the kernel code.
struct dst_entry *dst – the route for this sk_buff; this route is determined by the routing subsystem.
- It has 2 important function pointers:
  - int (*input)(struct sk_buff*);
  - int (*output)(struct sk_buff*);
- input() can be assigned to one of the following : ip_local_deliver, ip_forward, ip_mr_input, ip_error or dst_discard_in.
- output() can be assigned to one of the following :ip_output, ip_mc_output, ip_rt_bug, or dst_discard_out.
- we will deal more with dst when talking about routing.
- In the usual case, there is only one dst_entry for every skb.
- When using IPsec, there is a linked list of dst_entries and only the last one is for routing; all other dst_entries are for IPSec transformers ; these other dst_entries have the DST_NOHASH flag set. These entries , which has this DST_NOHASH flag set are not kept in the routing cache, but are kept instead on the flow cache.
tstamp(of type ktime_t ) : time stamp of receiving the packet.
- net_enable_timestamp() must be called in order to get values.
users— a refernce count. Initilized to 1. Increased in RX path for each protocol handler in deliver_skb().
The method skb_shared() returns true if users > 1.

priority

skb->priority, in the TX path, is set from the socket priority (sk->sk_priority);

See, for example, ip_queue_xmit() method in ip_output.c:

You can set sk_priority of sk by setsockopt; for example, thus:

setsockopt(s, SOL_SOCKET, SO_PRIORITY, &prio, sizeof(prio))

When we are forwarding the packet, there is no socket attached to the skb.

Therefore, in ip_forward(), we set skb->priority according to a sepcial table,
called ip_tos2prio; this table has 16 entries; see include/net/route.h

And we have
int ip_forward(struct sk_buff *skb)
<
.

pkt_type:
The packet type is determined in eth_type_trans() method.
eth_type_trans() gets skb and net_device as parameters. (see net/ethernet/eth.c).
It packet type depends on the destination mac address in the ethernet header.
it is PACKET_BROADCAST for broadcast.
it is PACKET_MULTICAST for mulitcast.
it is PACKET_HOST if the destination mac address is mac address of the device which was passed as a parmeter.
It is PACKET_OTHERHOST if these conditions are not met.
(there is another type for outgoing packets, PACKET_OUTGOING, dev_queue_xmit_nit())
Notice that eth_type_trans() is unique to ethernet; for FDDI, for example, we have fddi_type_trans() (see net/802/fddi.c).

net_device

● struct net_device represents a network interface card.

Important members of struct net_device:

● unsigned int mtu – Maximum Transmission Unit: the maximum size of frame the device can handle.

● Each protocol has mtu of its own; the default is 1500 for Ethernet.

● you can change the mtu with ifconfig or with ip or via sysfs; for example,like this:

– ifconfig eth0 mtu 1400

– ip link set eth0 mtu 1400

— echo 1400 > /sys/class/net/eth0/mtu

you can show the mtu of interface eth0 by:

ifconfig eth0

ip link show

cat /sys/class/net/eth0/mtu

– You cannot of course, change it to values higher than 1500 on 10Mb/s network: – ifconfig eth0 mtu 1501 will give: – SIOCSIFMTU: Invalid argument.

● unsigned int flags (which you see or set from user space using ifconfig utility):

for example, RUNNING or NOARP.

unsigned int priv_flags
- This flags you cannot see from user space with ifconfig or other utils.
- For example, IFF_EBRIDGE for a bridge inteface.
  - This flag is set in br_dev_setup() in net/bridge/br_device.c
- or IFF_BONDING
  - This flag is set inbond_setup() method.
- This flag is set also in bond_enslave() method.
- both methods are in drivers/net/bonding/bond_main.c.
- or IFF_802_1Q_VLAN
  - This flag is set in vlan_setup() in net/8021q/vlan_dev.c
- or IFF_TX_SKB_SHARING
  - Inieee80211_if_setup() , net/mac80211/iface.c we have:
    - dev->priv_flags &=
      ● unsigned char dev_addr[MAX_ADDR_LEN] : the MAC address of the device (6 bytes).
      - type is the hw type of the device.
        
        For ethernet it is ARPHRD_ETHER
        
        In ethernet, the device type ARPHRD_ETHER is assigned in ether_setup(). see: net/ethernet/eth.c
        
        For ppp, the device type ARPHRD_PPP is assigned in in ppp_setup()indrivers/net/ppp/ppp_generic.c.
      By default, the mac address is permanent. (NET_ADDR_PERM). In case the mac address was generated with a helper method called eth_hw_addr_random(), the type of the mac address is NET_ADD_RANDOM. There is also a type called NET_ADDR_STOLEN, which is not used. The type of the mac address is stored in addr_assign_type member of the net_device.
      
      ● int promiscuity; (a counter of the times a NIC is told to set to work in promiscuous mode; used to enable more than one sniffing client; it is used also in the bridging subsystem, when adding a bridge interface; see the call to dev_set_promiscuity() in br_add_if(),
      
      ● nd_net: The network namespace this network device is inside
      
      ● struct net_device *master.
      
      This is used in bonding driver for example.
      
      see more info in Documentation/networking/netdev-features.txt
      - hw_features should be set only in ndo_init callback and not changed later.
      For loopback device and ppp device , we set NETIF_F_NETNS_LOCAL in feartures.
      
      ● You are likely to encounter macros starting with IN_DEV like: IN_DEV_FORWARD() or IN_DEV_RX_REDIRECTS(). How are the related to net_device ? How are these macros implemented ?
      
      ● void *ip_ptr: IPv4 specific data. This pointer is assigned to a pointer to in_device in inetdev_init() (net/ipv4/devinet.c)
      
      ● struct in_device has a member named conf (instance of ipv4_devconf). Setting /proc/sys/net/ipv4/conf/all/forwarding eventually sets the forwarding member of in_device to 1. The same is true to accept_redirects and send_redirects; both are also members of cnf (ipv4_devconf).
      
      ● In most distros, /proc/sys/net/ipv4/conf/all/forwarding=0
      
      ● But probably this is not so on your ADSL router.
      
      ● There are cases when we work with virtual devices.
      
      – For example, bonding (setting the same IP for two or more NICs, for load balancing and for high availability.)
      
      – Many times this is implemented using the private data of the device (the void *priv member of net_device);
      
      – In OpenSolaris there is a special pseudo driver called “vnic” which enables bandwidth allocation (project CrossBow).
      - struct net_device_ops has methods for network device management:
        
        ndo_set_rx_mode()is used to initialize multicast addresses (It was done in the past by set_multicast_list() method, which is now deprecated).
        
        ndo_change_mtu()is for setting mtu.
        
        Recently, three methods were added to suppprt bridge operations: (John Fastabend)
        
        ndo_fdb_add()
        
        ndo_fdb_del()
        
        ndo_fdb_dump()
        
        Intel ixgbe driver uses these methods.
        
        See drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
        
        Also, a new command which uses these methods is to be added to iproute2 package; this comamnd is called «bridge’.
        
        see http://patchwork.ozlabs.org/patch/117664/
      Network interface drivers
      
      ● Most of the nics are PCI devices; there are also some USB network devices.
      
      ● The drivers for network PCI devices use the generic PCI calls, like pci_register_driver() and pci_enable_device().
      
      ● For more info on nic drives see the article “Writing Network Device Driver for Linux” (link no. 9 in links) and chap17 in ldd3.
      
      ● There are two modes in which a NIC can receive a packet.
      
      – The traditional way is interrupt driven
      
      each received packet is an asynchronous event which causes an interrupt.
      
      ● NAPI (new API).
      
      – The NIC works in polling mode.
      
      – In order that the nic will work in polling mode it should be built with a proper flag. – Most of the new drivers support this feature. – When working with NAPI and when there is a very high load, packets are lost; but this occurs before they are fed into the network stack. (in the nonNAPI driver they pass into the stack)
      
      – in Open Solaris, polling is built into the kernel (no need to build drivers in any special way). User Space Tools
      
      ● iputils (including ping, arping, tracepath, tracepath6, ifenslave and more)
      
      ● net tools (ifconfig, netstat, route, arp and more)
      
      ● IPROUTE2 (ip command with many options)
      
      – Uses rtnetlink API.
      
      – Has much wider functionalities the net tools; for example, you can create tunnels with “ip” command. – Note: no need for “n” flag when using IPROUTE2 (because it does not work with DNS).
      
      Routing Subsystem
      
      ● The routing table and the routing cache enable us to find the net device and the address of the host to which a packet will be sent.
      
      ● Reading entries in the routing table is done by calling fib_lookup(const struct flowi *flp, struct fib_result *res)
      
      ● FIB is the “Forwarding Information Base”.
      
      ● There are two routing tables by default: (non Policy Routing case) – local FIB table (ip_fib_local_table ; ID 255). – main FIB table (ip_fib_main_table ; ID 254) – See : include/net/ip_fib.h.
      
      ● Routes can be added into the main routing table in one of 3 ways:
      
      – By sys admin command (route add/ip route).
      
      – By routing daemons.
      
      – As a result of ICMP (REDIRECT).
      
      ● A routing table is implemented by struct fib_table.
      
      Routing Tables
      
      ● fib_lookup() first searches the local FIB table (ip_fib_local_table).
      
      ● In case it does not find an entry, it looks in the main FIB table (ip_fib_main_table).
      
      ● Why is it in this order ?
      
      ● There is one routing cache, regardless of how many routing tables there are.
      
      ● You can see the routing cache by running ”route C”.
      
      ● Alternatively, you can see it by : “cat /proc/net/rt_cache”. – con: this way, the addresses are in hex format
      
      Routing Cache
      
      ● The routing cache is built of rtable elements:
      
      ● The dst_entry is the protocol independent part. – Thus, for example, we have a dst_entry member (also called dst) in rt6_info in ipv6. ( include/net/ip6_fib.h)
      
      ● The key for a lookup operation in the routing cache is an IP address (whereas in the routing table the key is a subnet).
      
      ● Inserting elements into the routing cache by : rt_intern_hash()
      
      ● There is an alternate mechanism for route cache lookup, called fib_trie, which is inside the kernel tree (net/ipv4/fib_trie.c)
      
      ● It is based on extending the lookup key.
      
      ● You should set: CONFIG_IP_FIB_TRIE (=y) – (instead of CONFIG_IP_FIB_HASH)
      
      ● By Robert Olsson et al (see links).
      
      – TRASH (trie + hash)
      
      – Active Garbage Collection
      
      ● You can flush the routing cache by: ip route flush cache
      
      depends on your machine.
      
      ● You can show the routing cache by: ip route show cache
      
      Creating a Routing Cache Entry
      
      ● Allocation of rtable instance (rth) is done by: dst_alloc(). – dst_alloc() in fact creates and returns a pointer to dst_entry and we cast it to rtable (net/core/dst.c).
      
      ● Setting input and output methods of dst: – (rth->u.dst.input and rth->u.dst.output )
      
      ● Setting the flowi member of dst (rth->fl) – Next time there is a lookup in the cache,for example , ip_route_input(), we will compare against rth->fl.
      
      ● A garbage collection call which delete eligible entries from the routing cache.
      
      ● Which entries are not eligible ?
      
      Policy Routing (multiple tables)
      
      ● Generic routing uses destination address based decisions.
      
      ● There are cases when the destination address is not the sole parameter to decide which route to give; Policy Routing comes to
      
      ● Adding a routing table : by adding a line to: /etc/iproute2/rt_tables. – For example: add the line “252 my_rt_table”. – There can be up to 255 routing tables.
      
      ● Policy routing should be enabled when building the kernel (CONFIG_IP_MULTIPLE_TABLES should be set.)
      
      ● Example of adding a route in this table:
      
      ● > ip route add default via 192.168.0.1 table my_rt_table
      
      ● Show the table by: – ip route show table my_rt_table
      
      ● You can add a rule to the routing policy database (RPDB) by “ip rule add . ” – The rule can be based on input interface, TOS, fwmark (from netfilter).
      
      ● ip rule list – show all rules.
      
      Policy Routing: add/delete a rule example
      
      ● ip rule add tos 0x04 table 252 – This will cause packets with tos=0x08 (in the iphdr) to be routed by looking into the table we added (252) – So the default gw for these type of packets will be 192.168.0.1 – ip rule show will give: – 32765: from all tos reliability lookup my_rt_table – . Policy Routing: add/delete a rule example
      
      ● Delete a rule : ip rule del tos 0x04 table 252
      
      ● Breaking the fib_table into multiple data structures gives flexibility and enables fine grained and high level of sharing. – Suppose that we 10 routes to 10 different networks have the same next hop gw. – We can have one fib_info which will be shared by 10 fib_aliases. – fz_divisor is the number of buckets
      
      ● Each fib_ node element represents a unique subnet. – The fn_key member of fib_ node is the subnet (32 bit)
      
      ● In the usual case there is one fib_nh (Next Hop). – If the route was configured by using a multipath route, there can be more than one fib_nh.
      
      ● Suppose that a device goes down or enabled.
      
      ● We need to disable/enable all routes which use this device.
      
      ● But how can we know which routes use this device ?
      
      ● In order to know it efficiently, there is the fib_info_devhash table.
      
      ● This table is indexed by the device identifier.
      
      ● See fib_sync_down() and fib_sync_up() in net/ipv4/fib_semantics.c
      
      Routing Table lookup algorithm
      
      ● LPM (Longest Prefix Match) is the lookup algorithm.
      
      ● The route with the longest netmask is the one chosen.
      
      ● Netmask 0, which is the shortest netmask, is for the default gateway. – What happens when there are multiple entries with netmask=0? – fib_lookup() returns the first entry it finds in the fib table where netmask length is 0.
      
      ● It may be that this is not the best choice default gateway.
      
      ● So in case that netmask is 0 (prefixlen of the fib_result returned from fib_look is 0) we call fib_select_default().
      
      ● fib_select_default() will select the route with the lowest priority (metric) (by comparing to fib_priority values of all default gateways).
      
      Receiving a packet
      
      ● When working in interrupt driven model, the nic registers an interrupt handler with the IRQ with which the device works by calling request_irq().
      
      ● This interrupt handler will be called when a frame is received
      
      ● The same interrupt handler will be called when transmission of a frame is finished and under other conditions. (depends on the NIC; sometimes, the interrupt handler will be called when there is some error).
      
      ● Typically in the handler, we allocate sk_buff by calling dev_alloc_skb() ; also eth_type_trans() is called; among other things it advances the data pointer of the sk_buff to point to the IP header ; this is done by calling skb_pull(skb, ETH_HLEN).
      
      ● See : net/ethernet/eth.c – ETH_HLEN is 14, the size of ethernet header.
      
      ● The handler for receiving an IPV4 packet is ip_rcv(). (net/ipv4/ip_input.c)
      
      ● The handler for receiving an IPV6 packet is ipv6_rcv() (net/ipv6/ip6_input.c)
      
      ● Handler for the protocols are registered at init phase.
      
      – Likewise, arp_rcv() is the handler for ARP packets.
      
      ● First, ip_rcv() performs some sanity checks. For example: if (iph->ihl version != 4) goto inhdr_error; – iph is the ip header ; iph->ihl is the ip header length (4 bits). – The ip header must be at least 20 bytes. – It can be up to 60 bytes (when we use ip options)
      
      ● Then it calls ip_rcv_finish(), by: NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);
      
      ● This division of methods into two stages (where the second has the same name with the suffix finish or slow, is typical for networking kernel code.)
      
      ● In many cases the second method has a “slow” suffix instead of “finish”; this usually happens when the first method looks in some cache and the second method performs a lookup in a table, which is slower.
      
      ● ip_rcv_finish() implementation: if (skb->dst == NULL) < int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,skb->dev); . > . return dst_input(skb);
      
      ● ip_route_input(): First performs a lookup in the routing cache to see if there is a match. If there is no match (cache miss), calls ip_route_input_slow() to perform a lookup in the routing table. (This lookup is done by calling fib_lookup()).
      
      ● fib_lookup(const struct flowi *flp, struct fib_result *res) The results are kept in fib_result.
      
      ● ip_route_input() returns 0 upon successful lookup. (also when there is a cache miss but a successful lookup in the routing table.)
      
      According to the results of fib_lookup(), we know if the frame is for local delivery or for forwarding or to be dropped.
      
      ● If the frame is for local delivery , we will set the input() function pointer of the route to ip_local_deliver(): rth->u.dst.input= ip_local_deliver;
      
      ● If the frame is to be forwarded, we will set the input() function pointer to ip_forward(): rth->u.dst.input = ip_forward; Local Delivery
      
      ● Prototype: ip_local_deliver(struct sk_buff *skb) (net/ipv4/ip_input.c). calls NF_HOOK(PF_INET, NF_IP_LOCAL_IN, skb, skb->dev,NULL,ip_local_deliver_finish);
      
      ● Delivers the packet to the higher protocol layers according to itstype.
      
      Forwarding
      
      ● prototype: – int ip_forward(struct sk_buff *skb (net/ipv4/ip_forward.c)
      
      – decreases the ttl in the ip header; If the ttl is dev,rt->u.dst.dev, ip_forward_finish);
      
      ● ip_forward_finish(): sends the packet out by calling dst_output(skb).
      
      ● dst_output(skb) is just a wrapper, which calls skb->dst->output(skb). (see include/net/dst.h)
      
      You can see the number of forwarded packets by «netstat -s | grep forwarded»
      
      or by cat /proc/net/snmp (IPv4) and cat /proc/net/snmp6 (IPV6), and look in ForwDatagrams column (IPv4)/Ip6OutForwDatagrams (IPv6).
      
      Sending a Packet
      
      ● Handling of sending a packet is done by ip_route_output_key().
      
      ● We need to perform routing lookup also in the case of transmission.
      - There are cases when we peform two lookups, like in ipip tunnels.
      ● In case of a cache miss, we calls ip_route_output_slow(), which looks in the routing table (by calling fib_lookup(), as also is done in ip_route_input_slow().)
      
      ● If the packet is for a remote host, we set dst->output to ip_output()
      
      ● ip_output() will call ip_finish_output() – This is the NF_IP_POST_ROUTING point.
      
      ● ip_finish_output() will eventually send the packet from a neighbor by: – dst->neighbour->output(skb) – arp_bind_neighbour() sees to it that the L2 address of the next hop will be known. (net/ipv4/arp.c)
      
      ● If the packet is for the local machine: – dst->output = ip_output – dst->input = ip_local_deliver – ip_output() will send the packet on the loopback device, – Then we will go into ip_rcv() and ip_rcv_finish(), but this time dst is NOT null; so we will end in ip_local_deliver().
      
      Multipath routing
      
      ● This feature enables the administrator to set multiple next hops for a destination.
      
      ● To enable multipath routing, CONFIG_IP_ROUTE_MULTIPATH should be set when building the kernel.
      
      ● There was also an option for multipath caching: (by setting CONFIG_IP_ROUTE_MULTIPATH_CACHED).
      
      ● It was experimental and removed in 2.6.23 See links (6).
      
      Multicast routing
      
      The code which handles multicast routing is net/ipv4/ipmr.c for IPv4, and
      
      net/ipv6/ip6mr.c for IPv6,
      
      In order to work with Multicast routing, the kernel should be build with
      
      You should also need to work with multicast routing user space daemons, like pimd or xorp.
      (In the past there was a daemon called mrouted). Notice that
      
      /proc/sys/net/ipv4/conf/all/mc_forwarding entry is a read only entry;
      
      ls -al /proc/sys/net/ipv4/conf/all/mc_forwarding
      
      shows:
      -r—r—r— 1 root root
      
      However, starting a daemon like pimd changes its value to 1.
      
      (stopping the daemon changes it again to 0).
      
      ● Netfilter is the kernel layer to support applying iptables rules. – It enables:
      
      ● Changing packets (masquerading)
      
      ● Writing Netfilter modules
      Jan Engelhardt, Nicolas Bouliane
      
      http://jengelh.medozas.de/documents/Netfilter_Modules.pd f
      
      Netfilter rule example
      
      ● Applying the following iptables rule: – iptables A INPUT p udp dport 9999 j DROP
      
      ● This is NF_IP_LOCAL_IN rule;
      
      ● The packet will go to:
      
      ● and then: ip_rcv_finish()
      
      ● And then ip_local_deliver()
      
      ● but it will NOT proceed to ip_local_deliver_finish() as in the usual case, without this rule.
      
      ● As a result of applying this rule it reaches nf_hook_slow() with verdict == NF_DROP (calls skb_free() to free the packet)
      
      ● iptables t mangle A PREROUTING p udp dport 9999 j MARK setmark 5
      
      – Applying this rule will set skb->mark to 0x05 in ip_rcv_finish.
      
      ICMP redirect message
      
      ● ICMP protocol is used to notify about problems.
      
      ● A REDIRECT message is sent in case the route is suboptimal (inefficient).
      
      ● There are in fact 4 types of REDIRECT
      
      ● Only one is used :
      
      – Redirect Host (ICMP_REDIR_HOST)
      
      ● See RFC 1812 (Requirements for IP Version 4 Routers).
      
      ● To support sending ICMP redirects, the machine should be configured to send redirect messages. – /proc/sys/net/ipv4/conf/all/send_redirects should be 1.
      
      ● In order that the other side will receive redirects, we should set /proc/sys/net/ipv4/conf/all/accept_redirects to 1.
      
      ● Add a suboptimal route on 192.168.0.31:
      
      ● route add net 192.168.0.10 netmask 255.255.255.255 gw 192.168.0.121
      
      ● Running now “route” on 192.168.0.31 will show a new entry: Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.0.10 192.168.0.121 255.255.255.255 UGH 0 0 0 eth0
      
      ● Send packets from 192.168.0.31 to 192.168.0.10 :
      
      ● ping 192.168.0.10 (from 192.168.0.31)
      
      ● We will see (on 192.168.0.31): – From 192.168.0.121: icmp_seq=2 Redirect Host(New nexthop: 192.168.0.10)
      
      ● now, running on 192.168.0.121: – route Cn | grep .10 ● shows that there is a new entry in the routing cache:
      
      ● 192.168.0.31 192.168.0.10 192.168.0.10 ri 0 0 34 eth0
      
      ● The “r” in the flags column means: RTCF_DOREDIRECT.
      
      ● The 192.168.0.121 machine had sent a redirect by calling ip_rt_send_redirect() from ip_forward(). (net/ipv4/ip_forward.c)
      
      ● And on 192.168.0.31, running “route -c» | grep .10” shows now a new entry in the routing cache: (in case accept_redirects=1)
      
      ● 192.168.0.31 192.168.0.10 192.168.0.10 0 0 1 eth0
      
      ● In case accept_redirects=0 (on 192.168.0.31), we will see:
      
      ● 192.168.0.31 192.168.0.10 192.168.0.121 0 0 0 eth0
      
      ● which means that the gw is still 192.168.0.121 (which is the route that we added in the beginning).
      
      ● Adding an entry to the routing cache as a result of getting ICMP REDIRECT is done in ip_rt_redirect(), net/ipv4/route.c. ● The entry in the routing table is not deleted.
      
      Neighboring Subsystem
      
      ● Most known protocol: ARP (in IPV6: ND, neighbour discovery)
      
      ● Ethernet header is 14 bytes long: – Source mac address (6 bytes). – Destination mac address (6 bytes). – Type (2 bytes).
      
      ● 0x0800 is the type for IP packet (ETH_P_IP)
      
      ● 0x0806 is the type for ARP packet (ETH_P_ARP)
      
      ● 0x8100 is the type for VLAN packet (ETH_P_8021Q)
      
      ● When there is no entry in the ARP cache for the destination IP address of a packet, a broadcast is sent (ARP request, ARPOP_REQUEST: who has IP address x.y.z. ). This is done by a method called arp_solicit(). (net/ipv4/arp.c)
      
      ● You can see the contents of the arp table by running: “cat /proc/net/arp” or by running the “arp” from a command line .
      
      ● You can delete and add entries to the arp table; see man arp.
      
      Bridging Subsystem
      
      ● You can define a bridge and add NICs to it (“enslaving ports”) using brctl (from bridgeutils).
      
      ● You can have up to 1024 ports for every bridge device (BR_MAX_PORTS) .
      
      ● brctl addbr mybr (Create a bridge named «mybr»)
      
      ● brctl addif mybr eth0 (add a port to a bridger).
      
      ● brctl show
      
      ● brctl delbr mybr (Delete the bridge named «mybr»)
      
      ● When a NIC is configured as a bridge port, the br_port member of net_device is initialized. – (br_port is an instance of struct net_bridge_port).
      
      When a bridge is created, we call netdev_rx_handler_register() to register a method
      
      for handling a bridge method to handle packets. This method is called br_handle_frame().
      
      See br_add_if() method is net/bridge/br_if.c.
      
      (Besides the bridging interface, also macvlan interface and bonding interdace call
      
      netdev_rx_handler_register(); In fact what this method does is assign a method
      
      to the net_device rx_handler member, and assign rx_handler_data to net_device
      
      rx_handler_data member. You cannot call twice netdev_rx_handler_register() on the same
      
      network device; this will return an error («Device or resource busy», EBUSY).
      
      see drivers/net/macvlan.c and net/bonding/bond_main.c.
      
      ● In the past, when we received a frame, netif_receive_skb() calld handle_bridge().
      
      Now we call br_handle_frame(), via invoking rx_handler() (see __netif_receive_skb() in
      
      ● The bridging forwarding database is searched for the destination MAC address.
      
      ● In case of a hit, the frame is sent to the bridge port with br_forward() (net/bridge/br_forward.c).
      
      ● If there is a miss, the frame is flooded on all bridge ports using br_flood() (net/bridge/br_forward.c).
      
      ● Note: this is not a broadcast !
      
      ● The ebtables mechanism is the L2 parallel of L3 Netfilter.
      
      ● Ebtables enable us to filter and mangle packets at the link layer (L2).
      
      Network namespaces
      
      A network namespace is logically another copy of the network stack,
      with it’s own routes, firewall rules, and network devices.
      
      ip netns add netns_one
      
      we create a file under /var/run/netns/ called netns_one.
      
      see man ip netns
      
      In order to show all of the named network namespaces, we run:
      
      ./ip/ip netns list
      
      Next you run:
      
      ./ip link add name if_one type veth peer name if_one_peer
      
      ./ip link set dev if_one_peer netns netns_one
      
      Example for network namespaces usage:
      =====================================
      Create two namespaces, called «myns1» and «myns2»:
      ip netns add myns1
      ip netns add myns2
      
      Assigning p2p1 interface two myns1 network namespaces :
      ip link set p2p1 netns myns1
      
      Now:
      Running:
      ip netns exec myns1 bash
      will transfer me to myns1 network namespaces; so if I will run
      there:
      ifconfig -a
      I will see p2p1;
      
      On the other hand,
      running
      ip netns exec myns2 bash
      will transfer me to myns2 network namespaces; but if I will run
      there:
      ifconfig -a
      I will not see p2p1.
      
      Under the hood, when calling ip netns exec , we have here invocation of two system calls from user space:
      setns system call with CLONE_NEWNET (kernel/nsproxy.c)
      unshare system call with CLONE_NEWNS in (kernel/fork.c)
      
      see netns_exec() in ip/ipnetns.c (iproute package)
      
      Currently there is an issue («Device or resource busy» error) when trying to delete a namespacce.
      
      Three lwn articles about namespaces:
      
      http://lwn.net/Articles/219794/
      
      «network namespaces»
      
      http://lwn.net/Articles/259217/
      
      «PID namespaces in the 2.6.24 kernel»
      
      http://lwn.net/Articles/256389/
      
      «Notes from a container «
      
      IPSec
      
      ● Works at network IP layer (L3)
      
      ● Used in many forms of secured networks like VPNs.
      
      ● Mandatory in IPv6. (not in IPv4)
      
      ● Implemented in many operating systems: Linux, Solaris, Windows, and more.
      
      ● In 2.6 kernel : implemented by Dave Miller and Alexey Kuznetsov.
      
      ● IPSec subsystem Maintainers:
      
      Herbert Xu and David Miller.
      
      Steffen Klassert was added as a maintainer in October 2012.
      
      IPSec git kernel repositories:
      
      There are two git trees at kernel.org, an ‘ipsec’ tree that tracks the
      net tree and an ‘ipsec-next’ tree that tracks the net-next tree.
      
      They are located at
      
      Two data structures are important for IPSec configuration:
      
      struct xfrm_state and struct xfrm_policy.
      
      Both defined in include/net/xfrm.h
      
      We handle IPSec rules management (add/del/update actions, etc ) from user space by accessing methods in net/xfrm/xfrm_user.c.
      
      For example, adding a policy is done by xfrm_add_policy().
      
      This is done in response to getting XFRM_MSG_NEWPOLICY message from userspace.
      
      Deleting a policy is done by xfrm_get_policy() when receiving XFRM_MSG_DELPOLICY.
      
      xfrm_get_policy() also handles XFRM_MSG_GETPOLICY messages (which perform a lookup).
      
      ● Chain of dst entries; only the last one is for routing.
      
      Also strongSwan: http://www.strongswan.org/
      
      ● There are also non IPSec solutions for VPN
      
      ● struct xfrm_policy has the following member:
      
      – struct dst_entry *bundles.
      
      – __xfrm4_bundle_create() creates dst_entries (with the DST_NOHASH flag) see: net/ipv4/xfrm4_policy.c
      
      ● Transport Mode and Tunnel Mode.
      
      ● Show the security policies:
      
      – ip xfrm policy show
      
      ● Show xfrm states
      
      — ip xfrm state show
      
      ● Create RSA keys:
      
      – ipsec rsasigkey verbose 2048 > keys.txt
      
      – ipsec showhostkey left > left.publickey – ipsec showhostkey right > right.publickey
      
      Example: Host to Host VPN (using openswan)
      
      conn linuxtolinux left=192.168.0.189 leftnexthop=%direct leftrsasigkey=0sAQPPQ. right=192.168.0.45 rightnexthop=%direct rightrsasigkey=0sAQNwb. type=tunnel auto=start
      
      ● service ipsec start (to start the service)
      
      ● ipsec verify – Check your system to see if IPsec got installed and started correctly.
      
      ● ipsec auto –status – If you see “IPsec SA established” , this implies success.
      
      ● Look for errors in /var/log/secure (fedora core) or in kernel syslog Tips for hacking
      
      ● Documentation/networking/ipsysctl. txt: networking kernel tunabels
      
      ● Example of reading a hex address:
      
      ● iph->daddr == 0x0A00A8C0 or means checking if the address is 192.168.0.10 (C0=192,A8=168,00=0,0A=10).
      
      ● echo 1 >/proc/sys/net/ipv4/icmp_echo_ignore_all
      
      ● Disable arp: ip link set eth0 arp off (the NOARP flag will be set)
      
      ● Also ifconfig eth0 arp has the same effect.
      
      ● How can you get the Path MTU to a destination (PMTU)? – Use tracepath (see man tracepath). – Tracepath is from iputils.
      
      ● Keep iphdr struct handy (printout): (from linux/ip.h)
      
      ● NIPQUAD() : macro for printing hex addresses
      
      ● CONFIG_NET_DMA is for TCP/IP offload.
      
      ● When you encounter: xfrm / CONFIG_XFRM this has to to do with IPSEC. (transformers). New and future trends
      
      ● NetChannels (Van Jacobson and Evgeniy Polyakov).
      
      ● RDMA — Remote Direct Memory Access.
      
      The kernel maintainer of the INFINIBAND SUBSYSTEM is Roland Dreier.
      
      ● Mulitqueus. : some new nics, like e1000 and IPW2200, allow two or more hardware Tx queues.
      
      In case you want to override the kernel selection of tx queue, you should implement
      
      ndo_select_queue() member of the net_device_ops struct in your driver.
      
      For example, this is done in ieee80211_dataif_ops struct in net/mac80211/iface.c
      
      see Documentation/networking/multiqueue.txt
      
      /work/src/net-next/Documentation/networking/scaling.txt
      
      Managing multiple queues: affinity and other issues
      
      Ben Hutchings — netconf 2011
      
      vger.kernel.org/netconf2011_slides/bwh_netconf2011.pdf
      
      In some drivers, the number of queues is passed as a module parameter:
      
      see, for example, drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
      
      num_queues is a module parameter (number of queues) in this driver.
      
      You should also use alloc_etherdev_mq() in your network driver insead of alloc_etherdev()
      
      ● See: “Enabling Linux Network Support of Hardware Multiqueue Devices”, OLS 2007.
      
      ● Some more info in: Documentation/networking/multiqueue.txt in recent Linux kernels.
      
      See also Dave Miller multiqueue networking presentation he gave at the 5th Netfilter Workshop,September 11th-14th, 2007. Karlsruhe, Germany
      
      ● Devices with multiple TX/RX queues will have the NETIF_F_MULTI_QUEUE feature (include/linux/netdevice.h)
      
      ● MultiQueue nic drivers will call alloc_etherdev_mq() or alloc_netdev_mq() instead of alloc_etherdev() or alloc_netdev().
      
      ● We pass the setup method as a paramter to these methods; So , for example,
      
      with ethernet devices we pass ether_setup(); with wifi devices, we pass ieee80211_if_setup(). (see ieee80211_if_add() in net/mac80211/iface.c)
      
      lnstat tool
      
      lnstat tool is a powerful tool, part of iproute 2 package
      
      Examples of usage:
      
      lnstat -f rt_cache -k entries
      shows number of routing cache entries
      
      lnstat -f rt_cache -k in_hit
      shows number of routing cache hits
      
      Misc:
      
      In this section there are some topics on which I intend to add more info
      
      Fragmentation:
      
      Fragmentation of outgoing packets:
      
      When the length of the skb is larger then the MTU of the device from which
      
      the packet is transmitted, we preform fragmentation; this is done in ip_fragment() method
      
      (net/ipv4/ip_output.c); in IPv6, it is done in ip6_fragment() in net/ipv6/ip6_output.c
      
      Fragmentation can be done in two ways:
      
      — via a page array (called skb_shinfo(skb)->frags[]) (There can be up to MAX_SKB_FRAGS; MAX_SKB_FRAGS is 16 when page size is 4K).
      
      — via a list of SKBs (called skb_shinfo(skb)->frag_list)
      
      — Then method skb_has_frag_list() tests the second (This method was called skb_has_frags() in the past).
      
      When creating a socket in user space, we can tell it not to support fragmentation.
      
      This is done for example in tracepath util (part of iputils), with setsockopt(),
      
      (tracepath util finds the path MTU)
      
      .
      in on = IP_PMTUDISC_DO;
      
      setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on));
      
      In the kernel, ip_dont_fragment() checks the value of pmtudisc field of the socket (struct inet_sock, which is embedded the sock structure). In case pmtudisc equals IP_PMTUDISC_DO, we set the IP_DF (Don’t fragment) flag in the ip header by
      
      iph->frag_off = htons(IP_DF). See for example, ip_build_and_send_pkt() in ip_output.c
      
      raw_sendmsg() and udp_sendmsg() use ip_appand_data(), which
      uses the generic ip fragmentation method, ip_generic_getfrag().
      Exception to this is udplite sockets, which uses udplite_getfrag() for
      fragmentation.
      
      Extracting the fragment offset from the ip header and the fragmen flags:
      The «frag_off» field (which is 16 bit in length) in the ip header represents the offset and the flags of the fragment.
      — 13 leftmost bits are the offset. (the offset units is 8-bytes)
      — 3 rightmost bits are the flags.
      
      So in order getting the offset and the flag from the ip header can
      be done thus:
      
      IP_OFFSET is 0x1FFF: a mask for getting 13 leftmost bits.
      
      (see #define IP_OFFSET 0x1FFF in ip.h)
      
      int offset, flags;
      offset = ntohs(ip_hdr(skb)->frag_off);
      flags = offset &
      
      IP_OFFSET;
      offset &= IP_OFFSET;
      offset tx_queue_len = 0;
      .
      and
      vlan_setup() in
      
      net/8021q/vlan_dev.c
      .
      dev->tx_queue_len = 0;
      .
      
      and
      bond_setup()
      
      in drivers/net/bonding/bond_main.c:
      .
      bond_dev->tx_queue_len = 0;
      .
      
      and macvlan_setup()
      in drivers/net/macvlan.c:
      
      Somethime you see in wireshark sniffer,
      that the amount of «Bytes on wire» is larger then the MTU
      of the network card.
      This is probably due to using Jumbo packets or offloading.
      
      Tunnels
      
      What is the difference between ipip tunnel and gre tunnel?
      
      gre tunnel supports multicasting whereas ipip tunnel does support only unicast.
      
      MTU
      
      MTU stands for Maximum Transfer Unit (or sometimes also Maximum Transfer Unit).
      
      MTU is symmetrical and applies both to receive and transmit.
      
      Layer 3 should not pass pass an skb which has payload bigger than an MTU.
      
      GSO and TSO are exceptions; in such cases, the device will separate the packet into smaller
      
      packets, which are smaller than the MTU.
      
      Multicasting
      
      struct net_device holds two lists of addresses (instances of struct netdev_hw_addr_list ):
      - uc is the unicast mac addresses list
      - mc is the multicast mac addresses list
      You add multicast addresses to the multicast mac addresses list (mc) both in IPv4 and IPv6 by:
      
      dev_mc_add() (innet/core/dev_addr_lists.c).
      
      In ipv4, a device adds the 224.0.0.1 multicast address (IGMP_ALL_HOSTS , seeinclude/linux/igmp.h), in ip_mc_up() (see net/ipv4/igmp.c).
      
      GSO
      
      For implementing GSO, a method called gso_segment was added to net_protocol
      struct in ipv4 (see include/net/protocol.h)
      For tcp, this method is tcp_tso_segment() (see tcp_protocol in net/ipv4/af_inet.c).
      There are drivers who implement TSO; for example, e1000e of Intel.
      
      A member called gso_size was added to skb_shared_info
      Also a method called skb_is_gso() was added; this method checks whether
      gso_size of skb_shared_info is 0 or not (returns true when gso_size is not 0)
      
      Grouping net devices
      
      An interesting patch from Vlad Dogaru (January 2011) added support for network device groups
      This was done by adding a member called «group» to struct net_device, and
      an API to set this group from kernel (dev_set_group()) and from user space.
      By deafault, all network devices are assigned to the default group, group 0.
      (INIT_NETDEV_GROUP); see alloc_netdev_mqs() in net/core/dev.c
      
      ethtool
      
      struct ethtool_ops had recently been added EEE support (Energy Efficient Ethernet)
      in the form of a new struct called ethtool_eee (addded in include/linux/ethtool.h)
      and two methods get_eee() and set_eee()
      
      IP address
      
      In IPv4, when you set and IP addres, you in fact assign it to ifa->ifa_local.
      
      (ifa is a pointer to struct in_ifaddr)
      
      When running «ifconfig» or «ip addr show«, you in fact issue an SIOCGIFADDR ioctl,
      
      for getting interface address, which is handled by
      
      struct in_device from inetdevice.h has a list : ifa_list, which is the IP ifaddr chain
      
      ifa_local is a member of struct in_ifaddr which represents ipv4 address.
      
      IPV6
      
      In IPV6, the neighboring subsystem uses ICMPV6 for
      Neighboring messages (instead of ARP in IPV4).
      
      ● There are 5 types of ICMP codes for neighbour discovery
      messages:
      
      NEIGHBOUR SOLICITATION (135) parallel to ARP request in IPV4
      NEIGHBOUR ADVERTISEMENT (136) parallel to ARP reply in IPV4
      
      ROUTER SOLICITATION (133)
      
      ROUTER ADVERTISEMENT (134)
      REDIRECT (137)
      
      Special Addresses:
      
      All nodes (or : All hosts) address: FF02::1
      – ipv6_addr_all_nodes() sets address to FF02::1
      – All Routers address: FF02::2
      – ipv6_addr_all_routers() sets address to FF02::2
      Both in include/net/addrconf.h
      - In IPV6: All addresses starting with FF are multicast address.
      ● IPV4: Addresses in the range 224.0.0.0 – 239.255.255.255
      are multicast addresses (class D).
      
      Privacy Extensions
      ● Since the address is build using a prefix and MAC address,
      the identity of the machine can be found.
      ● To avoid this, you can use Privacy Extensions.
      – This adds randomness to the IPV6 address creation process. (calling get_random_bytes() for example).
      ● RFC 3041 Privacy Extensions for Stateless Address Autoconfiguration in IPv6.
      ● You need CONFIG_IPV6_PRIVACY to be set when building the kernel.
      
      Hosts can disable receiving Router Advertisements by setting
      Autoconfiguration
      ● When a host boots, (and its cable is connected) it first
      creates a Link Local Address.
      – A Link Local address starts with FE80.
      – This address is tentative (only works with ND messages).
      ● The host sends a Neighbour Solicitation message.
      – The target is its tentative address, the source is all zeros.
      – This is DAD (Double Address Detection).
      ● If there is no answer in due time, the state is changed to
      permanent. (IFA_F_PERMANENT)
      
      ● Then the host send Router Solicitation.
      – The target address of the Router Solicitation
      message is the All Routers multicast address
      FF02::2
      – All the routers reply with a Router Advertisement
      message.
      – The host sets address/addresses according to
      the prefix/prefixes received and starts the DAD
      process as before.● At the end of the process, the host will have two (or more)
      IPv6 addresses:
      – Link Local IPV6 address.
      – The IPV6 address/addresses which was built
      using the prefix. (in case that there is one or more
      routers sending RAs).
      ● There are three trials by default for sending Router
      Solicitation.
      – It can be configured by:
      ● /proc/sys/net/ipv6/conf/eth0/router_solicitations
      
      VLAN (802.1q)
      
      VLAN support in linux is under net/8021q.
      There is also the macvlan driver (drivers/net/macvlan.c).
      
      The header file for vlan is include/linux/if_vlan.h
      The header file for macvlan is include/linux/if_vlan.h
      
      The maintainer of vlan is Patrick McHardy.
      
      VLAN supports almost everything a regular ethernet interface does, including
      firewalling, bridging, and of course IP traffic.
      
      You will need the ‘vconfig’ tool from the VLAN project in order to effectively use VLANs.
      
      You can also set vlan/macvlan with «ip» utility:
      ip link add link p2p1 name p2p1.100 type vlan id 5
      ip link add link p2p1 name p2p1#101 address 00:aa:bb:cc:dd:ee type macvlan
      
      VLAN traffic has 0x8100 type (ETH_P_8021Q).
      
      VLAN interface is a virtual device (you set the netdevice tx_queue_len to be 0)
      
      SKB RECYCLE
      
      skb_recycle is a Linux kernel network stack feature;
      when we don’t need anymore an skb, we free its memory by calling (for example)
      __kfree_skb().
      The skb_recycle patch is based mainly on adding code in __kfree_skb(),
      so that this skb will not be freed. Instead we will initialized
      members of skb so the result will be as of a new skb which was
      just created.
      
      See: «generic skb recycling» — a pacth by Lennert Buytenhek
      http://lwn.net/Articles/332037/
      
      According to this patch, since the skb recycling feature got litle interest and
      many bugs, it was suggested to remove it.
      
      Usage of skb_recyle is only in 5 ethernet drivers:
      
      calxeda/xgmac.c ,freescale/gianfar.c ,freescale/ucc_geth.c,
      marvell/mv643xx_eth.c and stmicro/stmmac/stmmac_main.c
      
      Bluetooth
      
      Scanning for bluetooth devices is done by:
      hcitool scan
      
      Resetting a bluetooth device can be done by:
      hciconfig hci0 reset
      
      BlueTooth scanning can be done by:
      bluez-hcidump -Xt
      
      (bluez-hcidump is a package in Fedora)
      
      Two types of controllers are defined in Bluetooth version 3 by the core specification:
      - a Basic Rate / Enhanced Data Rate controller (HCI_BREDR)
      - an Alternate MAC/PHY (AMP) (HCI_AMP)
      BNEP layer is for the transmission of IP packets in the Personal Area Networking Profile and is implemented in net/bluetooth/bnep.
      
      Site for Linux Bluetooth:
      http://www.bluez.org/
      
      The BlueZ Project started in 2001 by Qualcomm.
      
      Obexd is the Object Exchange Protocol(OBEX) and is part of BlueZ.
      The Linux BLUETOOTH subsystem and drivers are maintained by
      Marcel Holtmann, Gustavo Padovan and Johan Hedberg
      
      BD (bluetooth device) address is 48 bits, and it looks like this:
      
      Lower Address Part (LAP): 24bits
      
      Upper Address Part (UAP): 8 bits
      
      Nonsignificant Address Part (NAP): 16 bits
      
      Linux kernel bluetooth mailing list archive:
      
      Some Bluetooth acronyms:
      
      BNEP: The Bluetooth Network Encapsulation Protocol
      
      BD: Bluetooth device.
      
      L2CAP: The Logical Link Control and Adaption protocol
      RFCOMM: The Radio Frequency Communications protocol
      ACL: The Asynchronous Connection-oriented Logical transport protocol
      SCO: Synchronous Connection-Oriented logical transport.
      
      Bluetooth git tree for developers (for submitting patches):
      git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git
      
      VXLAN
      
      VXLAN is a standard protocol to transfer layer 2 Ethernet packets over UDP.
      
      written by Stephen Hemminger.
      
      First patches were sent on September 2012
      
      GRE over IPv6
      
      Dmitry Kozlov added support for GRE over IPv6.
      
      These patches were applied in August 2012
      
      See:
      
      Links and more info
      
      2) Understanding the Linux Kernel, Second Edition By Daniel P. Bovet, Marco Cesati Second Edition December 2002 chapter 18: networking.
      
      3) Linux Device Driver, by Jonathan Corbet, Alessandro Rubini, Greg Kroah Hartman Third Edition February 2005.
      
      – Chapter 17, Network Drivers
      
      4) Linux networking: (a lot of docs about specific networking topics) – http://linuxnet.osdl.org/index.php/Main_Page
      
      7) Linux Advanced Routing & Traffic Control : http://lartc.org/
      
      8) ebtables – a filtering tool for a bridging: http://ebtables.sourceforge.net/
      
      10) Netconf – a yearly networking conference; first was in 2004.
      
      – Linux Conf Australia, January 2008,Melbourne
      
      12) THRASH A dynamic LCtrie and hash data structure:
      
      Robert Olsson Stefan Nilsson, August 2006
      
      14) Openswan: Building and Integrating Virtual Private Networks , by Paul Wouters, Ken Bantoft
      
      16) For a very basic description of the network stack, see [1].
      
      18) http://www.makelinux.net/reference is a general reference for Linux kernel internals.
      
      19) This Linux Journal article by Alan Cox is an overall introduction to the networking kernel.
      
      Receive packet steering (RPS)
      
      21) application for zero copy:
      
      (trafgen; uses PF_PACKET RAW sockets and sendto() sys call)
      
      22) splice tools:
      
      network splice receive:
      
      23) Network namespaces — by Jonathan Corbet:
      
      Источник
      Читайте также: Defaultuser0 windows 10 не дает зайти

Linux kernel networking rami rosen pdf

Ваш IP заблокирован

Your IP is blocked

Linux Kernel Networking (network_overview) by Rami Rosen

Contents

Introduction

Hierarchy of networking layers

Networking Data Structures

SK_BUFF

net_device

Routing Subsystem

Routing Tables

Routing Cache

Creating a Routing Cache Entry

Policy Routing (multiple tables)

Policy Routing: add/delete a rule example

Routing Table lookup algorithm

Receiving a packet

Forwarding

Sending a Packet

Multipath routing

Netfilter rule example

ICMP redirect message

Neighboring Subsystem

Bridging Subsystem

IPSec

Example: Host to Host VPN (using openswan)

Managing multiple queues: affinity and other issues

Fragmentation: