NIC will verify MAC (if not on promiscuous mode) and FCS(FCS:Frame Check Sequence(帧校验序列) and decide to drop or to continue
NIC will DMA packets at RAM, in a region previously prepared (mapped) by the driver
Packet is copied (via DMA) to a ring buffer in kernel memory.
123
// 网卡设备的ring buffer大小
Check command: ethtool -g ethX
Change command: ethtool -G ethX rx value tx value
NIC will enqueue references to the packets at receive ring buffer queue rx until rx-usecs(超时时间,只有到了这个超时时间才触发硬中断,避免大量中断) timeout or rx-frames(网卡收到指定数量的包,才会触发硬件中断)
12345678
# ethtool –C eth1 // 查看网卡的当前设置
# ethtool –C eth1 rx-usecs 450 // 设置网卡的rx-usecs参数
# Valid range for rx-usecs are: 0,1,3, 10-8191 // 网卡的取值范围
1. 0= off, real-time interruption, one package one interrupt, the lowest delay
2. 1=dynamic,which rangefor interrupts is 4000-70000
3. 3=dynamicconservative(default),which range for interrupts is 4000-20000
4. 10-8191,how many microseconds every interruption will occurs. For example,you may wish to control the number of interrupt in every second less than 1000 , then setrx-usecs to 1000000/1000=1000 microseconds, if less than 2000, then set it to1000000/2000=500
NIC will raise a hard IRQ
CPU will run the IRQ handler that runs the driver's code
Driver will schedule a NAPI, clear the hard IRQ and return
NAPI will poll data from the receive ring buffer until netdev_budget_usecs(poll的超时时间) timeout or netdev_budget(最低需要处理的报文数量) and dev_weight(一次poll累计处理的最大报文数) packets
It looks at the routing table, if forwarding or local
If it's local it calls netfilter (LOCAL_IN)
It calls the L4 protocol (for instance tcp_v4_rcv)
It finds the right socket
It goes to the tcp finite state machine
Enqueue the packet to the receive buffer and sized as tcp_rmem rules
123
Check command: sysctl net.ipv4.tcp_rmem
Change command: sysctl -w net.ipv4.tcp_rmem="min default max"; when changing default value,
remember to restart your user space app (i.e. your web server, nginx, etc)
If tcp_moderate_rcvbuf is enabled kernel will auto-tune the receive buffer
It enqueues skb to the socket write buffer of tcp_wmem size
123
Check command: sysctl net.ipv4.tcp_wmem
Change command: sysctl -w net.ipv4.tcp_wmem="min default max"; when changing default value,
remember to restart your user space app (i.e. your web server, nginx, etc)
Builds the TCP header (src and dst port, checksum)
Calls L3 handler (in this case ipv4 on tcp_write_xmit and tcp_transmit_skb)
L3 (ip_queue_xmit) does its work: build ip header and call netfilter (LOCAL_OUT)
Calls output route action
Calls netfilter (POST_ROUTING)
Fragment the packet (ip_output)
Calls L2 send function (dev_queue_xmit)
Feeds the output (QDisc) queue of txqueuelen length with its algorithm default_qdisc
* netstat -atn | awk '/tcp/ {print $6}' | sort | uniq -c or ss -s
* ss -neopt state time-wait | wc -l
counters by a specific state: established, syn-sent, syn-recv, fin-wait-1, fin-wait-2, time-wait, closed, close-wait, last-ack, listening, closing
* netstat -st or nstat -a
* cat /proc/net/tcp
detailed stats, see each field meaning at the kernel docs
* cat /proc/net/netstat
ListenOverflows and ListenDrops are important fields to keep an eye on
* /sys/class/net/eth0/statistics/ or ethtool -S eth0 监控网卡的统计信息
* /proc/net/dev high level级别的网卡统计信息
Tuning
RSS(Receive Side Scaling) / Receive Packet Steering (RPS)/ multiqueue