Netconf discussions, part 2

By Nathan Willis
February 18, 2016

On ~~September~~ February 8 and 9, the Monday and Tuesday before the Netdev 1.1 conference in Seville, Spain, kernel networking developers gathered for Netconf, an informal roundtable to discuss recent issues face to face and to debate upcoming work. Last week, we covered the discussions that took place on the event's first day; what follows is a recap of how those discussions progressed on the second day.

SR-IOV

First, Alex Duyck raised the issue of supporting single-root I/O virtualization (SR-IOV), which is often employed to share a network interface device between virtual machines (VMs). The kernel has SR-IOV support for several devices, but there is no formal specification that the in-kernel implementations adhere to, which leads to increasing complexity. Intel and Mellanox devices operate differently, he said; in particular, some are capable of learning about the virtual network, while others need the hypervisor to explicitly pass down most configuration information.

There are also newer SR-IOV devices that include an embedded switch, which raises the question of whether all SR-IOV devices should be supported through the switchdev driver model. There is a case to be made that SR-IOV hardware, in general, mediates access between physical and virtual network devices, which is "switch-like," but not everyone is persuaded—Jesse Brandeburg said that Intel, for one, was not sure if it should take the switchdev approach. Networking subsystem maintainer David Miller, however, was strongly in favor of the approach. He noted that switchdev was intended to support a number of abstract models that amount to "packets flowing through non-net_device devices." He also pointed out that using switchdev would provide better netfilter support for SR-IOV.

eBPF

Alexei Starovoitov then provided an update on the extended Berkeley Packet Filter (eBPF). He highlighted, among other things, the BPF Compiler Collection (bcc) toolkit from IO Visor, the ability to attach eBPF programs to tracepoints, and the perf integration. These recent additions to eBPF have opened the door to significantly better tracing from user space, including better profiling of network performance. There is still more to be done in that area, however; he wants to add three or four new tracepoints in strategic places and to add some metadata to struct sock to eliminate the need for another lookup.

He also discussed upcoming changes and ideas for eBPF, starting with the ability to map eBPF maps into user-space memory with mmap(). That should make it possible to avoid locks and write zero-copy eBPF programs. He is also developing a kernel sampling counter that can be set from user space and used to collect periodic statistics. Further out, he would like to add eBPF support for generic segmentation offload (GSO) and generic receive offload (GRO) for UDP-based protocols like QUIC, which he hoped would discourage people from bypassing the kernel to implement new protocols in user space. He is also interested in adding support for bounded loops and vector instructions, he said, but he lacked time to explain the full rationale. He noted, however, that he has heard several use cases for those features, including vector instructions "which I first thought was crazy."

Starovoitov also proposed a different kind of "offload"—offloading the execution of eBPF programs to hardware. He knows of an implementation of eBPF that runs on a field-programmable gate array (FPGA), he said, which is reported to be released soon as open-source hardware. eBPF could also run on dedicated chips; the important thing is that the API remain stable and that the tool chain provide what developers need.

That change would make eBPF programs more like firmware, which several in the group seemed to regard as an unwise move. Miller, in particular, was not a fan of the concept, particularly since the eBPF virtual machine can run arbitrary code. That puts it in a different class than (for example) WiFi adapters, which load firmware designed to implement a well-known API. Jeff Kirsher noted that the debate was similar to the one over how much code should be offloaded from the kernel to switch hardware, and proposed tabling further discussion for now, to take up both topics together further down the line.

Replacing `ethtool`

Next, Brandeburg proposed writing a modernized replacement for the increasingly outdated ethtool network-interface control utility, most likely written on top of netlink. Because ethtool has numerous limitations, there is general agreement that a replacement is needed, although there are differing ideas about what constitute the critical features. Thomas Graf suggested that every operation should be split into a validation step and an execution step, thus making transactional operations possible. Shrijeet Mukherjee said that operations need to be asynchronous, in order to avoid locking when hundreds of commands are processed in short order—even on "read" commands, which are important to collecting high-quality statistics.

Mukherjee also suggested splitting such a replacement tool into a separate daemon and front-end, although Brandeburg and Graf both felt that such a split might be unwarranted. Miller observed that many people seem to want to query multiple hardware devices at the same time, so multi-device support should probably be made a design goal as well. Graf asked whether that included "multi-SET" commands to update several parameters at once; Miller replied that developers lives would be much easier if they do not try to support that model.

There was consensus that all of ethtool's existing functions map onto netlink functions, and that ethtool will likely never disappear completely. So a migration to the successor utility seems to be the path forward. Murkherjee noted that Cumulus has written a generic netlink front-end tool, and volunteered to release it as an RFC. Other than that, it may be a while before the plan takes solid form—although Brandeburg did suggest the name "nettool."

Devlink

Next, Jiří Pírko discussed devlink, a tool he has written to simplify the administration of physical hardware devices that provide more than one network port. Such devices include certain network interface cards (NICs), splitter cables for some newer NICs, and various application-specific integrated circuit (ASIC) switch devices. In each case, the hardware has device-wide capabilities a level above what the net_device interface provides. There is currently no generic solution to managing such devices; many of them also do not map easily to existing configuration tools.

Pírko had already posted an RFC about devlink to the mailing list, so he demonstrated it and fielded questions from around the room. Mukherjee worried that the tool was adding yet another command-line utility to what is already a lengthy list; Miller countered that a similar set of hurdles is already facing users who want to take advantage of the newer features in many WiFi chipsets.

In the end, however, Miller concurred that there is probably a need for some higher-level interface to these devices, and noted that it may overlap with user-space tools desired for switchdev. He also pointed out that the network developers have recognized the need for some higher-level object for quite some time. Known colloquially as "the thing," what this abstraction will eventually become is far from clear. But it is clear that the kernel is having to cope with configuring and managing devices above the "NIC level," so new tools will undoubtedly follow.

Lightweight tunnels and MPLS

Roopa Prabhu then spoke about lightweight tunnel (lwt) support. She noted that there are currently two distinct user classes for tunneling: those that make use of a net_device (such as VXLAN and GENEVE tunnels) and those that do not, instead redirecting packets via Multiprotocol Label Switching (MPLS) or Identifier Locator Addressing (ILA). The latter class constitutes the use case for lwt.

However, the redirection of lwt packets needs optimization. For instance, outgoing MPLS packets are redirected too early, before IP fragmentation is done. Incoming ILA packets, though, are redirected too late, after they are demultiplexed. In both cases, the timing of the redirection hurts throughput. Work is underway to fix the redirection examples cited, but Prabhu also suggested that it might be worth adding additional redirection hooks at other strategically placed points in the lwt-processing pipeline.

In addition, she reported on some other patches still in the works for MPLS. Included are additional statistics reporting, support for MPLS-based VPNs, ping and traceroute support, and hardware offload.

Netlink API

Prabhu then raised some potential improvements that could be made to the netlink API. A chief concern is how to extend the API as new functionality (such as switchdev) is merged into the kernel. Adding new attributes and extending existing attributes is not complicated, but the kernel does not return errors when it encounters an unknown attribute; it ignores them. Thus, users who copy and paste routing examples from kernel documentation end up with silent failures on later kernel releases, when the API has changed.

Several potential fixes to this problem were discussed, from providing a "features" bitmask that software could retrieve to examine the attributes available to exporting the entire network hierarchy (perhaps filtered by protocol). Prabhu also suggested writing an official set of guidelines documenting the use of the netlink API. Finally, she reported on some ongoing work; patches should be expected soon to provide IGMP and per–virtual-LAN statistics for bridges, and to add full support for netlink's bridge and bond attributes in iproute2.

Miller then asked whether it would ever be possible to get rid of the netlink mmap() functions, which are widely regarded as a failed experiment. The functionality is rarely—if ever—used, likely mmap()ing outgoing traffic makes it difficult to verify, although it does make dumps easier. The consensus seemed to be that the feature could be removed, since all user-space code would fall back onto the non-mmap() paths anyway.

APD and control protocols

Mukherjee then discussed Cumulus's work on adding support for ACPI Platform Description (APD) to the kernel. APD is not networking-specific, but it enables a "self-describing" infrastructure that the kernel could use to describe the available network hardware. The most important examples, he said, are the lane maps (that is, which lines are for transmit and which for receive).

He also reported on some work in progress to protect control protocols. Frequently, he said, when a control protocol misses its heartbeat, it will freeze up, which can trigger a cascade effect and "melt down the rest of the network." The team is currently exploring using deadline scheduling, so that if a heartbeat is missed, the kernel will treat it as a "hard miss" and take the network device offline.

Specifications

Miller asked the attendees whether the kernel community should write "NIC specifications" to tell hardware vendors what the kernel wants. Too often, it seems, network-device vendors expect to have easily accessible documents that describe the interfaces and behavior the operating system needs—largely because Microsoft was in the habit of publishing such specifications for Windows. The kernel community, however, has never had such documents, so even as Linux has supplanted Windows as the networking OS of choice, vendors have continued to design products around the Microsoft specifications.

The trick is deciding how the kernel community would write and publish such documents, as they do not fit into the kernel's established development process. But there was sufficient interest in the idea that it may be kicked around further. Herbert, for example, noted that during the recent effort to implement checksum offloading, it became apparent that none of the NIC vendors had heard of the kernel community's strong preference for a "generic" checksum feature rather than protocol-specific checksums. Had that information been better disseminated, perhaps more vendors would have implemented generic checksum offloading.

Netfilter

Pablo Neira Ayuso closed out the second day with a report on Netfilter. He highlighted a long list of enhancements made over the previous year, such as cleanups to bridge-netfilter, the addition of per-namespace netfilter hooks, and support for the new unified control group hierarchy. There have also been many improvements to nftables over the same time period, including the addition of garbage collection, generic packet mangling, a new tracing infrastructure, and the ability to set timeouts.

He also reported on the addition of switchdev support in the nf_tables kernel module, which allows the user to offload network access-control lists (ACLs) to switch hardware. The idea is to provide a netlink front-end for creating rulesets, which are mapped to an intermediate representation (IR); the IR will then be pushed to the switch device driver, which is responsible for converting the IR into the necessary internal representation. The goal is to maintain compatibility between rulesets written for software nf_tables and for devices with hardware offload. Daniel Borkmann asked if the IR could be used to generate eBPF programs; Neira replied that it has been discussed, but that such an effort is not yet underway.

Finally, he discussed some ongoing work. Command autocompletion is being added to the nftables user-space tool, along with support for the new tracing infrastructure. Connection tracking will be added for bridge filtering, and several new features are in the works for the ingress hook (including connection tracking, logging, and queueing). There is also a high-level library under development, although it is still experimental at this stage; Neira said that the project was in no rush to publish it as of yet.

After the last presentation, Miller wrapped up the day by expressing thanks to the organizers and to each of the presenters. The attendees gathered for a group photo, then dispersed to prepare for the coming three days of the Netdev conference.

[The author would like to thank the Netconf and Netdev organizers for travel assistance to Seville.]

Index entries for this article
Conference	Netdev/2016