BPF and tcpdump

Introduction

I sometimes used to find myself in situations where tcpdump's filters seemingly did not work the way that I expected them to. In those situations I often simply ran tcpdump in line buffered mode (-l) and piped the output into grep to find what I was looking for. Particularly the VLAN filter used to give me some headaches with older versions of tcpdump and libpcap. In this article, I will try to shed some light at how tcpdump (via libpcap) generates bytecode for filtering packets. Understanding how the bytecode is generated and how it works should help you understand why a filter expression might not match a specific packet that you are looking for.

This article extends upon the introduction into tcpdump's BPF compiler provided in the blog post BPF - the forgotten bytecode. Before reading on, read through this resource, then come back here. The blog post also links to the paper The BSD Packet Filter: A New Architecture for User-level Packet Capture if you want to further deepen your understanding. The Linux Socket Filtering aka Berkeley Packet Filter (BPF) kernel documentation might come in handy when following the examples below. Please also note that when I refer to BPF in this article, I refer to classic BPF and not to eBPF.

tcpdump expressions and BPF

Compiling a basic BPF expression

After reading through the aforementioned material, you should already have an understanding of tcpdump's BPF compiler. Any expression that you give to tcpdump will be compiled into bytecode which in turn will be given to the kernel's JIT compiler. Let's use an easy example and look at the icmp filter.

First, we will capture on the loopback interface and ping 127.0.0.1 at the same time:

1
2
3
4
5
6
[root@host ~]# timeout 1 tcpdump -i lo -w lo.pcap
dropped privs to tcpdump
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
0 packets captured
6 packets received by filter
0 packets dropped by kernel

Let's look at one of the captured packets. Each visual block in the hexdump is a halfword, or 16 bits. Offsets in tcpdump are specified in Bytes. The IP ethertype is 0x800 and can be found in Bytes 12 and 13, the ICMP protocol number for ICMP is 0x01 and can be found in Byte 23:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
[root@host ~]# tcpdump -r lo.pcap -xx
reading from file lo.pcap, link-type EN10MB (Ethernet)
dropped privs to tcpdump
11:50:13.451990 IP localhost > localhost: ICMP echo request, id 1, seq 1, length 64

                                            | -- ethertype for IP (0800)
                                            |
                                            v
    0x0000:  0000 0000 0000 0000 0000 0000 0800 4500
    0x0010:  0054 17ee 4000 4001 24b9 7f00 0001 7f00
                               ^
                               | --- IP protocol number for ICMP (01)

    0x0020:  0001 0800 947c 0001 0001 4566 7663 0000
    0x0030:  0000 e2e4 0600 0000 0000 1011 1213 1415
    0x0040:  1617 1819 1a1b 1c1d 1e1f 2021 2223 2425
    0x0050:  2627 2829 2a2b 2c2d 2e2f 3031 3233 3435
    0x0060:  3637

And here is how tcpdump compiles the icmp filter expression for an interface for a live capture:

1
2
3
4
5
6
7
[root@host ~]# tcpdump -i lo -d icmp
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 5             # <---- ethertype for IP
(002) ldb      [23]
(003) jeq      #0x1             jt 4    jf 5             # <---- IP protocol number for ICMP
(004) ret      #262144
(005) ret      #0

We'd also expect this to be the same when we look at our capture file:

1
2
3
4
5
6
7
8
[root@host ~]# tcpdump -r lo.pcap -d icmp
reading from file lo.pcap, link-type EN10MB (Ethernet)
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 5             # <---- ethertype for IP
(002) ldb      [23]
(003) jeq      #0x1             jt 4    jf 5             # <---- IP protocol number for ICMP
(004) ret      #262144
(005) ret      #0

So far, so good and straight forward. We look at Byte 23 and check if the IP protocol number is 1. Before that, we check the ethernet header in Byte 12 and make sure that we are using IP.

Differences when filtering on the wire versus reading from a packet capture file

Depending on the tcpdump version in use, your provided expression might be compiled slightly differently. What's more, there is a difference between the bytecode that is compiled for live packet captures on interfaces and for the bytecode that is compiled for filtering packet capture files. Earlier, we looked at the icmp filter. We will now see what happens when we filter for a VLAN number with the vlan filter.

The Wikipedia article about EtherTypes lists the EtherTypes for VLANS: for 802.1q the value is 0x8100, for Q-in-Q we would expect to see 0x88A8 and for double tagging 0x9100. We expect to find one of these values in Bytes 12 and 13 of the Ethernet header. And when we apply the filter to our previously captured file, this is precisely what the expression is compiled to. Tcpdump first loads the halfword at Bytes 12 and 13. Then it sequentially checks if the value matches either of the known EtherTypes for VLAN:

1
2
3
4
5
6
7
8
[root@host ~]# tcpdump -r lo.pcap -d vlan
reading from file lo.pcap, link-type EN10MB (Ethernet)
(000) ldh      [12]
(001) jeq      #0x8100          jt 4    jf 2    # <---- ethertype for 802.1q
(002) jeq      #0x88a8          jt 4    jf 3    # <---- ethertype for Q-in-Q
(003) jeq      #0x9100          jt 4    jf 5    # <---- ethertype for double tagging
(004) ret      #262144
(005) ret      #0

Now, let's look at a live capture from the loopback interface. We might expect the same bytecode, but instead we see the following:

1
2
3
4
5
6
7
8
9
[root@host ~]# tcpdump -i lo -d vlan
(000) ldb      [-4048]
(001) jeq      #0x1             jt 6    jf 2
(002) ldh      [12]
(003) jeq      #0x8100          jt 6    jf 4
(004) jeq      #0x88a8          jt 6    jf 5
(005) jeq      #0x9100          jt 6    jf 7
(006) ret      #262144
(007) ret      #0

The expression is different now! Why is that? Well, the kernel has an internal optimization and will actually store VLAN and other ancillary information in a location with a negative offset.

The reason for this is that the kernel no longer passes specific information as-is to libpcap. See https://bugs.launchpad.net/ubuntu/+source/tcpdump/+bug/1641429 for further details:

1
2
3
4
5
6
The kernel now longer passes vlan tag information as-is to libpcap, instead BPF needs to access ancillary data.

That is the reason "vlan 114" works (because it does the right thing), and a manual filter doesn't, because libpcap never actually sees this.

Offsets are negative because this is the way to access this ancillary data (like vlan tags) in the Linux kernel:
https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/include/uapi/linux/filter.h#L60

And see the kernel documentation as well: https://github.com/torvalds/linux/blob/7acac4b3196caee5e21fb5ea53f8bc124e6a16fc/include/uapi/linux/filter.h#L60:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
/* RATIONALE. Negative offsets are invalid in BPF.
   We use them to reference ancillary data.
   Unlike introduction new instructions, it does not break
   existing compilers/optimizers.
 */
#define SKF_AD_OFF    (-0x1000)
 (...)
#define SKF_AD_VLAN_TAG 44
#define SKF_AD_VLAN_TAG_PRESENT 48
 (...)

According to the kernel code, we expect to find an indication that a VLAN tag is present at the following location:

  • -0x1000 + 48 = - 4096 + 48 = −4048

The VLAN number is stored at location:

  • -0x1000 + 44 = - 4096 + 44 = −4052

For example, let's look for VLAN 32 (0x20) in our packet capture file. We first verify if we find a VLAN ethertype in Bytes 12 and 13. If so, we load Bytes 14 and 15 and extract the last 12 bits (the VID). We then check if those match 0x20:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
[root@host ~]# tcpdump -r lo.pcap -d vlan 32
reading from file lo.pcap, link-type EN10MB (Ethernet)
(000) ldh      [12]                                  # <---- look for dot1q, etc. at Byte 12
(001) jeq      #0x8100          jt 4    jf 2         # <---- ethertype for 802.1q 
(002) jeq      #0x88a8          jt 4    jf 3         # <---- ethertype for Q-in-Q
(003) jeq      #0x9100          jt 4    jf 8         # <---- ethertype for double tagging
(004) ldh      [14]                                  # <---- load Bytes 14 and 15
(005) and      #0xfff                                # <---- extract the last 12 bits from the field (VID)
(006) jeq      #0x20            jt 7    jf 8         # <---- check for VLAN 32
(007) ret      #262144
(008) ret      #0

That, too, makes sense. But if we do a live capture, the logic becomes more complex. We first check at location SKF_AD_VLAN_TAG_PRESENT (-4048) to find out if a VLAN is present. If true, we load location SKF_AD_VLAN_TAG (-4052) and extract the last 12 bits from there and then check if the value equals 0x20. Otherwise, we check Bytes 12 and 14 like above:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
[root@host ~]# tcpdump -i lo vlan 32 -d
(000) ldb      [-4048]                             # <---- load value at location SKF_AD_VLAN_TAG_PRESENT
(001) jeq      #0x1             jt 6    jf 2       # <---- if the value is true, check SKF_AD_VLAN_TAG, otherwise read in Byte 12
(002) ldh      [12]                                # <---- load the content at Byte 12
(003) jeq      #0x8100          jt 6    jf 4
(004) jeq      #0x88a8          jt 6    jf 5
(005) jeq      #0x9100          jt 6    jf 14
(006) ldb      [-4048]                             # <---- does the kernel special field report this as a VLAN?
(007) jeq      #0x1             jt 8    jf 10
(008) ldb      [-4052]                             # <---- load value at location SKF_AD_VLAN_TAG
(009) ja       11                                  # <---- jump to instruction 11
(010) ldh      [14]
(011) and      #0xfff                              # <---- extract the last 12 bits from the field (VID)
(012) jeq      #0x20            jt 13   jf 14      # <---- look for VLAN 32 as earlier
(013) ret      #262144
(014) ret      #0

As a conclusion:

  • Data from a .pcap file is treated as if everything came directly from the wire, the offsets are the same as we'd see them on the physical layer.
  • When we capture data from an interface, libpcap will use kernel ancillary data but it will also add a fallback expression in newer versions.

How does tcpdump compile user provided expressions - Source code analysis

Let's look at how tcpdump (by means of libpcap) compiles its expressions into the appropriate BPF bytecode.

tcpdump will compile different types of BPF depending on if the optimization flag is set but also particularly depending on if we open a file for reading or if we run a live capture.

The -r indicates to read from a file and the parameter is read here:

1
2
3
        case 'r':
            RFileName = optarg;
            break;

Opening a file for reading happens here:

If -r is not specified, then libpcap will listen on the interface instead:

How the BPF will be compiled depends on if a file was used as an input or if tcpdump reads from a socket on a network interface. The actual compilation of the user provided filter expression to BPF happens here:

If the -d parameter (and thus the dflag) is set, tcpdump will call libpcap's bpf_dump function to print the expression in one of three formats:

The different dump formats are explained in the man page:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
man tcpdump
(...)
       -d     Dump the compiled packet-matching code in a human readable form to standard output and stop.

              Please  mind  that  although code compilation is always DLT-specific, typically it is impossible (and
              unnecessary) to specify which DLT to use for the dump because tcpdump uses either the DLT of the  in              put  pcap  file  specified with -r, or the default DLT of the network interface specified with -i, or
              the particular DLT of the network interface specified with -y and -i respectively. In these cases the
              dump shows the same exact code that would filter the input file or the network interface without -d.

              However, when neither -r nor -i is specified, specifying -d prevents tcpdump from guessing a suitable
              network interface (see -i).  In this case the DLT defaults to EN10MB and can be set to another  valid
              value manually with -y.

       -dd    Dump packet-matching code as a C program fragment.

       -ddd   Dump packet-matching code as decimal numbers (preceded with a count).
(...)

Building a filter expression compiler

Documentation

In order to build a filter expression compiler, we look at the source code:

And at this documentation here:

Installing dependencies

Install build dependencies. For example, on RHEL 8:

1
2
subscription-manager repos --enable=codeready-builder-for-rhel-8-x86_64-rpms 
yum install libpcap-devel -y

Creating the code

Now, create the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
cat <<'EOF' > libpcap_expression_compiler.c
// From:
// https://github.com/the-tcpdump-group/tcpdump/blob/master/tcpdump.c#L2238
// https://www.tcpdump.org/pcap.html
// https://git.netfilter.org/iptables/tree/utils/nfbpf_compile.c

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <pcap.h>

/*
 * Concatenate all input arguments into a long string
 */
char * concat_args(int argc, char **argv) {
    char * arg_string;
    int len;

    for(int i=1; i<argc; i++) {
        len += strlen(argv[i]);
    }
    arg_string = (char *)malloc(len+argc-1);

    for (int i=1; i < argc; i++) {
        if(i != 1) {
            strcat(arg_string, " ");
        }
        strcat(arg_string, argv[i]);
    }
    return arg_string;
}

int main(int argc, char **argv) {
    pcap_t *handle;
    char dev[] = "lo";
    char errbuf[PCAP_ERRBUF_SIZE];
    struct bpf_program fcode;
    bpf_u_int32 localnet = 0, netmask = 0;
    int Oflag = 1;

    char * filter_exp = concat_args(argc, argv);
    if(argc <= 1) {
        fprintf(stderr, "Empty filter expression\n");
        exit(1);
    }
    printf("Compiling expression '%s'\n\n", filter_exp);

    if (pcap_lookupnet(dev, &localnet, &netmask, errbuf) == -1) {
        fprintf(stderr, "Can't get netmask for device %s\n", dev);
        localnet = 0;
        netmask = 0;
    }
    handle = pcap_open_live(dev, BUFSIZ, 1, 1000, errbuf);
    if (handle == NULL) {
        fprintf(stderr, "Couldn't open device %s: %s\n", dev, errbuf);
        return(2);
    }
    if (pcap_compile(handle, &fcode, filter_exp, Oflag, netmask) < 0) {
        fprintf(stderr, "Couldn't parse filter %s: %s\n", filter_exp, pcap_geterr(handle));
         return(2);
    }

    printf("*** Dump compiled packet-matching code (-d) ***\n\n");
    bpf_dump(&fcode, 1);
    printf("\n*** Dump packet-matching code as a C program fragment (-dd) ***\n\n");
    bpf_dump(&fcode, 2);
    printf("\n*** Dump packet-matching code as bytecode (-ddd) ***\n\n");
    bpf_dump(&fcode, 3);

    // https://git.netfilter.org/iptables/tree/utils/nfbpf_compile.c
    printf("\n*** Dump packet-matching code as bytecode (on one line) ***\n\n");
    struct bpf_insn *ins;
    ins = fcode.bf_insns;

    printf("%d,", fcode.bf_len);
    for (int i = 0; i < fcode.bf_len-1; ++ins, ++i) {
    printf("%u %u %u %u,", ins->code, ins->jt, ins->jf, ins->k);
    }
    printf("%u %u %u %u\n", ins->code, ins->jt, ins->jf, ins->k);

    pcap_close(handle);
    free(filter_exp);
    pcap_freecode(&fcode);

    exit(0);
}
EOF

Compiling the code

And compile it:

1
gcc libpcap_expression_compiler.c -o libpcap_expression_compiler -lpcap

Using the tool

The tool will now produce the same output as tcpdump with the -d, -dd or -ddd flag set. In order to do so, it will attach to the loopback interface, so the binary must be run with sufficient privileges.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# ./libpcap_expression_compiler vlan 32
Compiling expression 'vlan 32'

*** Dump compiled packet-matching code (-d) ***

(000) ldb      [-4048]
(001) jeq      #0x1             jt 6    jf 2
(002) ldh      [12]
(003) jeq      #0x8100          jt 6    jf 4
(004) jeq      #0x88a8          jt 6    jf 5
(005) jeq      #0x9100          jt 6    jf 14
(006) ldb      [-4048]
(007) jeq      #0x1             jt 8    jf 10
(008) ldb      [-4052]
(009) ja       11
(010) ldh      [14]
(011) and      #0xfff
(012) jeq      #0x20            jt 13   jf 14
(013) ret      #8192
(014) ret      #0

*** Dump packet-matching code as a C program fragment (-dd) ***

{ 0x30, 0, 0, 0xfffff030 },
{ 0x15, 4, 0, 0x00000001 },
{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 2, 0, 0x00008100 },
{ 0x15, 1, 0, 0x000088a8 },
{ 0x15, 0, 8, 0x00009100 },
{ 0x30, 0, 0, 0xfffff030 },
{ 0x15, 0, 2, 0x00000001 },
{ 0x30, 0, 0, 0xfffff02c },
{ 0x5, 0, 0, 0x00000001 },
{ 0x28, 0, 0, 0x0000000e },
{ 0x54, 0, 0, 0x00000fff },
{ 0x15, 0, 1, 0x00000020 },
{ 0x6, 0, 0, 0x00002000 },
{ 0x6, 0, 0, 0x00000000 },

*** Dump packet-matching code as bytecode (-ddd) ***

15
48 0 0 4294963248
21 4 0 1
40 0 0 12
21 2 0 33024
21 1 0 34984
21 0 8 37120
48 0 0 4294963248
21 0 2 1
48 0 0 4294963244
5 0 0 1
40 0 0 14
84 0 0 4095
21 0 1 32
6 0 0 8192
6 0 0 0

*** Dump packet-matching code as bytecode (on one line) ***

15,48 0 0 4294963248,21 4 0 1,40 0 0 12,21 2 0 33024,21 1 0 34984,21 0 8 37120,48 0 0 4294963248,21 0 2 1,48 0 0 4294963244,5 0 0 1,40 0 0 14,84 0 0 4095,21 0 1 32,6 0 0 8192,6 0 0 0

Or let's create another filter expression, just as another example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# ./libpcap_expression_compiler host 127.0.0.254 and tcp and port 5353
Compiling expression 'host 127.0.0.254 and tcp and port 5353'

*** Dump compiled packet-matching code (-d) ***

(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 16
(002) ld       [26]
(003) jeq      #0x7f0000fe      jt 6    jf 4
(004) ld       [30]
(005) jeq      #0x7f0000fe      jt 6    jf 16
(006) ldb      [23]
(007) jeq      #0x6             jt 8    jf 16
(008) ldh      [20]
(009) jset     #0x1fff          jt 16   jf 10
(010) ldxb     4*([14]&0xf)
(011) ldh      [x + 14]
(012) jeq      #0x14e9          jt 15   jf 13
(013) ldh      [x + 16]
(014) jeq      #0x14e9          jt 15   jf 16
(015) ret      #8192
(016) ret      #0

*** Dump packet-matching code as a C program fragment (-dd) ***

{ 0x28, 0, 0, 0x0000000c },
{ 0x15, 0, 14, 0x00000800 },
{ 0x20, 0, 0, 0x0000001a },
{ 0x15, 2, 0, 0x7f0000fe },
{ 0x20, 0, 0, 0x0000001e },
{ 0x15, 0, 10, 0x7f0000fe },
{ 0x30, 0, 0, 0x00000017 },
{ 0x15, 0, 8, 0x00000006 },
{ 0x28, 0, 0, 0x00000014 },
{ 0x45, 6, 0, 0x00001fff },
{ 0xb1, 0, 0, 0x0000000e },
{ 0x48, 0, 0, 0x0000000e },
{ 0x15, 2, 0, 0x000014e9 },
{ 0x48, 0, 0, 0x00000010 },
{ 0x15, 0, 1, 0x000014e9 },
{ 0x6, 0, 0, 0x00002000 },
{ 0x6, 0, 0, 0x00000000 },

*** Dump packet-matching code as bytecode (-ddd) ***

17
40 0 0 12
21 0 14 2048
32 0 0 26
21 2 0 2130706686
32 0 0 30
21 0 10 2130706686
48 0 0 23
21 0 8 6
40 0 0 20
69 6 0 8191
177 0 0 14
72 0 0 14
21 2 0 5353
72 0 0 16
21 0 1 5353
6 0 0 8192
6 0 0 0

*** Dump packet-matching code as bytecode (on one line) ***

17,40 0 0 12,21 0 14 2048,32 0 0 26,21 2 0 2130706686,32 0 0 30,21 0 10 2130706686,48 0 0 23,21 0 8 6,40 0 0 20,69 6 0 8191,177 0 0 14,72 0 0 14,21 2 0 5353,72 0 0 16,21 0 1 5353,6 0 0 8192,6 0 0 0

Generating and using BPF bytecode with iptables

The aforementioned article BPF - the forgotten bytecode shows that BPF bytecode can also be used outside of tcpdump. Let's see how we can use BPF bytecode together with iptables to filter traffic.

Dependencies

For RHEL/CentOS 8, during my testing I installed the following build dependencies for the examples below:

1
2
3
4
5
6
7
yum install '@Development Tools' -y
yum install gcc-toolset -y
yum install binutils-devel -y
yum install readline-devel -y
yum install elfutils-libelf-devel -y
yum install clang -y
yum install llvm -y

Generating BPF bytecode with tcpdump

There are several different options to generate BPF bytecode. The easiest way is to use tcpdump to generate it from a provided filter expression. The iptables bpf module requires input that is generated with data link type DLT_RAW.

Prerequisites for tcpdump

In my case, DLT_RAW was not supported by my distribution's build (RHEL 8). If you find yourself in the same situation, you can build a recent version of tcpdump directly from source.

Example build instructions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
git clone https://github.com/the-tcpdump-group/tcpdump
git clone https://github.com/the-tcpdump-group/libpcap.git
cd libpcap/
git checkout origin/libpcap-1.10
./configure
make install
cd ../tcpdump/ 
git checkout origin/tcpdump-4.99
./configure
make install

Compare the output of the following commands - the first one, EN10MB, accounts for ethernet headers whereas RAW starts at the IP headers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# # implicit EN10MB DLT
# tcpdump -d tcp dst port 8080
Warning: assuming Ethernet
(000) ldh      [12]
(001) jeq      #0x86dd          jt 2    jf 6
(002) ldb      [20]
(003) jeq      #0x6             jt 4    jf 15
(004) ldh      [56]
(005) jeq      #0x1f90          jt 14   jf 15
(006) jeq      #0x800           jt 7    jf 15
(007) ldb      [23]
(008) jeq      #0x6             jt 9    jf 15
(009) ldh      [20]
(010) jset     #0x1fff          jt 15   jf 11
(011) ldxb     4*([14]&0xf)
(012) ldh      [x + 16]
(013) jeq      #0x1f90          jt 14   jf 15
(014) ret      #262144
(015) ret      #0
# RAW DLT
# tcpdump -y RAW -d tcp dst port 8080
(000) ldb      [0]
(001) and      #0xf0
(002) jeq      #0x60            jt 3    jf 7
(003) ldb      [6]
(004) jeq      #0x6             jt 5    jf 18
(005) ldh      [42]
(006) jeq      #0x1f90          jt 17   jf 18
(007) ldb      [0]
(008) and      #0xf0
(009) jeq      #0x40            jt 10   jf 18
(010) ldb      [9]
(011) jeq      #0x6             jt 12   jf 18
(012) ldh      [6]
(013) jset     #0x1fff          jt 18   jf 14
(014) ldxb     4*([0]&0xf)
(015) ldh      [x + 2]
(016) jeq      #0x1f90          jt 17   jf 18
(017) ret      #262144
(018) ret      #0

As stated earlier, for iptables -m bpf, we need the output that's generated by DLT_RAW.

For further information about DLTs, see LINK-LAYER HEADER TYPES which states the following for DLT_RAW:

1
Raw IP; the packet begins with an IPv4 or IPv6 header, with the version field of the header indicating whether it's an IPv4 or IPv6 heade

Using a simple filter expression with tcpdump

You can generate the bytecode for iptables with:

1
2
# tcpdump -y RAW -ddd tcp dst port 8080 | tr '\n' ',' | sed 's/,$//'
19,48 0 0 0,84 0 0 240,21 0 4 96,48 0 0 6,21 0 13 6,40 0 0 42,21 10 11 8080,48 0 0 0,84 0 0 240,21 0 8 64,48 0 0 9,21 0 6 6,40 0 0 6,69 4 0 8191,177 0 0 0,72 0 0 2,21 0 1 8080,6 0 0 262144,6 0 0 0

And now, you can use this bytecode with iptables:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# python3 -m http.server 8080 >/dev/null 2>&1 &
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
200
# iptables -I INPUT -m bpf --bytecode "$(tcpdump -y RAW -ddd tcp dst port 8080 | tr '\n' ',' | sed 's/,$//')" -j REJECT
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
000
# iptables -L INPUT -nv --line-numbers
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        2   120 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           match bpf 48 0 0 0,84 0 0 240,21 0 4 96,48 0 0 6,21 0 13 6,40 0 0 42,21 10 11 8080,48 0 0 0,84 0 0 240,21 0 8 64,48 0 0 9,21 0 6 6,40 0 0 6,69 4 0 8191,177 0 0 0,72 0 0 2,21 0 1 8080,6 0 0 262144,6 0 0 0 reject-with icmp-port-unreachable

tcpdump with an explicit offset expression

Let's get a bit more fancy and let's filter the same packets with a custom offset expression. In test.pcap, I captured a TCP request to port 8080. The hexdump looks as follows. I am purposefully ignoring the ethernet header by only providing -x as DLT_RAW begins with the IP header:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# tcpdump -nn -r test.pcap -x | tail -n 10
reading from file test.pcap, link-type EN10MB (Ethernet), snapshot length 262144
08:58:05.409490 IP 127.0.0.1.8080 > 127.0.0.1.55346: Flags [F.], seq 1, ack 80, win 342, options [nop,nop,TS val 1370696821 ecr 1370694317], length 0
    0x0000:  4500 0034 896d 4000 4006 b354 7f00 0001
    0x0010:  7f00 0001 1f90 d832 2b16 be50 ec11 89b4
    0x0020:  8011 0156 fe28 0000 0101 080a 51b3 2c75
    0x0030:  51b3 22ad
08:58:05.409525 IP 127.0.0.1.55346 > 127.0.0.1.8080: Flags [.], ack 2, win 342, options [nop,nop,TS val 1370696821 ecr 1370696821], length 0
    0x0000:  4500 0034 0000 4000 4006 3cc2 7f00 0001
    0x0010:  7f00 0001 d832 1f90 ec11 89b4 2b16 be51
                             ^
                             |-------- dst port 0x1f90 (Bytes 22 + 23)
    0x0020:  8010 0156 2423 0000 0101 080a 51b3 2c75
    0x0030:  51b3 2c75

We want to identify all packets with a dst port of 8080 (or 0x1f90 in hex notation). We are looking for halfword 11, or Bytes 22 and 23. The tcpdump expression would hence be:

1
ether[22:2] == 0x1f90
Don't be mislead by the ether keyword. We are filtering with DLT_RAW and hence have to ignore the etherheader.

And the compiled expression:

1
2
3
4
5
6
7
#  tcpdump -y RAW -d ether[22:2] == 0x1f90
(000) ldh      [22]
(001) jeq      #0x1f90          jt 2    jf 3
(002) ret      #262144
(003) ret      #0
# tcpdump -y RAW -ddd ether[22:2] == 0x1f90 | tr '\n' ',' | sed 's/,$//'
4,40 0 0 22,21 0 1 8080,6 0 0 262144,6 0 0 0

And we can test this again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# python3 -m http.server 8080 >/dev/null 2>&1 &
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
200
# iptables -I INPUT -m bpf --bytecode "$(tcpdump -y RAW -ddd ether[22:2] == 0x1f90 | tr '\n' ',' | sed 's/,$//')" -j REJECT
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
000
# iptables -L INPUT -nv --line-numbers
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        2   120 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           match bpf 40 0 0 22,21 0 1 8080,6 0 0 262144,6 0 0 0 reject-with icmp-port-unreachable

Generating BPF bytecode with nfbpf_compile

While searching for more resources, I found BPF: A Bytecode for filtering. The article links to nfbpf_compile.c which can generate iptables -m bpf compatible bytecode by using DLT_RAW.

Here's the full script from the above link (just in case the original resource changes):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
cat <<'EOF' > nbpf_compile.c
/*
 * BPF program compilation tool
 *
 * Generates decimal output, similar to `tcpdump -ddd ...`.
 * Unlike tcpdump, will generate for any given link layer type.
 *
 * Written by Willem de Bruijn (willemb@google.com)
 * Copyright Google, Inc. 2013
 * Licensed under the GNU General Public License version 2 (GPLv2)
*/

#include <pcap.h>
#include <stdio.h>

int main(int argc, char **argv)
{
    struct bpf_program program;
    struct bpf_insn *ins;
    int i, dlt = DLT_RAW;

    if (argc < 2 || argc > 3) {
        fprintf(stderr, "Usage:    %s [link] '<program>'\n\n"
                "          link is a pcap linklayer type:\n"
                "          one of EN10MB, RAW, SLIP, ...\n\n"
                "Examples: %s RAW 'tcp and greater 100'\n"
                "          %s EN10MB 'ip proto 47'\n'",
                argv[0], argv[0], argv[0]);
        return 1;
    }

    if (argc == 3) {
        dlt = pcap_datalink_name_to_val(argv[1]);
        if (dlt == -1) {
            fprintf(stderr, "Unknown datalinktype: %s\n", argv[1]);
            return 1;
        }
    }

    if (pcap_compile_nopcap(65535, dlt, &program, argv[argc - 1], 1,
                PCAP_NETMASK_UNKNOWN)) {
        fprintf(stderr, "Compilation error\n");
        return 1;
    }

    printf("%d,", program.bf_len);
    ins = program.bf_insns;
    for (i = 0; i < program.bf_len-1; ++ins, ++i)
        printf("%u %u %u %u,", ins->code, ins->jt, ins->jf, ins->k);

    printf("%u %u %u %u\n", ins->code, ins->jt, ins->jf, ins->k);

    pcap_freecode(&program);
    return 0;
}
EOF

Compile this with:

1
gcc nbpf_compile.c -o nbpf_compile -lpcap

And generate the bytecode:

1
2
# ./nbpf_compile RAW 'tcp port 8080'
23,48 0 0 0,84 0 0 240,21 0 6 96,48 0 0 6,21 0 17 6,40 0 0 40,21 14 0 8080,40 0 0 42,21 12 13 8080,48 0 0 0,84 0 0 240,21 0 10 64,48 0 0 9,21 0 8 6,40 0 0 6,69 6 0 8191,177 0 0 0,72 0 0 0,21 2 0 8080,72 0 0 2,21 0 1 8080,6 0 0 65535,6 0 0 0

As earlier, we can test this with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# python3 -m http.server 8080 >/dev/null 2>&1 &
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
200
# iptables -I INPUT -m bpf --bytecode "$(./nbpf_compile RAW 'tcp port 8080')" -j REJECT
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
000
# iptables -L INPUT -nv --line-numbers
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        2   120 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           match bpf 48 0 0 0,84 0 0 240,21 0 6 96,48 0 0 6,21 0 17 6,40 0 0 40,21 14 0 8080,40 0 0 42,21 12 13 8080,48 0 0 0,84 0 0 240,21 0 10 64,48 0 0 9,21 0 8 6,40 0 0 6,69 6 0 8191,177 0 0 0,72 0 0 0,21 2 0 8080,72 0 0 2,21 0 1 8080,6 0 0 65535,6 0 0 0 reject-with icmp-port-unreachable

Generating BPF bytecode from BPF asm instructions

You can also choose the low level route and write your own BPF asm instructions and convert them into bytecode.

Compiling helper binaries

First, you will need the bpf_asm binary from the kernel source code.

Get the kernel source code on RHEL / CentOS:

1
2
3
4
5
mkdir -p kernel/src
cd kernel/
yumdownloader --source kernel
cp kernel-*.src.rpm src/ ; cd src ; rpm2cpio kernel-*.src.rpm | cpio -idmv
tar -xf linux-*.tar.xz

Compile the BPF toolset:

1
2
3
4
5
6
cd $(find . -path '*tools/bpf')
make all
cp bpf_asm /usr/local/bin/.
cp bpf_dbg /usr/local/bin/.
cp bpf_jit_disasm /usr/local/bin/.
cp bpftool/bpftool /usr/local/bin/.

Building bytecode

Now, create the following bpf asm instructions:

1
2
3
4
5
6
7
8
# cat <<'EOF' > test.bpf 
     ldh [22]
     jneq #0x1f90, lb_1
     ret #1

lb_1:
     ret #0
EOF

And compile it to bytecode with:

1
2
# bpf_asm test.bpf 
4,40 0 0 36,21 0 1 8080,6 0 0 1,6 0 0 0,

As earlier, we can test this with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# python3 -m http.server 8080 >/dev/null 2>&1 &
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
200
# iptables -I INPUT -m bpf --bytecode "$(bpf_asm test.bpf)" -j REJECT
# curl -s -o /dev/null -w "%{http_code}" 127.0.0.1:8080
000
# iptables -L INPUT -nv --line-numbers
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1        2   120 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           match bpf 40 0 0 22,21 0 1 8080,6 0 0 1,6 0 0 0 reject-with icmp-port-unreachable

For further details about the assembly language, see the kernel documentation.