RXXR: Regular expression denial of service (REDoS) static analysis

Introduction to regex denial of service attacks

Backtracking regular expression matchers can take an exponential amount of time for attempting to match certain regular expressions. The exponential runtime means that for certain (perhaps malicious) inputs the regex matching fails to terminate for all intents and purposes, so there is a denial of service, or lack of availability of the process running the matcher.

The redos problem is sometimes explained in terms of NFAs (nondeterministic finite automata). It is true that an NFA is simulated in backtracking matchers, but that does not really explain the exponential blowup. The matcher is not even finite-state, so the applicability of finite automata is limited. By contrast, our abstract machines capture how the search tree branches and grows.

Static analysis for redos: RXXR

RXXR is a tool for statically detecting ReDoS vulnerabilities that cause regular expression matching to take an exponential amount of time, leading to the risk of a DoS attack.

RXXR stands for regular expression exponential runtime.

Given that the reg exp matchers in Java and .NET are vulnerable to redos, the intended users for RXXR include developers for these platforms who wish to write secure applications. More generally, there is a risk of redos whenever a naive backtracking matcher is exposed to untrusted input.

This paper explains the design of the static analysis by way of abstract machines:
Static Analysis for Regular Expression Denial-of-Service Attacks

If you intend to use RXXR, or would like to contribute or share feedback, please feel free to drop us an email at

[email protected],  [email protected]

Guide to installing and using RXXR

Building the source

The OCaml code implementing the PWFpi and the HFpi machines is available for download (see the link at the bottom of that page below the legal text.)

The linked archive contains the three directories: common, pwf and hf. The common directory contains the sources for the regular expression parser which is shared by both the PWFpi and HFpi machine implementations. The regular expression syntax supported by the parser is described in the syntax section below. Each of the pwf and hf directories contain a bash script build.sh which when executed should produce the executable binaries for the machines (pwf.bin and hf.bin). These sources were developed using the OCaml programming language version 3.11.2.

Results and Data sets

The results of running the HFpi machine on the RegExLib data set is located here (tool output log). The same for the Snort data set can be found here. The data set for the RegExLib archive comes in two versions; the first one is the un-processed raw data set, the second one (used in the results) incorporates the manual adjustments described below:

White-space: Some white-space can be part of the regular expression syntax, for an example, in an expression like [a-z ], the space character is part of the regular expression and should not be removed. Extra whitespace is usually added to make a complex expression easy to understand; the global modifier /x indicates if such white-space should be ignored by the parser. In most of the expressions the authors have ignored the /x modifier since the formatting of the expression itself indicates that the white-space is to be ignored (by the viewer); we had to manually look for such cases and remove the additional white-space before feeding the expressions into the analysis. Usually these cases are easy to spot since such expressions almost always expand into multiple lines (next adaptation).
Line breaks: Most regular expressions are defined on a single line. However, for extremely complex regular expressions the authors have opted to insert white-space, new-lines and comments to explain the behaviour of the expression. A manual scan through the data-set allowed the removal of such constructs so that those regular expressions can be fit into a single line (as required by our parser).
Comments: Comments are marked by a # sign and extends to the end of the line. These were removed during the multi-line removal phase described earlier.
Spam: Since RegExLib is a public web site, there were several entries in the regular expression data-set that can be considered as spam (advertising, trivial entries, etc.). Such entries were completely removed.

The data-set for snort on the other hand did not require any manual processing, this data set was collected from the Snort rule-set version 2.9.31.

The Pthon scripts used for extracting the regular expressions are located here. These scripts were developed using the Python scripting language version 2.7.3. The script used for scraping the RegExLib website has an external dependency on the library BeautifulSoup; this library provides the necessary sub-routines for parsing and extracting information from web sources. BeautifulSoup library should be installed on Python before executing the first script; this installation can be accomplished by issueing the command python setup.py install from the BeautifulSoup directory. The first script does not require any input parameters while the second script (for Snort) should be given one argument which points to the Snort rules directory. Both the scripts produce the results of the extraction on the standard output (terminal) which can then be re-directed to an output file as desired. The data-sets linked in the previous section were produced this way.

Testing

HFpi

The HFpi implementation (hf.bin) expects three arguments. The first argument should be an integer value indicating the maximum number of HFpi transitions to be performed on each of the input expressions. The second integer argument limits the suffix generation algorithm in a similar way. The third argument should be a file path pointing to the input data set. Format of the input data file can be observed in the RegExLib and Snort data sets linked above; each line of the input file contains a single regular expression while empty lines and lines beginning with the # symbol are ignored. The HFpi machine produces its results on the standard output (terminal) which can be redirected to an output file as desired. For an example, consider a file f1.txt containing the following two lines:

# A regular expression for validating file paths (paths to zip archives)
^([a-zA-z]:((\\([-*\.*\w+\s+\d+]+)|(\w+)\\)+)(\w+.zip)|(\w+.ZIP))$

With this file we can invoke the HFpi machine by issueing the following command:

./hf.bin 1000 10 f1.txt

The result of this command should appear as below (the execution time may vary depending on the machine configuration):

=[1]=
- Regex: ^([a-zA-z]:((\\([-*\.*\w+\s+\d+]+)|(\w+)\\)+)(\w+.zip)|(\w+.ZIP))$
- Parse: Ok
- Kleene count: 7
+ Analysis: Completed
+ Vulnerable: Yes
  + Kleene: (\\([-*\.*\w+\s+\d+]+)|(\w+)\\)+
    - Prefix: z:\ 
    - Pumpable: \zzz\
    - Suffix: (The empty string)
Max search depth: 1000, Time: 0.001678 (s), Total: 1, Parsable: 1, Analysable: 1, 
Vulnerable: 1, Suffixed: 1, Not Vulnerable: 0

PWFpi

The PWFpi implementation (pwf.bin) can be used to validate the vulnerabilities reported by the HFpi machine. When executed, pwf.bin queries for a regular expression and an input string. It then executes the PWFpi machine and produces the result of the maching operation along with the corresponding step count. For the example above, the following session may be observed:

./pwf.bin 
Input expression: ^([a-zA-z]:((\\([-*\.*\w+\s+\d+]+)|(\w+)\\)+)(\w+.zip)|(\w+.ZIP))$
Subject string: z:\ \zzz\
No match.
Step count: 74
./pwf.bin 
Input expression: ^([a-zA-z]:((\\([-*\.*\w+\s+\d+]+)|(\w+)\\)+)(\w+.zip)|(\w+.ZIP))$
Subject string: z:\ \zzz\\zzz\
No match.
Step count: 182
./pwf.bin 
Input expression: ^([a-zA-z]:((\\([-*\.*\w+\s+\d+]+)|(\w+)\\)+)(\w+.zip)|(\w+.ZIP))$
Subject string: z:\ \zzz\\zzz\\zzz\
No match.
Step count: 398

Note that the growth of the step count (with each pumping) is exponential.

RegexTestHarness

A modified version of the Java RegexTestHarness can be downloaded from here. The said modification allows the harness to accept input strings which contain hexa-decimal escape sequences (ASCII); this is because some of the prefixes, pumpable strings and suffixes generated by the HFpi machine contain hexa-decimal characters which cannot be directly fed into the original harness. The harness operates in a similar fashion to pwf.bin, it queries for a regular expression and an input string and then executes the Java regular expression matching routines to determine if the given input string can be matched by the regular expression. While the harness does not offer any means of observing an exponential runtime behavior, repeating the pumpable string for 25 times or so should get the harness stuck.

Regex syntax

At the moment, only a subset of the usual regular expression syntax is supported by RXXR, which includes all the pure contructs such as Kleene star and alternation. See the rxxr input syntax page for details.