3 August 2022

How not to use Regular Expressions

by Julien Vehent

Earlier this week I shared some strong thoughts on regular expressions with my team. We were discussing how a regex-based deny-list failed and I may have lamented my lack of trust for anything based on regexes, which judging by the eyes rolling probably sounded irritating. So I thought I would share a story from when I started as a junior security engineer back in the mid 2000s. Gathered around the bonfire, grandpa Julien is going to ramble for a minute.

At the time, I was working for a french bank on the security of their web portal. For those who were already in the industry, you most likely remember how everyone was raving about web application firewalls back then. WAFs were the new shiny technology that would solve all our security problems, nevermind how basic they were. They essentially applied perl-compatible regular expressions (PCRE) on URL query parameters. That’s it. They lacked any sort of learning capabilities and had very limited understanding of the web stack. Most didn’t even support TLS. Many increased latency. Some made services more vulnerable just by being on the critical path.

The team I worked in maintained the bank’s WAFs, called DenyAll if I recall (it no longer exists). For every change the developers would make to the banking portals, we would have to make changes to the rules simultaneously, or the requests wouldn’t go through (that actually happened often, and infuriated many sysadmins who had to do rollbacks). Those rules were incredibly verbose, complex, and unreadable. It was usual to have one regular expression per URL covering each query parameter that the web endpoint would support. And, of course, the developers did not standardize the format of their URLs or query parameters (to be fair, they had to deal with a web portal written in pure C… they had their own battles to fight), so we often had dozens upon dozens of regular expressions that would cover every permutation of the parameters, which resulted in large files that contained dozens, sometimes hundreds, of regular expressions that the WAFs would eat up and apply to every single incoming request, and at times outgoing response.

You can probably imagine where this is going. At some point, the development team in charge of the trading application made a change to their endpoint, and the quality assurance team reviewed those changes before they went out to production. A junior engineering in QA decided to test out a few security cases, probably just for fun since we didn’t have security testing in QA back then, and to their surprise discovered that they could use any flavor of directory traversal, sql injection or any sort of attack they could think of without being blocked. The WAF wasn’t blocking anything at all.

They raised the alarm and my team started digging through the configuration files. It took us, if I recall, a couple days to go through these enormous files, some of which had over 600 regular expressions in them. We had to copy and paste individual regexes into separate test files to break them down into their components and run example queries against them. PCREs are complex and much more powerful than what modern regexp, like Golang’s, can do. They allow things like positive and negative lookbehind that are powerful but can also explode compute time exponentially. In many cases we couldn’t understand the expression just by reading them, and had to run them in the context of the URLs and query parameters that they were meant to process, meaning we would build test cases and try to find out where the regular expression failed to filter a given parameter. A full audit of thousands of massively complex regular expressions is about as much fun as it sounds. Eventually we discovered that one of the regular expressions buried somewhere on line 415-or-so of the trading endpoints configuration file had a typo where a simple .+ sequence was improperly escaped and effectively allowed any query parameter on any URL to pass through the entire filter without any limitation.

Six hundred lines of regular expression and thousands of euros of compute costs needed to apply those filters to every incoming request and outbound response was rendered entirely useless by a simple pair of characters buried in the middle of a 300 characters of obscure regular expression in the middle of a configuration file that no one ever bothered auditing. Best of all: we couldn’t tell how long this had been in place for, because the config files weren’t checked into source management.

And so, this is why I’m skeptical that building any sort of security system, or really any sort of reliable functionality, around regular expression has long chances of survival long-term. And why I asked my team to move away from using them for our own deny lists.

tags: