Have you ever asked yourself: how do regexes works? Or, even better, have you ever thought that you could totally implement a regex engine that would be faster than what powers std::regex? If not, than you’re probably took a wrong turn somewhere on the interwebs and you should probably move along.
For those of you performance freaks that kept on reading, let me share some experiments I did with bringing a regex engine on the GPU using C++AMP.
The Aho-Corasick string matching algorithm
At the heart of our regex engine will be the Aho-Corasick (AC) string matching algorithm. Briefly, this algorithm builds a state machine resembling
a trie which is then used to match patterns on a given string. In other words, each state in the AC state machine corresponds to a character that can be matched given the pattern(s) fed to the algorithm. Between each pair of these states there can be an edge that either signifies that:
The two states linked together represents a valid sequence of characters in a matched string.
There was a problem fully matching a sequence so we’re backing up to the state which is the biggest prefix of what we succeeded to match.
As an example (shamelessly taken from Wikipedia), say we have the following pattern : “a|ab|bab|bc|bca|c|caa”. This will be used to construct this state machine:
If you prefer code to pictures, here’s how an AC-using regex would be set up:
In order, we have:
mTransitionMatrix: The state machine represented as an adjacency matrix where rows correspond to states an columns correspond to possible state transition given an ASCII character.
mOutputTable: A simple table to identify if a state is a complete match.
mDictionary: A list of all possible patterns that we’re currently trying to match.
Once we have the representation down, we need some helper functions.
The first one will be there to help us add a blank state to our state machine.
The second helper function will be responsible of adding new patterns to search for in our regex engine. This will be done by adding states when necessary and adding transitions between states based on input characters.
With these helper functions, we can now define our regex’s constructor.
One thing you may notice is that I didn’t introduce back edges in the AC state machine like I’m supposed to. This was intentional since we won’t be needing it for what we’re about to do.
Optionally, readers are encouraged to implement the AC failure function as an exercise. Another cool thing to do is to implement other special character like “+” that I left out of the present implementation.
(I won’t lie, this is quite fun to finally be the one leaving holes in his article for “educational” purpose and definitely not because I’m lazy).
Adding a match
Before continuing any further, I want to introduce the data type which will contain the result of a successful pattern match. There’s not really much to mention about it but it’s important to introduce it for the coming string matching algorithms.
Now back to the main event!
Right about now, you should be wondering: “How can we parallelize the computations done by the AC state machine?”. The answer is: let’s give up and fail!
When it's not simply failure
In all seriousness tough, failing, not giving a damn about it and moving on is actually the very premise behind the Parallel Failureless Aho-Corasick (PFAC) algorihtm (I encourage you to check out  and  where I first learned about this algorithm).
The way it works is as follows:
Have each thread start at a given index in the string to match.
If there’s a match, record it.
Whether there’s a match or not, move on to the next index for the thread
See, failure is not really failure when we can keep going forward (there seems to be a life lesson here but it’s totally lost on me).
An implementation of this idea would look like this:
Alright, last stop before we implement our regex engine on the GPU.
C++ Accelerated Massive Parallelism (C++AMP) is an open standard, meaning anybody can access it and implement it, (and some aready did) by Microsoft that specifies language extensions and a small library to enable (mostly) painless computations on the GPU. To give you an overview of this technology, here’s an annotated version (by yours truly) of the “Hello world” example on the C++AMP team’s blog.
There’s of course more to it that what I just showed but for our current needs it will be plenty enough.
Also, this technology seems to have been abandoned since 2014 (last time the C++AMP team’s blog has been updated), which is a damn shame if you ask me. Still, I had wanted to have a go at it for a long time and this project was the perfect excuse.
How to fail faster
Now that we finally have everything we need to implement PFAC on the GPU, let’s get to work!
First, we’ll create array_views of the structures (state machine and all) need for the regex search. Most of the time, this will be straightforward.
But we’ll need to give our transition matrix a special treatment. The fact is that C++AMP only deals with contiguous data structures. That means that a std::vector of std::array, basically a pointer to an array of pointers, can’t be used as is to initialize an array_view. However, fixing this problem is quite easy as we only need to flatten our transition matrix before constructing our view.
Then there’s the problem of transferring the string on which will perform the search to the GPU. You see, C++AMP data structures can only be typed with something whose size is a multiple of the size of an int. And clearly, char doesn’t fit that requirement. So we have two options here: wasting space by using ints to hold our chars or packing our chars into ints. It goes without saying to we’ll go for the second option.
Last but not least, we’ll define a structure to contain our pattern matches. This will be a matrix where the rows will represent a pattern to match and the column will represent a character in the input string. When a match will be made, we’ll write the length of the match in the cell corresponding to the pattern’s row and the beginning character’s column.
This may look like we’re doing do much redundant work here. Why not just build most of these views when constructing our regex engine? Well, the thing is that a lambda marked with restrict(amp) can’t capture this so member variables are a no go. This means that there’s pretty much no way that we could pre-initialize our structures on the GPU with C++AMP.
With that being said, let’s finally move on to the implementation of PFAC on the GPU.
For the experimental setup, I’ve decided to take a somewhat long text and create a pattern out of some of its characters. Can you guess which book I’ve used?
As our baseline, we’ll use std::regex_search. This will then be compared to both the multithreaded PFAC on the CPU and the C++AMP implementation of PFAC.
Now how well did we do against std::regex?
This did not go as planned
Unfortunately, not so great. Even when I add a timer to get the actual computation performed on the GPU (see below) what I record still doesn’t match std::regex’s performance.
Still a long way to go
What else could have I done?
Even tough the end result is a bit disappointing it is not entirely unexpected and there are two big reasons for that.
First, and I knew this when I started this little project, the algorithm that I’m using, the AC algorithm, is not the most efficient for implementing a regex engine. So the fact that I’m not starting from the most effective regex engine might’ve played against me.
The second explanation for the results we’ve seen that I can think of is that C++AMP wasn’t as easy to use as I thought and for all the syntatic sugar it offers, it doesn’t offer much when it comes to irregular computations on the GPU. It did make me think that eschewing syntactic sugar in favor of using lower level tools allowing us to have finer-grained control would be the way to go.
But not always
In the end, I still believe that the idea to parallelize regex matching based on the input string is a great idea and it kinda made me want to try to test it in a standard std::regex implementation.