When you’re working with regular languages specified in regular expression form, there’s a really cool idea that you can use for building regular expression matchers, and for describing how to convert from a regular expression to a NFA. It’s called the Brzozozwksi derivative of a regular expression – or just simply the derivative of a regexp.
The basic idea of the derivative is that given a regular expression, , you can derive a new regular expression called the derivative with respect to symbol , . is a regular expression describing the string matched by after it’s matched an .
To define the derivative, we first need a helper, which we’ll call . What does is tell us if a given regular expression canmatch the empty string. We’ll use it a few ways – both as a part of the derivative, and as part of the process of turning a regular expression into a finite state machine. For convenience, we’ll define it so that returns (a pattern matching the empty string) if can match the empty string, or (the null pattern, a regular expression which never matches anything) if it can’t.
Given a regular expression , we define as follows:
- if is , then ; since the void pattern can’t ever match anything, it doesn’t match the empty string.
- if is , then ; obviously, the pattern that (by definition) matches only the empty string does match the empty string.
- if is a single-character pattern, then . A pattern which matches a specific single character can’t match anything but that single character – so it can’t match the empty string.
- if is a sequence then . A sequence matches the empty string if all of the elements of the sequence match the empty string.
- if is a choice , then . A sequence matches empty if any of its elements match empty.
- if is a starred regular expression, , then . Starred regular expressions are sequences of zero or more repetitions of some other pattern. Zero repetitions is the same as the empty string – so all starred patterns match empty.
To make that useful, we also need to define how empty and void patterns combine with other patterns:
- For any regular expression , the regular expression ; in a sequence, concatenating an empty pattern with any regular expression is equivalent to the regular expression without the pattern.
- For any regular expression , . If you’ve got a sequence of void with a pattern, it’s equivalent to void.
- For any regular expression , . A choice between void and is equivalent to just .
- .
- .
Now, it’s really easy to define the derivative:
- If r is the void pattern, then any derivative of it is void.
- If r is the empty pattern, then any derivative of it is void.
- If r is a character pattern matching character , then , and the derivative of with respect to any other character is void.
- If is a choice pattern between and , then for all characters c, .
- If is a sequence pattern consisting of followed by , then for all characters , . This one might need a bit of explanation. What that means is that for two patterns put together sequentially, you’ve got a choice. You could match the first pattern in the sequence – producing followed by . Or, if could match the empty pattern, then you can drop it, and match . With the rules for combining empty and void patterns with other patterns, the statement above using is the same thing as this explanation.
The beauty of this is that it is really simple. A lot of the earlier mechanisms for decomposing regular expressions were rather complicated. This simple construct makes it very easy. For example, to convert a regular expression to a finite state machine, you do the following:
- Create an initial state, labeled with the complete regular expression .
- While there are states in the machine which haven’t been processed yet:
- For each character, in the alphabet, compute the derivative r_i’.
- If there is a state already in the machine, then add a transition from to labeled with symbol .
- If there is no state , then add it, and add a transition from to labeled with the character .
- For each state in the machine labeled with a regular expression , it is a final state if and only if .
For your amusement, I threw together a really quick implementation of the regular expression derivative in Haskell:
data Regexp = CharRE Char | ChoiceRE [ Regexp ] | SeqRE [ Regexp ] | StarRE Regexp | VoidRE | EmptyRE deriving (Eq, Show) delta :: Regexp -> Bool delta (CharRE c) = False delta (ChoiceRE (re:res)) = if (delta re) then True else (delta (ChoiceRE res)) delta (ChoiceRE []) = False delta (SeqRE []) = True delta (SeqRE (re:res)) = (delta re) && (delta (SeqRE res)) delta VoidRE = False delta EmptyRE = True delta (StarRE r) = True derivative :: Regexp -> Char -> Regexp derivative VoidRE c = VoidRE derivative EmptyRE c = VoidRE derivative (CharRE c) d = if c == d then EmptyRE else VoidRE derivative (SeqRE (re:res)) c = let re' = (derivative re c) in case re' of VoidRE -> VoidRE EmptyRE -> (SeqRE res) _ -> (SeqRE (re':res)) derivative (SeqRE []) c = VoidRE derivative (ChoiceRE []) c = VoidRE derivative (ChoiceRE res) c = let derivs = filter ( x -> x /= VoidRE) (map ( r -> derivative r c) res) in case derivs of [] -> VoidRE [re] -> re (r:res) -> (ChoiceRE (r:res)) derivative (StarRE r) c = let r' = derivative r c in case r' of EmptyRE -> (StarRE r) VoidRE -> VoidRE _ -> SeqRE [r', (StarRE r)]
This can easily be used to implement a regular expression matcher. In fact, it can be used to build an RE matcher that’s nearly (not quite, but nearly) as efficient as a traditional table-based FSM implementation – only without the extra step of generating the table, and building code that can interpret it. Basically, for each input symbol, you just take the derivative of the expression. If it’s not the void expression, then go on to the next character, using the derivative to process the rest of the string. When you get to the end of your input, if of the final RE is empty, then you accept the string.
If you do this intelligently – i.e., you do something like memoize the derivative function, so that you’re not constantly recomputing derivatives, this ends up being a reasonably efficient way of processing regular expressions. (Memoization is a technique where you save the results of every invocation of a function, so that if you call it repeatedly with the same input, it doesn’t redo the computation, but just returns the same result as the last time it wass called with that input.)
The obvious question about all of this is: why is this called the derivative?
If you think about differential calculus in continuous number mathematics, a simple explanation of the derivative of a function is a function , which tells you how the the value of will change. The derivative of a regular expression is sort-of similar, if you squint at it just right: what it does is take a regular expression , and show you how it changes.
Looks like a typo on the rule expressing that the sequence of void and r evaluates to epsilon, when it should express that void and r evaluates to void.
This relates to your post:
http://sebfisch.github.com/haskell-regexp/regexp-play.pdf
An absolutely beautiful paper on a time- and space-efficient FSM-based implementation of generalized weighted regexes in Haskell.
I know that in general, one can resolve a regex on an input to an algebraic equation in a semiring. I wonder if your derivative in fact relates to the conventional derivative of this equation….
It looks like there is a bug in the SeqRE handling. It would fail with regular expressions like (ab)*ac. For symbol ‘a’ it would produce derivative ‘b(ab)*ac’ instead of ‘b(ab)*ac|c’.
Without simplifications for VoidRE and EmptyRE the function should look more like this:
derivative (SeqRE (re:res)) c = if delta re then (ChoiceRE [(SeqRE ((derivative re):res)), (derivative (SeqRE res))]) else (SeqRE ((derivative re):res))