The simplest family of languages – and correspondingly, the simplest kind of computing machine – deal with regular languages. A regular language is very limited: there’s no counting, no deep patterns. They really don’t have much in the way of computational power. But they’re still very useful – the ubiquitous regular expression libraries in most programming languages are just easy ways of using regular languages to decompose simple patterned inputs.
The regular languages can be described in three ways: as regular grammars; as regular expressions; or as a kind of computing machine called a finite state machine or finite state automaton.
We’ll start with the machines. For each regular language, you can define a simple finite state machine that will determine whether or not a particular string is in the language. In a finite state machine, the machine has a small piece of state – a single atomic value, called the state, and it’s allowed to look at exactly one character of input. It’s not allowed to sneak a peak ahead; it’s not allowed to look at any previous characters to see how it got into its current state.
Let’s get precise. A FSM can be described by a tuple: , where:
- is a set of symbols, called an alphabet;
- is a set of states;
- is a special distinguished state called the start-state or initial state;
- is a subset of the states called the final states; and
- is a transition relation which maps pairs of machine state and input symbol to target states. This relation defines how the machine actually works: if there is a relation , that means that when the machine is in state , and it sees an input symbol , it will transition to state .
The machine starts to look at an input string in state . For each symbol in the input string in sequence, it performs a single transition, consuming that input symbol. When it’s consumed every symbol in the input string, if it’s in a state that’s part of the set , then it accepts the string.
For example, we can create a machine which accepts strings that consist of any string containing at least one a, followed by at least one b. In tuple form, that would be (, , , , t = { , , , , , , , }). In graphic form:
In the picture, the initial state is marked with a big blue arrow, and the final states are outlined in green.
Let’s step through a couple of examples.
- Suppose the input is “aaabb”. The machine starts in state . It consumes the first “a”, and follows the transition labelled “a” into state . The remaining input is “aabb”. It follows the transition from state labeled “a”, which again leads to state , leaving the remaining input a “abb”. It consumes another “a”, doing the same thing, leaving the remaining input “bb”. It consumes a “b”, following the transition labelled “b” to state , and leaving the remaining input as just “b”. Finally, it does the transition labeled from state , and again goes to state . The input is gone, and it’s in state . Since is one of the final states, the machine accepts the string.
- Suppose the input is “baab”. The machine starts in . It consumes the first , which moves it to state . The rest of the characters get consumed one at a time, but all of the transitions from in this machine return to : it’s a dead end. Once it’s consumed all of the characters, it will still be in , so it won’t accept the string.
- Suppose the input is the empty string. The machine starts in , and never runs any transitions at all, so it ends in state . Since isn’t a final state, it doesn’t accept the empty string.
When you look at this, it’s really a trivial machine. It seems like there’s very little that it can do. And yet, at least in theory, any computation that can be performed with a fixed, finite amount of state can be implemented with a machine this simple. Everything that I can do on the computer that I’m typing this on is doable using a finite state machine. It would have a huge number of states, and an insanely complex state transition function, but it would be doable.
Let’s move on to grammars. As I said in the last post, each of the Chomsky grammar levels can be characterized by some way of constraining the grammar. There are two versions of regular grammars, called right regular and left regular grammars. The idea of both is similar; we’ll use right-regular as an example.
- there is exactly one non-terminal on the left-hand side of the rule;
- There can at most one non-terminal symbol on the right-hand side of the rule;
- There can be at most one terminal symbol on the right-hand side of the rule
- If there’s both a non-terminal and a terminal symbol on the right-hand side of the rule, then the terminal will come before the non-terminal.
So, for example, the following are all valid right-regular grammar rules.(I’m writing terminals as boldface, and non-terminals as italics.)
But the following are not right regular:
- . (The non-terminal is in the middle of two terminals, not at the end.)
- . (The non-terminal is not at the end.)
There’s a clear connection between the structure of a regular grammar, and a finite state machine. If you think of each non-terminal symbol as a state, each terminal symbol as an the label on a transition between states, and final states as states reached by a rule without a non-terminal on the right-hand side – you’ve pretty much got the a finite state machine. They’re really similar, and you can see how the regular grammar restrictions constrain the grammar to exactly the same languages that you can recognize with a finite state machine.
FSMs are great, but if you want to describe a regular language, you probably don’t want to write one down. And grammars aren’t the easiest way to write down a regular language: for something so simple, they’re very verbose and not particularly easy to read. That’s why we have a third way of describing regular grammars: regular expressions. Regular expressions are compact, easy to read, and there’s a pretty clear correspondence between a regular expression and the corresponding regular grammar.
A regular expression is:
- A single character, e.g., “a”
- This is a literal match: the language matched by a single character regular expression is exactly that character.
- Concatenation
- If and are regular expressions, then is a regular expression. The language matched by is the concatenation of a string matched by followed by a string matched by .
- Alternation
- Alternation describes a choice. If and are regular expressions, then is a regular expression, which matches anything matched by either or .
- Repetition (aka Kleene closure)
- If is a regular expression, then is a regular expression. matches any sequence of zero or more strings matched by concatenated together.
You can also use grouping, written using parens, so that you can have alternation between concatenated groups.
A few examples of regular expressions:
- any string of any length made from the characters “a”, “b”, “c”, and “d”: “abcd”, “aaaa”, “ab”, “dabcbad”, etc.
- a string of any length made of “a”s and “b”s, followed by a string of any length of “c”s and “d”s. “ababbbacdcdcccc”, “ab”, “b”, “c”, “cdcdddcc”, “ddcc”.
- a string of any number of repetitions of “ab”, followed by any number of “c”s and “d”s. “ababababcccd”, “abcc”, “ab”, etc.
- strings consisting of at least one A; followed any number of “b”s (including zero); followed by any number of repetitions of “cd”; followed by either a single “e” or a single “f”.
In your SM example, if S_a is a final state I think the SM will accept strings with no ‘b’.
Yeah, if the input ist just “aa” the SM stops at S_a and accepts the string. I think it works if S_a is a non-final state.
Similarly, shouldn’t the machine accept “baab” if it’s supposed to accept “any string containing at least one a, followed by at least one b”? The current machine seems to accept only strings that start with a sequence of at least one a and then a sequence of b: aa*b*, I believe?
For the original goal you’d somehow need to make it recognize when it’s getting into a b-State after having been in a-State before – maybe two separate b-States? But I guess that would make the machine somewhat more complicated.
Sorry that my first comment here is so much nagging. I love this series!
The English description of your example finite state machine doesn’t match the machine very well. You said containing at least one A followed by at least one B. Instead of containing, which implies there can be other characters in there. I’d normally see it written as consisting of. Anyway, just a minor nit.
You’re getting to Brzozowski derivatives, right?
Unless I’m missing some subtle pun, we sneak a “peek”, not a “peak”…(and my mind immediately started a descent into musing about sneaking a “poke” as well, sorry….)
Fine exposition though.
Michael Sipser, in the book Introduction to the Theory of Computation, defines a regular language such as:
1. a symbol
2. the empty symbol
3. the empty set
4. union of regular languages
5. concatenation of regular languages
6. star of a regular language.
I believe that the omission of the empty symbol not mischaracterize the regular language (he could be generated by the star operation). However, rule 3 (the empty set) is definitely needed, since without it there is no way to define the empty language using regular expressions.
After all, great post. Many computer scientists are unaware of the importance of regular languages for computation.
This is another way (if you add “and nothing else is a regular language”) of saying that regular expressions as described above are a way to describe the regular languages (i.e. all of them).
It doesn’t really matter, but if I recall the standard notation is to write a list of tuples, not a single tuple with the transition relation defined separately. The standard notation has the advantage of conciseness.
Not that it takes away from your overall point, but a computer is a finite state machine (designed as a collection of smaller FSMs working together). Memory is a FSM. Processors are FSMs. Keyboard controllers are an FSMs. And so on.
Another thing worth noting is that the regular languages are those that can be recognized with a fixed amount of working memory. Or with no working memory, except what is required to read each input symbol.
You say,
A regular language is very limited: there’s no counting, no deep patterns.
(Sorry about my just-previous “comment”; I hit tab and enter by mistake.)
(Also, thanks for the material; my own education didn’t include much of anything about regular languages and their connection with FSMs.)
But, I have question(s) about counting. You say,
A regular language is very limited: there’s no counting, no deep patterns.
and then you go on to say that regular expression libraries can be described as regular languages.
But, I think the “ubiquitous” ones include counting: you can specify that you want to match a pattern zero or more, one, one or more, any number of occurrences within a range, or exactly some number of occurrences.
Don’t these all involve counting? Or is this not what you mean when you say there’s no counting? Or what?
I’m eager to see your next installment on these topics, because, despite my questions above, I do always come away understanding more than I did going in.
It’s a matter of what you mean by counting.
In a regular expression, you can specify any specific number, or any specific range of numbers of repetitions of a given sub-pattern. That’s well within the capability of a regular expression.But the number of repetitions is always statically specificed, and hard-coded into the regular expression.
What we mean by counting is something much stronger: the ability of the machine to actually count things requires a lot more capability. For a simple example of what I mean, you can imagine which consists of a number, followed by that number of x’s: “6xxxxxx” would be in the language; “12xxxxxx” would be in the language; but “6xx” would not. It’s not just a matter of parsing the number: you can use unary numbers, so that, for example, “iixx” and “iiixxx” are members of the language, but “iiixxxx” is not.
A finite state machine can’t process that kind of a language, and a regular expression can’t specify it. But the next class of machines, the pushdown automata, can.
Nitpick: the languages most regular expression libraries accept are not regular languages. I’m too lazy to do the thinking to come up with a good example, but probably something with backreferences.
I do believe posix regexes are mostly regular, but they’re not nearly as useful as perl style regexes.
S_a can’t be a final state, right? Then “a” would be accepted, even if there was no “b”. What am I missing?
Oh, I see this was already pointed out. Sorry.