39 Regular Expressions and Finite Automata

40 Regular Expressions and Finite Automata

Week 9, Session 1. CSS 343.

40.1 Warmup

Longest Common Prefix (#14). Compute the longest prefix shared by an array of strings.

class Solution {
 public:
  string longestCommonPrefix(vector<string>& strs) {
    if (strs.empty()) return "";
    for (int i = 0; i < (int)strs[0].size(); ++i) {
      char c = strs[0][i];
      for (int j = 1; j < (int)strs.size(); ++j) {
        if (i >= (int)strs[j].size() || strs[j][i] != c) {
          return strs[0].substr(0, i);
        }
      }
    }
    return strs[0];
  }
};

A vertical scan: column-major over the strings.

40.2 Learning objectives

Write a regex to match common patterns.
Use std::regex in C++ to validate input and extract subgroups.
Construct a DFA with labeled states and transitions from an English description.
Trace a DFA on an input string.
Explain why regex cannot match balanced parentheses.

40.3 Regex syntax — the essentials

Pattern	Matches
`a`	the literal `a`
`.`	any single character
`a*`	zero or more `a`s
`a+`	one or more `a`s
`a?`	zero or one `a`
`[abc]`	any one of `a`, `b`, `c`
`[^abc]`	any char NOT `a`, `b`, `c`
`[a-z]`	range
`\d`	digit
`\w`	word char
`\s`	whitespace
`^`	start of string
`$`	end of string
`a\\|b`	`a` or `b`
`(abc)`	group
`{n,m}`	between n and m repetitions

A simple email-like pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}. Not a complete email regex — those are notoriously hairy.

40.4 `std::regex` in C++

#include <regex>

string s = "contact me at alice@uw.edu or bob@x.org";
regex re("(\\w+)@(\\w+\\.\\w+)");
smatch m;
auto it = s.cbegin();
while (regex_search(it, s.cend(), m, re)) {
  cout << "match: " << m[0] << ", user: " << m[1] << ", domain: " << m[2] << endl;
  it = m.suffix().first;
}

Backslashes are doubled because \ in a C++ string is the escape character. \\d in source is \d to the regex engine.

regex_match requires the entire string to match. regex_search finds the first occurrence. regex_replace substitutes.

std::regex in C++ is slow compared to other implementations (PCRE, Rust’s regex). Fine for one-off use, awful for hot loops. If you are processing millions of lines, link against PCRE or do string scanning by hand.

40.5 Deterministic Finite Automata (DFA)

A DFA has:

A finite set of states Q.
An alphabet Σ.
A start state q₀.
A set of accept states F ⊆ Q.
A transition function δ : Q × Σ → Q.

Walk the input character by character. End in an accept state → string accepted.

40.5.1 Example: strings over {a, b} ending in “ab”

States: q₀ (start), q₁ (just saw a), q₂ (saw ab, accept).

State	a	b
q₀	q₁	q₀
q₁	q₁	q₂
q₂	q₁	q₀

Accept: {q₂}.

Trace on “aab”: q₀ → q₁ → q₁ → q₂. Accept.

40.5.2 Example: binary strings divisible by 3

States are the remainders 0, 1, 2. Reading bit b from remainder r gives 2r + b mod 3.

State	0	1
0	0	1
1	2	0
2	1	2

Accept: {0}.

Trace on “110” (= 6): 0 → 1 → 0 → 0. Accept (6 is divisible by 3).

40.6 Implementing a DFA

struct DFA {
  map<pair<int, char>, int> delta;
  int start;
  set<int> accept;

  bool simulate(const string& input) const {
    int state = start;
    for (char c : input) {
      auto it = delta.find({state, c});
      if (it == delta.end()) return false;
      state = it->second;
    }
    return accept.count(state) == 1;
  }
};

40.7 NFA — nondeterministic

An NFA can be in multiple states at once. Transitions can be on the empty string (ε-transitions). For every NFA there is an equivalent DFA (subset construction). NFAs are usually easier to write; DFAs are easier to run.

40.8 Regular languages and the pumping lemma

A language is regular if some DFA recognizes it. Equivalently, some regex matches it. The set of regular languages is closed under union, intersection, complement, and concatenation.

Some languages are not regular. The classic non-regular language: {aⁿbⁿ : n ≥ 0} — strings of n a’s followed by n b’s. Intuitively, a regex would need to count, and a DFA has only finitely many states.

The pumping lemma is the formal tool: if L is regular, then long-enough strings in L can be “pumped” — repeating a middle section keeps them in L. For aⁿbⁿ, you can show that pumping breaks the count, so the language is not regular.

You need a more powerful machine: a pushdown automaton (PDA), which is a DFA plus a stack. PDAs recognize context-free languages — which includes balanced parentheses, arithmetic expressions, and most programming language syntax. We will hint at this in the next chapter.

40.9 Try it

Build a DFA for “strings over {0, 1} with an even number of 0s.” Two states: even (start, accept), odd. On 0: flip state. On 1: stay.

Implement and test on “0100100” (three 0s = odd, reject), “00110010” (four 0s = even, accept).

40.10 Self-check

1. Why can't a regular expression match balanced parentheses?

a. Regex is too slow.
b. Parentheses are special characters.
c. Regular languages cannot count arbitrary nesting; you need a stack (PDA).
d. Balanced parentheses are not context-free.

c. A DFA has finitely many states; matching (((((((...))))))) for unbounded depth requires unbounded state, which regular languages cannot supply.

2. Tracing the DFA "ends in ab" on the input "bba":

a. Accepts.
b. Rejects (ends in q₁).
c. Rejects (no transition defined).
d. Loops forever.

b. q₀ → q₀ (b) → q₀ (b) → q₁ (a). Ends in q₁, which is not accepting.

40.11 Challenges

Build a DFA for “strings ending in 01.” Implement and test.
Use std::regex to extract all phone numbers from a paragraph.
Prove (sketch) that {aⁿbⁿ} is not regular using the pumping lemma.
Implement an NFA-to-DFA subset construction. (Hard.)

40.12 Where this fits

Regular languages are the bottom rung of the Chomsky hierarchy. Next chapter climbs up: Turing machines and the limits of computation itself.

You are here	Coming up
Regex, DFA, NFA, regular languages, pumping lemma	Chapter 38: Turing machines and computability