Chapter 1: Regular Languages
The theory of computation begins with a question: What is a computer?
Instead of the complicated real computers, we use an idealized computer called a computational model.
We begin with the simplest model, called the finite state machine or finite automaton.
Contents
1.1 Finite Automata
(此处特指 DFA (Deterministic Finite Automaton)
Definition
A finite automaton is a 5-tuple \( (Q,\Sigma,\delta,q_0,F)\), where
- Q is a finite set called the states,
-
\( \Sigma\) is a finite set called the alphabet,
-
\( \delta: Q \times \Sigma \longrightarrow Q\) is the transition function ,
-
\( q_0 \in Q\) is the start state, and
-
\( F \subseteq Q\) is the set of accept states.
Computation
Let \(M = (Q, \Sigma, \delta, q_0, F)\) be a finite automaton and Let \(w = w_1w_2 \cdots w_n\) be a string where each \(w_i\) is a member of the alphabet \(\Sigma\). Then M accepts w if a sequence of states \(r_0,r_1,\dots ,r_n\) in Q exists with three conditions:
1. \( r_0 = q_0\)
2. \( r_{i+1} = \delta (r_i, w_{i+1})\), for i = 0,…,m – 1, and
3. \( r_n \in F\).
if \(A = \{ w | \text{M accepts w} \}\)
We say that M recognizes language A.
Regular languages
A language is called a regular language if some finite automaton recognizes it.
Regular Operations
Definition:
Let A and B be languages. We define the regular operations union, Intersection, concatenation, and star as follows:
- Union: \( A \cup B = \{x | x \in A \text{ or } x \in B \}\).
-
Intersection: \( A \cap B = \{x | x \in A \text{ and } x \in B \}\).
-
Concatenation: \( A \circ B = \{xy | x \in A \text{ and } y \in B \}\).
-
Star: \( A^* = \{x_1x_2 \dots x_k | k \ge 0 \text{ and each } x_i \in A \}\).
(the empty string \( \varepsilon\) is always a member of \( A^*\), no matter what A is.)
Theorem 1.26:
The class of regular languages is closed under regular operations.
如果是证明在并下封闭,则要构造 M 识别 \(A_1 \cup A_2\), 可以通过笛卡儿积模拟同时跑 \(A_1\) 和 \(A_2\) 的 DFA,若其中一个 DFA accepts 则 M accepts。
但如果是证明在连接下封闭,则要构造 M 识别 \(A_1 \circ A_2\),则要猜测在哪里断开 w,使得前一部分属于 \(A_1\),后一部分属于 \(A_2\)。
这时候就显示出 DFA 的死板了
于是提出了 NFA (Nondeterministic Finite Automaton)
1.2 Nondeterminism
Definition
A nondeterministic finite automaton is a 5-tuple \( (Q,\Sigma,\delta,q_0,F)\), where
- Q is a finite set of states,
-
\( \Sigma\) is a finite alphabet,
-
\( \delta: Q \times \Sigma _\varepsilon \longrightarrow P(Q)\) is the transition function ,
-
\( q_0 \in Q\) is the start state, and
-
\( F \subseteq Q\) is the set of accept states.
Computation
Let \( N = (Q,\Sigma ,\delta ,q_0,F)\) be an NFA and w a string over the alphabet \( \Sigma\). Then we say that N accepts w if we can write w as \( w = y_1y_2\dots y_m\), where each \( y_i\) is a member of \( \Sigma _\varepsilon\) and a sequence of states \( r_0,r_1,\dots ,r_m\) exists in Q with three conditins:
1. \( r_0 = q_0\)
2. \( r_{i+1} \in \delta (r_i,y_{i+1})\), for i = 0,…,m – 1, and
3. \( r_m \in F\).
Difference between NFA and DFA
- 当前有多个状态
如果把它们组合成一个集合,那么这个集合里的每个状态都会转化成一个集合,再并起来,得到另外一个集合 -
多了 \( \varepsilon\) 箭头
可以理解为“捆绑销售”,即你“买”了一个状态的同时也必须买被 \( \varepsilon\) 箭头“捆绑”的状态
也可以理解为 不依赖输入地额外地做出猜测(NFA 理解为可以做出多种猜测并全部验证)
也可以理解为 fork()函数(NFA 理解为多进程)
Relationship of NFA and DFA
DFA \( \iff \) NFA,即两种机器是等价,可以相互转化的
我们的目的是 DFA,但是 DFA 太死板了,设计起来很麻烦
于是以 NFA 为手段,NFA 设计起来简单很多,设计出 NFA 后,就隐式地设计出 DFA 了
这一点在定理 1.26 的证明里得到了充分的体现
咋看之下可能会觉得很神奇,非确定性自动机与确定性自动机的能力范围竟然是一样的
但真正理解 DFA 的话
就会知道,如果把 NFA 里的当前的多个状态组合成一个集合,即令“当前状态”是一个状态集合
那么状态转移就是从一个集合转移成另一个集合,一对一,这不就是 DFA 了吗?
QED.
1.3 Regular Expressions
Definition
Say that R is a regular expression if R is
- a, for some a in the alphabet \( \Sigma\),
-
\( \varepsilon\),
-
\( \emptyset\),
-
\( (R_1 \cup R_2)\), where \( R_1\) and \( R_2\) are regular expressions,
-
\( (R_1 \circ R_2)\), where \( R_1\) and \( R_2\) are regular expressions, or
-
\( (R_1^*)\), where \( R_1\) is a regular expression.
Difference between \( \emptyset\) and \( \varepsilon\):
expression \( \varepsilon\) represents the language containing a single string–namely, the empty string
\( \emptyset\) represents the language that doesn’t contain any strings.
\( 1^* \emptyset = \emptyset \quad \emptyset ^* = { \varepsilon }\)
Concatenating the empty set to any set yields the empty set.
Theorem 1.54
是正则语言 \( \iff \) 能被正则表达式描述
再结合正则语言的定义
即能被 DFA/NFA 识别 \( \iff \) 能被正则表达式描述
证明:
两个方向:
能被正则表达式描述 \( \longrightarrow \) 是正则语言:
因为正则表达式是递归定义的
首先易证三个 base case 是正则语言
然后因为正则语言在正则操作下封闭
QED.
能被 DFA/NFA 识别 \( \longrightarrow\) 能被正则表达式描述:*
为了方便把 DFA 转化成正则表达式,首先引入 **GNFA (Generalized Nondeterministic Finite Automaton):
GNFA 在 NFA 的基础上额外多了三点要求:
1. 开始状态不能为接受状态,且入度为 0,作为“起点”
2. 只有一个接受状态,且其出度为 0,作为“终点”
3. 除了以上两个状态,任意一个状态都有箭头到达任意一个状态(包括自身)
于是可以把 DFA 转化成 GNFA:
添加新的开始状态、接受状态,使其满足 GNFA 的性质
添加额外箭头(其上的 label 为 空集)
接着把 GNFA 转化成正则表达式:
即要把任意一条从开始状态(起点)到接受状态(终点)之间的路径上的串(即被接受的串)并成一个正则表达式
递归地不断删除一个状态(合并串),当最终只剩下起点和终点时,就只剩下了一个箭头,其上的 label 就是与其等价的正则表达式
CONVERT(G):
1. Let k be the number of states of G.
2. If k = 2, then G must consist of a start state, an accept state, and a single arrow connecting them and labeled with a regular expression R. Return the expression R
3. If k > 2, we select any state \( q_{rip} \in Q\) different from \( q_{start} \text{ and } q_{accept}\),
and let G’ be the \( \text{GNFA}(Q’,\Sigma,\delta ‘,q_{start},q_{accept}) \) , where \( Q’ = Q – \{ q_{rip} \} \),
and for any \( q_i \in Q’ – \{ q_{accept} \}\) and any \( q_j \in Q’ – \{ q_{start} \}\), let
\( \varepsilon ‘(q_i,q_j) = (R_1)(R_2)^*(R_3)(R_4)\),
for \( R_1 = \delta (q_i,q_{rip}) \), \( R_2 = \delta (q_{rip},q_{rip}) \), \( R_3 = \delta (q_{rip},q_j)\), and \( R_4 = \delta (q_i,q_j) \).
4. Compute CONVERT(G’) and return this value.
1.4 Nonregular Languages
Let’s take the language \( B = \{ 0^n 1^n | n \ge 0 \}\)
Because the number of 0s isn’t limited, the machine will have to keep track of an unlimited number of possibilities.
But is cannot do so with any finite number of states.
But other languages seem to require an unlimited number of possibilities, yet actually they are regular.
For example, \( D = \{ w | \text{ w has an equal number of occurrences of 01 and 10 as substrings}\}\).
Pumping Lemma 告诉我们每个正则语言都会满足一个特殊的性质(必要条件)
于是要证明一个语言不是正则语言时,我们就可以通过反证法:
首先假设该语言是正则语言,然后说明该语言不满足 Pumping Lemma,QED
Pumping Lemma
If A is a regular language, then there is a number p (the pumping length) where if s is any string in A of length at least p, then s may be divided into three pieces, s = xyz, satisfying the following conditions:
1. for each \( i \ge 0, xy^i z \in A\),
2. \( |y| \ge 0\), and
3. \( |xy| \le p\).
Proof:
我们可以令 p 为 DFA 的状态数
当 DFA 接受一个长度至少为 p 的串时,根据鸽巢原理,至少存在一个状态重复出现,即出现环
于是我们可以绕这个环走任意圈,再到达接受状态
令环之前的串为 x,环上的串为 y,环后的串为 z,则 \(xy^iz\) 同样会被该 DFA 接受
Example:
Let B be the language \(\{ 0^n1^n | n \ge 0 \}\).
Assume to contrary that B is regular.
Let p be the pumping length given by the pumping lemma.
Choose s to be the string \(0^p1^p\).
Because of \( |xy| \le p\), y must be 0, then y can not be pumped, which is a contradiction.