Context-Free Grammars - Models of Computation

Previously, we have seen how to generate a language using regular expressions. CFG is a different way of generating languages but is more powerful in the sense that every language that can be generated by a regular expression can also be generated by a CFG, but there are languages that can be generated by CFGs and not by regular expressions, namely $\{0^n1^n \mid n \geq 0\}$ .

CFGs are widely used in Computer Science. For example, they are used to define the syntax of programming languages and markup languages (e.g. XML and JSON).

In this subject, we have actually used CFGs to define regular expressions. Well-formed formulas can also be defined using CFGs.

5.1.1Definition of CFG¶

Think of production rules as substitution rules. In particular, a production rule $A \rightarrow w$ tells us that we can substitute $A$ with $w$ . We remark that CFGs allow recursion since the right-hand side of a rule can contain the variable that is on the left-hand side of the rule. In particular, the variable $A$ can also occur in the string $w$ . This is evident in the upcoming example.

Example 1 (CFG for arithmetic expressions)

Consider arithmetic expressions over the numbers 0, …, 9 with addition, subtraction, multiplication, division and parentheses. For example,

(1 + 3) * 5 + 8

(1)

The syntax of arithmetic expressions can be expressed as a CFG:

$\Sigma$ consists of:
- the numbers 0, 1, …, 9
- the operators $+$ , $-$ , $\times$ and $/$
- parentheses ( and )
$V$ consists of $\textsf{Op}$ and $\textsf{Expr}$ , for operators and expressions, respectively
The production rules are:

\begin{align*} \textsf{Expr} & \rightarrow 0 \\ \textsf{Expr} & \rightarrow 1 \\ & \vdots \\ \textsf{Expr} & \rightarrow 9 \\ \textsf{Expr} & \rightarrow \textsf{Expr Op Expr}\\ \textsf{Expr} & \rightarrow (\textsf{Expr})\\ \textsf{Op} & \rightarrow +\\ \textsf{Op} & \rightarrow -\\ \textsf{Op} & \rightarrow \times\\ \textsf{Op} & \rightarrow /\\ \end{align*}

(2)

The start variable is $\textsf{Expr}$ .

Notational shorthand¶

Instead of listing each rule separately, we will group together rules with the same left-hand side in a single line separated by vertical bars.

We will also sometimes specify a grammar by specifying only its production rules.

5.1.2Generating strings and languages¶

Given a grammar $G$ , we can generate strings by starting with the start variable and iteratively replacing variables using one of the production rules[fn::Indeed, production rules are sometimes called substitution rules.} until we end up with a string that is only terminals:

Initialize output string $x = S$ , i.e. the output string $x$ consists of the start variable $S$
While the string $x$ contains at least one variable, apply one of the rules in $R$ to replace a single occurrence of some variable. In particular, applying the rule $A \rightarrow w$ to an occurrence of $A$ replaces it with the string $w$
The string at the end is the generated string

By making different choices of which production rule to apply and which occurrence of the variable to apply it to, we can generate different strings.

Example 3 (Generating a string with the CFG for arithmetic expressions)

Step 1: Start with the start variable

\textsf{Expr}

(4)

Step 2: Apply the rule $\textsf{Expr} \rightarrow \textsf{Expr Op Expr}$ to replace $\textsf{Expr}$ with $\textsf{Expr Op Expr}$

\textsf{Expr Op Expr}

(5)

Step 3: Apply the rule $\textsf{Expr} \rightarrow 1$ to the first occurrence of $\textsf{Expr}$

1 \textsf{ Op Expr}

(6)

Step 4: Apply the rule $\textsf{Op} \rightarrow +$

1 + \textsf{ Expr}

(7)

Step 5: Apply the rule $\textsf{Expr} \rightarrow 4$

1 + 4

(8)

Since the string $1 + 4$ no longer has any variables, it is the generated string.

Note that if we had chosen to apply the rule $\textsf{Expr} \rightarrow 3$ in Step 3 instead, then we would have obtained the string $3 + 4$ .

Terminology and notation¶

Let $x,y,z$ be strings in $(\Sigma \cup V)^*$ , i.e. strings of terminals and variables, and $A$ be a variable in $V$ . We say that applying the rule $A \rightarrow y$ to the string $xAz$ yields $xyz$ , and use the notation $xAz \Rightarrow xyz$ .

Definition 2 (Derivations)

Let $x,z$ be strings in $(\Sigma \cup V)^*$ . We say that $x$ derives $z$ (or $z$ derives from $x$ ) and use the notation $x \Rightarrow^* z$ iff there is a finite sequence of production rules that can be applied to obtain $z$ from $x$ . Formally, $x \Rightarrow^* z$ iff:

$x = z$ , or
there exists strings $y_1, \ldots, y_k \in (\Sigma \cup V)^*$ such that $x \Rightarrow y_1 \Rightarrow \ldots \Rightarrow y_k \Rightarrow z$ .

The sequence $x \Rightarrow y_1 \Rightarrow \ldots \Rightarrow y_k \Rightarrow z$ is called a derivation. When $x$ is the start variable, we call the sequence a derivation of $z$ .

We refer to strings in $(\Sigma \cup V)^*$ as sentential forms and strings in $\Sigma^*$ as sentences. Note that sentences are sentential forms without any variables.

5.1.3Parse trees¶

Parse trees let us visualize derivations of strings, and more importantly, represent the syntactic structure of a string. For example, parse trees tell us the order of operations of arithmetic expressions.

Ambiguity¶

Ambiguity in a grammar arises when there is more than one way to parse a string.^[4]

5.1.4Closure Properties¶

Just as for regular languages, we are interested in whether CFLs are closed under the operations on languages that we have seen before, as these operations let us define new languages. The closure properties will be useful later on for proving that regular expressions are context-free (Theorem 2).

Previously, we proved closure of regular languages under these operations by considering finite automata. While there is an equivalent automata model to CFGs, called pushdown automata, it is easy to prove these closure properties using CFGs.

Proof 1 (Proof ideas for union and Kleene closure)

We begin by proving closure under union.

Let $A$ and $B$ be two context-free languages over the same alphabet $\Sigma$ , generated by grammars $G_A$ and $G_B$ , respectively. Let $V_A, R_A, S_A$ and $V_B, R_B, S_B$ denote the variables, production rules and start variables of $G_A$ and $G_B$ , respectively.

We first construct a new grammar $H_1$ that generates $A \cup B$ . The new grammar has variables and production rules from both, plus its own start variable $S_1$ and an additional rule. More formally, its variables are $V_A \cup V_B \cup \{S_1\}$ , and its production rules consist of $R_A \cup R_B$ plus the rule $S_1 \rightarrow S_A \mid S_B$ .

Next, we construct a new grammar $H_2$ that generates $A^*$ . The new grammar has variables and production rules from $A$ , plus its own start variable $S_2$ and an additional rule. More formally, its variables are $V_A \cup \{S_2\}$ and its production rules consist of $R_A$ plus the rule $S_2 \rightarrow \epsilon \mid S_AS_2$ .

5.1.5Regular Languages are Context-Free¶

Proof 2 (Proof Ideas)

Let $A$ be a regular language and $R$ be a regular expression that generates $A$ . There are several cases based on the recursive definition of regular expressions:

$R$ is an alphabet symbol $a$ or $\epsilon$ .
$R = R_1 \cup R_2$ where $R_1$ and $R_2$ are smaller regular expressions.
$R = R_1R_2$ where $R_1$ and $R_2$ are smaller regular expressions.
$R = R_1^*$ where $R_1$ is a smaller regular expression.

Case 1 is easy: If $R = a$ , the equivalent CFG is $S \rightarrow a$ , and if $R = \epsilon$ , the equivalent CFG is $S \rightarrow \epsilon$ .

Cases 2 - 4 can be handled by induction. The base case is case 1. Assuming that we have CFGs for $R_1$ and $R_2$ , we can obtain a CFG for $R$ using the fact that CFLs are closed under union, concatenation and Kleene closure (Theorem 1).

Footnotes¶

This is the alphabet of the CFG.
↩
Similar to the state transition function of a finite automaton.
↩
Similar to the initial state of a finite automaton.
↩
See this Wikipedia article for a classic, humorous example of linguistic ambiguity.
↩

Models of Computation

Context-Free Languages