The job of the lexical analyzer is to read input characters of the source program, group them into lexemes, and then as output produce a sequence of tokens that the parser can use for syntax analysis. Both the parser and lexical analyzer interact with the symbol table.

Since the lexical analyzer is the part that deals with the actual source of the program, it may do more than just identify lexemes. This includes error messages (think of line numbers, lexical analyzer keeps track of these) and stripping out comments/whitespace. The lexical analyzer can also perform macro expansion if the program has macros.

Lexical Analysis vs Parsing

Why have the parser and lexer as separate phases? Simply put:

Tokens, Patterns, and Lexemes

What’s the difference?

printf("Total = %d\n", score);

Strings and Languages

Operations on Languages

Regular Expressions

Regular expressions are a notation for describing languages that can be built from operations on symbols of some alphabet. For example, we can use the following to describe the language of C identifiers:

letter_ ( letter_ | digit ) *

Where:

letter_ – Represents any letter or the underscore digit – Stands for any digit | – This stands for the union * – 0 or more occurrences

The juxtaposition of letter_ with the other parts of the expression implicitly represents concatenation

Any regular expression r represents a language L(r). The rules of regular expressions can be derived using some fancy induction which we don’t need to go over.

Some precedence rules:

This means that (a)|((b)*(c)) is the same as a|b*c.

Regular Definitions

Sometimes we want to give regular expressions names. Like this:

\[ \begin{aligned} & d_{1} \rightarrow r_{1} \\ & d_{2} \rightarrow r_{2} \\ & ... \\ & d_{n} \rightarrow r_{n} \end{aligned} \]

Where \(d_{n}\) is some symbol not in the relevant alphabet, and \(r_{n}\) is some regular expression concerning the relevant alphabet.

Example: C identifiers

\[ \begin{aligned} & letter\_ \rightarrow A\ |\ B\ |\ \cdot \cdot \cdot \ |\ Z\ |\ a\ |\ b\ |\ \cdot \cdot \cdot\ |\ z\ |\ \_\ \\ & digit \rightarrow 0\ |\ 1\ |\ \cdot \cdot \cdot \ |\ 9\ \\ & id \rightarrow letter\_\ (\ letter\_\ |\ digit\ )* \end{aligned} \]

Regular Expression Extensions

After Kleene introduced regular expressions with basic operators such as union, concatenation, and Kleene closure. These extensions have been incorporated into software such as Unix utilities (lex) that are useful for lexical analyzers.

Finite Automata

Two types:

NFANondeterministic finite automata can have states which have several transitions for several input symbols, and the empty string \(\epsilon\) is a possible transition. DFADeterministic finite automata has states that only have one transition per input symbol, and the empty string \(\epsilon\) is not a possible transition. Set of DFAs is a subset of all NFAs.

References

Dragon Book