A programming language has a collections of words and symbols that are called lexemes.
For example, C has symbols (, ), ->, etc. Reserved words include if and while. A variable or function name is also considered a lexeme, as are numeric and string constants.
A lexical analyzer, also called a lexer, is responsible for breaking a program up into a sequence of lexemes.
Not all of the characters belong to lexemes. Spaces and comments, for example, are part of space between lexemes.
The usual rule that a lexer follows is:
|
For example, given C code
ifx = 12*54;
the first lexeme is ifx. Notice that it will not
be if, since that is not the longest possible lexeme.
The next few lexemes are =, 12, *, 54
and ;.
Notice that spaces are only needed in places where otherwise
a longer lexeme would be chosen. For example, the first lexeme of
if x = 12*54;
is if. (Of course, that is not allowed in a C program, but
that has no bearing on lexical analysis.)
The parser wants to work at a higher level that individual characters.
Instead, it works with tokens. Tokens have no structure to them. Token TOK_IF, for example, might the the token that corresponds to lexeme if.
After getting a lexeme, the lexer converts it to a token and passes the token on to the parser.