This excerpt is from flex & bison.
If you need to parse or process text data in Linux or Unix, this useful book explains how to use flex and bison to solve your problems quickly. flex & bison is the long-awaited sequel to the classic O'Reilly book, lex & yacc. In the nearly two decades since the original book was published, the flex and bison utilities have proven to be more reliable and more powerful than the original Unix tools. flex & bison covers the same core functionality vital to Linux and Unix program development, along with several important new topics. You'll find revised tutorials for novices and references for advanced users, as well as an explanation of each utility's basic usage and simple, standalone applications you can create with them. With flex & bison, you'll discover the wide range of uses these flexible tools offer.
A set of distinct symbols. For example, the ASCII character set is a collection of 128 different symbols. In a flex specification, the alphabet is the native character set of the computer. In a bison grammar, the alphabet is the set of tokens and nonterminals used in the grammar.
An ambiguous grammar is one with more
than one rule or set of rules that match the same input. In a bison
grammar, ambiguous rules cause shift/reduce or reduce/reduce conflicts.
The parsing mechanism that bison normally uses cannot handle ambiguous
grammars. The programmer can use
and bison’s own internal rules to resolve conflicts when creating a
parser, or the programmer can use a GLR parser, which can handle
ambiguous grammars directly.
American Standard Code for Information Interchange; a collection of 128 symbols representing the common symbols found in the American alphabet: lowercase and uppercase letters, digits, and punctuation, plus additional characters for formatting and control of data communication links. Most systems that run flex and bison use ASCII or extended 8-bit codes in the ISO-8859 series that include ASCII as a subset.
A program that translates a set of instructions (a program) in one language into some other representation; typically, the output of a compiler is in the native binary language that can be run directly on a computer. Compare to interpreter.
An error within the bison grammar in which two (or more) parsing actions are possible when parsing the same input token. There are two types of conflicts: shift/reduce and reduce/reduce. (See also ambiguity.)
A grammar in which each rule has a single symbol on the LHS; hence, one in which the RHS can match input regardless of what might precede or follow the material it matches. Also called a phrase structure grammar. Context-sensitive grammars, containing rules with several symbols on the LHS, are not practical for parsing computer languages.
An abstract machine that consists of a finite number of instructions (or transitions). Finite automata are useful in modeling many commonly occurring computer processes and have useful mathematical properties. Flex and bison create scanners and parsers based on finite automata.
Generalized Left to Right; a powerful parsing technique that bison can optionally use. Unlike LALR(1), it can parse grammars that are ambiguous or need indefinite lookahead by maintaining all possible parses in parallel of the input read so far.
The left-hand side or LHS of a bison rule is the symbol that precedes the colon. During a parse, when the input matches the sequence of symbols on the RHS of the rule, that sequence is reduced to the LHS symbol.
A program that converts a character stream into a token
stream. Flex takes a description of individual tokens as regular
expressions, divides the character stream into tokens, and determines
the types and values of the tokens. For example, it might turn the
a = 17; into a token stream
consisting of the name
a, the operator
=, the number
17, and the single
;. Also called a
lexer or scanner.
The order in which some particular operation is performed;
for example, when interpreting mathematical statements, multiplication
and division are assigned higher precedence than addition and
subtraction. Thus, the statement
3+4*5 is 23 as
opposed to 35.
In a bison parser, when the input matches the list of symbols on the RHS of a rule, the parser reduces the rule by removing the RHS symbols from the parser stack and replacing them with the LHS symbol.
A language for specifying patterns that match a sequence of characters. Regular expressions consist of normal characters, which match the same character in the input; character classes, which match any single character in the class; and other characters, which specify the way that parts of the expression are to be matched against the input.
The right-hand side or RHS of a bison rule is the list of symbols that follow the colon. During a parse, when the input matches the sequence of symbols on the RHS of the rule, that sequence is reduced to the LHS symbol.
In bison, rules are the abstract description of the grammar. Bison rules are also called productions. A rule is a single nonterminal called the LHS, a colon, and a possibly empty set of symbols called the RHS. Whenever the input matches the RHS of a rule, the parser reduces the rule.
In a bison grammar, the situation where a symbol completes the RHS of one rule, which the parser needs to reduce, and is an intermediate symbol in the RHS of other rules, for which the parser needs to shift the symbol. Shift/reduce conflicts occur either because the grammar is ambiguous or because the parser would need to look more than one token ahead to decide whether to reduce the rule that the symbol completes. Bison resolves the conflict by doing the shift.
In a flex specification, patterns can be tagged with start states. At any point one start state is active, and patterns tagged with that start state can match the input. In an exclusive start state, only tagged patterns can match, while in an inclusive state, untagged patterns can also match the input.
In bison terminology, tokens or terminals are the symbols provided to the parser by the scanner. Compare to a nonterminal, which is defined within the parser.
Each token in a bison grammar has
both a syntactic and a
semantic value; its semantic value is the actual
data contents of the token. For instance, the syntactic type of a
certain operation may be
INTEGER, but its semantic
value might be 3.
If you enjoyed this excerpt, buy a copy of flex & bison.
Copyright © 2009 O'Reilly Media, Inc.