Intro. to Parsing
New 2.01 Features
Expression evaluator (freeware)
XIDEK interpreter kit (freeware)
Standard YACC is a minimal implementation of an LALR-1 parser generator. It is the Model T of parsing. Though in widespread use, it lacks the amenities that software developers have come to expect in their programming tools. AnaGram provides all the functionality of YACC with numerous extensions that make it a powerful, versatile software development tool.
The purpose of this document is to provide a brief overview of the capabilities AnaGram provides that are not available from standard YACC.
The features exclusive to AnaGram fall into roughly three classes. First, AnaGram has a much richer notation than YACC, usually eliminating the need for separate lexical scanners; second, AnaGram has an interactive working environment to help in understanding and debugging grammars; and third, AnaGram has numerous facilities to customize a parser to particular needs.
YACC's notation offers users little more than the bare minimum for stating their problems. It appears to have been designed for the convenience of the implementor rather than the user. At the time YACC was developed, this was an entirely understandable approach.
AnaGram, however, is written in AnaGram. During development this fact focused attention on the need to make AnaGram grammars convenient to use and easy to modify. The result is that programs written using AnaGram are easier to write and to maintain than YACC programs.
Some of the notational differences between AnaGram and YACC simply provide an easier way to represent functionality available in YACC. Other differences provide new functionality not available in YACC. The following paragraphs summarize the differences in notation.
It should be noted that every effort has been made to avoid peculiarities in notation. Wherever there are pre-existing standards for a construct, AnaGram abides by those standards. Thus, for example, the rules for comments, character constants, string constants and integer constants are identical to those of C and C++
In theoretical discussions of parsing and context free grammars, the ultimate, atomic units which make up a "language" are often referred to as "letters". These "letters" are a mathematical abstraction, too easily confused with the letters of the alphabet or the elements of the ascii character set. In most programming languages most letters are syntactically indistinguishable, that is, as far as the grammar is concerned "i", "j", and "k", for example, are all treated the same. However, treating all individual letters as the terminal tokens of a grammar immediately leads to extremely large, redundant parse tables. One solution to this problem has been recourse to a lexical scanner which reduces the number of input tokens to a manageable number.
Lexical scanners, unfortunately, introduce a number of their own problems, so that they are not an entirely satisfactory way to deal with the problem. Not the least of the problems associated with using YACC is the murky interface between a YACC parser and its associated lexical scanner. In most cases, AnaGram parsers do not need separate lexical scanners.
AnaGram allows character sets to be specified as ranges of characters: 'a-z' or 32..127, for example. It also provides notation for the unions, intersections, and differences of character sets. Thus, grammars can be written in terms of meaningful sets of characters. If the characters sets used in a grammar are not mutually disjoint, and they rarely are, AnaGram automatically calculates a disjoint covering of the character universe and adds productions to the grammar as necessary so that the true terminal tokens of the grammar are mutually exclusive.
AnaGram provides for the use of keywords in the parsers it builds. In your grammar you may specify a keyword as a quoted string. AnaGram incorporates special string matching logic in your parser to recognize keywords in the parser input. When a keyword is recognized it is treated as an individual token, completely independent of the characters which make up the keyword.
AnaGram has powerful facilities for specifying and controlling automatic white space suppression. You can, if necessary, exercise much finer control than you can with a lexical scanner. In most cases, you need only specify what qualifies as white space: blanks, tabs, and comments, for instance, and the circumstances where it should not be ignored: inside names, numbers, literal strings and so on. AnaGram takes it from there.
AnaGram allows a notation, used in programming manuals for many years, to denote optional or repetitive input. For instance:
Constructs using this notation are called virtual productions. To implement them, AnaGram adds appropriate real productions to the grammar.
This powerful extension to standard parsing techniques allows you to use semantic information to control syntactic analysis. You may specify multiple tokens on the left side of a production. At execution time, your reduction procedure may then determine, using semantic information, which is the correct left side. Thus, problems such as distinguishing variable names from typedef names in a C parser become trivial.
From a practical point of view, this is one of the most important differences between AnaGram and YACC. In semantic actions, or reduction procedures, it is often desirable to access the semantic value of some of the tokens in the rule being reduced. In YACC, you refer to these tokens by their position in the parser stack, relative to the beginning of the rule. Thus the value of the sixth token in the rule must be referred to as $6. In AnaGram, you instead attach a C variable name to the token in the rule. Then, in the reduction procedure, you simply refer to the token by name.
Although this approach has the obvious benefit of greater clarity, the greatest benefit becomes apparent only when you have to modify a production. If token 6 becomes token 7, the YACC programmer has to change all occurrences of $6 to $7. The AnaGram programmer avoids this tedious clerical task and the bugs it engenders. Largely because of this feature, it is much easier to manage an evolving system using AnaGram.
To support the linkage between rules and actions discussed in the previous paragraph, AnaGram supports C and C++ type declarations for grammar tokens. Using these declarations, it can automatically set up the parser stack.
The YACC programmer must declare the parser stack himself, making it a union if necessary, and must use the correct union element whenever he refers to the parser stack. As with the linkage between rules and actions, the benefits of this feature are greatest when modifying a program.
If your parser does not work the way you expect it to, standard YACC can provide you with printouts of parsing tables. For complex grammars, these printouts can be hundreds of pages long. Tracking a problem through such printouts is tedious and time-consuming.
AnaGram provides a large selection of interactive debugging facilities and help messages to help you find your problems quickly. The debugging facilities consist of interactive trace functions and a large number of cross-linked tables which you may inspect. Many of the tables are automatically synchronized with your syntax file window.
There are two principal kinds of problem you may have with a grammar. The first is that it fails to parse what you believe to be legitimate input. The second is that it has conflicts, or ambiguities. AnaGram provides help for both problems.
With the File Trace, you can see how a parser will interpret test data as soon as you have written a grammar. Select a test file and see how your parser will deal with it. You can parse character by character, fast forward, back up, and try again. At every step, File Trace displays the contents of the parser stack. If you get an unexpected error with a test file, you can quickly find the discrepancy between the test file and your grammar. If you wish to compare the differences in response to different test files you can even have several distinct File Traces running simultaneously.
The Grammar Trace is similar to the File Trace except that it is completely interactive. Instead of taking input from a file, it pops up a window of acceptable input tokens so you may select the one you wish. In many circumstances, you may ask AnaGram to initialize a Grammar Trace to help you understand a particular circumstance. AnaGram will find a sequence of input tokens that illustrates the circumstance. For instance, if you have a conflict, AnaGram can provide an input sequence that will trigger the conflict. For any state of the parser, AnaGram can create a sequence of tokens that will get to the specified state. As with File Traces, you can have multiple Grammar Traces active simultaneously.
Every time AnaGram analyzes a grammar it creates a number of summary tables, including the Symbol Table, the Token Table, and the State Table among others. These tables are listed in a Window Menu, from which you may choose to inspect any one that interests you. For each table there is a special menu of auxiliary windows that provide more detailed information. For instance, the State Table displays the rules which define each state in the grammar. The Auxiliary Windows menu allows you to see, among other tables, a full rule expansion for any state, or a Grammar Trace that leads to a specific state. All of the summary tables are available at any time, even when you are running a File Trace or Grammar Trace.
If your grammar has conflicts, they will be summarized in a Conflicts Window. The Auxiliary Windows menu for the Conflicts Window provides a large number of windows to help you find the reason for the conflict. Most important of these are a pair of "derivation" windows which show how the conflict derives from the rules of your grammar.
AnaGram includes two separate coverage analysis capabilities to help you gauge the adequacy of your testing. The Trace Coverage table counts the number of times each rule in your grammar is reduced while running File Trace. Rule Coverage is an optional facility that allows you to keep track of the number of times each rule is reduced by your actual running parser.
Just as you could get a Model T in any color you wanted, as long as it was black, YACC lets you name your parser anything you want as long as it is yyparse(). Generally, YACC does not provide you with alternatives.
AnaGram builds complete parsers from your grammar files. It uses the settings of a large complement of configuration parameters and switches to customize your parser to your particular needs. You can even name your parser yyparse(). There are defaults for all the parameters and switches, so you need not pay attention to them unless you actually need to exercise the degrees of freedom they control. AnaGram also has various attribute statements you can use to specify the properties of tokens and character sets in your parser. The following paragraphs summarize a few of the more important aspects of your parser you can control using the configuration parameters, switches and attribute statements.
With AnaGram, the entire status of a parse is encapsulated in a structure called a parser control block, abbreviated pcb. The practical result is that you can have multiple instances of a pcb, or different varieties of pcbs, operating within a single program, or even recursive parsers. Starting with AnaGram 2.01, the extend pcb statement lets you add your own declarations to the pcb. This greatly facilitates construction of thread-safe parsers, but extend pcb is also useful in other ways. For example, if you are deriving your own C++ class from the parser control block, you can use extend pcb to provide virtual function definitions in the base class.
In the default case, your parser is arranged so that every time it needs an input character it invokes a macro, GET_INPUT, to provide it with the next character. GET_INPUT defaults to reading stdin, but you can easily change it to read characters from a file, a keyboard, or even a comm port.
On the other hand, if you have all the input data in memory already you can set the pointer input switch, and simply provide your parser with a pointer to the input stream.
A third, more interesting way, to configure your parser for input uses the event driven switch. In this case, your parser will be built "inside out". You call it whenever you have an input character. When it has done all it can with that character, it will simply return and wait for you to call it again with the next input character. Using event driven parsers, you can run several different parsers on the same data, or you can have several instances of the same parser accepting data from different sources in parallel. Note that it is extremely easy to chain event driven parsers, so that output from one becomes input to the next.
YACC creates parsers which are not thread-safe. With AnaGram, beginning with version 2.01, you can easily make your parsers thread-safe if you wish to do so. The reentrant parser switch makes the AnaGram parse engine reentrant by passing the parser control block as an argument to all parse engine function calls. The extend pcb statement allows you to add your own declarations to the parser control block, so that you can avoid references to global or static variables in your reduction procedures. Further info.
Every parser has to be ready to deal with erroneous input. AnaGram parsers come ready-made with default syntax error diagnostics. Of course, if you don't like the ones AnaGram provides, you can build your own. In either case, you can use AnaGram's optional line and column tracking so that you can point exactly to the location of the error.
After diagnosing an error, you may want your parser to continue. In addition to providing a special "error" token which works very much like the error token in YACC, AnaGram provides the option of "automatic resynchronization". Automatic resynchronization uses a heuristic based on your grammar to get your parser back in synch with its input. All you have to do to use it is to set the switch.
For some problems you may not wish to use character input. You may have an existing lexical scanner, or some other program that produces input tokens. AnaGram provides numerous capabilities for setting up the interface between your scanner and your parser. In no case do you have to declare tokens in a specific order or rely on any such implicit interface. All interfaces are explicit.
Like YACC, AnaGram allows you to use operator precedence declarations to resolve conflicts. Unlike YACC, however, AnaGram normally condenses its parse tables in such a way that writing a deliberately ambiguous expression grammar, resolved with precedence declarations, does not provide any particular advantage.
Conflicts may also be resolved with the sticky statement. All conflicts resolved by using operator precedence or sticky are recorded in the Resolved Conflicts window so you can inspect the results. The tools for inspecting conflicts may also be used for inspecting resolved conflicts.
Links to: Home page | Trial Copy | Syntax Directed Parsing | Glossary