The Arithmetic Expression Module

[ Reference ]

[AEXP Module Source File: aexp.c. ]
[AEXP Module Source File File: aexp_token.c ]
[AEXP Module Public Header File: aexp.h. ]
[AEXP Module Private Header File: aexpp.h. ]

The Arithmetic Expression (AEXP) Module allows for encapsulation and evaluation of arithmetic expressions in infix notation, such as "2 * (3 + 5)." It incorporates constants, such as PI, and common mathematical functions, such as cosine and natural logarithm. The module implementation is based loosely on the "Interactive Expression Evaluator" described in your textbook, Data structures and Program Design in C, Second Edition, by Robert Kruse et al. (Prentice Hall, New Jersey, 1997).

The overall flow of the logic is as follows:

  1. The user instantiates an AEXP object, specifying an arithmetic expression (as a string) and the length of the arithmetic expression. Requiring the length of the expression string gives the user the option of specifying a non-null-terminated string. This is particularly useful in the case where the expression to be evaluated is embedded in a longer string, such as "The value of the expression '3 + (2 * 24)' is."
  2. We make a copy of the expression string and store it internally; the internal copy will be null-terminated.
  3. We parse the expression string into an array of tokens.
  4. The array of tokens is translated into a new array, representing the expression in postfix notation (i.e. Reverse Polish Notation, or RPN).
  5. The tokenized expression in postfix form is evaluated.

Following is a more detailed description of the process and its implementation.

Tokenization

Identifying a token in a string consists of two steps: parsing the string into words and mapping the words to tokens.

Parsing the String

A word is parsed from the string. A word is a sequence of one or more ASCII characters that fall into one of the following patterns:
  1. Number
    A number is a string of digits optionally including a single period (decimal point). Each of the following is a number:

    3       4.1       .77

  2. Word
    Any sequence of alphanumeric characters beginning with an alphabetic character. (I know, this is confusing; this means that we can have a word word as opposed to, say, a number word.) A word of this type must correspond to a recognizable constant (such as pi) or function (such as sin).
  3. Symbol
    A sequence of one or two non-alphanumeric, non-whitespace characters.

In this module a word is described by the AEXP__WORD_t struct declared in aexpp.h:

typedef struct aexp__word_s
{
    const char      *str;
    size_t          len;
    AEXP__TOKEN_p_t token;
    double          value;
} AEXP__WORD_t, *AEXP__WORD_p_t;

Where:

str
is a pointer to the first character of the word.
len
is the length of the word (note that a word is not null-terminated).
token
is a pointer to a struct that describes the token corresponding to a word (assuming the word corresponds to a valid token; see below).
value
is the value of the word. The value of a number word or of a word corresponding to a constant (such as pi) is the value of the number or constant. The value of a symbol word or a word corresponding to a function, is the value obtained by evaluating that portion of the expression. A special case is made for left- and right- parenthesis symbol words; these words are discarded in the process that converts the prefix token array into a postfix token array, and are never assigned a value. For example, the expression:
3 + cos( 0 )
consists of 6 words:
  • 3 and 0 have the values 3 and 0;
  • cos has the value of the cosine of 0, which is 1;
  • + has the value of 3 + 1, or 4;
  • ( (left-parenthesis) and ) (right-parenthesis) have no value.

Token Mapping

Words are mapped to tokens via a hash table. The word is the key and the value is a pointer to a struct of type AEXP__TOKEN_t, declared in aexpp.h:

typedef struct aexp__token_s
{
    const char*          name;
    AEXP__TOKEN_TYPE_e_t type;
    int                  priority;
    double               value;
    AEXP__EVALUATE_p_t   evaluate;
} AEXP__TOKEN_t, *AEXP__TOKEN_p_t;

Where:

name
is the name of the token. Note that name may consist of a symbol, such as "*" (for the multiplication operator token) as well as more recognizable names such as "pi" and "radians." The special name "x" is used for all numeric constants (e.g. 10 and 3.14). The unary plus and unary minus operators have special internal names denoted by AEXP__UP and AEXP__UM, respectively.
type
specifies the type of the token. Possible values (as declared in aexpp.h) are:
AEXP__NONE
This value is only used to label invalid tokens.
AEXP__END_EXPR
This value is presently not used by the module.
AEXP__LEFT_PAREN
Denotes a left parenthesis.
AEXP__RIGHT_PAREN
Denotes a right parenthesis.
AEXP__UNARY_OP
Denotes a unary operator. Unary plus, unary minus and, presently, all functions have this type.
AEXP__BINARY_OP
Denotes a binary operator.
AEXP__OPERAND
This value applies to the internal constants e and pi, and to all numeric constants.
priority
the priority of the token. Priority is only relevant to tokens of type AEXP__UNARY_OP and AEXP__BINARY_OP. All other tokens are arbitrarily assigned a priority of 0. The priority of the operators follows the usual conventions:
  • The highest priority is assigned to unary operators (note that, in this context, a function such as sin is considered a unary operator). All unary operators have the same priority.
  • Exponentiation has the second highest priority.
  • The third highest priority is assigned to the multiplication and division operators.
  • The lowest priority is given to the addition and subtraction operators.
value
is the value of the token, if the token has an intrinsic value. For example, the token pi has a value beginning with 3.14; the symbols for addition and multiplication, + and *, on the other hand, have no intrinsic value.
evaluate
is the address of a function that is used to evaluate the subexpression consisting of an operator and its operand(s). Tokens that are not operators have an evaluate value of NULL.

Following is a portion of the table of token values used by the module; the full table can be found in aexp_token.c:

static AEXP__TOKEN_t tokens_[] =
{
    { "(",       AEXP__LEFT_PAREN,  0, 0,        null_eval   },
    { ")",       AEXP__RIGHT_PAREN, 0, 0,        null_eval   },
    { AEXP__UM,  AEXP__UNARY_OP,    7, 0,        negate_eval },
    { AEXP__UP,  AEXP__UNARY_OP,    7, 0,        pos_eval    },
    { "abs",     AEXP__UNARY_OP,    7, 0,        abs_eval    },
    { "sqrt",    AEXP__UNARY_OP,    7, 0,        sqrt_eval   },
    . . .
    { "**",      AEXP__BINARY_OP,   6, 0,        raise_eval  },
    { "x",       AEXP__OPERAND,     0, 0.0,      null_eval   },
    { "pi",      AEXP__OPERAND,     0, AEXP__PI, null_eval   },
    { "e",       AEXP__OPERAND,     0, AEXP__E,  null_eval   }
};

Conversion to Postfix Form

After the string is fully parsed the expression consists of an array of tokens arranged in infix order. The next step is to translate this array into a new array in postfix order. A complete discussion of the postfix expression form (also known as reverse polish notation) can be found elsewhere, such as Wikipedia. . For this discussion let's just look at a couple of examples:

Infix to Postfix Conversion

Infix expression:   3  +  4  *  2
Postfix expression: 3  4  2  *  +
(Priority of * forces 4 * 2 to evaluate first.)

Infix expression:   (3  +  4)  *  2
Postfix expression: 3  4  +  2  *
(Parentheses force 3 + 4 to evaluate first.)

Infix expression:   (3  +  4)  *  (2  -  1)
Postfix expression: 3  4  +  2  1  -  *

Infix expression:   3  *  log(  10  )
Postfix expression: 3  10  log  *
(Priority of unary operator log forces log( 10 ) to evaluate first.)

Infix expression:   log(  10  )  *  3
Postfix expression: 10  log  3  *

To perform the conversion we will use a stack and a simplified form of Edger Dijkstra's "Shunting Yard" algorithm which is expressed (absent error checking) as follows:

while infix array is not traversed
    next token
    if token is an operand
        add token to postfix array
    else if token is a left parenthesis
        push token onto stack
    else if token is a right parenthesis
        pop token from stack
        while token is not a left paren
            add token to postfix array
            pop token from stack
        (note: left and right parens are discarded)
    else 
        (token must be an operator)
        set termination flag to false
        while !flag AND stack is not empty
            peek at top of stack
            if top-token is a left parenthesis
                set flag to TRUE
            else if top-token priority < token priority
                set flag to TRUE
            else
                pop top-token from stack
                add top-token to postfix array
        (end while !flag AND stack is not empty)
        push token onto stack
(end while infix array is not traversed)
while stack is not empty
    pop token from stack
    add token to postfix array

The complete implementation of this algorithm is found in aexp.c

Evaluation of the Postfix Expression

Once the expression has been successfully converted to postfix form it may be evaluated. The algorithm for doing so is relatively straightforward; once again, a stack will be employed; we will also need a temporary array, (denoted ops) suitable for holding two tokens to store operands pending subexpression evaluation. The algorithm (with error checking omitted) is as follows:
while the postfix array is not traversed
    next token
    if token is an operand
        push the token's value onto stack
    else
        (token must be an operator)
        if token is a binary operator
            pop a token from the stack to ops[1]
        pop a token from the stack to ops[0]
        invoke the token's evaluator passing ops
        push the resulting value onto the stack
(end while postfix array is not traversed)
(Note: there should be exactly one value left on the stack.)
pop the last value off the stack
flag this value as the value of the expression

Expression Representation

The overall expression is described by the AEXP__EXPR_t struct which is declared in aexpp.h:

typedef struct aexp__expr_s
{
    char        *expr_str;
    size_t      len;
    CDA_BOOL_t  valid;
    VEC_ID_t    infix;
    VEC_ID_t    postfix;
    double      value;
    const char  *bad_str;
    size_t      bad_str_len;
    int         errnum;
} AEXP__EXPR_t, *AEXP__EXPR_p_t;

Where:

expr_str
is a pointer to a buffer that holds the expression string. This buffer is allocated at the time the expression struct is created; the string provided by the caller is copied into the buffer.
len
is the length of the expression string.
valid
is initially true and is set to false any time the expression is determined to be invalid.
infix
is the array (expressed as a vector) to hold the tokens parsed from the expression string.
postifx
is the array (expressed as a vector) to hold the tokens in the postfix form of the expression string (see Conversion to Postfix Form ).
value
holds the value of the expression.
bad_str
if parsing fails, points to the first character that can't be parsed in expr_str.
bad_str_len
is the length of bad_str.
errnum
if an evaluation error occurs, holds the value of errno at the time the evaluation fails.

Notes on Error Checking

Extensive error checking takes place while parsing and evaluating an expression string. Below are a few notes regarding this process.

  1. Floating Point Numbers
    When checking floating point numbers (type double) it is necessary to make sure that the number is valid before using it in an operation, such as comparing it to zero. This is done with the isnan facility (for is not a number) declared in math.h. In Microsoft, this facility is named _isnan, so
    aexpp.h declares the macro ISNAN which is equivalent to _isnan in Microsoft environments, but equivalent to isnan elsewhere:
    // isnan in Microsoftland is _isnan
    #if defined( _WIN32 ) 
    #define ISNAN _isnan
    #else
    #define ISNAN isnan
    #endif
  2. Once parsing is successfully completed, and before postfix conversion takes place, the infix expression must undergo the following two validations:

    1. Leading Position
      A leading position is a position within an expression at which a subexpression may begin. Here are some examples:

      Leading Positions
      Expression Leading Positions
      3 + 4 * 6 3, 4, 6
      (3 + 4) * 6 left-paren, 3, 4, 6
      -3 + 7 unary minus, 3, 7

      This leads to two questions: how do we identify a leading position, and what tokens may legitimately occupy a leading position?

      Leading positions occur at:

      • The beginning of an expression;
      • Immediately after a right parenthesis;
      • Immediately after a unary operator; and
      • Immediately after a binary operator.

      Tokens that may appear in a leading position are:

      • An operand (type AEXP__OPERAND);
      • A unary operator (type AEXP__UNARY_OP); and
      • A left parenthesis (type AEXP__LEFT_PAREN).

      So, after successfully parsing an expression, we can identify all the leading positions and verify that they are occupied by valid tokens:

      if leading-position
          if token-type != OPERAND
          and token-type != UNARY-OP
          and token-type != left-parenthesis
              expression is invalid
    2. Balanced Parentheses
      We make sure that an expression contains the same number of right parentheses as left parentheses; we also make sure that the number of right parentheses never exceeds the number of left parentheses at any point in the expression:

Copyright © 2007 by Jack Straub
jstraub@centurytel.net