Inventing grammar for GNU Make

For the purpose of making zograscope support parsing of more than just one language, there was a need to pick a language that is substantially different from C (the first one supported). GNU Make language certainly looked like an interesting option right from the start and it’s a tool that I use regularly. However, finding a grammar to implement it turned out to be a problem. Since grammar wasn’t found, it was created.

On this post

Writing it is motivated by a couple of reasons:

to document what was done
to provide a starting point for anyone else willing to develop a grammar for GNU Make (or maybe other derivatives of make)

The description is not terribly detailed, but should provide enough information to be able to understand implementation of the grammar.

Also note that this is written a year past the events, so there might be some inaccuracies or omissions, which should be resolved in favour of code (links are at the bottom).

What was found

Initial search didn’t really provide much, but there are some things worth mentioning.

While GNU Make’s documentation does explain syntax and provide examples, it doesn’t define syntax in any rigorous way. For example of how far it goes one might take a look at 7.2 Syntax of Conditionals section:

conditional-directive
text-if-true
else
text-if-false
endif

Turns out that some people mistakenly call syntax highlighting description of GNU Make’s language its “grammar”, making it harder to look for a real grammar.

The only actual attempt to come up with the grammar is this thing for something called grammarjs, which is now abandoned. It looked like it could serve as a basis for grammar, but that’s not the case. It’s unfinished and is overly simplistic. It can handle only a couple of example makefiles, but is inadequate otherwise.

People actually tried to find a grammar before, but the answer was that:

It would be tricky to write a grammar for make, since the grammar is extremely context-dependent.

Which is a quote from GNU Make’s maintainer.

What was done

The result turned out to be better than anticipated. Why? Because it’s an LR grammar with (effectively) no conflicts! The “(effectively)” part means that there are no reduce/reduce conflicts (this would require GLR parser to handle) and shift/reduce conflicts are resolved by assigning precedence to some tokens. The trick for achieving this is to do some extra stuff in the lexer, which is actually a requirement as highly context-dependent tokens can’t be distinguished in the grammar alone, there is simply no context information for that.

The grammar was built for a use case where comments can’t be thrown away in the lexing phase, so it includes them. The version without comments would be simpler, but not by much.

The grammar includes a number of similarly looking sets of productions, which is another consequence of its dependency on context. There are subtle differences as to what can be used where, but building blocks are the same, so you get several similar productions which are comprised of different subsets of those building blocks. Without this duplication, the grammar would be even smaller (yes, it’s relatively compact).

There is no list of hard-coded builtin functions, which might be of interest to someone. And some things could probably be parsed on a more fine-grained level. Things between define and endef are parsed as words instead of as rules, which can be fixed at a cost of switching to a GLR parser. Other than that, it’s a reasonable grammar which was successfully tested on some rather complicated makefiles. However, it doesn’t cover at least weird variable names like:

ifeq(bla) = 10
ifeq$(bla) = 10

and conditionals that only partially lay inside recipes. It probably doesn’t parse some constructs perfectly well (i.e., they are parsed but it takes them for something else).

All in all, it fits the purpose for me, but probably can’t be used to build a valid AST for a makefile.

Structure of the grammar

It’s not hard to guess that there will be assignments, rules and recipes, but more basal elements of all of them are:

whitespace
non-whitespace
expansions (variables or functions)
expressions: sequences of expansions and non-whitespace

It is these basic elements which are hard to get right, because they behave differently in different contexts. For example:

adjacent non-whitespace can sometimes include reserved keywords
leading tabulation has special meaning only in recipes
whitespace is needed before first argument of a function call unless the argument is empty

This leads to already mentioned duplication which takes form of five kinds of expressions (listed from most to least constrained):

expressions – expressions in include, conditionals, first-level arguments
exprs_nested – expressions nested within parenthesis of expressions such that they can include comma (,) without breaking list of arguments
exprs_in_def – expression inside define statements
exprs_in_assign – expressions in right-hand side of assignments
exprs_in_recipe – expressions in recipes

The most complicated part of the grammar is probably specification of expressions as interleaving verbatim text and expansions without unnecessary conflicts. If you want to understand how expressions are done, look at exprs_in_recipe first, as it’s the most straight forward one.

Whitespace is necessarily part of the grammar:

new lines have to be explicitly encoded
leading tabs are a separate token, which includes new line that it follows

The grammar

/* Assign lower precedence to NL. */
%precedence NL
%precedence COMMENT "ifdef" "ifndef" "ifeq" "ifneq"

makefile: statements "end of file"
        | "end of file"

statements: br
          | statement
          | statements br
          | statements statement

conditional: if_eq_kw condition statements_opt "endif" comment_opt br
           | if_eq_kw condition statements_opt "else" statements_opt "endif" comment_opt br
           | if_eq_kw condition statements_opt "else" conditional
           | if_def_kw identifier statements_opt "endif" comment_opt br
           | if_def_kw identifier statements_opt "else" statements_opt "endif" comment_opt br
           | if_def_kw identifier statements_opt "else" conditional

conditional_in_recipe: if_eq_kw condition recipes_opt "endif" comment_opt
                     | if_eq_kw condition recipes_opt "else" recipes_opt "endif" comment_opt
                     | if_eq_kw condition recipes_opt "else" conditional_in_recipe
                     | if_def_kw identifier recipes_opt "endif" comment_opt
                     | if_def_kw identifier recipes_opt "else" recipes_opt "endif" comment_opt
                     | if_def_kw identifier recipes_opt "else" conditional_in_recipe

condition: '(' expressions_opt ',' expressions_opt ')'
         | SLIT SLIT

define: "define" pattern definition "endef" br
      | specifiers "define" pattern definition "endef" br
      | "define" pattern ASSIGN_OP definition "endef" br
      | specifiers "define" pattern ASSIGN_OP definition "endef" br

definition: comment_opt br
          | comment_opt br exprs_in_def br

include: "include" expressions br

statements_opt: comment_opt br
              | comment_opt br statements

if_def_kw: "ifdef"
         | "ifndef"

if_eq_kw: "ifeq"
        | "ifneq"

statement: COMMENT
         | assignment br
         | function br
         | rule
         | conditional
         | define
         | include
         | export br

export: "export"
      | "unexport"
      | assignment_prefix
      | assignment_prefix WS targets

assignment: pattern ASSIGN_OP comment_opt
          | pattern ASSIGN_OP exprs_in_assign comment_opt
          | assignment_prefix ASSIGN_OP comment_opt
          | assignment_prefix ASSIGN_OP exprs_in_assign comment_opt

assignment_prefix: specifiers pattern

specifiers: "override"
          | "export"
          | "unexport"
          | "override" "export"
          | "export" "override"
          | "undefine"
          | "override" "undefine"
          | "undefine" "override"

expressions_opt: %empty
               | expressions

expressions: expression
           | expressions WS expression

exprs_nested: expr_nested
            | exprs_nested WS expr_nested

exprs_in_assign: expr_in_assign
               | exprs_in_assign WS expr_in_assign

exprs_in_def: first_expr_in_def
            | br
            | br first_expr_in_def
            | exprs_in_def br
            | exprs_in_def WS expr_in_recipe
            | exprs_in_def br first_expr_in_def

first_expr_in_def: char_in_def expr_in_recipe
                 | function expr_in_recipe
                 | char_in_def
                 | function

exprs_in_recipe: expr_in_recipe
               | exprs_in_recipe WS expr_in_recipe

expression: expression_text
          | expression_function

expr_nested: expr_text_nested
           | expr_func_nested

expr_in_assign: expr_text_in_assign
              | expr_func_in_assign

expr_in_recipe: expr_text_in_recipe
              | expr_func_in_recipe

expression_text: text
               | expression_function text

expr_text_nested: text_nested
                | expr_func_nested text_nested

expr_text_in_assign: text_in_assign
                   | expr_func_in_assign text_in_assign

expr_text_in_recipe: text_in_recipe
                   | expr_func_in_recipe text_in_recipe

expression_function: function
                   | '(' exprs_nested ')'
                   | expression_text function
                   | expression_function function

expr_func_nested: function
                | '(' exprs_nested ')'
                | expr_func_nested function
                | expr_text_nested function

expr_func_in_assign: function
                   | expr_func_in_assign function
                   | expr_text_in_assign function

expr_func_in_recipe: function
                   | expr_func_in_recipe function
                   | expr_text_in_recipe function

function: VAR
        | "$(" function_name ")"
        | "$(" function_name WS arguments ")"
        | "$(" function_name ',' arguments ")"
        | "$(" function_name ':' expressions ")"
        | "$(" function_name ASSIGN_OP expressions ")"

function_name: function_name_text
             | function_name_function

function_name_text: function_name_piece
                  | function_name_function function_name_piece

function_name_piece: CHARS
                   | function_name_piece CHARS

function_name_function: function
                      | function_name_text function

arguments: %empty
         | argument
         | arguments ','
         | arguments ',' argument

argument: expressions

rule: targets colon prerequisites NL
    | targets colon prerequisites recipes NL
    | targets colon assignment NL

target: pattern

pattern: pattern_text
       | pattern_function

pattern_text: identifier
            | pattern_function identifier

pattern_function: function
                | pattern_text function
                | pattern_function function

prerequisites: %empty
             | targets

targets: target
       | targets WS target

recipes: recipe
       | recipes recipe

recipes_opt: comment_opt NL
           | comment_opt recipes NL

recipe: LEADING_TAB exprs_in_recipe
      | NL conditional_in_recipe
      | NL COMMENT

identifier: CHARS
          | ','
          | '('
          | ')'
          | identifier CHARS
          | identifier keywords
          | identifier ','
          | identifier '('
          | identifier ')'

text: char
    | text char

text_nested: char_nested
           | text_nested char_nested

text_in_assign: char_in_assign
              | text_in_assign char_in_assign

text_in_recipe: char_in_recipe
              | text_in_recipe char_in_recipe

char: CHARS
    | SLIT
    | ASSIGN_OP
    | ':'

char_nested: char
           | ','

char_in_assign: char_nested
              | '('
              | ')'
              | keywords

char_in_def: char
           | '('
           | ')'
           | ','
           | COMMENT
           | "include"
           | "override"
           | "export"
           | "unexport"
           | "ifdef"
           | "ifndef"
           | "ifeq"
           | "ifneq"
           | "else"
           | "endif"
           | "define"
           | "undefine"

char_in_recipe: char_in_assign
              | COMMENT

keywords: "include"
        | "override"
        | "export"
        | "unexport"
        | "ifdef"
        | "ifndef"
        | "ifeq"
        | "ifneq"
        | "else"
        | "endif"
        | "define"
        | "endef"
        | "undefine"

br: NL
  | LEADING_TAB

colon: ':'
     | ':' ':'

comment_opt: %empty
           | COMMENT

The lexer

Symbols:

All in ' and almost all in " quotes are literals.
')' ::= <unpaired )>
'}' ::= <unpaired }>
"$(" ::= "$(" | "${" – beginning of an expansion
")" ::= ")" | "}" – ending of an expansion
"end of file" ::= <end of file>
COMMENT ::= <# comment (can be multiline)>
ASSIGN_OP ::= "=" | "?=" | ":=" | "::=" | "+=" | "!="
CHARS ::= <sequence of non-whitespace>
WS ::= <sequence of whitespace>
NL ::= "\n" | "\r" | "\r\n"
VAR ::= /\$./
SLIT ::= <single- or double-quote literal>
LEADING_TAB ::= <tabulation at the first position in a line (eats NL)>

Whitespace handling is tricky: lexer actually ignores it unless previously returned token needs that whitespace after it on the parser’s side.

Unpaired closing parts of expansions () or }) are just themselves. While contents of expansions is never a keyword.

The implementation

As mentioned above, the grammar was done for zograscope, where the implementation can be found: lexer and parser.

You should be warned that coming up with this thing wasn’t easy and code reflects it by not being particularly pretty.

// load comments

reversed(top()) code tags rss about