reversed(top()) code tags rss about

Inventing grammar for GNU Make

February 28, 2019
[make] [programming]

For the purpose of making zograscope support parsing of more than just one language, there was a need to pick a language that is substantially different from C (the first one supported). GNU Make language certainly looked like an interesting option right from the start and it’s a tool that I use regularly. However, finding a grammar to implement it turned out to be a problem. Since grammar wasn’t found, it was created.

On this post

Writing it is motivated by a couple of reasons:

  • to document what was done
  • to provide a starting point for anyone else willing to develop a grammar for GNU Make (or maybe other derivatives of make)

The description is not terribly detailed, but should provide enough information to be able to understand implementation of the grammar.

Also note that this is written a year past the events, so there might be some inaccuracies or omissions, which should be resolved in favour of code (links are at the bottom).

What was found

Initial search didn’t really provide much, but there are some things worth mentioning.

While GNU Make’s documentation does explain syntax and provide examples, it doesn’t define syntax in any rigorous way. For example of how far it goes one might take a look at 7.2 Syntax of Conditionals section:

conditional-directive
text-if-true
else
text-if-false
endif

Turns out that some people mistakenly call syntax highlighting description of GNU Make’s language its “grammar”, making it harder to look for a real grammar.

The only actual attempt to come up with the grammar is this thing for something called grammarjs, which is now abandoned. It looked like it could serve as a basis for grammar, but that’s not the case. It’s unfinished and is overly simplistic. It can handle only a couple of example makefiles, but is inadequate otherwise.

People actually tried to find a grammar before, but the answer was that:

It would be tricky to write a grammar for make, since the grammar is extremely context-dependent.

Which is a quote from GNU Make’s maintainer.

What was done

The result turned out to be better than anticipated. Why? Because it’s an LR grammar with (effectively) no conflicts! The “(effectively)” part means that there are no reduce/reduce conflicts (this would require GLR parser to handle) and shift/reduce conflicts are resolved by assigning precedence to some tokens. The trick for achieving this is to do some extra stuff in the lexer, which is actually a requirement as highly context-dependent tokens can’t be distinguished in the grammar alone, there is simply no context information for that.

The grammar was built for a use case where comments can’t be thrown away in the lexing phase, so it includes them. The version without comments would be simpler, but not by much.

The grammar includes a number of similarly looking sets of productions, which is another consequence of its dependency on context. There are subtle differences as to what can be used where, but building blocks are the same, so you get several similar productions which are comprised of different subsets of those building blocks. Without this duplication, the grammar would be even smaller (yes, it’s relatively compact).

There is no list of hard-coded builtin functions, which might be of interest to someone. And some things could probably be parsed on a more fine-grained level. Things between define and endef are parsed as words instead of as rules, which can be fixed at a cost of switching to a GLR parser. Other than that, it’s a reasonable grammar which was successfully tested on some rather complicated makefiles. However, it doesn’t cover at least weird variable names like:

ifeq(bla) = 10
ifeq$(bla) = 10

and conditionals that only partially lay inside recipes. It probably doesn’t parse some constructs perfectly well (i.e., they are parsed but it takes them for something else).

All in all, it fits the purpose for me, but probably can’t be used to build a valid AST for a makefile.

Structure of the grammar

It’s not hard to guess that there will be assignments, rules and recipes, but more basal elements of all of them are:

  • whitespace
  • non-whitespace
  • expansions (variables or functions)
  • expressions: sequences of expansions and non-whitespace

It is these basic elements which are hard to get right, because they behave differently in different contexts. For example:

  • adjacent non-whitespace can sometimes include reserved keywords
  • leading tabulation has special meaning only in recipes
  • whitespace is needed before first argument of a function call unless the argument is empty

This leads to already mentioned duplication which takes form of five kinds of expressions (listed from most to least constrained):

  1. expressions – expressions in include, conditionals, first-level arguments
  2. exprs_nested – expressions nested within parenthesis of expressions such that they can include comma (,) without breaking list of arguments
  3. exprs_in_def – expression inside define statements
  4. exprs_in_assign – expressions in right-hand side of assignments
  5. exprs_in_recipe – expressions in recipes

The most complicated part of the grammar is probably specification of expressions as interleaving verbatim text and expansions without unnecessary conflicts. If you want to understand how expressions are done, look at exprs_in_recipe first, as it’s the most straight forward one.

Whitespace is necessarily part of the grammar:

  • new lines have to be explicitly encoded
  • leading tabs are a separate token, which includes new line that it follows

The grammar

/* Assign lower precedence to NL. */
%precedence NL
%precedence COMMENT "ifdef" "ifndef" "ifeq" "ifneq"

  1  makefile: statements "end of file"
  2          | "end of file"

  3  statements: br
  4            | statement
  5            | statements br
  6            | statements statement

  7  conditional: if_eq_kw condition statements_opt "endif" comment_opt br
  8             | if_eq_kw condition statements_opt "else" statements_opt "endif" comment_opt br
  9             | if_eq_kw condition statements_opt "else" conditional
 10             | if_def_kw identifier statements_opt "endif" comment_opt br
 11             | if_def_kw identifier statements_opt "else" statements_opt "endif" comment_opt br
 12             | if_def_kw identifier statements_opt "else" conditional

 13  conditional_in_recipe: if_eq_kw condition recipes_opt "endif" comment_opt
 14                       | if_eq_kw condition recipes_opt "else" recipes_opt "endif" comment_opt
 15                       | if_eq_kw condition recipes_opt "else" conditional_in_recipe
 16                       | if_def_kw identifier recipes_opt "endif" comment_opt
 17                       | if_def_kw identifier recipes_opt "else" recipes_opt "endif" comment_opt
 18                       | if_def_kw identifier recipes_opt "else" conditional_in_recipe

 19  condition: '(' expressions_opt ',' expressions_opt ')'
 20           | SLIT SLIT

 21  define: "define" pattern definition "endef" br
 22        | specifiers "define" pattern definition "endef" br
 23        | "define" pattern ASSIGN_OP definition "endef" br
 24        | specifiers "define" pattern ASSIGN_OP definition "endef" br

 25  definition: comment_opt br
 26            | comment_opt br exprs_in_def br

 27  include: "include" expressions br

 28  statements_opt: comment_opt br
 29                | comment_opt br statements

 30  if_def_kw: "ifdef"
 31           | "ifndef"

 32  if_eq_kw: "ifeq"
 33          | "ifneq"

 34  statement: COMMENT
 35           | assignment br
 36           | function br
 37           | rule
 38           | conditional
 39           | define
 40           | include
 41           | export br

 42  export: "export"
 43        | "unexport"
 44        | assignment_prefix
 45        | assignment_prefix WS targets

 46  assignment: pattern ASSIGN_OP comment_opt
 47            | pattern ASSIGN_OP exprs_in_assign comment_opt
 48            | assignment_prefix ASSIGN_OP comment_opt
 49            | assignment_prefix ASSIGN_OP exprs_in_assign comment_opt

 50  assignment_prefix: specifiers pattern

 51  specifiers: "override"
 52            | "export"
 53            | "unexport"
 54            | "override" "export"
 55            | "export" "override"
 56            | "undefine"
 57            | "override" "undefine"
 58            | "undefine" "override"

 59  expressions_opt: %empty
 60                 | expressions

 61  expressions: expression
 62             | expressions WS expression

 63  exprs_nested: expr_nested
 64              | exprs_nested WS expr_nested

 65  exprs_in_assign: expr_in_assign
 66                 | exprs_in_assign WS expr_in_assign

 67  exprs_in_def: first_expr_in_def
 68              | br
 69              | br first_expr_in_def
 70              | exprs_in_def br
 71              | exprs_in_def WS expr_in_recipe
 72              | exprs_in_def br first_expr_in_def

 73  first_expr_in_def: char_in_def expr_in_recipe
 74                   | function expr_in_recipe
 75                   | char_in_def
 76                   | function

 77  exprs_in_recipe: expr_in_recipe
 78                 | exprs_in_recipe WS expr_in_recipe

 79  expression: expression_text
 80            | expression_function

 81  expr_nested: expr_text_nested
 82             | expr_func_nested

 83  expr_in_assign: expr_text_in_assign
 84                | expr_func_in_assign

 85  expr_in_recipe: expr_text_in_recipe
 86                | expr_func_in_recipe

 87  expression_text: text
 88                 | expression_function text

 89  expr_text_nested: text_nested
 90                  | expr_func_nested text_nested

 91  expr_text_in_assign: text_in_assign
 92                     | expr_func_in_assign text_in_assign

 93  expr_text_in_recipe: text_in_recipe
 94                     | expr_func_in_recipe text_in_recipe

 95  expression_function: function
 96                     | '(' exprs_nested ')'
 97                     | expression_text function
 98                     | expression_function function

 99  expr_func_nested: function
100                  | '(' exprs_nested ')'
101                  | expr_func_nested function
102                  | expr_text_nested function

103  expr_func_in_assign: function
104                     | expr_func_in_assign function
105                     | expr_text_in_assign function

106  expr_func_in_recipe: function
107                     | expr_func_in_recipe function
108                     | expr_text_in_recipe function

109  function: VAR
110          | "$(" function_name ")"
111          | "$(" function_name WS arguments ")"
112          | "$(" function_name ',' arguments ")"
113          | "$(" function_name ':' expressions ")"
114          | "$(" function_name ASSIGN_OP expressions ")"

115  function_name: function_name_text
116               | function_name_function

117  function_name_text: function_name_piece
118                    | function_name_function function_name_piece

119  function_name_piece: CHARS
120                     | function_name_piece CHARS

121  function_name_function: function
122                        | function_name_text function

123  arguments: %empty
124           | argument
125           | arguments ','
126           | arguments ',' argument

127  argument: expressions

128  rule: targets colon prerequisites NL
129      | targets colon prerequisites recipes NL
130      | targets colon assignment NL

131  target: pattern

132  pattern: pattern_text
133         | pattern_function

134  pattern_text: identifier
135              | pattern_function identifier

136  pattern_function: function
137                  | pattern_text function
138                  | pattern_function function

139  prerequisites: %empty
140               | targets

141  targets: target
142         | targets WS target

143  recipes: recipe
144         | recipes recipe

145  recipes_opt: comment_opt NL
146             | comment_opt recipes NL

147  recipe: LEADING_TAB exprs_in_recipe
148        | NL conditional_in_recipe
149        | NL COMMENT

150  identifier: CHARS
151            | ','
152            | '('
153            | ')'
154            | identifier CHARS
155            | identifier keywords
156            | identifier ','
157            | identifier '('
158            | identifier ')'

159  text: char
160      | text char

161  text_nested: char_nested
162             | text_nested char_nested

163  text_in_assign: char_in_assign
164                | text_in_assign char_in_assign

165  text_in_recipe: char_in_recipe
166                | text_in_recipe char_in_recipe

167  char: CHARS
168      | SLIT
169      | ASSIGN_OP
170      | ':'

171  char_nested: char
172             | ','

173  char_in_assign: char_nested
174                | '('
175                | ')'
176                | keywords

177  char_in_def: char
178             | '('
179             | ')'
180             | ','
181             | COMMENT
182             | "include"
183             | "override"
184             | "export"
185             | "unexport"
186             | "ifdef"
187             | "ifndef"
188             | "ifeq"
189             | "ifneq"
190             | "else"
191             | "endif"
192             | "define"
193             | "undefine"

194  char_in_recipe: char_in_assign
195                | COMMENT

196  keywords: "include"
197          | "override"
198          | "export"
199          | "unexport"
200          | "ifdef"
201          | "ifndef"
202          | "ifeq"
203          | "ifneq"
204          | "else"
205          | "endif"
206          | "define"
207          | "endef"
208          | "undefine"

209  br: NL
210    | LEADING_TAB

211  colon: ':'
212       | ':' ':'

213  comment_opt: %empty
214             | COMMENT

The lexer

Symbols:

  • All in ' and almost all in " quotes are literals.
  • ')' ::= <unpaired )>
  • '}' ::= <unpaired }>
  • "$(" ::= "$(" | "${" – beginning of an expansion
  • ")" ::= ")" | "}" – ending of an expansion
  • "end of file" ::= <end of file>
  • COMMENT ::= <# comment (can be multiline)>
  • ASSIGN_OP ::= "=" | "?=" | ":=" | "::=" | "+=" | "!="
  • CHARS ::= <sequence of non-whitespace>
  • WS ::= <sequence of whitespace>
  • NL ::= "\n" | "\r" | "\r\n"
  • VAR ::= /\$./
  • SLIT ::= <single- or double-quote literal>
  • LEADING_TAB ::= <tabulation at the first position in a line (eats NL)>

Whitespace handling is tricky: lexer actually ignores it unless previously returned token needs that whitespace after it on the parser’s side.

Unpaired closing parts of expansions () or }) are just themselves. While contents of expansions is never a keyword.

The implementation

As mentioned above, the grammar was done for zograscope, where the implementation can be found: lexer and parser.

You should be warned that coming up with this thing wasn’t easy and code reflects it by not being particularly pretty.