For the purpose of making zograscope support parsing of more than just one language, there was a need to pick a language that is substantially different from C (the first one supported). GNU Make language certainly looked like an interesting option right from the start and it’s a tool that I use regularly. However, finding a grammar to implement it turned out to be a problem. Since grammar wasn’t found, it was created.
On this post
Writing it is motivated by a couple of reasons:
- to document what was done
- to provide a starting point for anyone else willing to develop a grammar for GNU Make (or maybe other derivatives of make)
The description is not terribly detailed, but should provide enough information to be able to understand implementation of the grammar.
Also note that this is written a year past the events, so there might be some inaccuracies or omissions, which should be resolved in favour of code (links are at the bottom).
What was found
Initial search didn’t really provide much, but there are some things worth mentioning.
While GNU Make’s documentation does explain syntax and provide examples, it doesn’t define syntax in any rigorous way. For example of how far it goes one might take a look at 7.2 Syntax of Conditionals section:
conditional-directive text-if-true else text-if-false endif
Turns out that some people mistakenly call syntax highlighting description of GNU Make’s language its “grammar”, making it harder to look for a real grammar.
The only actual attempt to come up with the grammar is this thing for something called grammarjs, which is now abandoned. It looked like it could serve as a basis for grammar, but that’s not the case. It’s unfinished and is overly simplistic. It can handle only a couple of example makefiles, but is inadequate otherwise.
People actually tried to find a grammar before, but the answer was that:
It would be tricky to write a grammar for make, since the grammar is extremely context-dependent.
Which is a quote from GNU Make’s maintainer.
What was done
The result turned out to be better than anticipated. Why? Because it’s an LR grammar with (effectively) no conflicts! The “(effectively)” part means that there are no reduce/reduce conflicts (this would require GLR parser to handle) and shift/reduce conflicts are resolved by assigning precedence to some tokens. The trick for achieving this is to do some extra stuff in the lexer, which is actually a requirement as highly context-dependent tokens can’t be distinguished in the grammar alone, there is simply no context information for that.
The grammar was built for a use case where comments can’t be thrown away in the lexing phase, so it includes them. The version without comments would be simpler, but not by much.
The grammar includes a number of similarly looking sets of productions, which is another consequence of its dependency on context. There are subtle differences as to what can be used where, but building blocks are the same, so you get several similar productions which are comprised of different subsets of those building blocks. Without this duplication, the grammar would be even smaller (yes, it’s relatively compact).
There is no list of hard-coded builtin functions, which might be of interest to
someone. And some things could probably be parsed on a more fine-grained level.
Things between define and endef are parsed as words instead of as rules,
which can be fixed at a cost of switching to a GLR parser. Other than that,
it’s a reasonable grammar which was successfully tested on some rather
complicated makefiles. However, it doesn’t cover at least weird variable
names like:
ifeq(bla) = 10 ifeq$(bla) = 10
and conditionals that only partially lay inside recipes. It probably doesn’t parse some constructs perfectly well (i.e., they are parsed but it takes them for something else).
All in all, it fits the purpose for me, but probably can’t be used to build a valid AST for a makefile.
Structure of the grammar
It’s not hard to guess that there will be assignments, rules and recipes, but more basal elements of all of them are:
- whitespace
- non-whitespace
- expansions (variables or functions)
- expressions: sequences of expansions and non-whitespace
It is these basic elements which are hard to get right, because they behave differently in different contexts. For example:
- adjacent non-whitespace can sometimes include reserved keywords
- leading tabulation has special meaning only in recipes
- whitespace is needed before first argument of a function call unless the argument is empty
This leads to already mentioned duplication which takes form of five kinds of expressions (listed from most to least constrained):
expressions– expressions ininclude, conditionals, first-level argumentsexprs_nested– expressions nested within parenthesis ofexpressionssuch that they can include comma (,) without breaking list of argumentsexprs_in_def– expression insidedefinestatementsexprs_in_assign– expressions in right-hand side of assignmentsexprs_in_recipe– expressions in recipes
The most complicated part of the grammar is probably specification of
expressions as interleaving verbatim text and expansions without unnecessary
conflicts. If you want to understand how expressions are done, look at
exprs_in_recipe first, as it’s the most straight forward one.
Whitespace is necessarily part of the grammar:
- new lines have to be explicitly encoded
- leading tabs are a separate token, which includes new line that it follows
The grammar
/* Assign lower precedence to NL. */
%precedence NL
%precedence COMMENT "ifdef" "ifndef" "ifeq" "ifneq"
1 makefile: statements "end of file"
2 | "end of file"
3 statements: br
4 | statement
5 | statements br
6 | statements statement
7 conditional: if_eq_kw condition statements_opt "endif" comment_opt br
8 | if_eq_kw condition statements_opt "else" statements_opt "endif" comment_opt br
9 | if_eq_kw condition statements_opt "else" conditional
10 | if_def_kw identifier statements_opt "endif" comment_opt br
11 | if_def_kw identifier statements_opt "else" statements_opt "endif" comment_opt br
12 | if_def_kw identifier statements_opt "else" conditional
13 conditional_in_recipe: if_eq_kw condition recipes_opt "endif" comment_opt
14 | if_eq_kw condition recipes_opt "else" recipes_opt "endif" comment_opt
15 | if_eq_kw condition recipes_opt "else" conditional_in_recipe
16 | if_def_kw identifier recipes_opt "endif" comment_opt
17 | if_def_kw identifier recipes_opt "else" recipes_opt "endif" comment_opt
18 | if_def_kw identifier recipes_opt "else" conditional_in_recipe
19 condition: '(' expressions_opt ',' expressions_opt ')'
20 | SLIT SLIT
21 define: "define" pattern definition "endef" br
22 | specifiers "define" pattern definition "endef" br
23 | "define" pattern ASSIGN_OP definition "endef" br
24 | specifiers "define" pattern ASSIGN_OP definition "endef" br
25 definition: comment_opt br
26 | comment_opt br exprs_in_def br
27 include: "include" expressions br
28 statements_opt: comment_opt br
29 | comment_opt br statements
30 if_def_kw: "ifdef"
31 | "ifndef"
32 if_eq_kw: "ifeq"
33 | "ifneq"
34 statement: COMMENT
35 | assignment br
36 | function br
37 | rule
38 | conditional
39 | define
40 | include
41 | export br
42 export: "export"
43 | "unexport"
44 | assignment_prefix
45 | assignment_prefix WS targets
46 assignment: pattern ASSIGN_OP comment_opt
47 | pattern ASSIGN_OP exprs_in_assign comment_opt
48 | assignment_prefix ASSIGN_OP comment_opt
49 | assignment_prefix ASSIGN_OP exprs_in_assign comment_opt
50 assignment_prefix: specifiers pattern
51 specifiers: "override"
52 | "export"
53 | "unexport"
54 | "override" "export"
55 | "export" "override"
56 | "undefine"
57 | "override" "undefine"
58 | "undefine" "override"
59 expressions_opt: %empty
60 | expressions
61 expressions: expression
62 | expressions WS expression
63 exprs_nested: expr_nested
64 | exprs_nested WS expr_nested
65 exprs_in_assign: expr_in_assign
66 | exprs_in_assign WS expr_in_assign
67 exprs_in_def: first_expr_in_def
68 | br
69 | br first_expr_in_def
70 | exprs_in_def br
71 | exprs_in_def WS expr_in_recipe
72 | exprs_in_def br first_expr_in_def
73 first_expr_in_def: char_in_def expr_in_recipe
74 | function expr_in_recipe
75 | char_in_def
76 | function
77 exprs_in_recipe: expr_in_recipe
78 | exprs_in_recipe WS expr_in_recipe
79 expression: expression_text
80 | expression_function
81 expr_nested: expr_text_nested
82 | expr_func_nested
83 expr_in_assign: expr_text_in_assign
84 | expr_func_in_assign
85 expr_in_recipe: expr_text_in_recipe
86 | expr_func_in_recipe
87 expression_text: text
88 | expression_function text
89 expr_text_nested: text_nested
90 | expr_func_nested text_nested
91 expr_text_in_assign: text_in_assign
92 | expr_func_in_assign text_in_assign
93 expr_text_in_recipe: text_in_recipe
94 | expr_func_in_recipe text_in_recipe
95 expression_function: function
96 | '(' exprs_nested ')'
97 | expression_text function
98 | expression_function function
99 expr_func_nested: function
100 | '(' exprs_nested ')'
101 | expr_func_nested function
102 | expr_text_nested function
103 expr_func_in_assign: function
104 | expr_func_in_assign function
105 | expr_text_in_assign function
106 expr_func_in_recipe: function
107 | expr_func_in_recipe function
108 | expr_text_in_recipe function
109 function: VAR
110 | "$(" function_name ")"
111 | "$(" function_name WS arguments ")"
112 | "$(" function_name ',' arguments ")"
113 | "$(" function_name ':' expressions ")"
114 | "$(" function_name ASSIGN_OP expressions ")"
115 function_name: function_name_text
116 | function_name_function
117 function_name_text: function_name_piece
118 | function_name_function function_name_piece
119 function_name_piece: CHARS
120 | function_name_piece CHARS
121 function_name_function: function
122 | function_name_text function
123 arguments: %empty
124 | argument
125 | arguments ','
126 | arguments ',' argument
127 argument: expressions
128 rule: targets colon prerequisites NL
129 | targets colon prerequisites recipes NL
130 | targets colon assignment NL
131 target: pattern
132 pattern: pattern_text
133 | pattern_function
134 pattern_text: identifier
135 | pattern_function identifier
136 pattern_function: function
137 | pattern_text function
138 | pattern_function function
139 prerequisites: %empty
140 | targets
141 targets: target
142 | targets WS target
143 recipes: recipe
144 | recipes recipe
145 recipes_opt: comment_opt NL
146 | comment_opt recipes NL
147 recipe: LEADING_TAB exprs_in_recipe
148 | NL conditional_in_recipe
149 | NL COMMENT
150 identifier: CHARS
151 | ','
152 | '('
153 | ')'
154 | identifier CHARS
155 | identifier keywords
156 | identifier ','
157 | identifier '('
158 | identifier ')'
159 text: char
160 | text char
161 text_nested: char_nested
162 | text_nested char_nested
163 text_in_assign: char_in_assign
164 | text_in_assign char_in_assign
165 text_in_recipe: char_in_recipe
166 | text_in_recipe char_in_recipe
167 char: CHARS
168 | SLIT
169 | ASSIGN_OP
170 | ':'
171 char_nested: char
172 | ','
173 char_in_assign: char_nested
174 | '('
175 | ')'
176 | keywords
177 char_in_def: char
178 | '('
179 | ')'
180 | ','
181 | COMMENT
182 | "include"
183 | "override"
184 | "export"
185 | "unexport"
186 | "ifdef"
187 | "ifndef"
188 | "ifeq"
189 | "ifneq"
190 | "else"
191 | "endif"
192 | "define"
193 | "undefine"
194 char_in_recipe: char_in_assign
195 | COMMENT
196 keywords: "include"
197 | "override"
198 | "export"
199 | "unexport"
200 | "ifdef"
201 | "ifndef"
202 | "ifeq"
203 | "ifneq"
204 | "else"
205 | "endif"
206 | "define"
207 | "endef"
208 | "undefine"
209 br: NL
210 | LEADING_TAB
211 colon: ':'
212 | ':' ':'
213 comment_opt: %empty
214 | COMMENT
The lexer
Symbols:
- All in
'and almost all in"quotes are literals. ')' ::= <unpaired )>'}' ::= <unpaired }>"$(" ::= "$(" | "${"– beginning of an expansion")" ::= ")" | "}"– ending of an expansion"end of file" ::= <end of file>COMMENT ::= <# comment (can be multiline)>ASSIGN_OP ::= "=" | "?=" | ":=" | "::=" | "+=" | "!="CHARS ::= <sequence of non-whitespace>WS ::= <sequence of whitespace>NL ::= "\n" | "\r" | "\r\n"VAR ::= /\$./SLIT ::= <single- or double-quote literal>LEADING_TAB ::= <tabulation at the first position in a line (eats NL)>
Whitespace handling is tricky: lexer actually ignores it unless previously returned token needs that whitespace after it on the parser’s side.
Unpaired closing parts of expansions () or }) are just themselves.
While contents of expansions is never a keyword.
The implementation
As mentioned above, the grammar was done for zograscope, where the implementation can be found: lexer and parser.
You should be warned that coming up with this thing wasn’t easy and code reflects it by not being particularly pretty.