HTML parsing in Elixir with leex and yecc

If you had the novel idea to try to implement a text-based “toy” web browser, where would you start? I asked myself this question in late 2016, and have often regretted it since.

A responsible programmer finding themself in this position might first ask what technologies would be a good pick for the task. They might look for a language which has a great HTTP package, or a proven library for working with TUIs, for instance. I however picked Elixir.¹

Having made this decision, the next question is perhaps how to go from a server’s HTML response to rendering something in your terminal. This is a pretty big question. After breaking down further you might decide that step 1 is to parse HTML into a data structure you can more easily work with.

This is the topic of this post.²

HTML parsers

HTML parsing is hopefully a solved problem, but as this is supposed to be an exercise in using Elixir, it seems like it would be ~~fun~~ ok to explore writing the parser, instead of using a pre-made solution.

The kind folks at the WHATWG provide an authoritative spec for how HTML parsers should function, which seems like a good first stop. They even include this handy diagram, giving an overview of what the spec does and does not cover:

Whatwg html spec sm

For a toy browser it might not be essential to follow the spec to the letter, but it seems like it’s worth a look. Delving in some, under the tokenization section we have something that seems useful:

The output of the tokenization step is a series of zero or more of the following tokens: DOCTYPE, start tag, end tag, comment, character, end-of-file. DOCTYPE tokens have a name, a public identifier, a system identifier, and a force-quirks flag…

Only 6 tokens and we’ll be on our way to a fully-fledged HTML parser, how hard can it be?! Time to see how this plays out.

Parsing with Elixir

Elixir is a fairly new language, but as it leverages the Erlang VM we benefit from being able to hook into all of the tools the Erlang ecosystem provides. The good news here is that Erlang provides some great tooling for writing tokenizers (aka lexers) and parsers: leex, and yecc, respectively.

I’m going to try to give a quick overview of how I tried approaching parsing HTML with these tools here, but for a deeper run through of their use you might try one of the posts referenced at the foot of this post.³

Tokenizing with leex

Rather than inflict a great deal of frustration upon myself by attempting to tokenize any old HTML found out in the wild, I decided to keep things simple. As a starting point, I decide to look at parsing a relatively simple subset of HTML:

1
2
3
4
5
6
7
8
9
<html>
  <head>
    <title>Hello, world</title>
  </head>
  <body>
    <h1>Hello world</h1>
    <p>Pls <em>parse</em> me.</p>
  </body>
</html>

This input doesn’t include anything which ought to be tokenized as DOCTYPE or comment, so for now we’ll ignore those. I don’t think I care about end-of-file so much either, so will focus on start tag, end tag, and character.

Leex allows us to tokenize an input by specifying Definitions and Rules, which state how to identify tokens, and what to spit them out as. We use regular expressions for the identification step (eek), then specify how each token should be emitted with an Erlang expression.

Here’s my initial attempt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Definitions.

START_TAG = \<[A-Za-z0-9\w\s]+\>
END_TAG   = \<\/[A-Za-z0-9\w\s]+\>
CHARACTER = [A-Za-z0-9\w\s\.\,\']
NEW_LINE  = [\n]

Rules.

{START_TAG} : {token, {start_tag, TokenLine, TokenChars}}.
{END_TAG}   : {token, {end_tag, TokenLine, TokenChars}}.
{CHARACTER} : {token, {char, TokenLine, TokenChars}}.
{NEW_LINE}  : skip_token.

Erlang code.

Lines 1-6 define regular expressions which identify tokens in the input. For instance, a start tag consists of a < followed by one or more letters, numbers, or whitespaces, followed by >. An end tag is the same, but has a / immediately following the opening <.

Next, lines 8-13 specify what to output when matching our definitions. For START_TAG, END_TAG and CHARACTER we return a tuple with the name of the token, the line number, and the characters we matched. We skip NEW_LINE as we don’t need these any longer.

The Erlang code section is also required, although we don’t use it here. We explore this further when looking at yecc.

Calling leex from Elixir

Elixir’s build tool, mix, can handle compilation of our leex and yecc definitions, as long as we save them into a src/ dir in our project root. If we do this, then create a basic Parser module in lib, we can call out to our lexer from Elixir:⁴

1
2
3
4
5
6
7
8
9
10
defmodule BrowsEx.Parser do
  @doc """
  Attempts to tokenize an input string to start_tag, end_tag, and char
  """
  @spec parse(binary) :: list
  def parse(input) do
    {:ok, tokens, _} = input |> to_char_list |> :html_lexer.string
    tokens
  end
end

Now we can call this from iex:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
iex(1)> html="""
    <html>
      <head>
        <title>Hello, world</title>
      </head>
      <body>
        <h1>Hello world</h1>
        <p>Pls <em>parse</em> me.</p>
      </body>
    </html>
...(2)>"""
"    <html>\n      <head>\n        <title>Hello, world</title>\n      </head>\n      <body>\n        <h1>Hello world</h1>\n        <p>Pls <em>parse</em> me.</p>\n      </body>\n    </html>\n"
iex(3)> BrowsEx.Parser.parse(html)
[{:start_tag, 1, '<html>'}, {:start_tag, 2, '<head>'},
 {:start_tag, 3, '<title>'}, {:char, 3, 'H'}, {:char, 3, 'e'}, {:char, 3, 'l'},
 {:char, 3, 'l'}, {:char, 3, 'o'}, {:char, 3, ','}, {:char, 3, ' '},
 {:char, 3, 'w'}, {:char, 3, 'o'}, {:char, 3, 'r'}, {:char, 3, 'l'},
 {:char, 3, 'd'}, {:end_tag, 3, '</title>'}, {:end_tag, 4, '</head>'},
 {:start_tag, 5, '<body>'}, {:start_tag, 6, '<h1>'}, {:char, 6, 'H'},
 {:char, 6, 'e'}, {:char, 6, 'l'}, {:char, 6, 'l'}, {:char, 6, 'o'},
 ...]

This is not pretty, but it has worked. We can see that we’ve successfully identified the start and end tag tokens, and categorized everything else as a char.

Parsing with yecc

Looking back at the WHATWG spec, we can see that the next step after tokenization is the tree construction phase. Now that we have a basic tokenized form of our input, we can look to parse it into a tree structure, making use of yecc to do so.⁵

Yecc takes in our list of tokens, processing them based upon rules we define. The first step is to specify Terminals and Nonterminals. In this case Terminals are the tokens we have just returned, which do not expand out to other categories. Nonterminals on the other hand do expand to other categories (until they have been reduced down to their base Terminals). We also need to specify a Rootsymbol, representing the root of our grammar.

So in our case, we could have:

Terminals for the previously emitted start_tag, end_tag and char tokens
Nonterminals being things constructed of our Terminals (or further Nonterminals!), which define the structure of our parsed output. I picture these as being either a tag, some tag_contents, or chars.
Rootsymbol which is going to be a tag. Perhaps the <html> tag?

1
2
3
Terminals start_tag end_tag char.
Nonterminals tag tag_contents chars.
Rootsymbol tag.

The next step is to define rules for what makes up each Nonterminal, and specify what the parser should do when encountering them:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Terminals start_tag end_tag char.
Nonterminals tag tag_contents chars.
Rootsymbol tag.

tag -> start_tag end_tag : {'$1', []}.
tag -> start_tag tag_contents end_tag : {'$1', '$2'}.

tag_contents -> tag : ['$1'].
tag_contents -> tag tag_contents : ['$1'|'$2'].
tag_contents -> chars tag_contents : ['$1'|'$2'].
tag_contents -> chars : ['$1'].

chars -> char chars : ['$1'|'$2'].
chars -> char : '$1'.

Line by line then, this states:

tags are either made up of:

a start_tag followed by an end_tag, or
a start_tag followed by some tag_contents followed by an end_tag

So what’s tag_contents? Well, it could be:

a tag
a tag followed by some more tag_contents
some chars followed by some more tag_contents
some chars only

Finally, we need to define what chars is:

a single char
a char followed by chars

Each rule also specifies what to output, through an Erlang expression following the :. For instance, we specify that tags should be output as tuples in the format {start_tag, []} when the tag is empty, or {start_tag, tag_contents} otherwise. Tag contents on the other hand is output as a list.

We can call this from Elixir with a tweak to our Parser module:

1
2
3
4
5
6
7
8
9
defmodule BrowsEx.Parser do
  @spec parse(binary) :: list
  def parse(input) do
    {:ok, tokens, _} = input |> to_char_list |> :html_lexer.string
-   tokens
+   {:ok, list} = tokens |> :html_parser.parse
+   list
  end
end

And calling this now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
iex(2)> BrowsEx.Parser.parse(html)
{{:start_tag, 1, '<html>'},
 [{{:start_tag, 2, '<head>'},
   [{{:start_tag, 3, '<title>'},
     [[{:char, 3, 'H'}, {:char, 3, 'e'}, {:char, 3, 'l'}, {:char, 3, 'l'},
       {:char, 3, 'o'}, {:char, 3, ','}, {:char, 3, ' '}, {:char, 3, 'w'},
       {:char, 3, 'o'}, {:char, 3, 'r'}, {:char, 3, 'l'} | {:char, 3, 'd'}]]}]},
  {{:start_tag, 5, '<body>'},
   [{{:start_tag, 6, '<h1>'},
     [[{:char, 6, 'H'}, {:char, 6, 'e'}, {:char, 6, 'l'}, {:char, 6, 'l'},
       {:char, 6, 'o'}, {:char, 6, ' '}, {:char, 6, 'w'}, {:char, 6, 'o'},
       {:char, 6, 'r'}, {:char, 6, 'l'} | {:char, 6, 'd'}]]},
    {{:start_tag, 7, '<p>'},
     [[{:char, 7, 'P'}, {:char, 7, 'l'}, {:char, 7, 's'} | {:char, 7, ' '}],
      {{:start_tag, 7, '<em>'},
       [[{:char, 7, 'p'}, {:char, 7, 'a'}, {:char, 7, 'r'}, {:char, 7, 's'} |
         {:char, 7, 'e'}]]},
      [{:char, 7, ' '}, {:char, 7, 'm'}, {:char, 7, 'e'} |
       {:char, 7, '.'}]]}]}]}

So that’s almost impossible to read, but seems to have done something. Nodes are represented as {{token, line_number, name}, [children]} where children can be made up or more nodes, chars, or a mix of the two.

We can clean this up further:

The token types are redundant at this stage, they are merely an artefact of the tree building process.
It would be nice if we could concat the chars into strings.

To do this we can make use of the Erlang code section of our parser generator, where we define a function to strip the data we no longer want. We can also use Erlang’s unicode module to transform lists of chars into a binary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Terminals start_tag end_tag char.
Nonterminals tag tag_contents chars.
Rootsymbol tag.

- tag -> start_tag end_tag : {'$1', []}.
+ tag -> start_tag end_tag : {get_value('$1'), []}.
- tag -> start_tag tag_contents end_tag : {'$1', '$2'}.
+ tag -> start_tag tag_contents end_tag : {get_value('$1'), '$2'}.

tag_contents -> tag : ['$1'].
tag_contents -> tag tag_contents : ['$1'|'$2'].
tag_contents -> chars tag_contents : ['$1'|'$2'].
tag_contents -> chars : ['$1'].

- chars -> char chars : ['$1'|'$2'].
+ chars -> char chars : unicode:characters_to_binary([get_value('$1')|'$2']).
- chars -> char : '$1'.
+ chars -> char : get_value('$1').
+ 
+ Erlang code.
+ 
+ get_value({_,_,Value}) -> Value.

And what do we get now?

1
2
3
4
5
6
iex(2)> BrowsEx.Parser.parse(html)
{'<html>',
 [{'<head>', [{'<title>', ["Hello, world"]}]},
  {'<body>',
   [{'<h1>', ["Hello world"]},
    {'<p>', ["Pls ", {'<em>', ["parse"]}, " me."]}]}]}

It works! We are successfully parsing html into a tree data structure.

Feeling confident, we update the test html to include a link:

1
2
3
4
5
6
7
8
9
10
11
12
13
iex(1)> html="""
    <html>
      <head>
        <title>Hello, world</title>
      </head>
      <body>
        <h1>Hello world</h1>
        <p>Pls <em>parse</em> me <a src="/">link</a>.</p>
      </body>
    </html>
...(2)>"""
iex(3)> BrowsEx.Parser.parse(html)
** (MatchError) no match of right hand side value: {:error, {7, :html_lexer, {:illegal, '<a src='}}, 7}

Somewhat unsurprisingly, it breaks. We can see that we’re encountering an :illegal error in our lexer, when it encounters '<a src='. In fact, this is falling over precisely because of the equals sign.

At this stage, it’s tempting to update our start_tag regex to allow handling of the = sign, but this seems short-sighted. Our expressions will have to handle a long list of characters if we’re going to parse real-world HTML. Here’s a quick look at how that might go:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Definitions.

% Helper expressions (non-token)
ValidInTagName = [A-Za-z0-9]
ValidTagAttributeName = [A-Za-z\-]
ValidTagAttributeValue = [A-Za-z0-9\s="\-/:;,\.?\(\)@]
ValidTagAttribute = {ValidTagAttributeName}+=("|')?{ValidTagAttributeValue}+("|')?
ValidTags = (\s{ValidTagAttribute})*
ValidInsideTag = [A-Za-z0-9\w\s="\-/@#:;,\.\'{}\(\)\[\]&\|\*]

% Core HTML tokens
START_TAG = \<{ValidInTagName}+{ValidTags}\>
END_TAG   = \<\/{ValidInTagName}+\>
CHARACTER = {ValidInsideTag}
DOCTYPE   = \<!(DOCTYPE|doctype)\s+[A-Za-z]+\>\s*

% Other tokens
NEW_LINE  = [\r\n]\s*
SELF_CLOSING_TAG = \<{ValidInTagName}+{ValidTags}\s\/\>

As expected, this is quickly turning ugly, so it’s time to consider another approach.

A more advanced parser

As it turns out, parsing HTML is non-trivial, and not necessarily fun. This was a useful exercise in getting to grips with leex and yecc, but having done so I’m not keen to push this much further. In truth if I want something reliable I’m going to need something much closer to the spec, and the idea of writing then maintaining this when other solutions are available does not thrill me.

Fortunately the open source community has us covered, so we can use an existing parser and move onto more interesting aspects of building a browser.

I chose to use Floki, which makes use of leex for search, although uses Mochiweb for the general html parser. Floki has an open issue to get rid of this dependency, if this sort of thing interests you!

The source code for the full browser, BrowsEx, can be found on GitHub.

My justification for this is not the topic of this post, but in short it is because I enjoy using the language, and wanted to try it out on something both relatively complex, and new to me. Elixir does also have a few options for working with TUIs. I found cecho worked best for my purposes. ↩
Full disclosure: Although I go through the motions, and I feel this is a good exercise in testing out leex and yecc, ultimately I end up using a library. If you want to learn something about hand-rolling an HTML parser you might end up disappointed. ↩
I highly recommend both Andrea Leopardi’s and Cameron Price’s posts on this topic. ↩
Attentive readers might point out that by using leex and yecc we’re not writing much Elixir at all, and they’d be right. In fact the only Elixir in this post is this module, sorry. ↩
As previously, I don’t intend to stick exactly to the guidelines here. Just having a rough notion of what the steps in the specification are suggesting is OK for me right now. ↩