An Interesting HTML Parser Conundrum

This content originally appeared on dbushell.com and was authored by dbushell.com

Despite better judgement I decided to code a basic HTML parser. Not the full HTML spec but enough to create a tree of nodes and attributes. I’ve already written a streamable XML parser that has been working for my podcast web app.

Parsing (most) HTML isn’t as complicated as it sounds. Look for a less-than sign < and see if a valid tag like <div> follows. If that node is a void element or self-closing element it gets appended to the current parent. If it’s an opening tag it becomes the current parent until a matching close tag is found.

There are several HTML elements that I consider opaque and will skip parsing inside.

So far that list includes:

export const opaqueTags = new Set([
  'code', 'iframe', 'math', 'noscript',
  'object', 'pre', 'script', 'style',
  'svg', 'template', 'textarea'
]);

For these elements I just want to gather the raw text and avoid creating a node tree. This is where I became confused.

The Conundrum

I started thinking about inline <script> and <style> tags. The contents of said elements are not HTML but could look like HTML.

What happens when I parse this inline script:

<script>
  console.log('</script>');
</script>

Or similar:

<script>
  /* </script> */
</script>

In these two examples the JavaScript text includes </script> inside a string literal and comment. How do HTML parsers know that is not real HTML? They’re not JavaScript parsers; they’re not aware of the string or comment context.

Investigation

I tested two popular Node.js libraries: htmlparser2 and parse5. Both libraries failed — at least I thought — by ending the <script> node early.

Nodes are created something like this:

<script> opening tag
console.log(' child text node
</script> closing tag
'); adjacent text node

The final </script> gets thrown away as a stray error.

I wasn’t satisfied! At this point I remembered that the best HTML parsers are web browsers, not Node packages. Surely a web browser can parse this correctly? Nope. Well… actually yes, once I realised my assumptions were wrong. Web browsers behave exactly the same.

See this CodePen for proof.

The same behaviour happens with an inline <style>:

<style>
  /* </style> */
  html {
    background: red;
  }
</style>

Everything from */ html { onwards is rendered as a text node and the “real” closing </style> tag is thrown away.

I did not expect this behaviour, but oh boy am I relieved! Can you imagine how difficult it would be to parse HTML otherwise?

My HTML parsing efforts currently reside in my Hyperless GitHub repo; an assortment of JavaScript + HTML experimental utilities. I’m not sure my final plans I’m just coding for fun right now. Originally I was planning to make a reference in JavaScript and then reimplement it in Rust or Zig. I just need more free time!

This content originally appeared on dbushell.com and was authored by dbushell.com

Print Share Comment Cite Upload Translate Updates

APA

dbushell.com | Sciencx (2024-10-01T10:00:00+00:00) An Interesting HTML Parser Conundrum. Retrieved from https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/

MLA

" » An Interesting HTML Parser Conundrum." dbushell.com | Sciencx - Tuesday October 1, 2024, https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/

HARVARD

dbushell.com | Sciencx Tuesday October 1, 2024 » An Interesting HTML Parser Conundrum., viewed ,<https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/>

VANCOUVER

dbushell.com | Sciencx - » An Interesting HTML Parser Conundrum. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/

CHICAGO

" » An Interesting HTML Parser Conundrum." dbushell.com | Sciencx - Accessed . https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/

IEEE

" » An Interesting HTML Parser Conundrum." dbushell.com | Sciencx [Online]. Available: https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/. [Accessed: ]

rf:citation

» An Interesting HTML Parser Conundrum | dbushell.com | Sciencx | https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

The Conundrum

Investigation

Related Posts