An Interesting HTML Parser Conundrum

Despite better judgement I decided to code a basic HTML parser. Not the full HTML spec but enough to create a tree of nodes and attributes. I’ve already written a streamable XML parser that has been working for my podcast web app. Parsing (most) HTML isn’t as complicated as it sounds. […]


This content originally appeared on dbushell.com and was authored by dbushell.com

Despite better judgement I decided to code a basic HTML parser. Not the full HTML spec but enough to create a tree of nodes and attributes. I’ve already written a streamable XML parser that has been working for my podcast web app.

Parsing (most) HTML isn’t as complicated as it sounds. Look for a less-than sign < and see if a valid tag like <div> follows. If that node is a void element or self-closing element it gets appended to the current parent. If it’s an opening tag it becomes the current parent until a matching close tag is found.

There are several HTML elements that I consider opaque and will skip parsing inside.

So far that list includes:

export const opaqueTags = new Set([
  'code', 'iframe', 'math', 'noscript',
  'object', 'pre', 'script', 'style',
  'svg', 'template', 'textarea'
]);

For these elements I just want to gather the raw text and avoid creating a node tree. This is where I became confused.

The Conundrum

I started thinking about inline <script> and <style> tags. The contents of said elements are not HTML but could look like HTML.

What happens when I parse this inline script:

<script>
  console.log('</script>');
</script>

Or similar:

<script>
  /* </script> */
</script>

In these two examples the JavaScript text includes </script> inside a string literal and comment. How do HTML parsers know that is not real HTML? They’re not JavaScript parsers; they’re not aware of the string or comment context.

Investigation

I tested two popular Node.js libraries: htmlparser2 and parse5. Both libraries failed — at least I thought — by ending the <script> node early.

Nodes are created something like this:

  1. <script> opening tag
  2. console.log(' child text node
  3. </script> closing tag
  4. '); adjacent text node

The final </script> gets thrown away as a stray error.

I wasn’t satisfied! At this point I remembered that the best HTML parsers are web browsers, not Node packages. Surely a web browser can parse this correctly? Nope. Well… actually yes, once I realised my assumptions were wrong. Web browsers behave exactly the same.

See this CodePen for proof.

The same behaviour happens with an inline <style>:

<style>
  /* </style> */
  html {
    background: red;
  }
</style>

Everything from */ html { onwards is rendered as a text node and the “real” closing </style> tag is thrown away.

I did not expect this behaviour, but oh boy am I relieved! Can you imagine how difficult it would be to parse HTML otherwise?

My HTML parsing efforts currently reside in my Hyperless GitHub repo; an assortment of JavaScript + HTML experimental utilities. I’m not sure my final plans I’m just coding for fun right now. Originally I was planning to make a reference in JavaScript and then reimplement it in Rust or Zig. I just need more free time!


This content originally appeared on dbushell.com and was authored by dbushell.com


Print Share Comment Cite Upload Translate Updates
APA

dbushell.com | Sciencx (2024-10-01T10:00:00+00:00) An Interesting HTML Parser Conundrum. Retrieved from https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/

MLA
" » An Interesting HTML Parser Conundrum." dbushell.com | Sciencx - Tuesday October 1, 2024, https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/
HARVARD
dbushell.com | Sciencx Tuesday October 1, 2024 » An Interesting HTML Parser Conundrum., viewed ,<https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/>
VANCOUVER
dbushell.com | Sciencx - » An Interesting HTML Parser Conundrum. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/
CHICAGO
" » An Interesting HTML Parser Conundrum." dbushell.com | Sciencx - Accessed . https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/
IEEE
" » An Interesting HTML Parser Conundrum." dbushell.com | Sciencx [Online]. Available: https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/. [Accessed: ]
rf:citation
» An Interesting HTML Parser Conundrum | dbushell.com | Sciencx | https://www.scien.cx/2024/10/01/an-interesting-html-parser-conundrum/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.