HTML5, character encodings and DOMDocument loadHTML and loadHTMLFile

Extreme HTML Verschachteling

Whilst working on a script for my GetProThemes app recently, I came across a problem with PHP’s loadHTML and loadHTMLFile methods.

The problem

I noticed that when using loadHTMLFile to parse an HTML document, the character encoding — UTF-8 in this case — was not being taken into consideration. Because of this, there was some mojibake after I extracted some content from the document. Here is an example of the problem:

$i18n_str = 'Iñtërnâtiônàlizætiøn';

$html = <<<EOS
<!doctype html>
<head>
  <meta charset="UTF-8">
  <title>html 5 document</title>
 </head>
 <body>
<h1 id="title">$i18n_str</h1>
</body>
</html>
EOS;

$dom = new DOMDocument();
$dom->loadHTML( $html );
echo $dom->getElementById( 'title' )->textContent;

// output: Iñtërnâtiônà lizætiøn

After some digging into the PHP source code, I discovered this function, along with loadHTML, uses Libxml for determining the character set of the HTML document automatically. It uses a function named htmlCheckEncoding for this purpose. What this function does is to look for a meta tag declaring the character set. Unfortunately, it only looks for the HTML4 style declaration:


<META http-equiv="Content-Type" content="text/html; charset=UTF-8">

This means that if your source document is HTML5, it will not pick up the newer meta tag declaration which has this form:


<meta charset="utf-8">

It seems that this glitch has been fixed in version 2.8.0 of Libxml, but if you are stuck with an older version then I have created a workaround.

The solution

I have created a drop-in replacements for the loadHTML/loadHTMLFile methods which will automatically convert an HTML5 character set declaration, if it exists, into an HTML4 character set declaration and thus allowing Libxml to parse the document correctly.

Fixing the above example is trivial:

require_once 'DOMDocumentCharset.php';

$i18n_str = 'Iñtërnâtiônàlizætiøn';

$html = <<<EOS
<!doctype html>
<head>
  <meta charset="UTF-8">
  <title>html 5 test</title>
 </head>
 <body>
<h1 id="title">$i18n_str</h1>
</body>
</html>
EOS;
		
$dom = new DOMDocumentCharset();
$dom->loadHTMLCharset( $html );
echo $dom->getElementById( 'title' )->textContent;

// output: Iñtërnâtiônàlizætiøn

So, the fix involves:

1. Including the DOMDocumentCharset class
2. Instantiating DOMDocumentCharset rather than DOMDocument
3. Calling the new loadHTMLCharset method

The class will only activate the workaround if the installed Libxml version is less than 2.8.0, so upgrading Libxml will not break this code.

The source code can be found on GitHub: dom-document-charset

Glen Scott

I’m a freelance software developer with 18 years’ professional experience in web development. I specialise in creating tailor-made, web-based systems that can help your business run like clockwork. I am the Managing Director of Yellow Square Development.

More Posts

Follow Me:
TwitterFacebookLinkedIn

2 thoughts on “HTML5, character encodings and DOMDocument loadHTML and loadHTMLFile

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.