Handling and Validating (Not So) Well-Formed XML in PHP

Tuesday, January 5, 2010 , Posted by Johnny Fuery at 4:21 PM

Originally Published 2007-09-05 00:27:33

This post is decidedly geeky, but it took me awhile to figure out and I found no definitive resource describing this case on the web, so I feel that sharing is necessary.

Problem Description

PHP 5 has some built in functions for handling and parsing xml files. In typical PHP style, these are simple and straightforward to use. The simplexml_load_file() function in PHP is commonly used to load xml files and ready them for parsing using associative array (that's a hash for your perl coders) syntax.

There's a catch, though. With simplicity comes limitations. The error checking built in to the simplexml_load_file() function is difficult to use. In point of fact, I couldn't get it to work.

I even found [incorrect] documentation that implied that this would do the trick:

$xml = simplexml_load_file($url);
if($xml) {
// parse it


That may work if the file doesn't exist altogether, but it still through plenty of runtime errors with my malformed xml.

Digging further, I discovered that, based on the API, I should be able to supply a libxml constant as the third argument and thus gain a high level of control, but after toying with the constant LIBXML_DTDVALID without achieving the expected (or desired) results, I opted for my own methods.

The problem in my particular case was that my script was trying to retrieve a dynamically produced xml document that contained Java Runtime Error output instead of the expected well-formed xml. Sure, it's the data provider's problem to fix this sort of thing, but in the meantime, I'm the one displaying PHP errors to my customers. And, in all probability, losing them. Better to tell them the data is currently unavailable and give them a phone number to call, no?

Here's the exact error message:

Warning: simplexml_load_file() [function.simplexml-load-file]: I/O warning : failed to load external entity "../my/server/directory/structure/myfile.xml" in /my/server/directory/strcuture/xml-parsing-script.php on line 34

And yes, of course the file is there. I told you, it contains an html-formatted page full of Java Errors. Can't you Java coders parse some simple data into some clean, well-formed xml? ;-)

The Solution

The problem arises because simplexml_load_file is trying to both retrieve and parse all at once. The solution is to read the file using file_get_contents() first, test for validity, then parse it as xml using simplexml_load_string. If you're still reading, you're probably done with the explanation and want to see the code:

The function below is based on some code snippet I found out in the wild somewhere. I'd credit the author, but I can't seem to find it again. Feel free to comment if you are the proud originator and I'll update this post accordingly.

function isWellFormed($xmlString)
{
libxml_use_internal_errors(true);

$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadXML($xmlString);

$errors = libxml_get_errors();
if (empty($errors))
{
return true;
}

$error = $errors[ 0 ];
if ($error->level < 3)
{
return true;
}

$lines = explode("r", $xmlString);
$line = $lines[($error->line)-1];

$message = $error->message.' at line '.$error->line.':
'.htmlentities($line);

return $message;
}


Once you have functionized the validation step, just load, test, and parse:


$xmlString = file_get_contents($url);
if (isWellFormed($xmlString))
{
$xml = simplexml_load_string($xmlString);
}

Comments

On 2007-09-14 01:40:42 fred said:
Another way to handle parse error in SimpleXML:

try {
libxml_use_internal_errors(true);
$doc = new SimpleXMLElement($result);
} catch (Exception $e) {
$errors = libxml_get_errors();
$error = $errors[0];
throw new XMLParseException("SimpleXMLElement error: {$e->getMessage()}: {$error->message}");
}

On 2007-09-26 09:25:49 Information security said:
Have a look through your wp-config.php, plugins and/or your theme's function.php files.

In all cases, there should be no spaces or blank lines before the opening as well.

On 2008-03-11 02:01:37 gbokolo said:
thanks...this put me on the way for solving my problem.

Currently have 0 comments:

Leave a Reply

Post a Comment