Skip to content Skip to sidebar Skip to footer

Catastophic Backtracking Issue With Html

I'm trying to scrape a series of webpages using PHP, grabbing all of the content between the tag and the earliest tag. This is the regex that I'm using: |(?<=div id='body'>

Solution 1:

Regular expressions are known to cause catastrophic backtracking with large HTML contents. In this case, the problem is surely with the look-behind and lazy dot matching, when each time the regex engine advances one symbol to the right, it must check if the symbol is preceded with the specified substring, and if it reached enough characters to yield a match.

A good idea of how this regex works is looking at the regex101 regex debugger section.

As to how to parse your HTML, PHP DOMDocument and DOMXPath are your best friends:

$html = "<<YOUR_HTML_STRING_HERE>>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Above is the DOM initialization from string example, below is parsing$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div[@id="body"]'); // Get all DIV tags with id=bodyforeach($divsas$div) { 
  echo$dom->saveHTML($div); // Echo the HTML, can be added to array
}

See IDEONE demo

Post a Comment for "Catastophic Backtracking Issue With Html"