Catastophic Backtracking Issue With Html

May 30, 2024 Post a Comment

I'm trying to scrape a series of webpages using PHP, grabbing all of the content between the tag and the earliest tag. This is the regex that I'm using: |(?<=div id='body'>

Solution 1:

Regular expressions are known to cause catastrophic backtracking with large HTML contents. In this case, the problem is surely with the look-behind and lazy dot matching, when each time the regex engine advances one symbol to the right, it must check if the symbol is preceded with the specified substring, and if it reached enough characters to yield a match.

A good idea of how this regex works is looking at the regex101 regex debugger section.

As to how to parse your HTML, PHP DOMDocument and DOMXPath are your best friends:

$html = "<<YOUR_HTML_STRING_HERE>>";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Above is the DOM initialization from string example, below is parsing$xpath = new DOMXPath($dom);
$divs = $xpath->query('//div[@id="body"]'); // Get all DIV tags with id=bodyforeach($divsas$div) { 
  echo$dom->saveHTML($div); // Echo the HTML, can be added to array
}

See IDEONE demo

Html5 Works

Catastophic Backtracking Issue With Html

Solution 1:

Post a Comment for "Catastophic Backtracking Issue With Html"