php – 正则表达式/ DOMDocument – 匹配和替换不在链接中的文本

我需要以不区分大小写的方式查找和替换所有文本匹配,除非文本在锚标签内,例如：

<p>Match this text and replace it</p>
<p>Don't <a href="/">match this text</a></p>
<p>We still need to match this text and replace it</p>

搜索“匹配此文本”将仅替换第一个实例和最后一个实例.

根据Gordon的评论,在这种情况下可能会使用DOMDocument.我完全不熟悉DOMDocument扩展,并且非常感谢这个功能的一些基本示例.

这是一个UTF-8安全解决方案,它不仅适用于正确格式化的文档,而且与文档片段一起使用.

需要mb_convert_encoding,因为loadHtml()似乎有一个UTF-8编码的错误(见here和here).

mb_substr从输出中修剪body标签,这样你就可以获得原始内容,而无需任何额外的标记.

<?PHP
$html = '<p>Match this text and replace it</p>
<p>Don\'t <a href="/">match this text</a></p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is <a href="#">a link <span>with <strong>don\'t match this text</strong> content</span></a></p>';

$dom = new DOMDocument();
// loadXml needs properly formatted documents,so it's better to use loadHtml,but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html,'HTML-ENTITIES',"UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
    $replaced = str_ireplace('match this text','MATCH',$node->wholeText);
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode,$node);
}

// get only the body tag with its contents,then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)),6,-7,"UTF-8");

参考文献：
1. find and replace keywords by hyperlinks in an html fragment,via php dom
2. Regex / DOMDocument – match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?

我读了几十个答案,所以我很抱歉,如果我忘了某人(请评论,我会在这种情况下添加你的).

感谢Gordon和my other answer的评论.

php – 正则表达式/ DOMDocument – 匹配和替换不在链接中的文本

猜你在找的PHP相关文章