如何选择所有子文本但使用Scapy的XPath排除标记？

我有这个HTML：

Box">


我想获取< div id =“content”>中的所有文字在Scrapy中使用XPath但不包括< div class =“infobox”>的内容,因此预期结果如下：

Title 1


Sub-Title 1


Descripton 1.

Descripton 2.


Sub-Title 2


Descripton 1.
Descripton 2.

但是我还没有达到排除部分,我仍然在努力从< div id =“content”>中获取文本.
我试过这个：

response.xpath('//*[@id="content"]/text()').extract()

但它只从子标题返回描述1.和描述2.
然后我尝试了：

response.xpath('//*[@id="content"]//*/text()').extract()

它仅返回标题1,子标题1,子标题2,信息标题和长信息文本.
所以这里有两个问题：
>我如何从内容div中获取所有儿童文本？
>如何从选择中排除信息框div？


最佳答案
使用descendant :: axis查找后代文本节点,并明确声明这些文本节点的父节点不能是div [@ class =’infoBox’]元素.
将上述内容转换为XPath表达式：

//div[@id = 'content']/descendant::text()[not(parent::div/@class='infoBox')]

然后,结果类似于(我使用在线XPath工具测试)以下内容.如您所见,div [@ class =’infoBox’]的文本内容不再出现在结果中.

-----------------------
Title 1
-----------------------
-----------------------
Sub-Title 1
-----------------------
-----------------------
Description 1.
-----------------------
Description 2.
-----------------------
-----------------------
Sub-Title 2
-----------------------
-----------------------
Description 1
-----------------------
Description 2
-----------------------
-----------------------
-----------------------

你的方法有什么问题？
你的第一次尝试：

//*[@id="content"]/text()

用简单的英语表示：


Look for any element (not necessarily a div) anywhere in the document,that has an attribute @id,its value being “content”. For this element,return all its immediate child text nodes.
@H_301_90@
问题：您正在丢失不是外部div的直接子节点的文本节点,因为它们位于该div的子元素内.
你的第二次尝试：

//*[@id="content"]//*/text()

翻译为：


Look for any element (not necessarily a div) anywhere in the document,find any descendant element node and return all text nodes of that descendant element.
@H_301_90@
问题：您正在丢失div的直接子文本节点,因为您只查看作为div后代的元素的子节点的文本节点.
编辑：
回应你的评论：

//div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infoBox')]

对于您将来的问题,请确保您显示的HTML代表您的实际问题.

如何选择所有子文本但使用Scapy的XPath排除标记？

猜你在找的HTML相关文章