我在div标签中有一堆div标签:
<div class="foo"> <div class="bar">I want this</div> <div class="unwanted">Not this</div> </div> <div class="bar">Don't want this either </div>
所以我用Python和美丽的汤来分离东西.只有当它被包含在“foo”类div中时,我才需要所有的“bar”类.这是我的代码
from bs4 import BeautifulSoup soup = BeautifulSoup(open(r'C:\test.htm')) tag = soup.div for each_div in soup.findAll('div',{'class':'foo'}): print(tag["bar"]).encode("utf-8")
或者,我试过:
from bs4 import BeautifulSoup soup = BeautifulSoup(open(r'C:\test.htm')) for each_div in soup.findAll('div',{'class':'foo'}): print(each_div.findAll('div',{'class':'bar'})).encode("utf-8")
我究竟做错了什么?如果我可以从选择中删除div类“不需要的”,我会对一个简单的打印(each_div)感到高兴.
解决方法
您可以使用find_all()搜索每个< div>使用foo作为属性的元素,对于每个元素,使用find()作为具有bar作为属性的元素,如:
from bs4 import BeautifulSoup import sys soup = BeautifulSoup(open(sys.argv[1],'r'),'html') for foo in soup.find_all('div',attrs={'class': 'foo'}): bar = foo.find('div',attrs={'class': 'bar'}) print(bar.text)
运行它像:
python3 script.py htmlfile
产量:
I want this
更新:假设可能存在多个< div>具有bar属性的元素,以前的脚本将不起作用.它只会找到第一个.但你可以得到他们的后代并迭代他们,如:
from bs4 import BeautifulSoup import sys soup = BeautifulSoup(open(sys.argv[1],attrs={'class': 'foo'}): foo_descendants = foo.descendants for d in foo_descendants: if d.name == 'div' and d.get('class','') == ['bar']: print(d.text)
输入如下:
<div class="foo"> <div class="bar">I want this</div> <div class="unwanted">Not this</div> <div class="bar">Also want this</div> </div>
它会产生:
I want this Also want this