我不知道以下是否可能.
我们假设我有以下文字:
我们假设我有以下文字:
<ul class="yes"> <li><img src="whatever1"></li> <li><img src="whatever2"></li> <li><img src="whatever3"></li> <li><img src="whatever4"></li> </ul> <ul class="no"> <li><img src="whatever5"></li> <li><img src="whatever6"></li> <li><img src="whatever7"></li> <li><img src="whatever8"></li> </ul>
我想将ul中的每个img的src与class yes匹配.
我想要一个正则表达式回复我:
whatever1 whatever2 whatever3 whatever4
如何在一个正则表达式中加入这样的两个正则表达式?
<ul class="yes">(.+?)<\/ul> <img src="(whatever.+?)">
解决方法
众所周知,正则表达式难以用于解析类似XML的东西.更好地跳过这个想法并使用适当的HTML解析器滚动,例如,使用
BeautifulSoup4:
import bs4 html = """ <ul class="yes"> <li><img src="whatever1"></li> <li><img src="whatever2"></li> <li><img src="whatever3"></li> <li><img src="whatever4"></li> </ul> <ul class="no"> <li><img src="whatever5"></li> <li><img src="whatever6"></li> <li><img src="whatever7"></li> <li><img src="whatever8"></li> </ul> """ soup = bs4.BeautifulSoup(html) def match_imgs(tag): return tag.name == 'img' \ and tag.parent.parent.name == 'ul' \ and tag.parent.parent['class'] == ['yes'] imgs = soup.find_all(match_imgs) print(imgs) whatevers = [i['src'] for i in imgs] print(whatevers)
产量:
[<img src="whatever1"/>,<img src="whatever2"/>,<img src="whatever3"/>,<img src="whatever4"/>] [u'whatever1',u'whatever2',u'whatever3',u'whatever4']