如果我有一个嵌套的html(无序)列表,如下所示:
<ul> <li><a href="Page1_Level1.html">Page1_Level1</a> <ul> <li><a href="Page1_Level2.html">Page1_Level2</a> <ul> <li><a href="Page1_Level3.html">Page1_Level3</a></li> </ul> <ul> <li><a href="Page2_Level3.html">Page2_Level3</a></li> </ul> <ul> <li><a href="Page3_Level3.html">Page3_Level3</a></li> </ul> </li> </ul> </li> <li><a href="Page2_Level1.html">Page2_Level1</a> <ul> <li><a href="Page2_Level2.html">Page2_Level2</a></li> </ul> </li> </ul>
如何在Python中形成嵌套列表?例如:
["Page1_Level1.html",["Page1_Level2.html",["Page1_Leve3.html","Page2_Level3.html","Page3_Level3.html"]],"Page2_Level1.html",["Page2_Level2.html"]]
我认为像Beautiful Soup和HTML Parser这样的库有可以做到这一点,但是我还没弄清楚它.感谢您的帮助/指点!
解决方法
您可以采用递归方法:
from pprint import pprint from bs4 import BeautifulSoup text = """your html goes here""" def find_li(element): return [{li.a['href']: find_li(li)} for ul in element('ul',recursive=False) for li in ul('li',recursive=False)] soup = BeautifulSoup(text,'html.parser') data = find_li(soup) pprint(data)
打印:
[{u'Page1_Level1.html': [{u'Page1_Level2.html': [{u'Page1_Level3.html': []},{u'Page2_Level3.html': []},{u'Page3_Level3.html': []}]}]},{u'Page2_Level1.html': [{u'Page2_Level2.html': []}]}]
仅供参考,这就是为什么我必须在这里使用html.parser:
> Don’t put html,head and body tags automatically,beautifulsoup