request+正则表达式爬猫眼

前端之家收集整理的这篇文章主要介绍了request+正则表达式爬猫眼前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
  1. import json
  2. import requests
  3. from requests.exceptions import RequestException
  4. import re
  5. import time
  6.  
  7. def get_one_page(url):
  8. try:
  9. response = requests.get(url)
  10. if response.status_code == 200:
  11. return response.text
  12. return None
  13. except RequestException:
  14. return None
  15.  
  16. def parse_one_page(html):
  17. pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?src="(.*?)".*?name"><a'
  18. + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
  19. + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
  20. items = re.findall(pattern,html)
  21. for item in items:
  22. yield {
  23. 'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5] + item[6]
  24. }
  25.  
  26. def write_to_file(content):
  27. with open('result.txt','a',encoding='utf-8') as f:
  28. f.write(json.dumps(content,ensure_ascii=False) + '\n')
  29.  
  30. def main(offset):
  31. url = 'http://maoyan.com/board/4?offset=' + str(offset)
  32. html = get_one_page(url)
  33. for item in parse_one_page(html):
  34. print(item)
  35. write_to_file(item)
  36.  
  37. if __name__ == '__main__':
  38. for i in range(10):
  39. main(offset=i * 10)
  40. time.sleep(1)

转自:http://cuiqingcai.com/

猜你在找的正则表达式相关文章