我有以下HTML,应该怎么做才能从变量中提取JSON:window .__ INITIAL_STATE__
<!DOCTYPE doctype html>
<html lang="en">
<script>
window.sessConf = "-2912474957111138742";
/* <sl:translate_json> */
window.__INITIAL_STATE__ = { /* Target JSON here with 12 million characters */};
/* </sl:translate_json> */
</script>
</html>
最佳答案
您可以使用以下Python代码提取JavaScript代码.
soup = BeautifulSoup(html)
s=soup.find('script')
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));'
with open('temp.js','w') as f:
f.write(js)
JS代码将被写入文件“ temp.js”.然后,您可以调用node执行JS文件.
from subprocess import check_output
window_init_state = check_output(['node','temp.js'])
python变量window_init_state包含JS对象window .__ INITIAL_STATE__的JSON字符串,您可以使用JSONDecoder在python中对其进行解析.
例
from subprocess import check_output
import json,bs4
html='''<!DOCTYPE doctype html>
<html lang="en">
<script> window.sessConf = "-2912474957111138742";
/* <sl:translate_json> */
window.__INITIAL_STATE__ = { 'Hello':'World'};
/* </sl:translate_json> */
</script>
</html>'''
soup = bs4.BeautifulSoup(html)
with open('temp.js','w') as f:
f.write('window = {};\n'+
soup.find('script').text.strip()+
';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));')
window_init_state = check_output(['node','temp.js'])
print(json.loads(window_init_state))
输出:
{'Hello': 'World'}