将XML非法和char转换为utf8

在以下位置有一个XML和HTML字符引用列表：https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references.

但是,有些内容在该列表中根本没有定义,但它们在较旧的HTML脚本中使用.当我处理来自http://www.d.umn.edu/~tpederse/data.html的Senseval-2格式(带有修复)数据集时,我会遇到以下单词,它会破坏我的脚本,该脚本试图使用xml.et.elementTree来解析数据.

这些单词的unicode等价是什么？

&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
&degree.
&degree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
&times.

我的剧本：

import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()

给出这个追溯：

Traceback (most recent call last):
  File "senseval.py",line 4,in Feed(data)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py",line 1642,in Feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py",line 1506,in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41,column 113

最佳答案

“单词”看起来像格式错误的entity references.有效的实体引用最后有一个分号.我查看了test-fix.xml(在Sval1to2.fix.tar.gz中),而且& dash(或& dash.)似乎很有可能代表某种破折号或连字符.该文件具有.xml扩展名,如果修复了错误的实体引用,它将非常接近格式良好的XML.

在您链接到的页面(http://www.d.umn.edu/~tpederse/data.html)上,它说：

Please note that our converted data will not “parse” as true xml text. This is due to the fact that in the original sense-tagged text,characters that require special handling in xml are not escaped,and so forth. We are considering ways to make this data “true” xml,and would be most grateful for any Feedback on how to best do this.

因此,尽管该文档看起来非常像XML,但它不是XML,发布它的人也非常清楚这一点.

将XML非法和char转换为utf8 – python

猜你在找的HTML相关文章