我目前正在编写一个脚本,将一堆
XML文件从各种编码转换为统一的UTF-8.
我首先尝试使用LXML确定编码:
def get_source_encoding(self): tree = etree.parse(self.inputfile) encoding = tree.docinfo.encoding self.inputfile.seek(0) return (encoding or '').lower()
如果那是空白的,我尝试从chardet获取它:
def guess_source_encoding(self): chunk = self.inputfile.read(1024 * 10) self.inputfile.seek(0) return chardet.detect(chunk).lower()
然后我使用编解码器转换文件的编码:
def convert_encoding(self,source_encoding,input_filename,output_filename): chunk_size = 16 * 1024 with codecs.open(input_filename,"rb",source_encoding) as source: with codecs.open(output_filename,"wb","utf-8") as destination: while True: chunk = source.read(chunk_size) if not chunk: break; destination.write(chunk)
最后,我正在尝试重写XML标头.如果最初是XML标头
<?xml version="1.0"?>
要么
<?xml version="1.0" encoding="windows-1255"?>
我想把它变成
<?xml version="1.0" encoding="UTF-8"?>
我目前的代码似乎不起作用:
def edit_header(self,input_filename): output_filename = tempfile.mktemp(suffix=".xml") with open(input_filename,"rb") as source: parser = etree.XMLParser(encoding="UTF-8") tree = etree.parse(source,parser) with open(output_filename,"wb") as destination: tree.write(destination,encoding="UTF-8")
解决方法
尝试:
tree.write(destination,xml_declaration=True,encoding='UTF-8')
从the API docs开始:
xml_declaration controls if an XML declaration should be added to the file. Use
False
for never,True
for always,None
for only if not US-ASCII or UTF-8 (default isNone
).
来自ipython的示例:
In [15]: etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout,encoding='UTF-8') <?xml version='1.0' encoding='UTF-8'?> <hi/>
经过反思,我觉得你太努力了. lxml会自动检测编码并根据该编码正确解析文件.
所以你真正要做的事情(至少在Python2.7中)是:
def convert_encoding(self,output_filename): tree = etree.parse(input_filename) with open(output_filename,'w') as destination: tree.write(destination,encoding='utf-8',xml_declaration=True)