我可以解析文档并生成输出,但是由于p标记,输出无法解析为XElement,字符串中的所有其他内容都会被正确解析.
我的意见:
var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature,but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";
我的代码:
public static XElement CleanupHtml(string input) { HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); htmlDoc.OptionOutputAsXml = true; //htmlDoc.OptionWriteEmptyNodes = true; //htmlDoc.OptionAutoCloSEOnEnd = true; htmlDoc.OptionFixNestedTags = true; htmlDoc.LoadHtml(input); // ParseErrors is an ArrayList containing any errors from the Load statement if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0) { } else { if (htmlDoc.DocumentNode != null) { var ndoc = new HtmlDocument(); // HTML doc instance HtmlNode p = ndoc.CreateElement("body"); p.InnerHtml = htmlDoc.DocumentNode.InnerHtml; var result = p.OuterHtml.Replace("<br>","<br/>"); result = result.Replace("<br class=\"special_class\">","<br/>"); result = result.Replace("<hr>","<hr/>"); return XElement.Parse(result,LoadOptions.PreserveWhitespace); } } return new XElement("body"); }
我的输出:
<body> <p> Not sure why is is null for some wierd reason chappy! <br/> <br/>I have implemented the auto save feature,but does it really work after 100s? <br/> </p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p> **<p>** <hr/> <p> <br/> </p> </body>
解决方法
你要做的是基本上将Html输入转换为Xml输出.
当你使用OptionOutputAsXml选项时,Html Agility Pack可以做到这一点,但在这种情况下,你不应该使用InnerHtml属性,而是让Html Agility Pack为你做好基础工作,使用HtmlDocument的Save方法之一.
这是一个将Html文本转换为XElement实例的通用函数:
public static XElement HtmlToXElement(string html) { if (html == null) throw new ArgumentNullException("html"); HtmlDocument doc = new HtmlDocument(); doc.OptionOutputAsXml = true; doc.LoadHtml(html); using (StringWriter writer = new StringWriter()) { doc.Save(writer); using (StringReader reader = new StringReader(writer.ToString())) { return XElement.Load(reader); } } }
如您所见,您不必自己做太多工作!请注意,由于您的原始输入文本没有根元素,Html Agility Pack将自动添加一个封闭的SPAN以确保输出有效Xml.
在你的情况下,你想要另外处理一些标签,所以,这里是如何处理你的例子:
public static XElement CleanupHtml(string input) { if (input == null) throw new ArgumentNullException("input"); HtmlDocument doc = new HtmlDocument(); doc.OptionOutputAsXml = true; doc.LoadHtml(input); // extra processing,remove some attributes using DOM HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']"); if (coll != null) { foreach (HtmlNode node in coll) { node.Attributes.Remove("class"); } } using (StringWriter writer = new StringWriter()) { doc.Save(writer); using (StringReader reader = new StringReader(writer.ToString())) { return XElement.Load(reader); } } }
如您所见,您不应该使用原始字符串函数,而是使用Html Agility Pack DOM函数(SelectNodes,Add,Remove等等).