我正在寻找一种库/方法来解析比通用xml解析库具有更多html特定功能的html文件。

#1 楼

HTML Agility Pack


这是一个敏捷的HTML解析器,可构建读/写DOM并支持纯XPATH或XSLT(您实际上不必了解XPATH或XSLT即可使用它,不用担心...)。这是一个.NET代码库,可让您解析“网络外” HTML文件。该解析器对“真实世界”格式的HTML十分宽容。对象模型与提出System.Xml的对象模型非常相似,但是用于HTML文档(或流)。


评论


值得注意的是,它不能很好地处理像

这样的自闭合标签(它解释为空),而对于像

  • 这样的可选结束标签(它会解释为缺少结束标签)的处理效果很差。嵌套连续的li标签)。

    –Eamon Nerbonne
    2011年5月14日下午16:48

  • #2 楼

    您可以使用TidyNet.Tidy将HTML转换为XHTML,然后使用XML解析器。

    另一种选择是使用内置引擎mshtml:

    using mshtml;
    ...
    object[] oPageText = { html };
    HTMLDocument doc = new HTMLDocumentClass();
    IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
    doc2.write(oPageText);
    


    这使您可以使用类似javascript的函数,例如getElementById()

    评论


    叫我疯了,但我在弄清楚如何使用mshtml时遇到了麻烦。你有什么好的链接吗?

    – Alex Baranosky
    09年1月9日,下午5:52

    @Alex,您需要包括Microsoft.mshtml,可以在此处找到更多信息:msdn.microsoft.com/zh-cn/library/aa290341(VS.71).aspx

    –威尔弗雷德·基涅维尔(Wilfred Knievel)
    2010年1月12日23:17

    我有一个关于Tidy.Net和ManagedTidy的博客,它们都能够解析和验证(x)html文件。如果您不需要验证内容。我会使用htmlagilitypack。 jphellemons.nl/post/…

    –JP Hellemons
    11-10-25在7:03

    #3 楼

    我找到了一个名为Fizzler的项目,该项目采用jQuery / Sizzler方法来选择HTML元素。它基于HTML Agility Pack。它目前处于测试阶段,仅支持CSS选择器的子集,但是在讨厌的XPath上使用CSS选择器非常酷,令人耳目一新。

    http://code.google.com/p/fizzler/

    评论


    谢谢,这看起来很有趣!我对jQuery的流行感到惊讶,以至于很难找到受其启发的C#项目。现在,只要我能找到文件处理和更高级遍历也是软件包的一部分... :)

    – Funka
    2010年5月14日下午1:33

    我今天刚刚使用了它,不得不说,如果您知道jQuery,它就非常容易使用。

    –陈志
    2010-10-14 20:56

    #4 楼

    您可以做很多事情,而不必担心第三方产品和mshtml(即互操作)。使用System.Windows.Forms.WebBrowser。从那里,您可以执行诸如HtmlDocument上的“ GetElementById”或HtmlElements上的“ GetElementsByTagName”之类的操作。如果您想真正与浏览器接口(例如模拟按钮单击),则可以使用一点反射(比Interop邪恶一些)来实现:

    var wb = new WebBrowser()
    


    ...告诉浏览器进行导航(与此问题相切)。然后,在Document_Completed事件上,您可以像这样模拟点击。

    var doc = wb.Browser.Document
    var elem = doc.GetElementById(elementId);
    object obj = elem.DomElement;
    System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
    mi.Invoke(obj, new object[0]);
    


    您可以执行类似的反射操作来提交表单,等等。

    享受。

    #5 楼

    我已经写了一些提供“ LINQ to HTML”功能的代码。我以为我会在这里分享。它基于Majestic12。它采用Majestic-12结果并生成LINQ XML元素。那时,您可以针对HTML使用所有的LINQ to XML工具。例如:

            IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);
    
            foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {
    
                if (anchorTag.Attribute("href") == null)
                    continue;
    
                Console.WriteLine(anchorTag.Attribute("href").Value);
            }
    


    我想使用Majestic-12,因为我知道它具有许多内置的关于HTML的内置知识。不过,我发现要将Majestic-12结果映射到LINQ将接受的XML格式还需要额外的工作。我包含的代码完成了很多此类清理工作,但是当您使用此代码时,您会发现被拒绝的页面。您需要修正代码以解决该问题。引发异常时,请检查exception.Data [“ source”],因为它很可能设置为导致异常的HTML标记。有时以一种不错的方式处理HTML并非易事...

    所以现在期望值实际上很低,这里是代码:)

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using Majestic12;
    using System.IO;
    using System.Xml.Linq;
    using System.Diagnostics;
    using System.Text.RegularExpressions;
    
    namespace Majestic12ToXml {
    public class Majestic12ToXml {
    
        static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {
    
            HTMLparser parser = OpenParser();
            parser.Init(htmlAsBytes);
    
            XElement currentNode = new XElement("document");
    
            HTMLchunk m12chunk = null;
    
            int xmlnsAttributeIndex = 0;
            string originalHtml = "";
    
            while ((m12chunk = parser.ParseNext()) != null) {
    
                try {
    
                    Debug.Assert(!m12chunk.bHashMode);  // popular default for Majestic-12 setting
    
                    XNode newNode = null;
                    XElement newNodesParent = null;
    
                    switch (m12chunk.oType) {
                        case HTMLchunkType.OpenTag:
    
                            // Tags are added as a child to the current tag, 
                            // except when the new tag implies the closure of 
                            // some number of ancestor tags.
    
                            newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
    
                            if (newNode != null) {
                                currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
    
                                newNodesParent = currentNode;
    
                                newNodesParent.Add(newNode);
    
                                currentNode = newNode as XElement;
                            }
    
                            break;
    
                        case HTMLchunkType.CloseTag:
    
                            if (m12chunk.bEndClosure) {
    
                                newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
    
                                if (newNode != null) {
                                    currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
    
                                    newNodesParent = currentNode;
                                    newNodesParent.Add(newNode);
                                }
                            }
                            else {
                                XElement nodeToClose = currentNode;
    
                                string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
    
                                while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
                                    nodeToClose = nodeToClose.Parent;
    
                                if (nodeToClose != null)
                                    currentNode = nodeToClose.Parent;
    
                                Debug.Assert(currentNode != null);
                            }
    
                            break;
    
                        case HTMLchunkType.Script:
    
                            newNode = new XElement("script", "REMOVED");
                            newNodesParent = currentNode;
                            newNodesParent.Add(newNode);
                            break;
    
                        case HTMLchunkType.Comment:
    
                            newNodesParent = currentNode;
    
                            if (m12chunk.sTag == "!--")
                                newNode = new XComment(m12chunk.oHTML);
                            else if (m12chunk.sTag == "![CDATA[")
                                newNode = new XCData(m12chunk.oHTML);
                            else
                                throw new Exception("Unrecognized comment sTag");
    
                            newNodesParent.Add(newNode);
    
                            break;
    
                        case HTMLchunkType.Text:
    
                            currentNode.Add(m12chunk.oHTML);
                            break;
    
                        default:
                            break;
                    }
                }
                catch (Exception e) {
                    var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);
    
                    // the original html is copied for tracing/debugging purposes
                    originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
                        .Take(m12chunk.iChunkLength)
                        .Select(B => (char)B).ToArray()); 
    
                    wrappedE.Data.Add("source", originalHtml);
    
                    throw wrappedE;
                }
            }
    
            while (currentNode.Parent != null)
                currentNode = currentNode.Parent;
    
            return currentNode.Nodes();
        }
    
        static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {
    
            string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
    
            XElement discoveredParent = null;
    
            // Get a list of all ancestors
            List<XElement> ancestors = new List<XElement>();
            XElement ancestor = nextPotentialParent;
            while (ancestor != null) {
                ancestors.Add(ancestor);
                ancestor = ancestor.Parent;
            }
    
            // Check if the new tag implies a previous tag was closed.
            if ("form" == m12chunkCleanedTag) {
    
                discoveredParent = ancestors
                    .Where(XE => m12chunkCleanedTag == XE.Name)
                    .Take(1)
                    .Select(XE => XE.Parent)
                    .FirstOrDefault();
            }
            else if ("td" == m12chunkCleanedTag) {
    
                discoveredParent = ancestors
                    .TakeWhile(XE => "tr" != XE.Name)
                    .Where(XE => m12chunkCleanedTag == XE.Name)
                    .Take(1)
                    .Select(XE => XE.Parent)
                    .FirstOrDefault();
            }
            else if ("tr" == m12chunkCleanedTag) {
    
                discoveredParent = ancestors
                    .TakeWhile(XE => !("table" == XE.Name
                                        || "thead" == XE.Name
                                        || "tbody" == XE.Name
                                        || "tfoot" == XE.Name))
                    .Where(XE => m12chunkCleanedTag == XE.Name)
                    .Take(1)
                    .Select(XE => XE.Parent)
                    .FirstOrDefault();
            }
            else if ("thead" == m12chunkCleanedTag
                      || "tbody" == m12chunkCleanedTag
                      || "tfoot" == m12chunkCleanedTag) {
    
    
                discoveredParent = ancestors
                    .TakeWhile(XE => "table" != XE.Name)
                    .Where(XE => m12chunkCleanedTag == XE.Name)
                    .Take(1)
                    .Select(XE => XE.Parent)
                    .FirstOrDefault();
            }
    
            return discoveredParent ?? nextPotentialParent;
        }
    
        static string CleanupTagName(string originalName, string originalHtml) {
    
            string tagName = originalName;
    
            tagName = tagName.TrimStart(new char[] { '?' });  // for nodes <?xml >
    
            if (tagName.Contains(':'))
                tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);
    
            return tagName;
        }
    
        static readonly Regex _startsAsNumeric = new Regex(@"^[0-9]", RegexOptions.Compiled);
    
        static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {
    
            result = null;
            string attributeName = originalName;
    
            if (string.IsNullOrEmpty(originalName))
                return false;
    
            if (_startsAsNumeric.IsMatch(originalName))
                return false;
    
            //
            // transform xmlns attributes so they don't actually create any XML namespaces
            //
            if (attributeName.ToLower().Equals("xmlns")) {
    
                attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
                xmlnsIndex++;
            }
            else {
                if (attributeName.ToLower().StartsWith("xmlns:")) {
                    attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
                }   
    
                //
                // trim trailing \"
                //
                attributeName = attributeName.TrimEnd(new char[] { '\"' });
    
                attributeName = attributeName.Replace(":", "_");
            }
    
            result = attributeName;
    
            return true;
        }
    
        static Regex _weirdTag = new Regex(@"^<!\[.*\]>$");       // matches "<![if !supportEmptyParas]>"
        static Regex _aspnetPrecompiled = new Regex(@"^<%.*%>$"); // matches "<%@ ... %>"
        static Regex _shortHtmlComment = new Regex(@"^<!-.*->$"); // matches "<!-Extra_Images->"
    
        static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {
    
            if (string.IsNullOrEmpty(m12chunk.sTag)) {
    
                if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
                    return new XElement("doctype");
    
                if (_weirdTag.IsMatch(originalHtml))
                    return new XElement("REMOVED_weirdBlockParenthesisTag");
    
                if (_aspnetPrecompiled.IsMatch(originalHtml))
                    return new XElement("REMOVED_ASPNET_PrecompiledDirective");
    
                if (_shortHtmlComment.IsMatch(originalHtml))
                    return new XElement("REMOVED_ShortHtmlComment");
    
                // Nodes like "<br <br>" will end up with a m12chunk.sTag==""...  We discard these nodes.
                return null;
            }
    
            string tagName = CleanupTagName(m12chunk.sTag, originalHtml);
    
            XElement result = new XElement(tagName);
    
            List<XAttribute> attributes = new List<XAttribute>();
    
            for (int i = 0; i < m12chunk.iParams; i++) {
    
                if (m12chunk.sParams[i] == "<!--") {
    
                    // an HTML comment was embedded within a tag.  This comment and its contents
                    // will be interpreted as attributes by Majestic-12... skip this attributes
                    for (; i < m12chunk.iParams; i++) {
    
                        if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
                            break;
                    }
    
                    continue;
                }
    
                if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
                    continue;
    
                string attributeName = m12chunk.sParams[i];
    
                if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
                    continue;
    
                attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
            }
    
            // If attributes are duplicated with different values, we complain.
            // If attributes are duplicated with the same value, we remove all but 1.
            var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);
    
            foreach (var duplicatedAttribute in duplicatedAttributes) {
    
                if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
                    throw new Exception("Attribute value was given different values");
    
                attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
                attributes.Add(duplicatedAttribute.First());
            }
    
            result.Add(attributes);
    
            return result;
        }
    
        static HTMLparser OpenParser() {
            HTMLparser oP = new HTMLparser();
    
            // The code+comments in this function are from the Majestic-12 sample documentation.
    
            // ...
    
            // This is optional, but if you want high performance then you may
            // want to set chunk hash mode to FALSE. This would result in tag params
            // being added to string arrays in HTMLchunk object called sParams and sValues, with number
            // of actual params being in iParams. See code below for details.
            //
            // When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
            oP.SetChunkHashMode(false);
    
            // if you set this to true then original parsed HTML for given chunk will be kept - 
            // this will reduce performance somewhat, but may be desireable in some cases where
            // reconstruction of HTML may be necessary
            oP.bKeepRawHTML = false;
    
            // if set to true (it is false by default), then entities will be decoded: this is essential
            // if you want to get strings that contain final representation of the data in HTML, however
            // you should be aware that if you want to use such strings into output HTML string then you will
            // need to do Entity encoding or same string may fail later
            oP.bDecodeEntities = true;
    
            // we have option to keep most entities as is - only replace stuff like &nbsp; 
            // this is called Mini Entities mode - it is handy when HTML will need
            // to be re-created after it was parsed, though in this case really
            // entities should not be parsed at all
            oP.bDecodeMiniEntities = true;
    
            if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
                oP.InitMiniEntities();
    
            // if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
            // extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
            // this only works if auto extraction is enabled
            oP.bAutoExtractBetweenTagsOnly = true;
    
            // if true then comments will be extracted automatically
            oP.bAutoKeepComments = true;
    
            // if true then scripts will be extracted automatically: 
            oP.bAutoKeepScripts = true;
    
            // if this option is true then whitespace before start of tag will be compressed to single
            // space character in string: " ", if false then full whitespace before tag will be returned (slower)
            // you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
            // a waste of CPU cycles
            oP.bCompressWhiteSpaceBeforeTag = true;
    
            // if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
            // forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
            // compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
            // or open
            oP.bAutoMarkClosedTagsWithParamsAsOpen = false;
    
            return oP;
        }
    }
    }  
    


    评论


    顺便说一句,HtmlAgilityPack过去对我来说效果很好,我只喜欢LINQ。

    –弗兰克·施维特曼(Frank Schwieterman)
    09年3月8日在22:21

    添加LINQ转换时的性能如何?知道与HtmlAgilityPack相比如何吗?

    –user29439
    2011年8月3日在22:42

    我从未做过性能比较。这些天,我使用HtmlAgilityPack,省去了很多麻烦。不幸的是,上面的代码有很多特殊情况,我不需要为它们编写测试,因此我无法真正维护它。

    –弗兰克·施维特曼(Frank Schwieterman)
    2011年8月4日,0:40

    #6 楼

    之前已经提到了Html Agility Pack-如果您追求速度,那么您可能还想看看Majestic-12 HTML解析器。它的处理相当笨拙,但是提供了真正快速的解析体验。

    #7 楼

    我认为@Erlend使用HTMLDocument是最好的方法。但是,使用这个简单的库也很幸运:

    SgmlReader

    #8 楼

    没有第三方库,可在控制台上运行的WebBrowser类解决方案以及Asp.net

    using System;
    using System.Collections.Generic;
    using System.Text;
    using System.Windows.Forms;
    using System.Threading;
    
    class ParseHTML
    {
        public ParseHTML() { }
        private string ReturnString;
    
        public string doParsing(string html)
        {
            Thread t = new Thread(TParseMain);
            t.ApartmentState = ApartmentState.STA;
            t.Start((object)html);
            t.Join();
            return ReturnString;
        }
    
        private void TParseMain(object html)
        {
            WebBrowser wbc = new WebBrowser();
            wbc.DocumentText = "feces of a dummy";        //;magic words        
            HtmlDocument doc = wbc.Document.OpenNew(true);
            doc.Write((string)html);
            this.ReturnString = doc.Body.InnerHtml + " do here something";
            return;
        }
    }
    


    用法:

    string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
    Console.WriteLine("before:" + myhtml);
    myhtml = (new ParseHTML()).doParsing(myhtml);
    Console.WriteLine("after:" + myhtml);
    


    #9 楼

    解析HTML的麻烦在于这不是一门精确的科学。如果要解析的是XHTML,那么事情会容易得多(正如您提到的,您可以使用常规XML解析器)。因为HTML不一定是格式正确的XML,所以在尝试解析它时会遇到很多问题。这几乎需要逐个站点进行。

    评论


    难道不是像W3C所解析的那样,将格式正确的HTML解析为与XHTML完全一样的科学吗?

    – pupeno
    2009年12月8日,12:56

    应该的,但是人们不这样做。

    – Dominic K
    2010-2-16在3:54

    @J。 Pablo并不是那么容易(因此也需要一个库:p)...例如,

    标签不需要在HTML4 / 5下显式关闭。 kes!

    –user166390
    2010-12-22 4:13



    #10 楼

    我过去曾使用ZetaHtmlTidy加载随机网站,然后使用xpath打击内容的各个部分(例如/ html / body // p [@ class ='textblock'])。它运行良好,但是有一些例外的站点存在问题,因此我不知道这是否是绝对最佳的解决方案。

    #11 楼

    您可以使用HTML DTD和通用XML解析库。

    评论


    很少有真实世界的HTML页面可以在XML解析库中保留下来。

    –弗兰克·克鲁格(Frank Krueger)
    08年9月11日在11:07

    #12 楼

    如果需要查看JS对页面的影响,请使用WatiN [并且已经准备好启动浏览器]

    #13 楼

    根据您的需求,您可能会选择功能更丰富的库。我尝试了大多数/所有建议的解决方案,但脱颖而出的是Html Agility Pack。这是一个非常宽容且灵活的解析器。

    #14 楼

    试试这个脚本。

    http://www.biterscripting.com/SS_URLs.html

    当我将此URL与它一起使用时,

    script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
    


    它向我显示了该线程在页面上的所有链接。

    http://sstatic.net/so/all.css
    http://sstatic.net/so/favicon.ico
    http://sstatic.net/so/apple-touch-icon.png
    .
    .
    .
    


    您可以修改该脚本以检查图像,变量,随便。

    #15 楼

    我编写了一些用于在C#中解析HTML标记的类。如果满足您的特定需求,它们就很好而且简单。

    您可以阅读有关它们的文章并在http://www.blackbeltcoder.com/Articles/strings/parsing-html下载源代码。 -tags-in-c。

    http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class上也有一篇有关通用解析帮助器类的文章。