SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)

Overview

SwiftSoup

Platform OS X | iOS | tvOS | watchOS | Linux SPM compatible 🐧 linux: ready Carthage compatible Build Status Version License Twitter

SwiftSoup is a pure Swift library, cross-platform (macOS, iOS, tvOS, watchOS and Linux!), for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jQuery-like methods. SwiftSoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

  • Scrape and parse HTML from a URL, file, or string
  • Find and extract data, using DOM traversal or CSS selectors
  • Manipulate the HTML elements, attributes, and text
  • Clean user-submitted content against a safe white-list, to prevent XSS attacks
  • Output tidy HTML SwiftSoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; SwiftSoup will create a sensible parse tree.

Swift

Swift 5 >=2.0.0

Swift 4.2 1.7.4

Installation

Cocoapods

SwiftSoup is available through CocoaPods. To install it, simply add the following line to your Podfile:

pod 'SwiftSoup'

Carthage

SwiftSoup is also available through Carthage. To install it, simply add the following line to your Cartfile:

github "scinfu/SwiftSoup"

Swift Package Manager

SwiftSoup is also available through Swift Package Manager. To install it, simply add the dependency to your Package.Swift file:

...
dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "1.7.4"),
],
targets: [
    .target( name: "YourTarget", dependencies: ["SwiftSoup"]),
]
...

Try

Try out the simple online CSS selectors site:

SwiftSoup Test Site

Try out the example project opening Terminal and type:

pod try SwiftSoup

SwiftSoup SwiftSoup

To parse an HTML document:

do {
   let html = "<html><head><title>First parse</title></head>"
       + "<body><p>Parsed HTML into a doc.</p></body></html>"
   let doc: Document = try SwiftSoup.parse(html)
   return try doc.text()
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}
  • Unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
  • Implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
  • Reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)

The object model of a document

  • Documents consist of Elements and TextNodes
  • The inheritance chain is: Document extends Element extends Node.TextNode extends Node.
  • An Element contains a list of children Nodes, and has one parent Element. They also have provide a filtered list of child Elements only.

Extract attributes, text, and HTML from elements

Problem

After parsing a document, and finding some elements, you'll want to get at the data inside those elements.

Solution

  • To get the value of an attribute, use the Node.attr(_ String key) method
  • For the text on an element (and its combined children), use Element.text()
  • For HTML, use Element.html(), or Node.outerHtml() as appropriate
do {
    let html: String = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
    let doc: Document = try SwiftSoup.parse(html)
    let link: Element = try doc.select("a").first()!
    
    let text: String = try doc.body()!.text(); // "An example link"
    let linkHref: String = try link.attr("href"); // "http://example.com/"
    let linkText: String = try link.text(); // "example""
    
    let linkOuterH: String = try link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>"
    let linkInnerH: String = try link.html(); // "<b>example</b>"
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Description

The methods above are the core of the element data access methods. There are additional others:

  • Element.id()
  • Element.tagName()
  • Element.className() and Element.hasClass(_ String className)

All of these accessor methods have corresponding setter methods to change the data.

Parse a document from a String

Problem

You have HTML in a Swift String, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. The String may have come from user input, a file, or from the web.

Solution

Use the static SwiftSoup.parse(_ html: String) method, or SwiftSoup.parse(_ html: String, _ baseUri: String).

do {
    let html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p></body></html>"
    let doc: Document = try SwiftSoup.parse(html)
    return try doc.text()
} catch Exception.Error(let type, let message) {
    print("")
} catch {
    print("")
}

Description

The parse(_ html: String, _ baseUri: String) method parses the input HTML into a new Document. The base URI argument is used to resolve relative URLs into absolute URLs, and should be set to the URL where the document was fetched from. If that's not applicable, or if you know the HTML has a base element, you can use the parse(_ html: String) method.

As long as you pass in a non-null string, you're guaranteed to have a successful, sensible parse, with a Document containing (at least) a head and a body element.

Once you have a Document, you can get at the data using the appropriate methods in Document and its supers Element and Node.

Parsing a body fragment

Problem

You have a fragment of body HTML (e.g. div containing a couple of p tags; as opposed to a full HTML document) that you want to parse. Perhaps it was provided by a user submitting a comment, or editing the body of a page in a CMS.

Solution

Use the SwiftSoup.parseBodyFragment(_ html: String) method.

do {
    let html: String = "<div><p>Lorem ipsum.</p>"
    let doc: Document = try SwiftSoup.parseBodyFragment(html)
    let body: Element? = doc.body()
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Description

The parseBodyFragment method creates an empty shell document, and inserts the parsed HTML into the body element. If you used the normal SwiftSoup(_ html: String) method, you would generally get the same result, but explicitly treating the input as a body fragment ensures that any bozo HTML provided by the user is parsed into the body element.

The Document.body() method retrieves the element children of the document's body element; it is equivalent to doc.getElementsByTag("body").

Stay safe

If you are going to accept HTML input from a user, you need to be careful to avoid cross-site scripting attacks. See the documentation for the Whitelist based cleaner, and clean the input with clean(String bodyHtml, Whitelist whitelist).

Sanitize untrusted HTML (to prevent XSS)

Problem

You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.

Solution

Use the SwiftSoup HTML Cleaner with a configuration specified by a Whitelist.

do {
    let unsafe: String = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"
    let safe: String = try SwiftSoup.clean(unsafe, Whitelist.basic())!
    // now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Discussion

A cross-site scripting attack against your site can really ruin your day, not to mention your users'. Many sites avoid XSS attacks by not allowing HTML in user submitted content: they enforce plain text only, or use an alternative markup syntax like wiki-text or Markdown. These are seldom optimal solutions for the user, as they lower expressiveness, and force the user to learn a new syntax.

A better solution may be to use a rich text WYSIWYG editor (like CKEditor or TinyMCE). These output HTML, and allow the user to work visually. However, their validation is done on the client side: you need to apply a server-side validation to clean up the input and ensure the HTML is safe to place on your site. Otherwise, an attacker can avoid the client-side Javascript validation and inject unsafe HMTL directly into your site

The SwiftSoup whitelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.

It does not use regular expressions, which are inappropriate for this task.

SwiftSoup provides a range of Whitelist configurations to suit most requirements; they can be modified if necessary, but take care.

The cleaner is useful not only for avoiding XSS, but also in limiting the range of elements the user can provide: you may be OK with textual a, strong elements, but not structural div or table elements.

See also

  • See the XSS cheat sheet and filter evasion guide, as an example of how regular-expression filters don't work, and why a safe whitelist parser-based sanitizer is the correct approach.
  • See the Cleaner reference if you want to get a Document instead of a String return
  • See the Whitelist reference for the different canned options, and to create a custom whitelist
  • The nofollow link attribute

Set attribute values

Problem

You have a parsed document that you would like to update attribute values on, before saving it out to disk, or sending it on as a HTTP response.

Solution

Use the attribute setter methods Element.attr(_ key: String, _ value: String), and Elements.attr(_ key: String, _ value: String).

If you need to modify the class attribute of an element, use the Element.addClass(_ className: String) and Element.removeClass(_ className: String) methods.

The Elements collection has bulk attribute and class methods. For example, to add a rel="nofollow" attribute to every a element inside a div:

do {
    try doc.select("div.comments a").attr("rel", "nofollow")
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Description

Like the other methods in Element, the attr methods return the current Element (or Elements when working on a collection from a select). This allows convenient method chaining:

do {
    try doc.select("div.masthead").attr("title", "swiftsoup").addClass("round-box");
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Set the HTML of an element

Problem

You need to modify the HTML of an element.

Solution

Use the HTML setter methods in Element:

do {
    let doc: Document = try SwiftSoup.parse("<div>One</div><span>One</span>")
    let div: Element = try doc.select("div").first()! // <div></div>
    try div.html("<p>lorem ipsum</p>") // <div><p>lorem ipsum</p></div>
    try div.prepend("<p>First</p>")
    try div.append("<p>Last</p>")
    print(div)
    // now div is: <div><p>First</p><p>lorem ipsum</p><p>Last</p></div>
    
    let span: Element = try doc.select("span").first()! // <span>One</span>
    try span.wrap("<li><a href='http://example.com/'></a></li>")
    print(doc)
    // now: <li><a href="http://example.com/"><span>One</span></a></li>
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Discussion

  • Element.html(_ html: String) clears any existing inner HTML in an element, and replaces it with parsed HTML.
  • Element.prepend(_ first: String) and Element.append(_ last: String) add HTML to the start or end of an element's inner HTML, respectively
  • Element.wrap(_ around: String) wraps HTML around the outer HTML of an element.

See also

You can also use the Element.prependElement(_ tag: String) and Element.appendElement(_ tag: String) methods to create new elements and insert them into the document flow as a child element.

Setting the text content of elements

Problem

You need to modify the text content of an HTML document.

Solution

Use the text setter methods of Element:

do {
    let doc: Document = try SwiftSoup.parse("")
    let div: Element = try doc.select("div").first()! // <div></div>
    try div.text("five > four") // <div>five &gt; four</div>
    try div.prepend("First ")
    try div.append(" Last")
    // now: <div>First five &gt; four Last</div>
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Discussion

The text setter methods mirror the [[HTML setter|Set the HTML of an element]] methods:

  • Element.text(_ text: String) clears any existing inner HTML in an element, and replaces it with the supplied text.
  • Element.prepend(_ first: String) and Element.append(_ last: String) add text nodes to the start or end of an element's inner HTML, respectively The text should be supplied unencoded: characters like <, > etc will be treated as literals, not HTML.

Use DOM methods to navigate a document

Problem

You have a HTML document that you want to extract data from. You know generally the structure of the HTML document.

Solution

Use the DOM-like methods available after parsing HTML into a Document.

do {
    let html: String = "<a id=1 href='?foo=bar&mid&lt=true'>One</a> <a id=2 href='?foo=bar&lt;qux&lg=1'>Two</a>"
    let els: Elements = try SwiftSoup.parse(html).select("a")
    for link: Element in els.array() {
        let linkHref: String = try link.attr("href")
        let linkText: String = try link.text()
    }
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Description

Elements provide a range of DOM-like methods to find elements, and extract and manipulate their data. The DOM getters are contextual: called on a parent Document they find matching elements under the document; called on a child element they find elements under that child. In this way you can window in on the data you want.

Finding elements

  • getElementById(_ id: String)
  • getElementsByTag(_ tag:String)
  • getElementsByClass(_ className: String)
  • getElementsByAttribute(_ key: String) (and related methods)
  • Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(), nextElementSibling(), previousElementSibling()
  • Graph: parent(), children(), child(_ index: Int)

Element data

  • attr(_ key: Strin) to get and attr(_ key: String, _ value: String) to set attributes
  • attributes() to get all attributes
  • id(), className() and classNames()
  • text() to get and text(_ value: String) to set the text content
  • html() to get and html(_ value: String) to set the inner HTML content
  • outerHtml() to get the outer HTML value
  • data() to get data content (e.g. of script and style tags)
  • tag() and tagName()

Manipulating HTML and text

  • append(_ html: String), prepend(html: String)
  • appendText(text: String), prependText(text: String)
  • appendElement(tagName: String), prependElement(tagName: String)
  • html(_ value: String)

Use selector syntax to find elements

Problem

You want to find or manipulate elements using a CSS or jQuery-like selector syntax.

Solution

Use the Element.select(_ selector: String) and Elements.select(_ selector: String) methods:

do {
    let doc: Document = try SwiftSoup.parse("...")
    let links: Elements = try doc.select("a[href]") // a with href
    let pngs: Elements = try doc.select("img[src$=.png]")
    // img with src ending .png
    let masthead: Element? = try doc.select("div.masthead").first()
    // div with class=masthead
    let resultLinks: Elements? = try doc.select("h3.r > a") // direct a after h3
} catch Exception.Error(let type, let message) {
    print(message)
} catch {
    print("error")
}

Description

SwiftSoup elements support a CSS (or jQuery) like selector syntax to find matching elements, that allows very powerful and robust queries.

The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.

Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.

Selector overview

  • tagname: find elements by tag, e.g. a
  • ns|tag: find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
  • #id: find elements by ID, e.g. #logo
  • .class: find elements by class name, e.g. .masthead
  • [attribute]: elements with attribute, e.g. [href]
  • [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
  • [attr=value]: elements with attribute value, e.g. [width=500] (also quotable, like [data-name='launch sequence'])
  • [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
  • [attr~=regex]: elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
  • *: all elements, e.g. *

Selector combinations

  • el#id: elements with ID, e.g. div#logo
  • el.class: elements with class, e.g. div.masthead
  • el[attr]: elements with attribute, e.g. a[href]
  • Any combination, e.g. a[href].highlight
  • Ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
  • parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
  • siblingA + siblingB: finds sibling B element immediately preceded by sibling A, e.g. div.head + div
  • siblingA ~ siblingX: finds sibling X element preceded by sibling A, e.g. h1 ~ p
  • el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.masthead, div.logo

Pseudo selectors

  • :lt(n): find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
  • :gt(n): find elements whose sibling index is greater than n; e.g. div p:gt(2)
  • :eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
  • :has(seletor): find elements that contain elements matching the selector; e.g. div:has(p)
  • :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
  • :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains(swiftsoup)
  • :containsOwn(text): find elements that directly contain the given text
  • :matches(regex): find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
  • :matchesOwn(regex): find elements whose own text matches the specified regular expression
  • Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

Examples

To parse an HTML document from String:

let html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>"
guard let doc: Document = try? SwiftSoup.parse(html) else { return }

Get all text nodes:

guard let elements = try? doc.getAllElements() else { return html }
for element in elements {
    for textNode in element.textNodes() {
        [...]
    }
}

Set CSS using SwiftSoup:

try doc.head()?.append("<style>html {font-size: 2em}</style>")

Get HTML value

let html = "<div class=\"container-fluid\">"
    + "<div class=\"panel panel-default \">"
    + "<div class=\"panel-body\">"
    + "<form id=\"coupon_checkout\" action=\"http://uat.all.com.my/checkout/couponcode\" method=\"post\">"
    + "<input type=\"hidden\" name=\"transaction_id\" value=\"4245\">"
    + "<input type=\"hidden\" name=\"lang\" value=\"EN\">"
    + "<input type=\"hidden\" name=\"devicetype\" value=\"\">"
    + "<div class=\"input-group\">"
    + "<input type=\"text\" class=\"form-control\" id=\"coupon_code\" name=\"coupon\" placeholder=\"Coupon Code\">"
    + "<span class=\"input-group-btn\">"
    + "<button class=\"btn btn-primary\" type=\"submit\">Enter Code</button>"
    + "</span>"
    + "</div>"
    + "</form>"
    + "</div>"
    + "</div>"
guard let doc: Document = try? SwiftSoup.parse(html) else { return } // parse html
let elements = try doc.select("[name=transaction_id]") // query
let transaction_id = try elements.get(0) // select first element
let value = try transaction_id.val() // get value
print(value) // 4245

How to remove all the html from a string

guard let doc: Document = try? SwiftSoup.parse(html) else { return } // parse html
guard let txt = try? doc.text() else { return }
print(txt)

How to get and update XML values

let xml = "<?xml version='1' encoding='UTF-8' something='else'?><val>One</val>"
guard let doc = try? SwiftSoup.parse(xml, "", Parser.xmlParser()) else { return }
guard let element = try? doc.getElementsByTag("val").first() // Find first element
element.text("NewValue") // Edit Value
let valueString = element.text() // "NewValue"

How to get all <img src>

do {
    let doc: Document = try SwiftSoup.parse(html)
    let srcs: Elements = try doc.select("img[src]")
    let srcsStringArray: [String?] = srcs.array().map { try? $0.attr("src").description }
    // do something with srcsStringArray
} catch Exception.Error(_, let message) {
    print(message)
} catch {
    print("error")
}

Get all href of <a>

let html = "<a id=1 href='?foo=bar&mid&lt=true'>One</a> <a id=2 href='?foo=bar&lt;qux&lg=1'>Two</a>"
guard let els: Elements = try? SwiftSoup.parse(html).select("a") else { return }
for element: Element in els.array() {
    print(try? element.attr("href"))
}

Output:

"?foo=bar&mid&lt=true"
"?foo=bar<qux&lg=1"

Escape and Enescape

let text = "Hello &<> Å å π 新 there ¾ © »"

print(Entities.escape(text))
print(Entities.unescape(text))


print(Entities.escape(text, OutputSettings().encoder(String.Encoding.ascii).escapeMode(Entities.EscapeMode.base)))
print(Entities.escape(text, OutputSettings().charset(String.Encoding.ascii).escapeMode(Entities.EscapeMode.extended)))
print(Entities.escape(text, OutputSettings().charset(String.Encoding.ascii).escapeMode(Entities.EscapeMode.xhtml)))
print(Entities.escape(text, OutputSettings().charset(String.Encoding.utf8).escapeMode(Entities.EscapeMode.extended)))
print(Entities.escape(text, OutputSettings().charset(String.Encoding.utf8).escapeMode(Entities.EscapeMode.xhtml)))

Output:

"Hello &amp;&lt;&gt; Å å π 新 there ¾ © »"
"Hello &<> Å å π 新 there ¾ © »"


"Hello &amp;&lt;&gt; &Aring; &aring; &#x3c0; &#x65b0; there &frac34; &copy; &raquo;"
"Hello &amp;&lt;&gt; &angst; &aring; &pi; &#x65b0; there &frac34; &copy; &raquo;"
"Hello &amp;&lt;&gt; &#xc5; &#xe5; &#x3c0; &#x65b0; there &#xbe; &#xa9; &#xbb;"
"Hello &amp;&lt;&gt; Å å π 新 there ¾ © »"
"Hello &amp;&lt;&gt; Å å π 新 there ¾ © »"

Author

Nabil Chatbi, [email protected]

Note

SwiftSoup was ported to Swift from Java Jsoup library.

License

SwiftSoup is available under the MIT license. See the LICENSE file for more info.

Comments
  • Cannot archive with Xcode 11 GM 2

    Cannot archive with Xcode 11 GM 2

    Hello,

    I'm unable to archive SwiftSoup with Xcode 11 GM 2. it hangs during the Xcode archive process (seemingly on a different framework, but removing SwiftSoup fixes it)

    I've seen issues #116, but it was closed without a resolution for me.

    I've tried @Dschee's and @kkla320's forks/PRs, but neither work for me.

    How could I get some info that would help you debug this?

    opened by WillBishop 62
  • Archiving is stucked on Xcode 11 beta 3

    Archiving is stucked on Xcode 11 beta 3

    In new Xcode 11 beta 3 carthage is stucked during building SwiftSoup. Same effect when archiving directly from Xcode

    Some issue during optimizations. After disable optimizations for release everything works

    SWIFT_OPTIMIZATION_LEVEL = -Onone
    
    opened by Igor-Palaguta 29
  • CPU is high

    CPU is high

    thank you html parser , it's a very smart tool like jsoup. but i found when i use it , the cpu is very high.

    I compare it with other tool base on libxml2 (like Fuzi, kanna).i test in my project , fuzzy and kanna use 10%-15% cpu when parse one web site, but swiftsoup is 60% and more.

    but I like this project because of liking jsoup too.

    enhancement 
    opened by sankxuan 18
  • Dont work with Xcode 11 beta 5 and Carthage

    Dont work with Xcode 11 beta 5 and Carthage

    The carthage update --platform ios process in Carthage ver. 0.33.0 gets stucked with SwiftSoup 2.2.0. I dont know if this issue is SwiftSoup side or Carthage. It will be nice if you can provide some feedback about this issue?

    opened by catluc 11
  • Linux Swift 3.1 Build Failing

    Linux Swift 3.1 Build Failing

    SwiftSoup does not build on Linux with Swift 3.1 due to Apple helpfully renaming RegularExpression to NSRegularExpression on Linux in 3.1... I have no idea why either! I'll get a PR together which sorts this out and updates the CI scripts to run on 3.1 across everything

    opened by 0xTim 11
  • What is the iOS minimum version to support and what architecture?

    What is the iOS minimum version to support and what architecture?

    I receive an error while making release saying the below message: Undefined symbols for architecture armv7: "_FE9SwiftSoupScG9AmpersandSc", referenced from: function signature specialization <Arg[0] = Owned To Guaranteed> of SwiftSoup.TokeniserState.read (SwiftSoup.Tokeniser, SwiftSoup.CharacterReader) throws -> () in TokeniserState.o "_FE9SwiftSoupScG8LessThanSc", referenced from: function signature specialization <Arg[0] = Owned To Guaranteed> of SwiftSoup.TokeniserState.read (SwiftSoup.Tokeniser, SwiftSoup.CharacterReader) throws -> () in TokeniserState.o ld: symbol(s) not found for architecture armv7 clang: error: linker command failed with exit code 1 (use -v to see invocation)

    (debug works fine but only getting release generates the error)

    opened by mDadashi 10
  • Performance tweaks

    Performance tweaks

    Thanks for SwiftSoup!

    I've been using it to parse HTML fragments from podcast feeds and reduce them to a defined set of tags. Then I render the trees back out according to my own style sheet. SwiftSoup has been working like a champ - it's super reliable and the API fits well with the things I'm using it for.

    My only potential reservation is that parsing can be a bit sluggish. Although, I'd also note that I'm particularly sensitive to performance since part of my goal is to parse and render fragments on the fly while they're being scrolled around on screen.

    I spent some time looking at SwiftSoup traces to see if there was any low-hanging fruit that might allow for faster parsing. I didn't find anything really glaring (and I did see the many signs of someone having gone down this road before :-), but I did get some solid improvement through a combination of small changes. Overall, I got the benchmark time down from 6.345 seconds to 1.335, not quite 5X faster.

    The main issue seems to be that Swift is really slow to to convert non-strings (e.g., arrays of characters) to strings. Evidently, it does all the Unicode decoding and indexing up front. It's about 10X faster to promote a substring into a string versus extracting the codepoints and reassembling them into a new string.

    The big place this shows up is in CharacterReader, which currently uses a [UnicodeScalar] input buffer indexed by Ints. Just converting that to a String.UnicodeScalarView and using string-native indexing speeds up parsing by about 2X. I'm usually wary of String's clunky indexing system, but it's actually not bad at all for this application. Pretty much everything CharacterReader was doing fits right in without too much of a mismatch.

    The other place where this makes a big difference is in StringBuilder, which is currently backed by a Character array. Just switching this to strings makes a big difference. Actually, there seems to be a particular optimization for [String].reduce("", +), so I made the backing a string array. Nothing actually gets assembled until the composite string is extracted. That speeded up parsing by about 25%.

    Oddly enough, the other large improvement was from avoiding calls to String.trimmingCharactersIn. I'm not sure if this is because of bridging (I think this is actually an NSString method) or because CharacterSets in general are slow. I just added some code to peek at the first and last characters and make sure there was some reason to do a full trim.

    Other things:

    • Attributes was based on an OrderedDictionary that wasn't used anywhere else. But the average number of attributes on a tag is so low that it doesn't even seem to be worthwhile to put the attributes in a dictionary. Just removing OrderedDictionary and putting the attributes in an array that's searched linearly seems to give a significant performance gain. (Although of course, the worst case gets significantly worse.)

    • It looks like a binary search was contemplated for looking up HTML entities, but never actually made it into the code. I added it. It helps a bit, although it's not a huge difference.

    • Finally, it seems to help if all tag sets are spelled out as static arrays and searched with the native Swift contains() method. I'm actually really surprised by this and can't really explain why the difference is so significant.

    I also did a few experiments on consumeToAny, which is a major workhorse: Sets of characters, giant switch statements, making sure everything was constant-ified (as I did for tags). None of that panned out.

    Anyway, you're welcome to take whatever you like from this. If there's a part you'd like segmented out, just let me know and I can reshuffle the commits.

    I should caution that it wasn't entirely clear to me how serious I should be about preserving internal APIs. Most everything seems to be marked open or public, even things that seem very plumbing-ish, e.g., StringBuilder and CharacterReader. The only API that has changed in a potentially problematic way is CharacterReader - with all the indexes changing, I didn't try too hard to maintain exact compatibility. So if that's a goal, this code might need some further adjustment.

    I didn't update the example project or versioning. I added a PerformanceTest scheme that runs the benchmark test (six pages from Amazon, Wirecutter, Google, Reuters, Wikipedia, and GitHub) on macOS. But I always find the target/scheme/configuration/project system a bit confusing; I hope I didn't screw anything else up there. The benchmark files are static so that performance numbers can be comparable over a longer period.

    opened by GarthSnyder 9
  • Parser don't work when string has 0 notes

    Parser don't work when string has 0 notes

    public static func parseBodyFragment(_ bodyHtml: String, _ baseUri: String)throws->Document {
    		let doc: Document = Document.createShell(baseUri)
    		if let body: Element = doc.body() {
    			let nodeList: Array<Node> = try parseFragment(bodyHtml, body, baseUri)
    			//var nodes: [Node] = nodeList.toArray(Node[nodeList.size()]) // the node list gets modified when re-parented
                if nodeList.count > 0 {
                for i in 1..<nodeList.count{
    				try nodeList[i].remove()
    			}
    			for node: Node in nodeList {
    				try body.appendChild(node)
    			}
                }
    		}
    		return doc
    	}
    

    I have added nodeList.count for that.

    opened by chirayu25 9
  • Does not compile on Linux with Swift 3.2

    Does not compile on Linux with Swift 3.2

    Works fine on macOS with 3.2, but naturally fails on Linux 😞

    The log is here but looks like an issue with Matcher. I'll try and dig into it when I get a chance

    opened by 0xTim 8
  • How can I encode url in different character encoding?

    How can I encode url in different character encoding?

    When I try to parse url, sometimes it shows me an error like

    couldn’t be opened because the text encoding of its contents can’t be determined.

    how can I encode to EUC-KR or different character encoding when it fails to encode utf-8?

    func getStringFromHtml(urlString : String) -> String {

        let url = URL(string: urlString)!
        
        var result = ""
        
        do {
            let html = try String(contentsOf: url)
            let doc: Document = try SwiftSoup.parse(html)
    
            let meta: Element = try doc.select("meta[property=og:title]").first()!
           let text: String = try meta.attr("content")
            result = text
        } catch {
            print("error")
        }
        return result
    }
    

    And I have one more issue.

    How can I select not specific element , but that contains specific string?

    For example,

    In html I’m trying to parse , there is a <meta property=“og:title” content =“STRING I WANT” /> But sometimes the url has no such ‘property’ , but has <meta name=“twitter:title” content=“STRING I WANT” />

    So what I want to do is,

    Search meta element , and search content stirng with property that contains “:title” .

    Thank you.

    opened by alphonse1234 7
  • Parsing div html classes with SwiftSoup?

    Parsing div html classes with SwiftSoup?

    I'm trying to use Alamofire and Swiftsoup to display some body text from a website.

    The html that I need is in a div with a class and for some reason swiftsoup wont read it.

    The html div is < div class="translation-row" > with another < div class="t-english colorblue"> inside and when I try to parse it with Swiftsoup like below, it gives me no text. Is there a special way to parse Ids with swiftsoup? I am able to parse div classes.

    My viewcontroller code is:

    import UIKit
    import Alamofire
    import SwiftSoup
    
    class ViewController: UIViewController {
    
    
    override func viewDidLoad() {
        super.viewDidLoad()
    
    
       let pageURL = "http://www.sikhnet.com/hukam"
    
    Alamofire.request(pageURL, method: .post, parameters: nil, encoding: URLEncoding.default).validate(contentType: ["application/x-www-form-urlencoded"]).response { (response) in
    
        if let data = response.data, let utf8Text = String(data: data, encoding: .utf8) {
            do {
                let html: String = utf8Text
                let doc: Document = try SwiftSoup.parse(html)
    
                for lineRow in try! doc.select("div.translation-row") {
                    print("------------------")
                    for englishLine in try! lineRow.select("div.t-english.colorblue") {
                        print(try englishLine.text())
                    }
                }
    
    
                    } catch {}
    
            } catch let error {
                print(error.localizedDescription)
            }
    
        }
    }
    
    }
    
    override func didReceiveMemoryWarning() {
    super.didReceiveMemoryWarning()
    // Dispose of any resources that can be recreated.
     }
    

    Another issue I'm having that I'm not sure how to solve. Part of the text is in a non latin font. So do I need to improve a webfont for that text or will Swiftsoup parse it in the characters shown? I havent successfully parsed that div so I dont know if it will show up at all and wanted to ask the correct way to parse html text that was in non latin characters.

    opened by cluelessoodles 7
  • XML parse blocks main thread

    XML parse blocks main thread

    Dear all,

    I use SwiftSoup to extract XML data for my SwiftUI app. It works perfectly but recently I run up to a problem when working with very large XML files (10000 lines). Even though I use SwiftUI's async and task functionality to parse the data in the background, it still blocks the main thread.

    Please see the sample code below to explain the situation.

    @State var data: [Element] = []

    List { 
         ForEach(data) { element in
           ElementRow(element: element)
         }
    }
    .task {
         let (data, _) = try await session.data(from: some url)
         let xml = String(data: data, encoding: .utf8)
         let document = try SwiftSoup.parse(xml, "", Parser.xmlParser()) // <- This is where the hang comes up from
         <...>
         self.data = result
    }
    

    Any ideas on way forward to avoid the app hanging up while SwiftSoup parses large XML file?

    Thanks!

    opened by tomasbek 0
  • getAllElements doesn't return in the good order of the original html

    getAllElements doesn't return in the good order of the original html

    Hello,

    I'm experiencing a problem while calling the getAllElements function. It seems the tags are not always in the good order.

    For example,

    if I have a

    <div>
      <p>Hello</p>
      <p>Hi</p>
    Bye
    </div>
    
    

    The getAllElements function will return in this order : "Bye, Hello, Hi". So when we try to build the html in a textView on tvOS, we don't have the elements in the good order unfortunately.

    Is it a normal behavior or a bug?

    Thanks in advance!

    opened by AlexandreAad 0
  • Sanitizer does not support CSS properties

    Sanitizer does not support CSS properties

    Whitelist class only supports tags and attributes, but not CSS style like other HTML sanitizer library for web, e.g DOMPurify, HtmlSanitizer. Is it possible to add this functionality?

    opened by lingzlu 0
  • Heap Corruption

    Heap Corruption

    This issue comes up at random times.

    Getting signal SIGABRT in the CharacterReader class.

    public func consumeToAny(_ chars: [UnicodeScalar]) -> String { let start = pos while pos < input.endIndex { if chars.contains(input[pos]) { // signal pops up here. break } pos = input.index(after: pos) } return cacheString(start, pos) }

    APP NAME(11718,0x70000133e000) malloc: Heap corruption detected, free list is damaged at 0x6000009b8570 *** Incorrect guard value: 140528044080704 APP NAME(11718,0x700001132000) malloc: *** error for object 0x6000009b8870: pointer being freed was not allocated APP NAME(11718,0x700001132000) malloc: *** set a breakpoint in malloc_error_break to debug APP NAME(11718,0x700001132000) malloc: *** error for object 0x6000009b8870: pointer being freed was not allocated

    opened by edutim 0
  • Large file lookups are slow

    Large file lookups are slow

    I have a large html file, about 13m, and it takes way too long to find the modifications. Is there any way to quickly find changes?

    let html = try String(contentsOf: url, encoding: .utf8)
    let document = try SwiftSoup.parse(html)
    let fragmentIds: [String] = [......] //there are 1 thousand
    for fragmentID in fragmentIds {
    	let links = try document.select("[id=\(fragmentID)]")
    	if links.count > 0 {
    		let link = try document.createElement("a")
    		try link.attr("href", fragmentID)
    		try link.appendText(aFragmentID)
    		try links.get(0).before(link)
    	}
    }
    
    opened by forkdog 0
Kanna(鉋) is an XML/HTML parser for Swift.

Kanna(鉋) Kanna(鉋) is an XML/HTML parser for cross-platform(macOS, iOS, tvOS, watchOS and Linux!). It was inspired by Nokogiri(鋸). ℹ️ Documentation Fea

Atsushi Kiwaki 2.3k Dec 31, 2022
Ji (戟) is an XML/HTML parser for Swift

Ji 戟 Ji (戟) is a Swift wrapper on libxml2 for parsing XML/HTML. Features Build XML/HTML Tree and Navigate. XPath Query Supported. Comprehensive Unit T

HongHao Zhang 824 Dec 15, 2022
Lightweight Networking and Parsing framework made for iOS, Mac, WatchOS and tvOS.

NetworkKit A lightweight iOS, Mac and Watch OS framework that makes networking and parsing super simple. Uses the open-sourced JSONHelper with functio

Alex Telek 30 Nov 19, 2022
Super lightweight async HTTP server library in pure Swift runs in iOS / MacOS / Linux

Embassy Super lightweight async HTTP server in pure Swift. Please read: Embedded web server for iOS UI testing. See also: Our lightweight web framewor

Envoy 540 Dec 15, 2022
🌏 A zero-dependency networking solution for building modern and secure iOS, watchOS, macOS and tvOS applications.

A zero-dependency networking solution for building modern and secure iOS, watchOS, macOS and tvOS applications. ?? TermiNetwork was tested in a produc

Bill Panagiotopoulos 90 Dec 17, 2022
QwikHttp is a robust, yet lightweight and simple to use HTTP networking library for iOS, tvOS and watchOS

QwikHttp is a robust, yet lightweight and simple to use HTTP networking library. It allows you to customize every aspect of your http requests within a single line of code, using a Builder style syntax to keep your code super clean.

Logan Sease 2 Mar 20, 2022
ZeroMQ Swift Bindings for iOS, macOS, tvOS and watchOS

SwiftyZeroMQ - ZeroMQ Swift Bindings for iOS, macOS, tvOS and watchOS This library provides easy-to-use iOS, macOS, tvOS and watchOS Swift bindings fo

Ahmad M. Zawawi 60 Sep 15, 2022
A delightful networking framework for iOS, macOS, watchOS, and tvOS.

AFNetworking is a delightful networking library for iOS, macOS, watchOS, and tvOS. It's built on top of the Foundation URL Loading System, extending t

AFNetworking 33.3k Jan 5, 2023
A networking library for iOS, macOS, watchOS and tvOS

Thunder Request Thunder Request is a Framework used to simplify making http and https web requests. Installation Setting up your app to use ThunderBas

3 SIDED CUBE 16 Nov 19, 2022
An awesome Swift HTML DSL library using result builders.

SwiftHtml An awesome Swift HTML DSL library using result builders. let doc = Document(.html5) { Html { Head { Meta()

Binary Birds 204 Dec 25, 2022
Socket framework for Swift using the Swift Package Manager. Works on iOS, macOS, and Linux.

BlueSocket Socket framework for Swift using the Swift Package Manager. Works on iOS, macOS, and Linux. Prerequisites Swift Swift Open Source swift-5.1

Kitura 1.3k Dec 26, 2022
A Ruby on Rails inspired Web Framework for Swift that runs on Linux and OS X

IMPORTANT! We don't see any way how to make web development as great as Ruby on Rails or Django with a very static nature of current Swift. We hope th

Saulius Grigaitis 2k Dec 5, 2022
Lightweight library for web server applications in Swift on macOS and Linux powered by coroutines.

Why Zewo? • Support • Community • Contributing Zewo Zewo is a lightweight library for web applications in Swift. What sets Zewo apart? Zewo is not a w

Zewo 1.9k Dec 22, 2022
A tool to build projects on MacOS and a remote linux server with one command

DualBuild DualBuild is a command line tool for building projects on MacOS and a remote Linux server. ##Setup Install the repository git clone https://

Operator Foundation 0 Dec 26, 2021
StatusBarOverlay will automatically show a "No Internet Connection" bar when your app loses connection, and hide it again. It supports apps which hide the status bar and The Notch

StatusBarOverlay StatusBarOverlay will automatically show a "No Internet Connection" bar when your app loses connection, and hide it again. It support

Idle Hands Apps 160 Nov 2, 2022
A light weight network library with automated model parser for rapid development

Gem A light weight network library with automated model parser for rapid development. Managing all http request with automated model parser calls in a

Albin CR 10 Nov 19, 2022
SwiftCANLib is a library used to process Controller Area Network (CAN) frames utilizing the Linux kernel open source library SOCKETCAN.

SwiftCANLib SwiftCANLib is a library used to process Controller Area Network (CAN) frames utilizing the Linux kernel open source library SOCKETCAN. Th

Tim Wise 4 Oct 25, 2021
A simple GCD based HTTP client and server, written in 'pure' Swift

SwiftyHTTP Note: I'm probably not going to update this any further - If you need a Swift networking toolset for the server side, consider: Macro.swift

Always Right Institute 116 Aug 6, 2022
The Oakland Post iOS app, written in pure Swift

Now available on the App Store! Check it out! The Oakland Post App The mobile companion for The Oakland Post's website, written entirely in Swift! Scr

Andrew Clissold 280 Dec 13, 2022