🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Overview

Mustard 🌭

GitHub license Carthage compatible Swift Package Manager compatible

Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Quick start using character sets

Foundation includes the String method components(separatedBy:) that allows us to get substrings divided up by certain characters:

let sentence = "hello 2017 year"
let words = sentence.components(separatedBy: .whitespaces)
// words.count -> 3
// words = ["hello", "2017", "year"]

Mustard provides a similar feature, but with the opposite approach, where instead of matching by separators you can match by one or more character sets, which is useful if separators simply don't exist:

import Mustard

let sentence = "hello2017year"
let words = sentence.components(matchedWith: .letters, .decimalDigits)
// words.count -> 3
// words = ["hello", "2017", "year"]

If you want more than just the substrings, you can use the tokens(matchedWith: CharacterSet...) method which will return an array of TokenType.

As a minimum, TokenType requires properties for text (the substring matched), and range (the range of the substring in the original string). When using CharacterSets as a tokenizer, the more specific type CharacterSetToken is returned, which includes the property set which contains the instance of CharacterSet that was used to create the match.

import Mustard

let tokens = "123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
// tokens: [CharacterSet.Token]
// tokens.count -> 5 (characters '&', '^', and '.' are ignored)
//
// second token..
// token[1].text -> "Hello"
// token[1].range -> Range<String.Index>(3..<8)
// token[1].set -> CharacterSet.letters
//
// last token..
// tokens[4].text -> "67"
// tokens[4].range -> Range<String.Index>(19..<21)
// tokens[4].set -> CharacterSet.decimalDigits

Advanced matching with custom tokenizers

Mustard can do more than match from character sets. You can create your own tokenizers with more sophisticated matching behavior by implementing the TokenizerType and TokenType protocols.

Here's an example of using DateTokenizer (see example for implementation) that finds substrings that match a MM/dd/yy format.

DateTokenizer returns tokens with the type DateToken. Along with the substring text and range, DateToken includes a Date object corresponding to the date in the substring:

import Mustard

let text = "Serial: #YF 1942-b 12/01/17 (Scanned) 12/03/17 (Arrived) ref: 99/99/99"

let tokens = text.tokens(matchedWith: DateTokenizer())
// tokens: [DateTokenizer.Token]
// tokens.count -> 2
// ('99/99/99' is *not* matched by `DateTokenizer` because it's not a valid date)
//
// first date
// tokens[0].text -> "12/01/17"
// tokens[0].date -> Date(2017-12-01 05:00:00 +0000)
//
// last date
// tokens[1].text -> "12/03/17"
// tokens[1].date -> Date(2017-12-03 05:00:00 +0000)

Documentation & Examples

Roadmap

  • Include detailed examples and documentation
  • Ability to skip/ignore characters within match
  • Include more advanced pattern matching for matching tokens
  • Make project logo 🌭
  • Performance testing / benchmarking against Scanner
  • Include interface for working with Character tokenizers

Requirements

  • Swift 4.1

Author

Made with ❤️ by @permakittens

Contributing

Feedback, or contributions for bug fixing or improvements are welcome. Feel free to submit a pull request or open an issue.

License

MIT

You might also like...
SwiftVerbalExpressions is a Swift library that helps to construct difficult regular expressions

SwiftVerbalExpressions Swift Regular Expressions made easy SwiftVerbalExpressions is a Swift library that helps to construct difficult regular express

Swift markdown library
Swift markdown library

Markdown ![Swift version](https://img.shields.io/badge/Swift-2.1 | 2.2-blue.svg) ![GitHub license](https://img.shields.io/badge/license-LGPL v3-green.

Lightweight library to set an Image as text background. Written in swift.
Lightweight library to set an Image as text background. Written in swift.

![](https://img.shields.io/badge/Swift 2-compatible-4BC51D.svg?style=flat-square) Simple and light weight UIView that animate text with an image. Demo

A Cross-Platform String and Regular Expression Library written in Swift.

Guitar 🎸 A Cross-Platform String and Regular Expression Library written in Swift. About This library seeks to add common string manipulation function

Swift emoji string parsing library
Swift emoji string parsing library

Croc is a library for parsing emojis on iOS. It provides a simple and lightweight interface for detecting, generating, categorizing and managing emoji

SZMentionsSwift is a lightweight mentions library for iOS.
SZMentionsSwift is a lightweight mentions library for iOS.

SZMentionsSwift is a lightweight mentions library for iOS. This library was built to assist with the adding, removing and editing of a mention within a textview.

A simple library that provides standard Unicode emoji support across all platforms

Twitter Emoji (Twemoji) A simple library that provides standard Unicode emoji support across all platforms. Twemoji v13.1 adheres to the Unicode 13.0

iOS port from libphonenumber (Google's phone number handling library)

libPhoneNumber for iOS NBPhoneNumberUtil NBAsYouTypeFormatter ARC only Update Log https://github.com/iziz/libPhoneNumber-iOS/wiki/Update-Log Issue You

User input masking library repo.
User input masking library repo.

Migration Guide: v.6 This update brings breaking changes. Namely, the autocomplete flag is now a part of the CaretGravity enum, thus the Mask::apply c

Comments
  • Mustard VS RegEx

    Mustard VS RegEx

    I'm Trying to understand Mustard. So I tried to do your example: https://github.com/mathewsanders/Mustard/blob/master/Documentation/Template%20tokenizer.md

    In Swift RegEx

    let str = "Serial: #YF 1942-b 12/01/17 (Scanned) 12/03/17 (Arrived) ref: 99/99/99"
    let usDatePattern:String = "(\\d\\d)[-\\/](\\d\\d)[-\\/](\\d\\d(?:\\d\\d)?)"//Must be in the format: 12-30-1968 (mm-dd-yyyy) works with: "12-30-1968" and "12/30/1968" syntax
    let matches = str.matches(usDatePattern)//RegExpMatch.datePattern
    matches.forEach {
        let month:String = $0.value(str, 1)
        let day:String = $0.value(str, 2)
        let year:String = $0.value(str, 3)
        let dateStr:String = year + "/" + month + "/" + day
        let dateFormatter = DateFormatter()
        dateFormatter.dateFormat = "yy/mm/dd"
        let date:Date? = dateFormatter.date(from: dateStr)
        if date != nil {Swift.print(dateStr)}//Output: 17/12/01, 17/12/03
    }
    

    So mustard is more human readable than RegEx i guess. Q1: Do you see a use-case for mustard in Chat-bots, Ai-bot-writers etc?

    Q2: Seems like you know a bit about string scanners having built Mustard. I've written a css parser in RegEx: here doing it in RegEx is probably not the ultimate solution. What should one use? Scanner, Mustard other?

    opened by eonist 3
  • Switch from tuples to TokenType's

    Switch from tuples to TokenType's

    Updated tokens methods to return types of TokenType instead of returning information in a tuple.

    This brings some extra complexity into Mustard because TokenizerType now have an associatedtype and for different tokenizers to work together they need to use type erasure for them to be passed in together into the tokens method.

    It also adds some complexity to creating custom tokenizers (where now there is an option to create an associated custom TokenType).

    Performance has been negatively impacted, but Mustard was never the fastest option, and now with custom types, Mustard becomes more expressive to use.

    (Note: TokenizerType.swift had to be combined into Mustard.swift to prevent segmentation fault 11 errors)

    opened by mathewsanders 0
  • Add default tokenizer type protocol

    Add default tokenizer type protocol

    The DefaultTokenizerType separates out the requirement for a default initializer from the core TokenizerType protocol.

    This means that tokenizers like the LiteralTokenizer don't require an initializer when it doesn't really make sense.

    Also included an alternate tokens method that can be used when all the tokenizers are the same type: tokens<T: TokenizerType>(matchedWith tokenizers: T...) this tuple of matches that is returned is strongly typed to the single tokenizer that was used that provides more safety when using this method.

    opened by mathewsanders 0
Releases(0.5.0)
Owner
Mathew Sanders
Mathew Sanders
A simple library for building attributed strings, for a more civilized age.

Veneer A simple library for building attributed strings, for a more civilized age. Veneer was created to make creating attributed strings easier to re

Wess Cope 26 Dec 27, 2022
A Swift framework for using custom emoji in strings.

Emojica – a Swift framework for using custom emoji in strings. What does it do? Emojica allows you to replace the standard emoji in your iOS apps with

Dan 101 Nov 7, 2022
µframework for Attributed strings.

Attributed µframework for Attributed strings. What is Attributed? Attributed aims to be a drop in replacement to the current version of the NSAttribut

Nicholas Maccharoli 754 Jan 9, 2023
A Swifty API for attributed strings

SwiftyAttributes A Swifty API for attributed strings. With SwiftyAttributes, you can create attributed strings like so: let fancyString = "Hello World

Eddie Kaiger 1.5k Jan 5, 2023
Texstyle allows you to format iOS attributed strings easily.

Texstyle allows you to format attributed strings easily. Features Applying attributes with strong typing and autocompletion Cache for attributes Subst

Rosberry 79 Sep 9, 2022
An easier way to compose attributed strings

TextAttributes makes it easy to compose attributed strings. let attrs = TextAttributes() .font(name: "HelveticaNeue", size: 16) .foregroundCol

Damien 2.2k Dec 31, 2022
Converts Markdown files and strings into NSAttributedStrings with lots of customisation options.

SwiftyMarkdown 1.0 SwiftyMarkdown converts Markdown files and strings into NSAttributedStrings using sensible defaults and a Swift-style syntax. It us

Simon Fairbairn 1.5k Dec 22, 2022
Generate SwiftUI Text or AttributedString from markdown strings with custom style names.

iOS 15.0 / macOS 12.0 / tvOS 15.0 / watchOS 8.0 StyledMarkdown is a mini library that lets you define custom styles in code and use them in your local

null 19 Dec 7, 2022
BonMot is a Swift attributed string library

BonMot (pronounced Bon Mo, French for good word) is a Swift attributed string library. It abstracts away the complexities of the iOS, macOS, tvOS, and

Rightpoint 3.4k Dec 30, 2022
Croc is a swift emoji string parsing library

Croc is a library for parsing emojis on iOS. It provides a simple and lightweight interface for detecting, generating, categorizing and managing emoji characters, making emoji-powered features an easy task for developers.

Joe Kalash 127 Nov 20, 2022