🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Mathew Sanders

Last update: Nov 11, 2022

Related tags

Overview

Mustard 🌭

Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Quick start using character sets

Foundation includes the String method components(separatedBy:) that allows us to get substrings divided up by certain characters:

let sentence = "hello 2017 year"
let words = sentence.components(separatedBy: .whitespaces)
// words.count -> 3
// words = ["hello", "2017", "year"]

Mustard provides a similar feature, but with the opposite approach, where instead of matching by separators you can match by one or more character sets, which is useful if separators simply don't exist:

import Mustard

let sentence = "hello2017year"
let words = sentence.components(matchedWith: .letters, .decimalDigits)
// words.count -> 3
// words = ["hello", "2017", "year"]

If you want more than just the substrings, you can use the tokens(matchedWith: CharacterSet...) method which will return an array of TokenType.

As a minimum, TokenType requires properties for text (the substring matched), and range (the range of the substring in the original string). When using CharacterSets as a tokenizer, the more specific type CharacterSetToken is returned, which includes the property set which contains the instance of CharacterSet that was used to create the match.

import Mustard

let tokens = "123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
// tokens: [CharacterSet.Token]
// tokens.count -> 5 (characters '&', '^', and '.' are ignored)
//
// second token..
// token[1].text -> "Hello"
// token[1].range -> Range<String.Index>(3..<8)
// token[1].set -> CharacterSet.letters
//
// last token..
// tokens[4].text -> "67"
// tokens[4].range -> Range<String.Index>(19..<21)
// tokens[4].set -> CharacterSet.decimalDigits

Advanced matching with custom tokenizers

Mustard can do more than match from character sets. You can create your own tokenizers with more sophisticated matching behavior by implementing the TokenizerType and TokenType protocols.

Here's an example of using DateTokenizer (see example for implementation) that finds substrings that match a MM/dd/yy format.

DateTokenizer returns tokens with the type DateToken. Along with the substring text and range, DateToken includes a Date object corresponding to the date in the substring:

import Mustard

let text = "Serial: #YF 1942-b 12/01/17 (Scanned) 12/03/17 (Arrived) ref: 99/99/99"

let tokens = text.tokens(matchedWith: DateTokenizer())
// tokens: [DateTokenizer.Token]
// tokens.count -> 2
// ('99/99/99' is *not* matched by `DateTokenizer` because it's not a valid date)
//
// first date
// tokens[0].text -> "12/01/17"
// tokens[0].date -> Date(2017-12-01 05:00:00 +0000)
//
// last date
// tokens[1].text -> "12/03/17"
// tokens[1].date -> Date(2017-12-03 05:00:00 +0000)

Documentation & Examples

Roadmap

Include detailed examples and documentation
Ability to skip/ignore characters within match
Include more advanced pattern matching for matching tokens
Make project logo 🌭
Performance testing / benchmarking against Scanner
Include interface for working with Character tokenizers

Requirements

Swift 4.1

Author

Made with ❤️ by @permakittens

Contributing

Feedback, or contributions for bug fixing or improvements are welcome. Feel free to submit a pull request or open an issue.

License

MIT

SwiftVerbalExpressions is a Swift library that helps to construct difficult regular expressions

SwiftVerbalExpressions Swift Regular Expressions made easy SwiftVerbalExpressions is a Swift library that helps to construct difficult regular express

582 Jun 29, 2022

Swift markdown library

Markdown ![Swift version](https://img.shields.io/badge/Swift-2.1 | 2.2-blue.svg) ![GitHub license](https://img.shields.io/badge/license-LGPL v3-green.

79 Oct 9, 2022

Lightweight library to set an Image as text background. Written in swift.

![](https://img.shields.io/badge/Swift 2-compatible-4BC51D.svg?style=flat-square) Simple and light weight UIView that animate text with an image. Demo

552 Sep 9, 2022

A Cross-Platform String and Regular Expression Library written in Swift.

Guitar 🎸 A Cross-Platform String and Regular Expression Library written in Swift. About This library seeks to add common string manipulation function

659 Dec 27, 2022

Swift emoji string parsing library

Croc is a library for parsing emojis on iOS. It provides a simple and lightweight interface for detecting, generating, categorizing and managing emoji

125 Sep 27, 2021

SZMentionsSwift is a lightweight mentions library for iOS.

SZMentionsSwift is a lightweight mentions library for iOS. This library was built to assist with the adding, removing and editing of a mention within a textview.

122 Dec 12, 2022

A simple library that provides standard Unicode emoji support across all platforms

Twitter Emoji (Twemoji) A simple library that provides standard Unicode emoji support across all platforms. Twemoji v13.1 adheres to the Unicode 13.0

15k Jan 8, 2023

iOS port from libphonenumber (Google's phone number handling library)

libPhoneNumber for iOS NBPhoneNumberUtil NBAsYouTypeFormatter ARC only Update Log https://github.com/iziz/libPhoneNumber-iOS/wiki/Update-Log Issue You

2.3k Jan 3, 2023

User input masking library repo.

Migration Guide: v.6 This update brings breaking changes. Namely, the autocomplete flag is now a part of the CaretGravity enum, thus the Mask::apply c

548 Dec 20, 2022

Comments

Mustard VS RegEx

I'm Trying to understand Mustard. So I tried to do your example: https://github.com/mathewsanders/Mustard/blob/master/Documentation/Template%20tokenizer.md

In Swift RegEx

let str = "Serial: #YF 1942-b 12/01/17 (Scanned) 12/03/17 (Arrived) ref: 99/99/99"
let usDatePattern:String = "(\\d\\d)[-\\/](\\d\\d)[-\\/](\\d\\d(?:\\d\\d)?)"//Must be in the format: 12-30-1968 (mm-dd-yyyy) works with: "12-30-1968" and "12/30/1968" syntax
let matches = str.matches(usDatePattern)//RegExpMatch.datePattern
matches.forEach {
    let month:String = $0.value(str, 1)
    let day:String = $0.value(str, 2)
    let year:String = $0.value(str, 3)
    let dateStr:String = year + "/" + month + "/" + day
    let dateFormatter = DateFormatter()
    dateFormatter.dateFormat = "yy/mm/dd"
    let date:Date? = dateFormatter.date(from: dateStr)
    if date != nil {Swift.print(dateStr)}//Output: 17/12/01, 17/12/03
}

So mustard is more human readable than RegEx i guess. Q1: Do you see a use-case for mustard in Chat-bots, Ai-bot-writers etc?

Q2: Seems like you know a bit about string scanners having built Mustard. I've written a css parser in RegEx: here doing it in RegEx is probably not the ultimate solution. What should one use? Scanner, Mustard other?

opened by eonist 3

Switch from tuples to TokenType's

Updated tokens methods to return types of TokenType instead of returning information in a tuple.

This brings some extra complexity into Mustard because TokenizerType now have an associatedtype and for different tokenizers to work together they need to use type erasure for them to be passed in together into the tokens method.

It also adds some complexity to creating custom tokenizers (where now there is an option to create an associated custom TokenType).

Performance has been negatively impacted, but Mustard was never the fastest option, and now with custom types, Mustard becomes more expressive to use.

(Note: TokenizerType.swift had to be combined into Mustard.swift to prevent segmentation fault 11 errors)

opened by mathewsanders 0
Add default tokenizer type protocol

The DefaultTokenizerType separates out the requirement for a default initializer from the core TokenizerType protocol.

This means that tokenizers like the LiteralTokenizer don't require an initializer when it doesn't really make sense.

Also included an alternate tokens method that can be used when all the tokenizers are the same type: tokens<T: TokenizerType>(matchedWith tokenizers: T...) this tuple of matches that is returned is strongly typed to the single tokenizer that was used that provides more safety when using this method.

opened by mathewsanders 0