Chapters

Hide chapters

Swift Apprentice: Fundamentals

First Edition · iOS 16 · Swift 5.7 · Xcode 14.2

Section III: Building Your Own Types

Section 3: 9 chapters
Show chapters Hide chapters

10. Regex
Written by Ehab Amer

Heads up... You're reading this book for free, with parts of this chapter shown beyond this point as scrambled text.

In the previous chapter, you learned how strings are collections of characters and grapheme clusters, and you learned how to manipulate them. You also learned how to find a character inside a string. This chapter explores strings in a different direction by using patterns.

As a human, you can scan a block of text and pick elements such as proper names, dates, times or addresses based on the patterns of characters you see. You don’t need to know the exact values in advance to find elements.

What Is a Regular Expression?

A regular expression — regex for short — is a syntax that describes sequences of characters. Many modern programming languages, including Swift from Swift 5.7, can interpret regular expressions. In its simplest form, a regular expression looks much like a string. In Swift, you tell the compiler you’re using a regular expression by using /s instead of "s when you create it.

let searchString = "john"
let searchExpression = /john/

The code above defines a regular expression and a String you can use to search some text to find whether it contains “john”. Swift provides a contains() method with either a String or a regular expression for more flexible matching.

let stringToSearch = "Johnny Appleseed wants to change his name to John."

stringToSearch.contains(searchString) // false
stringToSearch.contains(searchExpression) // false

It might surprise you that .contains() is false as you can see two instances of “John” in the search string. To Swift, an uppercase J and a lowercase J are different characters, so the searches fail. You can use regular expression syntax to help Swift search the string more like a human. Try this expression:

let flexibleExpression = /[Jj]ohn/
stringToSearch.contains(flexibleExpression) // true

The regular expression above defines a search that begins with either an uppercase or a lowercase J followed by the string “ohn”. Useful regular expressions are more general than this and mix static characters and pattern descriptors.

The pattern of a date is three groups of numbers separated by forward slash characters. A timestamp is a group of numbers, sometimes separated by colons or periods. Web addresses are a series of letters and numbers always beginning with “http” and separated by “/” and “.” and “?”. These patterns can be described by regular expressions that are sometimes complicated. But you’ll start with something simple and work your way up.

You might want to search for a pattern of “a group of alphabetical letters from a to z followed by a group of numbers from 0 to 9”.

One way to represent this in regular expression syntax is /[a-z]+[0-9]+/.

This expression will match text including abcd12345 or swiftapprentice2023 but won’t work on XYZ567 or Pennsylvania65000. The expression only describes lowercase ASCII, not uppercase.

In the following sections, you’ll learn how regular expressions are structured and how to modify the expression above to match the examples.

Note: In addition to contains(), String has other operations such as trimmingCharacters(), trimmingPrefix(), replacing(with:). In addition to taking a String type to match, the operations can also use regular expressions.

Regular Expression Structure

A regular expression mainly consists of two parts:

Character Representations

Remember that regular expressions are a syntax for defining patterns. Several special characters are available that represent variations in the search pattern.

Repetitions

You have already used the repetition descriptor + for one or more. Multiple types of repetition descriptors exist. They follow the character pattern you want to repeat:

Mini-Exercise

Now that you’ve learned to construct regular expressions with different capabilities, how would you adapt /[a-z]+[0-9]+/ from earlier to match all of the example texts abcd12345, swiftapprentice2023, XYZ567, Pennsylvania65000?

Compile Time Checking

What separates Swift regex from other languages (and earlier versions of Swift before 5.7) is its ability to check for correctness at compile time. Consider the following:

let lowercaseLetters = /[a-z*/

let lowercaseLetters = /[a-z]*/

Regular Expression Matches

Regular expression matches can sometimes be surprising. To explore kinds of matching, start with this example:

let lettersAndNumbers = /[a-z]+[0-9]+/
let testingString1 = "abcdef ABCDEF 12345 abc123 ABC 123 123ABC 123abc abcABC"
for match in testingString1.matches(of: lettersAndNumbers) {
 print(String(match.output))
}
let possibleLettersAndPossibleNumbers = /[a-z]*[0-9]*/
for match in testingString1.matches(of: possibleLettersAndPossibleNumbers) {
  print(String(match.output)) // 32 times
}
let emptyString = ""
let matchCount = emptyString.matches(of:
                    possibleLettersAndPossibleNumbers).count // 1

Avoiding Zero-Length Matches

The regular expression engine starts at a position in the search string and increments along as far as it can while still matching the expression. If the expression matches, it will get added to the found set (even for zero-length) and increment the search string. This repeats until the search string is consumed.

let fixedPossibleLettersAndPossibleNumbers = /[a-z]+[0-9]*|[a-z]*[0-9]+/
for match in testingString1.matches(of: fixedPossibleLettersAndPossibleNumbers) {
  print(String(match.output))
}
abcdef
12345
abc123
123
123
123
abc
abc

Result Boundaries

One way to solve this is to specify boundaries that should contain each result. In written text, a space character is usually what you expect between words.

let fixedWithBoundaries = /\b[a-z]+[0-9]*\b|\b[a-z]*[0-9]+\b/
for match in testingString1.matches(of: fixedWithBoundaries) {
  print(String(match.output))
}
abcdef
12345
abc123
123

Challenge 1

Create a regular expression that matches any word that contains a sequence of two or more uppercase characters. Examples: 123ABC - ABC123 - ABC - abcABC - ABCabc - abcABC123 - a1b2ABCDEc3d4. It should reject abcA12a3 - abc123.

A Better Way to Write Expressions

So far, you’ve been writing regular expressions using the standard syntax. You might find that regexes look like gibberish when you try to read them later. Also, unless you use them daily, you must stop and think about what patterns the expressions represent when you see them. Don’t worry — it’s a common problem. :]

import RegexBuilder
let newlettersAndNumbers = Regex {
  OneOrMore { "a"..."z" }
  OneOrMore { .digit }
}
let newFixedRegex = Regex {
  Anchor.wordBoundary
  ChoiceOf {
   Regex {
     OneOrMore {
       "a"..."z"
     }
     ZeroOrMore {
       .digit
     }
   }
   Regex {
     ZeroOrMore {
       "a"..."z"
     }
     OneOrMore {
       .digit
     }
   }
  }
  Anchor.wordBoundary
}

Challenge 2

Update the expression you created in Challenge 1 to use the new RegexBuilder structure and match expressions that have multiple sequences of uppercase characters. Example a1b2ABCDEc3d4FGHe5f6g7

Refactoring to RegexBuilder

As you complete Challenge 2, you might wonder how this new way is better. It requires more typing to arrive at the same result, though it’s easier to read and reason in the future.

Capturing Results

So far, you’ve used regular expressions to match a pattern in a larger string. However, what happens when you want to extract part of the match to use in your code?

let regexWithCapture = Regex {
  OneOrMore {
    "a"..."z"
  }
  Capture {
    OneOrMore {
      CharacterClass.digit
    }
  }
  OneOrMore {
    "a"..."z"
  }
}
let testingString2 = "welc0me to chap7er 10 in sw1ft appren71ce. " +
  "Th1s chap7er c0vers regu1ar express1ons and regexbu1lder"

for match in testingString2.matches(of: regexWithCapture) {
  print(match.output)
}
("elc0me", "0")
("chap7er", "7")
("sw1ft", "1")
("appren71ce", "71")
("h1s", "1")
("chap7er", "7")
("c0vers", "0")
("regu1ar", "1")
("express1ons", "1")
("regexbu1lder", "1")
for match in testingString2.matches(of: regexWithCapture) {
  let (wordMatch, extractedDigit) = match.output
  print("Full Match: \(wordMatch) | Captured value: \(extractedDigit)")
}
Full Match: elc0me | Captured value: 0
Full Match: chap7er | Captured value: 7
Full Match: sw1ft | Captured value: 1
Full Match: appren71ce | Captured value: 71
Full Match: h1s | Captured value: 1
Full Match: chap7er | Captured value: 7
Full Match: c0vers | Captured value: 0
Full Match: regu1ar | Captured value: 1
Full Match: express1ons | Captured value: 1
Full Match: regexbu1lder | Captured value: 1
let regexWithStrongType = Regex {
  OneOrMore {
    "a"..."z"
  }
  TryCapture {
    OneOrMore {
      CharacterClass.digit
    }
  } transform: {foundDigits in
     Int(foundDigits)
  }
  OneOrMore {
    "a"..."z"
  }
}
("elc0me", 0)
("chap7er", 7)
("sw1ft", 1)
("appren71ce", 71)
("h1s", 1)
("chap7er", 7)
("c0vers", 0)
("regu1ar", 1)
("express1ons", 1)
("regexbu1lder", 1)
let repetition = "123abc456def789ghi"
let repeatedCaptures = Regex {
  OneOrMore {
    Capture {
      OneOrMore {
        CharacterClass.digit
      }
    }
    OneOrMore {
      "a"..."z"
    }
  }
}
for match in repetition.matches(of: repeatedCaptures) {
  print(match.output)
}

Challenge 3

Change the expression used in the last challenge to capture the text in uppercase. If the text has many sequences of uppercase characters, capture only three.

Key Points

  • Regular expressions give you incredible flexibility for matching patterns over simple substring matching.
  • Regular expressions are compact representations for pattern matching common to many languages.
  • Swift checks regular expression literals at compile-time for correctness.
  • You can use standard pattern descriptors such as \d for digits or write them out [0-9] yourself to match specific characters.
  • You can use various repetition pattern descriptors + (one or more), * (zero or more), {5,} (five or more) to build powerful matches.
  • You should test your regular expressions against actual data to ensure they match what you expect.
  • Boundary anchors like ^ (beginning of a line), $ (end of a line) and \b (word) can narrow down the results to words or lines and avoid zero-length matches.
  • Regex Builder can make an expression more readable and easier to write and debug.
  • You can capture one or more parts of a matched expression.
  • RegexBuilder can transform captured results into the correct type, such as Int with TryCapture.
Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2024 Kodeco Inc.

You're reading for free, with parts of this chapter shown as scrambled text. Unlock this book, and our entire catalogue of books and videos, with a Kodeco Personal Plan.

Unlock now