1D, 2D, and 3D variations of Fast Fourier Transforms for a Metal S4TF backend

Related tags

Image MetalFFT
Overview

MetalFFT

MetalFFT is an experiment in adding GPU acceleration for 1D, 2D, and 3D variations of Fast Fourier Transforms.

This framework's original purpose was to become a raw operator for a Swift for TensorFlow Metal backend. Work on it paused due to unfavorable performance. For many transforms, Apple's CPU alternative from Accelerate runs faster. The A13 and later have AMX accelerators capable of ~2 TFLOPS, and the CPU implementation may harness that processing power.

MetalFFT was also created to show Apple's Metal Performance Shaders (MPS) team how to support encoding MPS kernels into indirect command buffers. Due to the lack of a direct connection to machine learning, this will not be merged into a larger repository when constructing an S4TF Metal backend. For more context on the backend, see the Differentiation iOS Demo.

How to use

MetalFFT's API is similar to Metal Performance Shaders. It only performs FFTs on tensors whose dimensions are powers of two. One may run a batch of several 1D or 2D FFTs simultaneously, as long as the tensors are arranged in contiguous blocks of memory. The batch size does not need to be a power of two, although it has an upper bound: a singular transform's dimensions multiplied by the batch's size must not exceed 2^31 (about 2 billion).

Executing a 1D Fast Fourier Transform:

import MetalFFT

// Allocate GPU-accessible buffers
let input = FFTRealBuffer(device: device, capacity: 128 * batchSize)
let output = FFTComplexBuffer(device: device, capacity: 128 * batchSize)

// Fill the time-domain data
let timeDomainData = input.buffer.contents()
let audioSamples: UnsafeMutablePointer<Float> = ...
memcpy(timeDomainData, audioSamples, audioSampleSizeInBytes)

let commandBuffer = commandQueue.makeCommandBuffer()!
#if os(macOS)
if input.buffer.storageMode == .managed { // `.shared` on M1
    // Synchronize `input.buffer` between CPU and GPU
}
#endif

let fft = FastFourierTransform1D(device: device,
                                 transformWidth: 128,
                                 inputType: .real,
                                 batchSize: batchSize)
fft.encode(commandBuffer: commandBuffer, input: input, output: output)

#if os(macOS)
if output.buffer.storageMode == .managed { // `.shared` on M1
    // Synchronize `output.buffer` between CPU and GPU
}
#endif
commandBuffer.commit()

// Extract frequency-domain data
commandBuffer.waitUntilCompleted()
let frequencyDomainData = output.buffer.contents()

All numbers processed by FFT kernels are single-precision floats. You may use either real, interleaved complex, or split complex numbers as input. The output is interleaved complex to optimize memory accesses during the kernel's execution. Use a ComplexNumberConversion to convert the output from complex to real or split complex.

Converting complex numbers into split form:

let interleavedComplexData: FFTComplexBuffer = ...
let splitComplexData: FFTSplitComplexBuffer = ...

let conversion = ComplexNumberConversion(device: device,
                                         inputType: .interleavedComplex,
                                         outputType: .splitComplex)

let commandBuffer = commandQueue.makeCommandBuffer()!
conversion.encode(commandBuffer: commandBuffer,
                  input: interleavedComplexData,
                  output: splitComplexData)
commandBuffer.commit()

// Extract real and imaginary GPU buffers
let realData = splitComplexData.realBuffer
let imaginaryData = splitComplexData.imaginaryBuffer

How it works

FFT kernels operate in-place on their output buffer, using bit reversal pre-processing and decimation in time. The first stage is out-of-place and eliminates one copying pass. It reads from the input using bit reversal, performs 2-wide Discrete Fourier Transforms (butterflies), then writes to the output. The other stages proceed according to the iterative-fft pseudocode on the Cooley-Tukey algorithm wikipedia article. Twiddle factors are cached into a constant buffer so that the shader can avoid recalculating them.

The 2D and 3D variations repeat the steps above, once for each dimension. Between repetitions, the shader pages to a scratch buffer since bit reversal indexing cannot be done in-place. MetalFFT allocates and manages the scratch buffer internally.

The iterative-fft approach was selected because it can be easily parallelized, unlike the recursive approach often used on CPUs. This could be improved with a mixed-radix algorithm, executing 4-wide or 8-wide butterflies inside the shader. That approach reduces the number of memory accesses while retaining the same number of floating-point operations. However, it slightly complicates command encoding on the CPU.

Future work

Hopefully, someone in the open source community or Apple's Metal Performance Shaders (MPS) team will pick up where I left off. MetalFFT still requires several optimizations, including mixed radices and automatically splitting large batches into smaller ones (to improve cache coherence). It also lacks DocC documentation.

I will move on to resurrecting Swift for TensorFlow now. Pull requests to this repository are welcome, and I can test them on a wide range of GPUs to profile performance. If anyone contributes DocC documentation, I will host it on a GitHub Pages website under my account.

Note: Although I can test on GPUs from Intel and AMD, that has not been done yet. I ran the Swift package tests several times on an M1 Max, and ran one test on an A15. There may be bugs on older Apple GPUs and Intel Macs.

This project is open-sourced under the MIT license, with one exception. Before this code or anything derived from it is used in MPS, MPS Graph or ML Compute, the Apple MPS team must contact me about it. They may use email, developer forums, or any other communication channel.

In addition, I ask the MPS team to experiment with encoding some MPS kernels into indirect command buffers, using this project as a reference. There is ample time to add FFTs and ICB support to MPS, then announce the features during WWDC 2022.

Acknowledgments

Special thanks to @CaptainHarlockSSX for contributing CPU performance benchmarks and assisting me throughout this project.

You might also like...
A python library to run metal compute kernels on MacOS 12

metalcompute for Python A python library to run metal compute kernels on MacOS Usage Example execution from M1-based Mac running MacOS 12.0: ./build

GPU-based media processing library using Metal written in Swift
GPU-based media processing library using Metal written in Swift

GPU-based media processing library using Metal written in Swift. Overview MetalAcc is a GPU-Based media processing library that lets you apply GPU-acc

Boids written in the Metal Shader language + Swift

MetalBoid Boids written in the Metal Shader language + Swift This is an example of a Boid simulating running in iOS using Swift+Metal. The parameters

An example app showing how to use AVCaptureSession with Metal in Swift.
An example app showing how to use AVCaptureSession with Metal in Swift.

#iOSSwiftMetalCamera Click here to see video demo. This app is a basic example showing how to use Swift to setup an AVCaptureSession session to access

Extract Metal functions from .metallib files.
Extract Metal functions from .metallib files.

Metal Library Archive MetalLibraryArchive is a product of reverse-engineering Apple's metallib file format. You can use MetalLibraryArchive to get the

new home for the non-Metal framework shims!

Moraea non-Metal Frameworks The core of the non-Metal patches: wrappers for downgraded frameworks, consisting of a mixture of autogenerated stubs and

A Metal application that mimics SAO
A Metal application that mimics SAO "Link Start" scene.

SAO Link Start Effect This is a Metal application that mimics SAO "Link Start" scene. Building The project requires Xcode 13.3 or later version. The a

A Metal re-implementation of GLKit's GLKBaseEffect
A Metal re-implementation of GLKit's GLKBaseEffect

MBEBaseEffect: Antique Fixed-Function Features for Modern Apps When GLKit was introduced with iOS 5 (2011), it included the GLKBaseEffect class to all

🍁🥓 Lightweight and fast Swift library for image downloading, caching and transformations
🍁🥓 Lightweight and fast Swift library for image downloading, caching and transformations

MapleBacon Introduction MapleBacon is a lightweight and fast Swift library for downloading and caching images. Example The folder Example contains a s

Owner
Philip Turner
Spreading news that anyone can use $5 Google Cardboard to replicate $3,500 Microsoft Hololens
Philip Turner
Swift Package Manager plug-in to compile Metal files that can be debugged in Xcode Metal Debugger.

MetalCompilerPlugin Swift Package Manager plug-in to compile Metal files that can be debugged in Xcode Metal Debugger. Description Swift Package Manag

Jonathan Wight 10 Oct 30, 2022
The program calculates the approximate function to the given graph using the Fourier series

Ряды фурье Программа вычисляет приближенную функцию к заданной графиком с помощь

Zed Null 0 Sep 25, 2022
A simple mesh viewer for MacOS based on Swift and Metal and using Assimp for loading meshes

Metal Mesh Viewer A simple triangle mesh viewer for MacOS This application is a simple (triangle) mesh viewer that should be capable of rendering even

J. Andreas Bærentzen 0 Dec 13, 2021
📷 A composable image editor using Core Image and Metal.

Brightroom - Composable image editor - building your own UI Classic Image Editor PhotosCrop Face detection Masking component ?? v2.0.0-alpha now open!

Muukii 2.8k Jan 3, 2023
GPUImage 3 is a BSD-licensed Swift framework for GPU-accelerated video and image processing using Metal.

GPUImage 3 Janie Clayton http://redqueengraphics.com @RedQueenCoder Brad Larson http://www.sunsetlakesoftware.com @bradlarson contact@sunsetlakesoftwa

Brad Larson 2.4k Jan 3, 2023
📷 A composable image editor using Core Image and Metal.

Brightroom - Composable image editor - building your own UI Classic Image Editor PhotosCrop Face detection Masking component ?? v2.0.0-alpha now open!

Muukii 2.8k Jan 2, 2023
A GPU accelerated image and video processing framework built on Metal.

MetalPetal An image processing framework based on Metal. Design Overview Goals Core Components MTIContext MTIImage MTIFilter MTIKernel Optimizations C

null 1.5k Jan 4, 2023
VRTracerSample - Learning project in Metal Ray Tracing and Swift

VRTracer This is a personal project for learning Metal's Ray Tracing API with sw

null 1 Feb 12, 2022
Visualiser written in Swift, SwiftUI and Metal API

ModularMTL About Visualisation of modular multiplication on a circle. Written in Swift using Metal API and SwiftUI. Images Features Keyboard controls

Gracien 12 Dec 20, 2022
Playing with Core Image and Metal Shader Language for fun.

Playing with Core Image and Metal Shader Language for fun.

Makeeyaf 6 Jan 5, 2023