Mobile(iOS) Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)

Overview

Mobile Text-to-Image Search(MoTIS)

MoTIS is a minimal demo demonstrating semantic multimodal text-to-image search using pretrained vision-language models. Semantic search represents each sample(text and image) as a vector in a shared semantic embedding space. The relevance score can then be measured as similarity(cosine similarity or distance) between vectors.

Distilled Image Encoder Checkpoints

Model Google Drive Hit@10 on MSCOCO 2017
original CLIP(336MB) https://drive.google.com/file/d/1K2wIyTuSWLTKBXzUlyTEsa4xXLNDuI7P/view?usp=sharing 58.6
Deit-small-distilled-patch16-224(84MB) https://drive.google.com/file/d/1Fg3ckUUqBs5n4jvNWZUcwwk7db0QBRri/view?usp=sharing 62.1
ViT-small-patch16-224(85MB) https://drive.google.com/file/d/1s_oX0-HIELpjjrBXsjlofIbTGZ_Wllo0/view?usp=sharing 63.8
ViT-small-patch16-224(train with larger batch size) https://drive.google.com/file/d/1h_w9msJMB4F-dR6uNwp-BHeguS5QIrnE/view?usp=sharing 64.7

Note that these ckpts are not ckpt from state_dict(), but rather the ckpt after torch.jit.script operation. The same original CLIP text encoder is used for all various image encoders.

Recent Updates:

  • We use pretrained ViT-Small(85MB) as initialization for the student model. Using the same distillation pipeline, it achieves even better results(2 points higher Hit@1) than the previous Deit-small-distilled model. Link of the jit scirpt checkpoint is here.
  • A more effective distilled image encoder(84MB compared to the original 350MB ViT-B/32 in CLIP) is available here. This image encoder is initialized with DeiT-base-distilled's pre-trained weights, which leads to more robust image representation hence better retrieval performance(obtain higher Hit@1/5/10 than original CLIP on MSCOCO validation set). It is further learned through supervised learning and knowledge distillation.
  • Transplanted Spotify's Annoy Approximate Nearest Neighbor search in this project(annoylib.h).
  • Relatively low quality images are displayed by default. Retrieved images are displayed with high quality. This is designed to reduce the runtime memory.

Features

  1. text-to-image retrieval using semantic similarity search.
  2. support different vector indexing strategies(linear scan, KMeans, and random projection).

Screenshot

  • All images in the gallery all
  • Search with query Three cats search

Installation

  1. Download the two TorchScript model files(text encoder, image encoder) into models folder and add them into the Xcode project.
  2. Required dependencies are defined in the Podfile. We use Cocapods to manage these dependencies. Simply do 'pod install' and then open the generated .xcworkspace project file in XCode.
pod install
  1. This demo by default load all images in the local photo gallery on your realphone or simulator. One can change it to a specified album by setting the albumName variable in getPhotos method and replacing assetResults in line 117 of GalleryInteractor.swift with photoAssets.

Usage

Just type any keyword in order to search the relecant images. Type "reset" to return to the default one.

Todos

  • Basic features
  • Access to specified album or all photos
  • Asynchronous model loading and vectors computation
  • Export pretrinaed CLIP into TorchScript format using torch.jit.script and optimize_for_mobile provided by Pytorch
  • Transplant the original PIL based image preprocessing procedure into OpenCV based procedure, observed about 1% retrieval performance degradation
  • Transplant the CLIP tokenizer from Python into Swift(Tokenizer.swift)
  • Indexing strategies
  • Linear indexing(persisted to file via built-in Data type)
  • KMeans indexing(persisted to file via NSMutableDictionary, hard-coded num of clusters, u can change to whatever u want)
  • Spotify's Annoy libraby with random projection indexing, the size of index file is 41MB for 2200 images.
  • Choices of semantic representation models
  • OpenAI's CLIP model
  • Integration of other multimodal retrieval models
  • Effiency
  • Reducing memory consumption of models: runtime memory 1GB -> 490MB via a smaller yet effective distilled ViT model.

About Us

This project is actively maintained by ADAPT lab from Shanghai Jiao Tong University. We expect it to continually integrate more advanced features and better cross-modal search experience. If you have any problems, welcome to file an issue.

You might also like...
Image-cropper - Image cropper for iOS

Image-cropper Example To run the example project, clone the repo, and run pod in

πŸ“· A composable image editor using Core Image and Metal.
πŸ“· A composable image editor using Core Image and Metal.

Brightroom - Composable image editor - building your own UI Classic Image Editor PhotosCrop Face detection Masking component πŸŽ‰ v2.0.0-alpha now open!

πŸ“· A composable image editor using Core Image and Metal.
πŸ“· A composable image editor using Core Image and Metal.

Brightroom - Composable image editor - building your own UI Classic Image Editor PhotosCrop Face detection Masking component πŸŽ‰ v2.0.0-alpha now open!

AYImageKit is a Swift Library for Async Image Downloading, Show Name's Initials and Can View image in Separate Screen.
AYImageKit is a Swift Library for Async Image Downloading, Show Name's Initials and Can View image in Separate Screen.

AYImageKit AYImageKit is a Swift Library for Async Image Downloading. Features Async Image Downloading. Can Show Text Initials. Can have Custom Styles

Convert the image to hexadecimal to send the image to e-paper

ConvertImageToHex Convert the image to hexadecimal to send the image to e-paper Conversion Order // 0. hex둜 λ³€ν™˜ν•  이미지 var image = UIImage(named: "sample

An instagram-like image editor that can apply preset filters passed to it and customized editings to a binded image.
An instagram-like image editor that can apply preset filters passed to it and customized editings to a binded image.

CZImageEditor CZImageEditor is an instagram-like image editor with clean and intuitive UI. It is pure swift and can apply preset filters and customize

RadarKit - The Radar Kit allowing you to locate places, trip neary by you Or it will help you to search out the people around you with the few lines of code FlickrSearchPhotos - Simple search photos application which uses Flickr REST API made in Swift
FlickrSearchPhotos - Simple search photos application which uses Flickr REST API made in Swift

FlickrSearchPhotos - Simple search photos application which uses Flickr REST API made in Swift

A SwiftUI app to filter & search runewords for Diablo II
A SwiftUI app to filter & search runewords for Diablo II

Runewords App This small SwiftUI app have two purposes: Making a clean, fully SwiftUI app using all the latest iOS 16 / Xcode 14 features. Browse, sea

Comments
  • Cannot match up model encodings

    Cannot match up model encodings

    Hi! Thanks for publishing this work, it's a great reference.

    I'm trying to integrate a couple of different systems, and I need the model encodings to match. So far, I haven't been able to make that work:

    Given this python;

    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, preprocess = clip.load("ViT-B/32", device=device)
    
    image = preprocess(Image.open("image_1.png")).unsqueeze(0).float().to(device)
    text = clip.tokenize(["a face", "a dog", "a cat"]).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image)
        print(image_features.tolist()[0])
    

    I'm trying to get the same array of floats out using Clip.mm's - (NSArray<NSNumber*>*)test_uiimagetomat:(UIImage*)image function. Try as I might, they always differ - and I'm not sure what the difference is. I can see that the cvt methods do the same as the image preprocess, then the normalise with the values from clip.

    Here's some of the initial values from the python code above:

    [0.3502497971057892, 0.0028706961311399937, -0.46749746799468994, -0.14868411421775818, -0.03139263391494751, -0.4536064863204956
    

    And from the Swift:

    [0.3193549513816833496, 0.0140316337347030640, -0.4410626888275146484, -0.0908056870102882385, -0.0415024310350418091, -0.4141347408294677734
    

    I used the preview of the quicklook on debugging the iOS code to save the image from the UIImage to ensure the same image is being used. In both cases, I'm using the original vit-b-32 CLIP image encoding. Strangely, the numbers above are kind of similar - but not sure if that's coincidental.

    Any advice?

    opened by wabzqem 1
Owner
Roy
CS PhD in Natural Language Processing, SJTU
Roy
Converts images to a textual representation.

ConsoleApp7 Essentially, this suite of programs converts images to text, which is made to resemble the original image. There are three targets in this

Cedric 1 Nov 11, 2021
add text(multiple line support) to imageView, edit, rotate or resize them as you want, then render the text on image

StickerTextView is an subclass of UIImageView. You can add multiple text to it, edit, rotate, resize the text as you want with one finger, then render the text on Image.

Textcat 478 Dec 17, 2022
A complete Mac App: drag an image file to the top section and the bottom section will show you the text of any QRCodes in the image.

QRDecode A complete Mac App: drag an image file to the top section and the bottom section will show you the text of any QRCodes in the image. QRDecode

David Phillip Oster 2 Oct 28, 2022
SwiftUI Image loading and Animation framework powered by SDWebImage

SDWebImageSwiftUI What's for SDWebImageSwiftUI is a SwiftUI image loading framework, which based on SDWebImage. It brings all your favorite features f

null 1.6k Jan 6, 2023
A smart and easy-to-use image masking and cutout SDK for mobile apps.

TinyCrayon SDK for iOS A smart and easy-to-use image masking and cutout SDK for mobile apps. TinyCrayon SDK provides tools for adding image cutout and

null 1.8k Dec 30, 2022
Style Art library process images using COREML with a set of pre trained machine learning models and convert them to Art style.

StyleArt Style Art is a library that process images using COREML with a set of pre trained machine learning models and convert them to Art style. Prev

iLeaf Solutions Pvt. Ltd. 222 Dec 17, 2022
Not Suitable for Work (NSFW) classification using deep neural network Caffe models.

Open nsfw model This repo contains code for running Not Suitable for Work (NSFW) classification deep neural network Caffe models. Please refer our blo

Yahoo 5.6k Jan 5, 2023
An image download extension of the image view written in Swift for iOS, tvOS and macOS.

Moa, an image downloader written in Swift for iOS, tvOS and macOS Moa is an image download library written in Swift. It allows to download and show an

Evgenii Neumerzhitckii 330 Sep 9, 2022
AsyncImage before iOS 15. Lightweight, pure SwiftUI Image view, that displays an image downloaded from URL, with auxiliary views and local cache.

URLImage URLImage is a SwiftUI view that displays an image downloaded from provided URL. URLImage manages downloading remote image and caching it loca

Dmytro Anokhin 1k Jan 4, 2023
Twitter Image Pipeline is a robust and performant image loading and caching framework for iOS clients

Twitter Image Pipeline (a.k.a. TIP) Background The Twitter Image Pipeline is a streamlined framework for fetching and storing images in an application

Twitter 1.8k Dec 17, 2022