Mobile Text-to-Image Search(MoTIS)

MoTIS is a minimal demo demonstrating semantic multimodal text-to-image search using pretrained vision-language models. Semantic search represents each sample(text and image) as a vector in a shared semantic embedding space. The relevance score can then be measured as similarity(cosine similarity or distance) between vectors.

Distilled Image Encoder Checkpoints

Model	Google Drive	Hit@10 on MSCOCO 2017
original CLIP(336MB)	https://drive.google.com/file/d/1K2wIyTuSWLTKBXzUlyTEsa4xXLNDuI7P/view?usp=sharing	58.6
Deit-small-distilled-patch16-224(84MB)	https://drive.google.com/file/d/1Fg3ckUUqBs5n4jvNWZUcwwk7db0QBRri/view?usp=sharing	62.1
ViT-small-patch16-224(85MB)	https://drive.google.com/file/d/1s_oX0-HIELpjjrBXsjlofIbTGZ_Wllo0/view?usp=sharing	63.8
ViT-small-patch16-224(train with larger batch size)	https://drive.google.com/file/d/1h_w9msJMB4F-dR6uNwp-BHeguS5QIrnE/view?usp=sharing	64.7

Note that these ckpts are not ckpt from state_dict(), but rather the ckpt after torch.jit.script operation. The same original CLIP text encoder is used for all various image encoders.

Recent Updates:

We use pretrained ViT-Small(85MB) as initialization for the student model. Using the same distillation pipeline, it achieves even better results(2 points higher Hit@1) than the previous Deit-small-distilled model. Link of the jit scirpt checkpoint is here.
A more effective distilled image encoder(84MB compared to the original 350MB ViT-B/32 in CLIP) is available here. This image encoder is initialized with DeiT-base-distilled's pre-trained weights, which leads to more robust image representation hence better retrieval performance(obtain higher Hit@1/5/10 than original CLIP on MSCOCO validation set). It is further learned through supervised learning and knowledge distillation.
Transplanted Spotify's Annoy Approximate Nearest Neighbor search in this project(annoylib.h).
Relatively low quality images are displayed by default. Retrieved images are displayed with high quality. This is designed to reduce the runtime memory.

Features

text-to-image retrieval using semantic similarity search.
support different vector indexing strategies(linear scan, KMeans, and random projection).

Screenshot

All images in the gallery
Search with query Three cats

Installation

Download the two TorchScript model files(text encoder, image encoder) into models folder and add them into the Xcode project.
Required dependencies are defined in the Podfile. We use Cocapods to manage these dependencies. Simply do 'pod install' and then open the generated .xcworkspace project file in XCode.

pod install

This demo by default load all images in the local photo gallery on your realphone or simulator. One can change it to a specified album by setting the albumName variable in getPhotos method and replacing assetResults in line 117 of GalleryInteractor.swift with photoAssets.

Usage

Just type any keyword in order to search the relecant images. Type "reset" to return to the default one.

Todos

Basic features

Access to specified album or all photos
Asynchronous model loading and vectors computation
Export pretrinaed CLIP into TorchScript format using torch.jit.script and optimize_for_mobile provided by Pytorch
Transplant the original PIL based image preprocessing procedure into OpenCV based procedure, observed about 1% retrieval performance degradation
Transplant the CLIP tokenizer from Python into Swift(Tokenizer.swift)

Indexing strategies

Linear indexing(persisted to file via built-in Data type)
KMeans indexing(persisted to file via NSMutableDictionary, hard-coded num of clusters, u can change to whatever u want)
Spotify's Annoy libraby with random projection indexing, the size of index file is 41MB for 2200 images.

Choices of semantic representation models

OpenAI's CLIP model
Integration of other multimodal retrieval models

Effiency

Reducing memory consumption of models: runtime memory 1GB -> 490MB via a smaller yet effective distilled ViT model.

About Us

This project is actively maintained by ADAPT lab from Shanghai Jiao Tong University. We expect it to continually integrate more advanced features and better cross-modal search experience. If you have any problems, welcome to file an issue.

Hi! Thanks for publishing this work, it's a great reference.

I'm trying to integrate a couple of different systems, and I need the model encodings to match. So far, I haven't been able to make that work:

Given this python;

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("image_1.png")).unsqueeze(0).float().to(device)
text = clip.tokenize(["a face", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    print(image_features.tolist()[0])

I'm trying to get the same array of floats out using Clip.mm's - (NSArray<NSNumber*>*)test_uiimagetomat:(UIImage*)image function. Try as I might, they always differ - and I'm not sure what the difference is. I can see that the cvt methods do the same as the image preprocess, then the normalise with the values from clip.

Here's some of the initial values from the python code above:

[0.3502497971057892, 0.0028706961311399937, -0.46749746799468994, -0.14868411421775818, -0.03139263391494751, -0.4536064863204956

And from the Swift:

[0.3193549513816833496, 0.0140316337347030640, -0.4410626888275146484, -0.0908056870102882385, -0.0415024310350418091, -0.4141347408294677734

I used the preview of the quicklook on debugging the iOS code to save the image from the UIImage to ensure the same image is being used. In both cases, I'm using the original vit-b-32 CLIP image encoding. Strangely, the numbers above are kind of similar - but not sure if that's coincidental.

Any advice?

Image-cropper - Image cropper for iOS

Image-cropper Example To run the example project, clone the repo, and run pod in

0 Jan 6, 2022

📷 A composable image editor using Core Image and Metal.

Brightroom - Composable image editor - building your own UI Classic Image Editor PhotosCrop Face detection Masking component 🎉 v2.0.0-alpha now open!

2.8k Jan 3, 2023

2.8k Jan 2, 2023

AYImageKit is a Swift Library for Async Image Downloading, Show Name's Initials and Can View image in Separate Screen.

RadarKit - The Radar Kit allowing you to locate places, trip neary by you Or it will help you to search out the people around you with the few lines of code

RadarKit Preview Discover the world 🌎 around you..!!! The Radar Kit allowing yo

6 Sep 20, 2022

FlickrSearchPhotos - Simple search photos application which uses Flickr REST API made in Swift

1 Jun 6, 2022

A SwiftUI app to filter & search runewords for Diablo II

Runewords App This small SwiftUI app have two purposes: Making a clean, fully SwiftUI app using all the latest iOS 16 / Xcode 14 features. Browse, sea

44 Dec 18, 2022

Cannot match up model encodings
Hi! Thanks for publishing this work, it's a great reference.

I'm trying to integrate a couple of different systems, and I need the model encodings to match. So far, I haven't been able to make that work:

Given this python;

device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device) image = preprocess(Image.open("image_1.png")).unsqueeze(0).float().to(device) text = clip.tokenize(["a face", "a dog", "a cat"]).to(device) with torch.no_grad(): image_features = model.encode_image(image) print(image_features.tolist()[0])

I'm trying to get the same array of floats out using Clip.mm's - (NSArray<NSNumber*>*)test_uiimagetomat:(UIImage*)image function. Try as I might, they always differ - and I'm not sure what the difference is. I can see that the cvt methods do the same as the image preprocess, then the normalise with the values from clip.

Here's some of the initial values from the python code above:

[0.3502497971057892, 0.0028706961311399937, -0.46749746799468994, -0.14868411421775818, -0.03139263391494751, -0.4536064863204956

And from the Swift:

[0.3193549513816833496, 0.0140316337347030640, -0.4410626888275146484, -0.0908056870102882385, -0.0415024310350418091, -0.4141347408294677734

I used the preview of the quicklook on debugging the iOS code to save the image from the UIImage to ensure the same image is being used. In both cases, I'm using the original vit-b-32 CLIP image encoding. Strangely, the numbers above are kind of similar - but not sure if that's coincidental.

Any advice?
opened by wabzqem 1

Mobile(iOS) Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)

Related tags

Overview

Mobile Text-to-Image Search(MoTIS)

Distilled Image Encoder Checkpoints

Recent Updates:

Features

Screenshot

Installation

Usage

Todos

About Us

You might also like...

Image-cropper - Image cropper for iOS

📷 A composable image editor using Core Image and Metal.

📷 A composable image editor using Core Image and Metal.

AYImageKit is a Swift Library for Async Image Downloading, Show Name's Initials and Can View image in Separate Screen.

Convert the image to hexadecimal to send the image to e-paper

An instagram-like image editor that can apply preset filters passed to it and customized editings to a binded image.

RadarKit - The Radar Kit allowing you to locate places, trip neary by you Or it will help you to search out the people around you with the few lines of code

FlickrSearchPhotos - Simple search photos application which uses Flickr REST API made in Swift

A SwiftUI app to filter & search runewords for Diablo II

Comments

Cannot match up model encodings

Owner

Roy

Converts images to a textual representation.

add text(multiple line support) to imageView, edit, rotate or resize them as you want, then render the text on image

A complete Mac App: drag an image file to the top section and the bottom section will show you the text of any QRCodes in the image.

SwiftUI Image loading and Animation framework powered by SDWebImage

A smart and easy-to-use image masking and cutout SDK for mobile apps.

Style Art library process images using COREML with a set of pre trained machine learning models and convert them to Art style.

Not Suitable for Work (NSFW) classification using deep neural network Caffe models.

An image download extension of the image view written in Swift for iOS, tvOS and macOS.

AsyncImage before iOS 15. Lightweight, pure SwiftUI Image view, that displays an image downloaded from URL, with auxiliary views and local cache.

Twitter Image Pipeline is a robust and performant image loading and caching framework for iOS clients