DeepInfant® is a Neural network system designed to predict whether and why your baby is crying.



DeepInfant® is a Neural network system designed to predict whether and why your baby is crying.

DeepInfant uses artificial intelligence and machine learning algorithms to determine which acoustic features are associated with which one of baby's needs.

For example, babies who are in pain demonstrate cries with high energy, while a fussy cry may have more periods of silence. We are preparing our findings for academic review and publication within a single well trained model based on academic datasets.


DeepInfant is a machine learning model that uses arts of artificial intelligence to predict your baby's needs based on sound classification of cries.


DeepInfant has used as part of final project in Speech Technology Course in KTH (Royal Institute of Technology Sweden)

iOS Models

This repo is published with pre-trained CoreML Models.

  • DeepInfant_VGGish
  • DeepInfant_AFP
DeepInfant_VGGish DeepInfant_AFP
Trained with a feature extractor that uses extra layers for classification with a fixed window duration of 975ms Trained with a feature extractor that optimized for speed with support of a flexible window duration.

iOS and iPadOS App

This repo contains an example of using DeepInfant_VGGish model to build an iOS app that analyzes baby's cry sound with and push a prediction results with a tip of how to deal with each predicted result.

Building a model

The audio clips have a sample rate of 16000 Hz and a duration of about ~7 secs. This means there are about 16000*7 numbers per second representing the audio data. We take a fast fourier transform (FFT) of a 2048 sample window, slide it by 512 samples and repeat the process of the 7 sec clip. The resulting representation can be shown as a 2D image and is called a Short-Time Fourier Transform (STFT). Since humans perceive sound on a logarithmic scale, we’ll convert the STFT to the mel scale. The librosa library lets us load an audio file and convert it to a melspectrogram

import librosa
import librosa.display

fname = 'test-1-audio001.wav'
samples, sample_rate = librosa.load(fname)
fig = plt.figure(figsize=[4,4])
ax = fig.add_subplot(111)
S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))

The melspectrogram of a baby crying looks like the image below

In order to build the spectrograms of the audio samples needed for training the model, we’ll be using the fantastic audio loader module for fastai v1 built by Jason Hartquist.

n_fft = 2048 # output of fft will have shape [1024 x n_frames]
n_hop = 512 # 50% overlap between frames
n_mels = 128 # compress 2048 dimensions to 128 via mel frequency scale
sample_rate = 16000
tfms = get_frequency_batch_transforms(n_fft=n_fft, n_hop=n_hop,
              n_mels=n_mels, sample_rate=sample_rate)
batch_size = 64
data = (AudioItemList.from_folder(CRYING_PATH)
     .databunch(bs=batch_size, tfms=tfms, equal_lengths=False))
learn = create_cnn(data, models.resnet34, metrics=accuracy)
learn.lr_find(start_lr=0.001, end_lr=1)

Fastai’s cyclical learning rate finder runs the model against a small batch of training samples to find a good learning rate.

As the learning rate increases to 10e-2, you can see the model loss decrease. However, for higher learning rates, the loss begins to increase. Hence we pick 10e-2 as the learning rate for training the model. After training the model over a few epochs, we see an accuracy of 95% over the validation set

Predicting over realtime audio samples

Now that we have a really good model, in order to use it in a real application, we need to be able to run predictions over an audio stream in real time.

We use the pyaudio library to read audio samples from the device microphone and then convert the audio data into numpy arrays and feed it to the model.

while True:
frames = [] # A python-list of chunks(numpy.ndarray)
for _ in range(0, int(RATE / CHUNKSIZE * RECORD_SECONDS)):
    data =, exception_on_overflow=False)
    frames.append(np.fromstring(data, dtype=np.float32))
    npdata = np.hstack(frames)
    audio_clip = AudioClip.from_np(npdata, RATE)

The above code reads a 7 sec audio clip from the microphone and loads that into memory. It converts it to a numpy array and runs the model on them to get a prediction. This simple piece of code is now ready to be deployed to a service or an embedded device and be used in real applications !


The audio files should contain baby cry samples, with the corresponding tagging information encoded in the filenames. The samples were tagged by the contributors themselves. So here's how to parse the filenames.



app instance uuid (36 chars)-unix epoch timestamp-app version-gender-age-reason So, the above translates to:

the sample was recorded with the app instance having the unique id 0D1AD73E-4C5E-45F3-85C4-9A3CB71E8856. These ids are generated upon installation, so they identify an installed instance, not a device or a user the recording was made at 1430742197 (unix time epoch) , which translates to Mon, 04 May 2015 12:23:17 GMT

version 1.0 of the mobile app was used, the user tagged the recording to be of a boy, the baby is 0-4 weeks old according to the user, the suspected reason of the cry is hunger



The structure is the same with the exception that the unix epoch timestamp is in milliseconds



  • m - male
  • f - female


  • 04 - 0 to 4 weeks old
  • 48 - 4 to 8 weeks old
  • 26 - 2 to 6 months old
  • 72 - 7 month to 2 years old
  • 22 - more than 2 years old


  • hu - hungry
  • bu - needs burping
  • bp - belly pain
  • dc - discomfort
  • ti - tired


Please feel free to contact us if you need any further assistance.


Apache License Version 2.0, January 2004

Copyright 2022 Skytells AI Research

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

You might also like...
Spokestack: give your iOS app a voice interface!
Spokestack: give your iOS app a voice interface!

Spokestack provides an extensible speech recognition pipeline for the iOS platform. It includes a variety of built-in speech processors for Voice Acti

Resource monitor - A flutter plugin for Android and IOS to monitor CPU and RAM usage of device.

resource_monitor A flutter plugin for Android and IOS to monitor CPU and RAM usage of device. TODO Implement Android Side of this plugin. Add listener

Matft is Numpy-like library in Swift. Function name and usage is similar to Numpy.
Matft is Numpy-like library in Swift. Function name and usage is similar to Numpy.

Numpy-like library in swift. (Multi-dimensional Array, ndarray, matrix and vector library)

Realtime yoga pose detection and classification plugin for Flutter using MLKit

ML Kit Pose Detection Plugin Flutter plugin for realtime pose detection using MLKit's Blazepose. License Copyright (c) 2021 Souvik Biswas, Bharat Bira

Sample code for Core ML using ResNet50 provided by Apple and a custom model generated by coremltools.
Sample code for Core ML using ResNet50 provided by Apple and a custom model generated by coremltools.

CoreML-samples This is the sample code for Core ML using ResNet50 provided by Apple. ResNet50 can categorize the input image to 1000 pre-trained categ

A Swift deep learning library with Accelerate and Metal support.
A Swift deep learning library with Accelerate and Metal support.

Serrano Aiming to offering popular and cutting edge techs in deep learning area on iOS devices, Serrano is developed as a tool for developers & resear

A Swift library for creating and exporting CoreML Models in Swift

SwiftCoreMLTools A Swift Library for creating CoreML models in Swift. Work in progress This library expose a (function builder based) DSL as well as a

Takes those cursed usernames you see on social networks and lets them be accessible to screen readers.
Takes those cursed usernames you see on social networks and lets them be accessible to screen readers.

AccessibleAuthorLabel 🔖 Takes those cursed usernames you see on social networks and lets them be accessible to screen readers so everyone can partake

WhatPet - A basic app that classifies images of dogs, cats and rabbits using CoreML
WhatPet - A basic app that classifies images of dogs, cats and rabbits using CoreML

WhatPet ✓ A basic app that classifies images of dogs, cats and rabbits using Cor

Skytells AI Research
At Skytells, we’re conducting research that advances the state-of-the-art in the field, applying AI to products and to new domains.
Skytells AI Research
Gyros-identifier - An iOS app capable of detecting whether a photo contains a gyros or not

gyros-identifier ??️ In case you enter a shady restaurant, with suspicious price

Alexandros Tzimas 3 Feb 25, 2022
Artificial intelligence/machine learning data structures and Swift algorithms for future iOS development. bayes theorem, neural networks, and more AI.

Swift Brain The first neural network / machine learning library written in Swift. This is a project for AI algorithms in Swift for iOS and OS X develo

Vishal 331 Oct 14, 2022
Accelerated tensor operations and dynamic neural networks based on reverse mode automatic differentiation for every device that can run Swift - from watchOS to Linux

DL4S provides a high-level API for many accelerated operations common in neural networks and deep learning. It furthermore has automatic differentiati

Palle 87 Dec 29, 2022
DL4S provides a high-level API for many accelerated operations common in neural networks and deep learning.

DL4S provides a high-level API for many accelerated operations common in neural networks and deep learning. It furthermore has automatic differentiati

DL4S Team 2 Dec 5, 2021
Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Neural Networks

mtcnn-caffe Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Neural Networks. This project provide you a method to update mu

Weilin Cong 500 Oct 30, 2022
A toolbox of AI modules written in Swift: Graphs/Trees, Support Vector Machines, Neural Networks, PCA, K-Means, Genetic Algorithms

AIToolbox A toolbox of AI modules written in Swift: Graphs/Trees, Linear Regression, Support Vector Machines, Neural Networks, PCA, KMeans, Genetic Al

Kevin Coble 776 Dec 18, 2022
Easily craft fast Neural Networks on iOS! Use TensorFlow models. Metal under the hood.

Bender Bender is an abstraction layer over MetalPerformanceShaders useful for working with neural networks. Contents Introduction Why did we need Bend

xmartlabs 1.7k Dec 24, 2022
Automatic colorization using deep neural networks. Colorful Image Colorization. In ECCV, 2016.

Colorful Image Colorization [Project Page] Richard Zhang, Phillip Isola, Alexei A. Efros. In ECCV, 2016. + automatic colorization functionality for Re

Richard Zhang 3k Dec 27, 2022
Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

Detecting Text in Natural Image with Connectionist Text Proposal Network The codes are used for implementing CTPN for scene text detection, described

Tian Zhi 1.3k Dec 22, 2022
Jitsi Meet - Secure, Simple and Scalable Video Conferences that you use as a standalone app or embed in your web application.

Jitsi Meet is a set of Open Source projects which empower users to use and deploy video conferencing platforms with state-of-the-art video quality and features.

Jitsi 19.1k Jan 5, 2023