Return to site

Describe Video with Neural Network

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. Recent advances are starting to enable machines to describe image with sentences. This experiment uses neural networks to automatically describe the content of videos.


This line of work has been the subject of multiple academic papers from the research community over the last year. Some of the proposed approaches have been implemented and are available as open-source:

NeuralTalk: implements the models proposed by Vinyals et al. from Googleand by Karpathy and Fei-Fei from Stanford.

Arctic-Captionsimplements the models proposed collaboratively by Université de Montréal & University of Toronto.

Visual-concepts: implements the model proposed by Hao Fang et al.

It analyzes videos semantically - that means searching, filtering, and describing videos based on objects, places, and other things that appear in them. It uses a convolutional neural network to create an "index" of what's contained in the every second of the input by repeatedly performing image classification on a frame-by-frame basis. Once an index for a video file has been created, you can search and filter.

It's a 3-D convolutional neural network that is designed to capture local fine-grained motion information from consecutive frames. In order to capture global temporal structure, we propose the use of a temporal attentional mechanism that learns the ability to focus on subsets of frames. Finally, the two proposed approaches fit naturally together into an encoder-decoder neural video caption generator.


All experiment results were generated with NeuralTalk. It takes an image and predict its sentence description with a Recurrent Neural Network. The NeuralTalkAnimator was used to process video files.

NeuralTalk is overall very fascinating. With the right selection of inputs, it works with astounding accuracy and generates informative sentences. When it fails... Inputs & Outputs are cherrypicked, balancing accuracy VS comedy.


NeuralTalk´s model generates natural language descriptions of images. It leverages large datasets of images and their sentence descriptions to learn about the correspondences between language and visual data.

The model is based on a combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities. For more insights, read this great blog post: Image captioning for mortals.


The NeuralTalkAnimator is a python helper, that creates captioned videos. It take a folder with videos and returns a folder with processed videos back. It´s open source on GitHub.

Final Thoughts

The rate of innovation in the field of machine captioning images is astounding. While results might still be inaccurate at times, they are certainly entertaining. The next generation of networks, trained on even bigger datasets, will undoubtedly operate faster and more precise.

Emerging novel approaches like Describing Videos by Exploiting Temporal Structure, Action-Conditional Video Prediction using Deep Networks in Atari Games and Searchable Video are highly fascinating.

All Posts

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!