Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. Recent advances are starting to enable machines to describe image with sentences. This experiment uses neural networks to automatically describe the content of videos.
This line of work has been the subject of multiple academic papers from the research community over the last year. Some of the proposed approaches have been implemented and are available as open-source:
It analyzes videos semantically - that means searching, filtering, and describing videos based on objects, places, and other things that appear in them. It uses a convolutional neural network to create an "index" of what's contained in the every second of the input by repeatedly performing image classification on a frame-by-frame basis. Once an index for a video file has been created, you can search and filter.
It's a 3-D convolutional neural network that is designed to capture local fine-grained motion information from consecutive frames. In order to capture global temporal structure, we propose the use of a temporal attentional mechanism that learns the ability to focus on subsets of frames. Finally, the two proposed approaches fit naturally together into an encoder-decoder neural video caption generator.
NeuralTalk is overall very fascinating. With the right selection of inputs, it works with astounding accuracy and generates informative sentences. When it fails... Inputs & Outputs are cherrypicked, balancing accuracy VS comedy.
NeuralTalk´s model generates natural language descriptions of images. It leverages large datasets of images and their sentence descriptions to learn about the correspondences between language and visual data.
The rate of innovation in the field of machine captioning images is astounding. While results might still be inaccurate at times, they are certainly entertaining. The next generation of networks, trained on even bigger datasets, will undoubtedly operate faster and more precise.
Emerging novel approaches like Describing Videos by Exploiting Temporal Structure, Action-Conditional Video Prediction using Deep Networks in Atari Games and Searchable Video are highly fascinating.
We just sent you an email. Please click the link in the email to confirm your subscription!