NVIDIA hosted GPUniversity, a day of talks and a hands-on workshop on Deep Learning. This was held in the Husky Union Building (HUB) at University of Washington, Seattle on 14 April, 2017. The workshop was organized to discuss the future of Artificial Intelligence computing and discover how Graphics Processing Units (GPUs) are powering this revolution.
The day had a solid lineup of speakers (Stan Birchfield, Nvidia and Prof. Ali Farhadi, UW-Seattle) and a workshop on Signal Processing using NVIDIA digits.
The talks started at 10:30 am, with Dr. Stan Birchfiled presenting on ‘Deep Learning for Autonomous Drone Flying through Forest Trails‘. He is a Principal Research Scientist at NVIDIA, Seattle. Dr. Birchfield provided us a brief overview of three major projects happening at NVIDIA. The first project described how NVIDIA is currently looking at replacing the Image Signal Processor (ISP), which is a collection of modules like auto exposure, denoise, demosaic, amongst others, with a deep learning network. Here is a blog post from NVIDIA that could provide some information on the advances in deep learning.
The second project was about their efforts to reduce driver distraction. By making use of data from inside the car, the head pose and gaze of the driver are estimated. A different research team at NVIDIA is also researching on the use of hand gestures for automotive interfaces. Having worked on gesture recognition using a standard camera and computer vision algorithms, this research excites me. Their most recent paper can be found in CVPR2016.
He finally addressed the topic of image-to-image translation before speaking about his research. Image-to-image translation would allow one to shift images from a day view to night, from a sunny image to rainy, or from RGB to IR. The possibilities are endless. The system takes a raw image as input and provides a final image as output. Here is a publication by NVIDIA I found on the topic.
This was followed by information about Dr. Birchfield’s research on autonomous flight of drones in forests. Most drone enthusiasts have found it hard to navigate their autonomous aerial vehicles in the forest. The trees create multipath effect and attenuate/block the signal, resulting in GPS being unreliable. However, if this problem could be solved, drones could serve multiple functions – search and rescue, environmental mapping, personal videography, and of course, drone racing!
NVIDIA’s approach to the problem eliminates the use of GPS (at this stage) and uses deep learning for computer vision instead. Their research is done using micro aerial vehicles (MAV). For this purpose, they make use of the 3DR Iris+ with a Jetson Tegra TX1 processor. By the method of imitation learning (used in NVIDIA self-driving cars), the drone is taught to fly along a trail and stop at a safe distance if a human is detected. The dataset makes use of prior research from University of Zurich (Giusti et al. 2016) and the data collected from Pacific Northwest trails. The system also makes use of DSO and YOLO algorithms. The distribution mismatch was fixed by adding three cameras instead of just one. A detailed talk about this research will be presented at the GPU Technology Conference in May. You can follow the research here.
Professor Ali Farhadi had an interactive session on Visual Intelligence. He started his presentation by showcasing the performance of YOLO in real-time.
An additional demo that followed briefed the design of a 5$ computer to detect people. This was built using a Raspberry Pi Zero.
Prof. Farhadi took us through a number of projects in his 45-minute talk. The man never fails to impress (I have been in his class and he is an inspiring teacher!) I am going to provide a brief description of these projects and add links to publications/research websites below.
Visual recognition involves visual knowledge, data, parsing and visual reasoning. The action-centric view of visual recognition involves three parts: recognizing actions, predicting expected outcomes and devising a plan. The projects discussed include all these factors.
- imsitu.org : It is used for situation recognition, as opposed to treating all the components of an image as objects. This enables the system to not just predict the objects or locations, but include information on the activity being performed and the roles of the participants performing the activity. The demo provided on the website implements Compositional Conditional Random field, pre-trained using semantic data augmentation on 5 million web images.
Go ahead and try it here.
- Learn EVerything about ANything (LEVAN): Single camera systems pose a problem when size is a determining factor for visual intelligence. However, if we are able to understand the average sizes of objects, we could make better predictions by imposing a distribution. LEVAN acts as a visual encyclopedia for you, helping you explore and understand in detail any topic that you are curious about.
Try the demo here. If it does not have a concept you are looking for, click and add it to the database! 🙂
- Visual Knowledge Extraction Engine (VisKE): To briefly describe it, VisKE does visual fact checking. It provides the most probable explanation based on the visual image off the internet. It generates a factor graph that assigns scores based on how much it visually trusts the information.
Try the demo here.
- Visual Newtonian Dynamics (VIND): VIND predicts the dynamics of query objects in static images. The dataset compiled includes videos aligned with Newtonian scenarios represented using game engines, and still images with their ground truth dynamics. A Newtonian neural network performs the correlation.
- What Happens if?: By making use of the Forces in Scenes (ForScene) dataset from the University of Washington, and using a combination of Recurrent Neural Nets with Convolutional Neural Nets, this project aims to understand the effect of external forces on objects. The system makes sequential predictions based on the force vector applied to a specific location.
- AI2 THOR Framework: THOR is the dataset of visually realistic scenes rendered for studying actions based on visual input.
Hope these projects shed more light on the possibilities in Computer Vision and Deep Learning.