Intelligent Cameras That Can Learn and Understand What They Are Seeing

Combining sensing and learning can lead to the development of innovative cameras for AI systems.

Intelligent cameras could be one step closer thanks to a research collaboration between the Universities of Bristol and Manchester who have developed cameras that can learn and understand what they are seeing.

Roboticists and artificial intelligence (AI) researchers know there is a problem in how current systems sense and process the world. Currently, they are still combining sensors, like digital cameras that are designed for recording images, with computing devices like graphics processing units (GPUs) designed to accelerate graphics for video games.

This means AI systems perceive the world only after recording and transmitting visual information between sensors and processors. But many things that can be seen are often irrelevant to the task at hand, such as the detail of leaves on roadside trees as an autonomous car passes by. However, at the moment all this information is captured by sensors in meticulous detail and sent clogging the system with irrelevant data, consuming power, and taking processing time. A different approach is necessary to enable efficient vision for intelligent machines.

Two papers from the Bristol and Manchester collaboration have shown how sensing and learning can be combined to create novel cameras for AI systems.

A Convolutional Neural Network (CNN) on the SCAMP-5D vision system classifying hand gestures at 8,200 frames per second. Credit: University of Bristol, 2020

Walterio Mayol-Cuevas, Professor in Robotics, Computer Vision and Mobile Systems at the University of Bristol and principal investigator (PI), commented: “To create efficient perceptual systems we need to push the boundaries beyond the ways we have been following so far.

“We can borrow inspiration from the way natural systems process the visual world — we do not perceive everything — our eyes and our brains work together to make sense of the world and in some cases, the eyes themselves do processing to help the brain reduce what is not relevant.”

SCAMP-5d vision system. Credit: The University of Manchester, 2020

This is demonstrated by the way the frog’s eye has detectors that spot fly-like objects, directly at the point where the images are sensed.

The papers, one led by Dr. Laurie Bose and the other by Yanan Liu at Bristol, have revealed two refinements towards this goal. By implementing Convolutional Neural Networks (CNNs), a form of AI algorithm for enabling visual understanding, directly on the image plane. The CNNs the team has developed can classify frames at thousands of times per second, without ever having to record these images or send them down the processing pipeline. The researchers considered demonstrations of classifying handwritten numbers, hand gestures, and even classifying plankton.

The research suggests a future with intelligent dedicated AI cameras — visual systems that can simply send high-level information to the rest of the system, such as the type of object or event taking place in front of the camera. This approach would make systems far more efficient and secure as no images need be recorded.

The work has been made possible thanks to the SCAMP architecture developed by Piotr Dudek, Professor of Circuits and Systems and PI from the University of Manchester, and his team. The SCAMP is a camera-processor chip that the team describes as a Pixel Processor Array (PPA). A PPA has a processor embedded in each and every pixel which can communicate with each other to process in truly parallel form. This is ideal for CNNs and vision algorithms.

SCAMP-5d’s hardware architecture. It incorporates a 256 x 256 PPA array of pixel-processors, each containing light sensor, local memory registers and other functional components. Credit: The University of Manchester, 2020

Professor Dudek said: “Integration of sensing, processing, and memory at the pixel level is not only enabling high-performance, low-latency systems, but also promises low-power, highly efficient hardware.

“SCAMP devices can be implemented with footprints similar to current camera sensors, but with the ability to have a general-purpose massively parallel processor right at the point of image capture.”

Dr. Tom Richardson, Senior Lecturer in Flight Mechanics, at the University of Bristol and a member of the project has been integrating the SCAMP architecture with lightweight drones.

He explained: “What is so exciting about these cameras is not only the newly emerging machine learning capability, but the speed at which they run and the lightweight configuration. They are absolutely ideal for high speed, highly agile aerial platforms that can literally learn on the fly!”

The research, funded by the Engineering and Physical Sciences Research Council (EPSRC), has shown that it is important to question the assumptions that are out there when AI systems are designed. And things that are often taken for granted, such as cameras, can and should be improved towards the goal of more efficient intelligent machines.

References:

“Fully embedding fast convolutional networks on pixel processor arrays” by Laurie Bose, Jianing Chen, Stephen J. Carey, Piotr Dudek and Walterio Mayol-Cuevas, 23 August 2020, European Conference on Computer Vision (ECCV) 2020.
DOI: 10.1007/978-3-030-58526-6_29

“High-speed Light-weight CNN Inference via strided convolutions on a pixel processor array” by Yanan Liu, Laurie Bose, Jianing Chen, Stephen J. Carey, Piotr Dudek, Walterio Mayol-Cuevas, 7 September 2020, British Machine Vision Conference (BMVC) 2020.
PDF