NUS, Facebook AI and other world-class universities collaborate to teach AI to understand the world through our eyes

There is a marked difference between viewing and interacting with the world as a third-party spectator, and experiencing the action intimately from a first-person point of view.

This difference is similar to watching others ride a roller coaster, from the ground, as opposed to riding the roller coaster yourself – what ultimately ends up informing the experience and understanding of the roller coaster ride, is entirely disparate.

This is the current obstacle facing Artificial Intelligence (AI) and its applications for use.

To unlock the next wave of AI technology that will power future assistants and innovations for Augmented Reality (AR) and robotics, AI needs to evolve to an entirely new paradigm of egocentric (i.e. first-person) perception. This means teaching AI to understand the world through human eyes in the context of real-time motion, interaction, and multi-sensory observations.

To do so, a consortium has been formed by the National University of Singapore (NUS) and 12 other universities around the worldto undertake an ambitious, long-term project, called Egocentric 4D Live Perception (Ego4D).

A team from the NUS Department of Electrical and Computer Engineering has been actively working to collect first-person data, specifically in Singapore. This data was collected over the course of 2021, via head-mounted cameras and AR glasses distributed to a total of 40 participants locally. This allowed the NUS team to capture their eye-level, first-person, unscripted experiences of day-to-day scenarios: routine activities such as getting a haircut, dining at a hawker centre, or going to the gym. Based on the collected video data, the NUS researchers were then able to train AI models to understand people and their interactions by leveraging both audio and visual cues.

Assistant Professor Mike Shou, who leads the NUS research team, said, “Over the past 10 years, we have witnessed the revolution of AI for image understanding, which is built on the foundations laid down by datasets like the ImageNet. Similarly, I believe our Ego4D dataset will lay down the necessary groundwork for egocentric video research, and spur remarkable progress in building AI models for AR and robot applications.”

Improving AI technology for smart and useful assistants

The current paradigm of computer vision (CV) has, so far, excelled at understanding what is in an image or video by learning from vast amounts of online photos and videos from a third-person view, where the camera is a spectator from afar.

Advancements in first-person, or egocentric perception, will provide the fundamental building blocks necessary to develop smarter cognitive capabilities for AI assistants in the context of the person interacting with it. Such AI assistants will prove more useful in our day-to-day lives and work. Imagine: when trying to cook a new recipe, instead of referring repeatedly to a cookbook and attempting to multi-task, simply wearing a pair of AR smart glasses can direct you to perform each specific step, as you are doing it.

Asst Prof Shou explained, “In particular for Singapore and her aging population, such AI assistants can be an exceptional aid for the elderly, especially those with health conditions like dementia or Alzheimer’s. A pair of AR glasses could help elderly patients remember, and memorise “what happened and when”, to answer questions like where they left their keys, or if they remembered to lock the door. AI assistants applied to healthcare robotics can also understand if a person is speaking to, or looking at the robot itself, and thus take care of multiple patients in a single ward concurrently.”

This technology can be applied across a spectrum of devices, and fuel a world where physical, augmented, and virtual reality can co-exist together in a single space.

World’s largest first-person video data set

Today, the consortium is announcing the world’s largest first-person video data set that’s captured “in the wild”, featuring people going about their normal daily life. The NUS team, together with its partners, have collectively gathered more than 3,000 hours of first-person video data — which will be publicly available in November 2021 — from more than 700 research participants across nine countries.

Progress in the nascent field of egocentric perception depends on large volumes of egocentric data of daily life activities, considering most AI systems learn from thousands, if not millions, of examples. Existing data sets do not yet have the scale, diversity, and complexity necessary to be useful in the real-world.

This first-of-its-kind video dataset captures what the camera wearer chooses to gaze at in a specific environment; what the camera wearer is doing with their hands and objects in front of them; and how the camera wearer interacts with other people from the egocentric perspective. So far, the collection features camera wearers performing hundreds of activities and interactions with hundreds of different objects. All visible faces and audible speeches in the Ego4D dataset’s video footage have been consented to by the participants for public release.

Global representation is crucial for egocentric research since the egocentric visual

experience will significantly differ across cultural and geographic contexts. In particular, the NUS team’s collected data can be extrapolated to represent the Southeast Asian demographic, so that AI systems developed under-the-hood can ideally recognise the nuances and needs that may look different from region to region.

Unpacking the real-world data set

Equally important as data collection is defining the right research benchmarks, or tasks, that can be used for testing and measurement.

To provide a common objective for all the researchers involved to build fundamental research for real-world perception of visual and social contexts, the consortium has collectively developed five new, challenging benchmarks, which required rigorous annotation of the collective egocentric data set.

“Our NUS research team has been focusing on two of these key benchmarks: audio-visual diarization, that is, to leverage cues of sound and sight to help AI machines identify ‘who said what when’; and the second is to train AI models to better understand the nature of social interactions. A socially intelligent AI should understand who is speaking to whom, and who is paying attention to whom at any given point of time,” illustrated Professor Li Haizhou, a co-Principal Investigator also from the NUS Department of Electrical and Computer Engineering.

To date, the NUS team has created multi-modal deep learning models for:

1) Active speaker detection (detecting who is speaking, using both audio and visual signals)

2) Detecting who is speaking to whom

3) Detecting who is paying attention to whom in a given social interaction

Currently, these models are only trained on video data up to 100 hours, which is a small part of the entire Ego4D dataset. Next steps for the NUS team include conducting large-scale pre-training, to create one generic, strong model that is trained on the whole dataset and can learn to perform multiple tasks jointly.