Opinion Summary: as remote working has become the norm, the limitations of remote videoconferencing and collaboration technology have become apparent. The perfect remote collaboration platform will include AI that will blur the line between what is real and what is augmented. In the future, our webcam will actually be a digital avatar that will be a photorealistic real-time representation of ourselves – made from merging our webcam input with digital enhancements designed to increase social interaction. In order to make videoconferencing less tiring and more productive, videoconferencing software needs to help surface the information that is most relevant and suppress what is distracting (or embarrassing). I call this Filtered Augmented Reality.
As remote working and webcams have become the norm, so has the complaint of “I can’t hear you, let’s turn off our webcams”, and the fear of the “up-the-nostril webcam picture” and “the family member in the background ruining your big presentation”.
Whilst webcam usage has become more informal, there is still a long way to go before it is completely hassle- and worry-free.
Recently, technology has been released that aims to solve many of the major videoconferencing problems by blurring the line between augmented reality and videoconferencing. The most exciting and novel of these technologies have come from Nvidia’s AI research team. The research team built on Nvidia’s hyperrealistic face generation technology.
Repurposing deepfake technology
Nvidia has been able to reduce the bandwidth required for video transmission by ten times. By limiting the scope/imagination of generative adversarial networks, the hyperrealistic face generation framework, the research team were able to repurpose the AI to generate a specific face. The neural network on the sender’s side creates a model of the user’s face and only transmits the required information the receiving computer needs to “regenerate the face”. This is more than just video compression, this is a form of augmented reality as the regenerated face can be adjusted. The revolutionary improvement is that the face doesn’t have to be a perfect reconstruction; subtle changes can be introduced: for example, ensuring that the image is always “straight on” and not “up the nose” or looking from the side as the person looks at their second screen, and the eyes can be repositioned to maintain eye contact with the audience.
Some might argue that this type of adjustment makes the interaction less genuine as the reconstructed image isn’t a true likeness of the transmitter (and doesn’t accurately convey what they were actually doing), but I think most people will not mind as the interaction will better imitate in-person interactions.
Improved accessibility and real-time translation
A huge benefit of the reduced bandwidth is the reduction in latency, therefore ensuring that frames don’t need to be dropped in order for the video to keep up with the audio (the reason why videoconferencing sometimes judders). By using upscaling AI more usually found in video games (deep learning super sampling) the reconstructed image can also be of better and higher resolution than any cheap laptop webcam could ever produce. Both the improved resolution and frame rate could benefit the hard of hearing by allowing them to better lip-read – something that is not easy, currently, unless video reception is perfect and expensive webcam equipment is used. Nvidia and Google both have real-time language translation (and captioning) that can make international business meetings more seamless. Currently, Nvidia’s technology is limited to audio-to-text translation, but there is no reason why the person’s accent/voice could not be replicated, and their mouth and face adjusted to match. Start-up Descript already has voice manipulation technology that can be used to create synthetic podcasts.
Reducing meeting fatigue
Microsoft Teams have released Together mode. The feature is based on research that people are used to interacting with others with reference to location in a room. Our brains expect the people we are talking with to be in the same surroundings and in a fixed position. Microsoft claims that brain scans indicate that removing people from their surroundings, making them the same size, and putting them into a common surrounding can decrease the mental effort of virtual meetings.
Together mode is meant to assist meeting participants to share non-verbal social cues, such as being able to lean over and tap someone on the shoulder, or to virtually make eye contact (for example when a colleague waffles on during a presentation). Social cues are important – there is nothing worse than giving a virtual presentation and receiving no real-time feedback.
You can do better than a screen share of a PowerPoint presentation
Prezi is an interesting start-up that is trying to make videoconference presentations more engaging by mixing the presenter and presentation. Usually, as soon as the presenter shares their screen/presentation, their face becomes a tiny box that no one continues to look at – the exact opposite of what should happen in an engaging presentation.
Prezi allows for the creation of presentations that incorporate the presenter as part of the presentation, therefore keeping them interacting with the audience.
A fractured ecosystem is ready for consolidation
Remote working has accelerated the uptake of start-ups like Zoom and Miro. Both of these start-ups have excelled at making virtual meetings and seamless collaboration available to those who were previously not heavy videoconference users (mostly non-corporates).
The explosion of video conference software has led to a fractured ecosystem and “installing new meeting software fatigue”. It’s annoying having to install a new meeting client, and learn how it works – “which button is it to share the screen again?!” Datanyze tracks the market share of 96 different web-conferencing platforms/tools.
The fact Nvidia is releasing their advanced AI technologies as an SDK (and not creating their own conferencing platform) will allow smaller software companies , which cannot afford to have large AI research departments, to compete with the larger players. Currently, Microsoft is in a good position to gain large market share in the videoconferencing space. Whilst their collaboration tools aren’t as good as Miro, and their augmented reality is not as good as Nvidia, they are good at nearly everything. The integration with Outlook, and the fact the Microsoft Office ecosystem is so ubiquitous, gives Teams a great market advantage. In the month of March 2020, Microsoft saw a 775% increase in the use of Teams, and didn’t face the same privacy scandal that Zoom faced. The issue is, Microsoft does have a history of dropping the ball; they are only now killing the disaster that was Skype.
The traditional definition of augmented reality is technology that superimposes a computer-generated image on a user’s view of the real world, creating a composite view. I think this definition needs to be updated; the generated augmented reality needs to look real but it doesn’t have to be completely true to life. In order to make videoconferencing less tiring and more productive, videoconferencing software needs to help surface the information that is most relevant and suppress what is distracting. Filtered Augmented Reality will become the norm.
QY Research estimates that the videoconferencing market will grow from $12 billion to nearly $20 billion in 2020. There is a big commercial opportunity for the company that can create the perfect videoconference and collaboration tool.
Introduction to Nvidia Maxine Platform
Links and References