Technology

Combining Vision and Language Could Be the Key to More Capable AI

Combining Vision and Language Could Be the Key to More Capable AI

Depending on whose theory of intelligence you believe in, reaching “human-level” AI would necessitate a system that can reason about the environment using several modalities, such as voice, vision, and text. A human-level AI, for example, would deduce that unsafe road conditions caused an accident when presented a picture of an overturned truck and a police vehicle on a snowy motorway. 

Or, if they were running on a robot, they’d maneuver over people, furniture, and pets to collect a can of Coke from the refrigerator and bring it within reach of the requester. Today’s AI is woefully inadequate. New research, however, reveals evidence of improvement, ranging from robots that can figure out how to fulfill basic requests (such as “fetch a water bottle”) to text-generating systems that learn from explanations. 

We’re covering work from DeepMind, Google, and OpenAI that makes strides toward systems that can — if not perfectly understand the world — solve narrow tasks like generating images with impressive robustness in this revived edition of Deep Science, our weekly series about the latest developments in AI and the broader scientific field. AI research facility the enhanced DALL-E, DALL-E 2, from OpenAI is without a doubt the most outstanding project to come from an AI research center.

While the first DALL-E displayed a surprising skill at making pictures to meet practically any cue (for example, “a dog wearing a beret”), DALL-E 2 takes this a step further, as my colleague Devin Coldewey explains. DALL-E 2’s photos are far more detailed, and it can intelligently replace a specific section in an image, such as adding a table into a shot of a marbled floor with suitable reflections.

This week, DALL-E 2 drew the most interest. In a post published on Google’s AI blog on Thursday, researchers described an equally amazing visual understanding system dubbed Visually-Driven Prosody for Text-to-Speech — VDTTS. With only text and video frames of the person speaking, VDTTS can produce realistic-sounding, lip-synced speech. While not a perfect match for recorded conversation, VDTTS’ produced speech is nonetheless pretty good, with impressively human-like expressiveness and timing. It might one day be used in a studio to replace original audio that was captured in loud settings, according to Google.

Visual understanding is, of course, just one step on the road to more proficient AI. Another factor is language comprehension, which falls behind in many areas, even when ignoring AI’s well-documented toxicity and prejudice problems. According to a report, a cutting-edge Google system known as Pathways Language Model (PaLM) remembered 40% of the material used to “train” it, resulting in PaLM plagiarizing content down to copyright warnings in code snippets. Fortunately, DeepMind, Alphabet’s AI department, is one of many looking at ways to fix this. DeepMind researchers look at whether AI language systems that learn to generate text from a large number of samples of existing text (think books and social media) might benefit from being given explanations of those texts in a new study. 

After annotating dozens of language tasks (e.g., “Answer these questions by determining whether the second sentence is an appropriate paraphrase of the first, metaphorical sentence”) with explanations (e.g., “David’s eyes were not literally daggers, it is a metaphor used to imply that David was glaring fiercely at Paul.”) and evaluating different systems’ performance on them, the DeepMind team discovered that examples do improve the systems’ performance. If DeepMind’s technique passes academic scrutiny, it may be used in robotics one day, forming the building blocks of a robot that can interpret broad requests (such as “take out the garbage”) without step-by-step instructions. Google’s new “Do As I Can, Not as I Say” initiative offers a look into the future – but with considerable caveats.

Do As I Can, Not As I Say is a cooperation between Google Robotics and Alphabet’s Everyday Robots team that aims to train an AI language system to suggest behaviors that are “possible” and “contextually suitable” for a robot given an arbitrary assignment. The robot serves as the language systems “hands and eyes,” while the language system provides high-level semantic understanding about the activity — the notion being that the language system encodes a lot of information that the robot may exploit.