Scene graphs allow humans and machines to categorize and query images based on complex relationships among objects in a scene.
If you google “Biden + White House” and click the Image tab, you’ll see lots of photos of the President speaking or hosting guests around the building. But what if you wanted to see only images of Biden in the presence of a top cabinet official while speaking in the Rose Garden? That’s far more difficult to achieve.
Or what if supply chain analysts wanted to rapidly identify images only of certain ships in a major port? Or if an automated military system needed to accurately analyze and sort satellite images that suggested covert troop movements?
That’s the realm of scene graphs, an emerging form of artificial intelligence for understanding complex relationships among objects in an image or video “scene.”
Military and government functions are finding promising real-world applications for this computer-vision technology. As a result, the approach could transform the way people and software interact with images and benefit from image analysis.
From mind’s eye to eye’s mind
A scene graph is a structured representation of a “scene” in a photo or video. It describes the objects in the scene, the attributes of those objects, and relationships and actions among those objects. It goes beyond simply identifying items, in order to provide a higher level of understanding and even reasoning about the scene.
Mere object detection doesn’t tell anything about the interaction of objects. For example, a dock worker sitting next to a shipping container, a man opening a shipping container, and a man operating machinery to move the container all simply look like “man + shipping container” to object detection. Understanding the relationships between and actions of the objects can provide crucial insights.
A scene graph might recognize objects, attributes, and relationships such as:
- Person (man)
- Place (shipping port)
- Thing (shipping container)
- Parts of objects (arms)
- Shape (ship)
- Color (tan clothing)
- Pose (lifting)
- Action (opening a container)
- Position (next to ship)
By identifying these objects and attributes, a scene graph can achieve several types of understanding, including:
- Visual relationship detection—the relationship among objects in an image.
- Human-object interaction—how people are interacting with objects in an image.
- Visual question answering—answering questions about the contents of a scene or the actions of an object.
- Image editing and retrieval—removal of an object from an image and discovery of similar images.
- Image captioning—a detailed and accurate text description of image content.
For many tasks, image retrieval for example, scene graphs are used to create a knowledge base, similar to how virtual assistants like Siri and Alexa function. If you asked a virtual assistant when Biden was inaugurated, the technology would answer by accessing a knowledge base that associates an entity—Biden—with a year—2021. Scene graphs can answer similar types of questions for images and video.
But scene graphs don’t simply label images with tags such as “Biden” and “podium.” By generating a knowledge base, they can answer queries such as, “Where is the president?” with, “The president is at the podium.”
Scene graphs’ algorithms must initially be trained by humans. Once a set of images is annotated by people, the model can begin producing what it thinks is an accurate scene graph. Data scientists then verify and fine-tune the model. The algorithm can then use a dataset to build a knowledge base that users or software can run either linguistic or structured queries against.
Seeing the way to real-world applications
Several use cases for scene graphs are already being explored or deployed for military or other government needs. One example is situational awareness. Let’s say an agency uses video cameras to monitor a facility. The agency could write a query to ignore most images of people but send alerts if a person is carrying a box or touches a door.
Or let’s say the military uses satellite imagery to monitor a warzone. It could query the system to send alerts based on how many aircraft are parked in an airfield or when aircraft move from a hangar to a runway. Or it could ask the system to ignore most cars driving on a road but flag a caravan of military vehicles. That way, humans don’t waste time monitoring activities that aren’t a concern but don’t miss situations that could indicate a threat.
Scene graphs can also be used to create and scale simulations for real-world training. For example, the military uses simulations for trainings that can accommodate a small number of users. But scaling those scenarios while keeping them highly realistic has been a challenge. Scene graphs can facilitate large numbers of users interacting with one another in virtual environments.
Another promising application is cross-modal search. Cross-modal search is already being used with text and audio, where text can retrieve audio, and audio can retrieve text. A similar approach can be applied to images. With images, unstructured text isn’t always good for describing complex relationships among objects. Scene graphs combined with natural language processing can transform unstructured text into structured queries, making image retrieval more effective.
Going forward, scene graphs will enable organizations to leverage visual data for better analytics and deeper insights. By identifying, categorizing and contextualizing objects in images, the technology will make the visual data organizations already capture that much more valuable.
Sean McPherson, Ph.D., is research scientist and manager of AI and ML for Intel. He focuses on deep learning models and algorithms, as well as applied research with federal government organizations. He holds a doctoral degree in electrical engineering from the University of Southern California.