Predicting a single label (or a distribution over labels as shown here to indicate our confidence) for a given image.
Detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos.
Partitioning image into semantically meaningful parts to classify each part into one of the pre-determined classes.
A.k.a Human Pose Estimation, detecting human figures in images and video to determine, for example, where someone’s elbow shows up in an image.
Detecting faces of participants by using object detection and checks whether each face was present or not.
Detecting facial landmarks like eyes, nose, mouth, etc., which can be used for web-based try-on simulator of online store.
Transfering makeup style of the sample makeup image to facial image to check how the selected makeup looks like.
Generating higher-resolution image or video frames to prevent degradation of the perceived image or video quality.
Providing automatic image captioning which predicts explanatory words of the presentation slides for better accessibility.
Translating every text into different language.
Mapping human facial features to different types of emotion class by using face detection.
Generating short version of the recorded video to reduce recorded video data to be stored.
Recognizing a set of voice commands using the audio file or device's microphone input.
Enabling the recognition and translation of spoken language into text.