Working With Video Intelligence API Detect Labels in Video


Working With Video Intelligence API Detect Labels in Video


You can use the Video Intelligence API to detect labels in a video. Labels describe the video based on its visual content.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def detect_labels(video_uri, mode, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.LABEL_DETECTION]
    config = types.LabelDetectionConfig(label_detection_mode=mode)
    context = types.VideoContext(
        segments=segments,
        label_detection_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()
Take a moment to study the code and see how it uses the annotate_video client library method with the LABEL_DETECTION parameter to analyze a video and detect labels.

Call the function to analyze the first 37 seconds of the video:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
mode = enums.LabelDetectionMode.SHOT_MODE
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(0)
segment.end_time_offset.FromSeconds(37)

response = detect_labels(video_uri, mode, [segment])

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out the labels at the video level:

def print_video_labels(response):
    # First result only, as a single video is processed
    labels = response.annotation_results[0].segment_label_annotations
    sort_by_first_segment_confidence(labels)

    print(f' Video labels: {len(labels)} '.center(80, '-'))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        for segment in label.segments:
            confidence = segment.confidence
            start_ms = segment.segment.start_time_offset.ToMilliseconds()
            end_ms = segment.segment.end_time_offset.ToMilliseconds()
            print(f'{confidence:4.0%}',
                  f'{start_ms:>7,}',
                  f'{end_ms:>7,}',
                  f'{label.entity.description}{categories}',
                  sep=' | ')


def sort_by_first_segment_confidence(labels):
    labels.sort(key=lambda label: label.segments[0].confidence, reverse=True)


def category_entities_to_str(category_entities):
    if not category_entities:
        return ''
    entities = ', '.join([e.description for e in category_entities])
    return f' ({entities})'
Call the function:

print_video_labels(response)

You should see something like this:

------------------------------- Video labels: 10 -------------------------------
 96% |       0 |  36,960 | nature
 74% |       0 |  36,960 | vegetation
 59% |       0 |  36,960 | tree (plant)
 56% |       0 |  36,960 | forest (geographical feature)
 49% |       0 |  36,960 | leaf (plant)
 43% |       0 |  36,960 | flora (plant)
 38% |       0 |  36,960 | nature reserve (geographical feature)
 38% |       0 |  36,960 | woodland (forest)
 35% |       0 |  36,960 | water resources (water)
 32% |       0 |  36,960 | sunlight (light)
 
 Thanks to these video-level labels, you can understand that the beginning of the video is mostly about nature and vegetation.

Add this function to print out the labels at the shot level:


def print_shot_labels(response):
    # First result only, as a single video is processed
    labels = response.annotation_results[0].shot_label_annotations
    sort_by_first_segment_start_and_reversed_confidence(labels)

    print(f' Shot labels: {len(labels)} '.center(80, '-'))
    for label in labels:
        categories = category_entities_to_str(label.category_entities)
        print(f'{label.entity.description}{categories}')
        for segment in label.segments:
            confidence = segment.confidence
            start_ms = segment.segment.start_time_offset.ToMilliseconds()
            end_ms = segment.segment.end_time_offset.ToMilliseconds()
            print(f'  {confidence:4.0%}',
                  f'{start_ms:>7,}',
                  f'{end_ms:>7,}',
                  sep=' | ')


def sort_by_first_segment_start_and_reversed_confidence(labels):
    def first_segment_start_and_reversed_confidence(label):
        first_segment = label.segments[0]
        return (+first_segment.segment.start_time_offset.ToMilliseconds(),
                -first_segment.confidence)
    labels.sort(key=first_segment_start_and_reversed_confidence)

Call the function:

print_shot_labels(response)

You should see something like this:

------------------------------- Shot labels: 29 --------------------------------
planet (astronomical object)
   83% |       0 |  12,880
earth (planet)
   53% |       0 |  12,880
water resources (water)
   43% |       0 |  12,880
aerial photography (photography)
   43% |       0 |  12,880
vegetation
   32% |       0 |  12,880
   92% |  12,920 |  21,680
   83% |  21,720 |  27,880
   77% |  27,920 |  31,800
   76% |  31,840 |  34,720
...
butterfly (insect, animal)
   84% |  34,760 |  36,960
...

Thanks to these shot-level labels, you can understand that the video starts with a shot of a planet (likely Earth), that there's a butterfly in the 34,760..36,960 ms shot,...

Note: You can also request label detection at the frame level with the FRAME_MODE mode (or SHOT_AND_FRAME_MODE for both).



Comments