Working With Video Intelligence API Transcribing Speech In Python


Working With Video Intelligence API Transcribing Speech In Python

You can use the Video Intelligence API to transcribe speech in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types


def transcribe_speech(video_uri, language_code, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.SPEECH_TRANSCRIPTION]
    config = types.SpeechTranscriptionConfig(
        language_code=language_code,
        enable_automatic_punctuation=True,
    )
    context = types.VideoContext(
        segments=segments,
        speech_transcription_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the SPEECH_TRANSCRIPTION parameter to analyze a video and transcribe speech.

Call the function to analyze the video from seconds 55 to 80:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
language_code = 'en-GB'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(55)
segment.end_time_offset.FromSeconds(80)
response = transcribe_speech(video_uri, language_code, [segment])

Note: As the narrator is British, language_code is set to en-GB. See language support for a list of the currently supported language codes.

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out transcribed speech:

def print_video_speech(response, min_confidence=.8):
    def keep_transcription(transcription):
        return min_confidence <= transcription.alternatives[0].confidence
    # First result only, as a single video is processed
    transcriptions = response.annotation_results[0].speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f' Speech Transcriptions: {len(transcriptions)} '.center(80, '-'))
    for transcription in transcriptions:
        best_alternative = transcription.alternatives[0]
        confidence = best_alternative.confidence
        transcript = best_alternative.transcript
        print(f' {confidence:4.0%} | {transcript.strip()}')
Call the function:

print_video_speech(response)

You should see something like this:

--------------------------- Speech Transcriptions: 2 ---------------------------
  95% | I was keenly aware of secret movements in the trees.
  94% | I looked into his large and lustrous eyes. They seemed somehow to express his entire personality.
  
Add this function to print out the list of detected words and their timestamps:

def print_word_timestamps(response, min_confidence=.8):
    def keep_transcription(transcription):
        return min_confidence <= transcription.alternatives[0].confidence
    # First result only, as a single video is processed
    transcriptions = response.annotation_results[0].speech_transcriptions
    transcriptions = [t for t in transcriptions if keep_transcription(t)]

    print(f' Word Timestamps '.center(80, '-'))
    for transcription in transcriptions:
        best_alternative = transcription.alternatives[0]
        confidence = best_alternative.confidence
        for word in best_alternative.words:
            start_ms = word.start_time.ToMilliseconds()
            end_ms = word.end_time.ToMilliseconds()
            word = word.word
            print(f'{confidence:4.0%}',
                  f'{start_ms:>7,}',
                  f'{end_ms:>7,}',
                  f'{word}',
                  sep=' | ')
  
Call the function:

print_word_timestamps(response)

You should see something like this:

------------------------------- Word Timestamps --------------------------------
 95% |  55,000 |  55,700 | I
 95% |  55,700 |  55,900 | was
 95% |  55,900 |  56,300 | keenly
 95% |  56,300 |  56,700 | aware
 95% |  56,700 |  56,900 | of
...
 94% |  76,900 |  77,400 | express
 94% |  77,400 |  77,600 | his
 94% |  77,600 |  78,200 | entire
 94% |  78,200 |  78,800 | personality
 
 

Comments