Working With Video Intelligence API Transcribing Speech In Python

You can use the Video Intelligence API to transcribe speech in a video.

Copy the following code into your IPython session:

from google.cloud import videointelligence
from google.cloud.videointelligence import enums, types

def transcribe_speech(video_uri, language_code, segments=None):
video_client = videointelligence.VideoIntelligenceServiceClient()
features = [enums.Feature.SPEECH_TRANSCRIPTION]
config = types.SpeechTranscriptionConfig(
language_code=language_code,
enable_automatic_punctuation=True,
)
context = types.VideoContext(
segments=segments,
speech_transcription_config=config,
)

print(f'Processing video "{video_uri}"...')
operation = video_client.annotate_video(
input_uri=video_uri,
features=features,
video_context=context,
)
return operation.result()

Take a moment to study the code and see how it uses the annotate_video client library method with the SPEECH_TRANSCRIPTION parameter to analyze a video and transcribe speech.

Call the function to analyze the video from seconds 55 to 80:

video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'
language_code = 'en-GB'
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(55)
segment.end_time_offset.FromSeconds(80)
response = transcribe_speech(video_uri, language_code, [segment])

Note: As the narrator is British, language_code is set to en-GB. See language support for a list of the currently supported language codes.

Wait a moment for the video to be processed:

Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...

Add this function to print out transcribed speech:

def print_video_speech(response, min_confidence=.8):
def keep_transcription(transcription):
return min_confidence <= transcription.alternatives[0].confidence
# First result only, as a single video is processed
transcriptions = response.annotation_results[0].speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]

print(f' Speech Transcriptions: {len(transcriptions)} '.center(80, '-'))
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
transcript = best_alternative.transcript
print(f' {confidence:4.0%} | {transcript.strip()}')

Call the function:

print_video_speech(response)

You should see something like this:

--------------------------- Speech Transcriptions: 2 ---------------------------

95% | I was keenly aware of secret movements in the trees.

94% | I looked into his large and lustrous eyes. They seemed somehow to express his entire personality.

Add this function to print out the list of detected words and their timestamps:

def print_word_timestamps(response, min_confidence=.8):
def keep_transcription(transcription):
return min_confidence <= transcription.alternatives[0].confidence
# First result only, as a single video is processed
transcriptions = response.annotation_results[0].speech_transcriptions
transcriptions = [t for t in transcriptions if keep_transcription(t)]

print(f' Word Timestamps '.center(80, '-'))
for transcription in transcriptions:
best_alternative = transcription.alternatives[0]
confidence = best_alternative.confidence
for word in best_alternative.words:
start_ms = word.start_time.ToMilliseconds()
end_ms = word.end_time.ToMilliseconds()
word = word.word
print(f'{confidence:4.0%}',
f'{start_ms:>7,}',
f'{end_ms:>7,}',
f'{word}',
sep=' | ')

Call the function:

print_word_timestamps(response)

You should see something like this:

------------------------------- Word Timestamps --------------------------------

95% | 55,000 | 55,700 | I

95% | 55,700 | 55,900 | was

95% | 55,900 | 56,300 | keenly

95% | 56,300 | 56,700 | aware

95% | 56,700 | 56,900 | of

...

94% | 76,900 | 77,400 | express

94% | 77,400 | 77,600 | his

94% | 77,600 | 78,200 | entire

94% | 78,200 | 78,800 | personality

Mr & Mrs Tamilan - Social Blog and Tamil Technical Information Site

Search This Blog

Working With Video Intelligence API Transcribing Speech In Python

Comments

Post a Comment