Working With Video Intelligence API Transcribing Speech In Python
You can use the Video Intelligence API to transcribe speech in a video.
Copy the following code into your IPython session:
from google.cloud import videointelligencefrom google.cloud.videointelligence import enums, typesdef transcribe_speech(video_uri, language_code, segments=None):video_client = videointelligence.VideoIntelligenceServiceClient()features = [enums.Feature.SPEECH_TRANSCRIPTION]config = types.SpeechTranscriptionConfig(language_code=language_code,enable_automatic_punctuation=True,)context = types.VideoContext(segments=segments,speech_transcription_config=config,)print(f'Processing video "{video_uri}"...')operation = video_client.annotate_video(input_uri=video_uri,features=features,video_context=context,)return operation.result()
Take a moment to study the code and see how it uses the annotate_video client library method with the SPEECH_TRANSCRIPTION parameter to analyze a video and transcribe speech.
Call the function to analyze the video from seconds 55 to 80:
video_uri = 'gs://cloudmleap/video/next/JaneGoodall.mp4'language_code = 'en-GB'segment = types.VideoSegment()segment.start_time_offset.FromSeconds(55)segment.end_time_offset.FromSeconds(80)response = transcribe_speech(video_uri, language_code, [segment])
Note: As the narrator is British, language_code is set to en-GB. See language support for a list of the currently supported language codes.
Wait a moment for the video to be processed:
Processing video "gs://cloudmleap/video/next/JaneGoodall.mp4"...
Add this function to print out transcribed speech:
def print_video_speech(response, min_confidence=.8):def keep_transcription(transcription):return min_confidence <= transcription.alternatives[0].confidence# First result only, as a single video is processedtranscriptions = response.annotation_results[0].speech_transcriptionstranscriptions = [t for t in transcriptions if keep_transcription(t)]print(f' Speech Transcriptions: {len(transcriptions)} '.center(80, '-'))for transcription in transcriptions:best_alternative = transcription.alternatives[0]confidence = best_alternative.confidencetranscript = best_alternative.transcriptprint(f' {confidence:4.0%} | {transcript.strip()}')
Call the function:
print_video_speech(response)
You should see something like this:
--------------------------- Speech Transcriptions: 2 ---------------------------
95% | I was keenly aware of secret movements in the trees.
94% | I looked into his large and lustrous eyes. They seemed somehow to express his entire personality.
Add this function to print out the list of detected words and their timestamps:
def print_word_timestamps(response, min_confidence=.8):def keep_transcription(transcription):return min_confidence <= transcription.alternatives[0].confidence# First result only, as a single video is processedtranscriptions = response.annotation_results[0].speech_transcriptionstranscriptions = [t for t in transcriptions if keep_transcription(t)]print(f' Word Timestamps '.center(80, '-'))for transcription in transcriptions:best_alternative = transcription.alternatives[0]confidence = best_alternative.confidencefor word in best_alternative.words:start_ms = word.start_time.ToMilliseconds()end_ms = word.end_time.ToMilliseconds()word = word.wordprint(f'{confidence:4.0%}',f'{start_ms:>7,}',f'{end_ms:>7,}',f'{word}',sep=' | ')
Call the function:
print_word_timestamps(response)
You should see something like this:
------------------------------- Word Timestamps --------------------------------
95% | 55,000 | 55,700 | I
95% | 55,700 | 55,900 | was
95% | 55,900 | 56,300 | keenly
95% | 56,300 | 56,700 | aware
95% | 56,700 | 56,900 | of
...
94% | 76,900 | 77,400 | express
94% | 77,400 | 77,600 | his
94% | 77,600 | 78,200 | entire
94% | 78,200 | 78,800 | personality
Comments
Post a Comment