-
Notifications
You must be signed in to change notification settings - Fork 7
Feature/voice analysis #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Nathanlauga
wants to merge
25
commits into
main
Choose a base branch
from
feature/voice_analysis
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 24 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
5ec016a
Adapt artpech's imports for ina speech segmenter to python and poetry…
Lokhia b0d00fd
Gendered audio segmentation based on inaSpeechSegmenter and some expe…
Lokhia db5f243
Speaking time according to gender implemented
Lokhia eac60f9
Speech to text implementation
Lokhia 1429e72
Extract to csv now available
Lokhia f9ffc8e
Doing full movie pipeline
Lokhia bfb61a1
Add Whisper automatic speak recognition
DnzzL fe3f487
Extract from notebook
DnzzL 29a6e08
[FIX] Output length + types
DnzzL ff0f003
Small renaming and adding docstrings
Lokhia 63d235d
Poetry update and some audio tests
Lokhia 35043f4
Refactoring audio processing package - transcribers gestion
Lokhia 3b95203
Refactoring audio processing - gender segmenter gestion
Lokhia e6f0b5f
Refactoring audio processing - dialogue tagger gestion
Lokhia 9da1078
Refactoring audio processing - Audio Processor, main and poetry depen…
Lokhia 7dee943
Merge remote-tracking branch 'origin/main' into feature/voice_analysis
Lokhia 4ca2cd4
Archive previous audio notebook work
Lokhia 907cdfe
Update an properly merge pyproject from main
Lokhia 493e531
Minor changes in speech to text to make it work with default API key :)
Lokhia 70d32bb
Updating poetry lock and toml
Lokhia 1cfab8f
Transform all audio code to functionable library with tutorial
Lokhia c620f9a
Include Us English profile and tutorial
Lokhia af4dfd0
Added whisper API in the pipeline
TheoLvs 73e5269
Updated demo
TheoLvs 2bda1d5
Remove deprecated code
DnzzL File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| # Inspired from https://huggingface.co/spaces/vumichien/whisper-speaker-diarization/blob/main/app.py | ||
|
|
||
| import gradio as gr | ||
| import re | ||
| import time | ||
| import os | ||
|
|
||
| import pandas as pd | ||
| import numpy as np | ||
| import sys | ||
| sys.path.append("../../") | ||
|
|
||
| from pytube import YouTube | ||
|
|
||
| # Custom code | ||
| from bechdelai.data.youtube import download_youtube_video | ||
| from bechdelai.audio.utils import extract_audio_from_video | ||
| from bechdelai.audio.gender_segmenter import InaSpeechSegmentor | ||
| from bechdelai.audio.transcriber import WhisperAPI | ||
| from bechdelai.nlp.gpt import GPT3 | ||
|
|
||
| # Constants | ||
| # whisper_models = ["tiny.en","base.en","tiny","base", "small", "medium", "large"] | ||
| # device = 0 if torch.cuda.is_available() else "cpu" | ||
| # os.makedirs('output', exist_ok=True) | ||
|
|
||
| def get_youtube(video_url): | ||
| yt = YouTube(video_url) | ||
| abs_video_path = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first().download() | ||
| print("Success download video") | ||
| print(abs_video_path) | ||
| return abs_video_path | ||
|
|
||
|
|
||
| def speech_to_text(video_filepath, selected_source_lang = "en", whisper_model = "tiny.en"): | ||
| """ | ||
| # Transcribe youtube link using OpenAI Whisper | ||
| 1. Using Open AI's Whisper model to seperate audio into segments and generate transcripts. | ||
| 2. Generating speaker embeddings for each segments. | ||
| 3. Applying agglomerative clustering on the embeddings to identify the speaker for each segment. | ||
|
|
||
| Speech Recognition is based on models from OpenAI Whisper https://github.com/openai/whisper | ||
| Speaker diarization model and pipeline from by https://github.com/pyannote/pyannote-audio | ||
| """ | ||
|
|
||
| whisper_api = WhisperAPI() | ||
| gender = InaSpeechSegmentor(only_gender=True,batch_size = 32) | ||
|
|
||
| # Convert video to audio | ||
| audio_filepath = extract_audio_from_video(video_filepath,"mp3") | ||
| result,segments,text = whisper_api.speech_to_text(audio_filepath) | ||
| segments_ina = gender._convert_whisper_output(segments) | ||
| segments_ina = gender.predict_gender_on_segments(audio_filepath,segments_ina) | ||
| segments_with_gender = segments.merge(segments_ina[["gender","start"]],on = "start",how = "left") | ||
|
|
||
| dialogue_id = segments_with_gender["speech"].astype(int).diff(1).fillna(0) | ||
| dialogue_id.loc[dialogue_id < 0] = 0 | ||
| segments_with_gender["dialogue_id"] = dialogue_id.cumsum() | ||
| segments_with_gender.loc[segments_with_gender["speech"] == False,"dialogue_id"] = np.NaN | ||
|
|
||
| return [segments_with_gender,text] | ||
|
|
||
| source_language_list = ["en","fr"] | ||
|
|
||
| # ---- Gradio Layout ----- | ||
| # Inspiration from https://huggingface.co/spaces/RASMUS/Whisper-youtube-crosslingual-subtitles | ||
| video_in = gr.Video(label="Video file", mirror_webcam=False) | ||
| youtube_url_in = gr.Textbox(label="Youtube url", lines=1, interactive=True) | ||
| # selected_source_lang = gr.Dropdown(choices=source_language_list, type="value", value="en", label="Spoken language in video", interactive=True) | ||
| # selected_whisper_model = gr.Dropdown(choices=whisper_models, type="value", value="tiny.en", label="Selected Whisper model", interactive=True) | ||
| df_init = pd.DataFrame(columns=['start', 'end', 'text', 'speech', 'gender', 'duration', 'dialogue_id']) | ||
| transcription_df = gr.DataFrame(value = df_init,label="Répartition du temps de parole", row_count=(0, "dynamic"), max_rows = 25, wrap=True, overflow_row_behaviour='paginate') | ||
| output_text = gr.Textbox(label = "Transcribed text",lines = 10) | ||
|
|
||
| title = "BechdelAI - demo" | ||
| demo = gr.Blocks(title=title,live = True) | ||
| demo.encrypt = False | ||
|
|
||
|
|
||
| with demo: | ||
| with gr.Tab("BechdelAI - dialogue demo"): | ||
| gr.Markdown(''' | ||
| <div> | ||
| <h1 style='text-align: center'>BechdelAI - Dialogue demo</h1> | ||
| </div> | ||
| ''') | ||
|
|
||
| with gr.Row(): | ||
| gr.Markdown('''# 🎥 Download Youtube video''') | ||
|
|
||
|
|
||
| with gr.Row(): | ||
|
|
||
| with gr.Column(): | ||
| # gr.Markdown('''### You can test by following examples:''') | ||
| examples = gr.Examples(examples= | ||
| [ | ||
| "https://www.youtube.com/watch?v=FDFdroN7d0w", | ||
| "https://www.youtube.com/watch?v=b2f2Kqt_KcE", | ||
| "https://www.youtube.com/watch?v=ba5F8G778C0", | ||
| ], | ||
| label="Examples", inputs=[youtube_url_in]) | ||
| youtube_url_in.render() | ||
| download_youtube_btn = gr.Button("Download Youtube video") | ||
| download_youtube_btn.click(get_youtube, [youtube_url_in], [ | ||
| video_in]) | ||
| print(video_in) | ||
|
|
||
| with gr.Column(): | ||
| video_in.render() | ||
|
|
||
| with gr.Row(): | ||
| gr.Markdown('''# 🎙 Extract text from video''') | ||
|
|
||
| with gr.Row(): | ||
| with gr.Column(): | ||
| transcribe_btn = gr.Button("Transcribe audio and diarization") | ||
| # transcribe_btn.click(speech_to_text, [video_in, selected_source_lang, selected_whisper_model], [transcription_df,output_text]) | ||
| transcribe_btn.click(speech_to_text, [video_in], [transcription_df,output_text]) | ||
| with gr.Column(): | ||
| output_text.render() | ||
| with gr.Row(): | ||
| transcription_df.render() | ||
|
|
||
| demo.launch(debug=True) |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| import pandas as pd | ||
| import speech_recognition as sr | ||
| from inaSpeechSegmenter import Segmenter | ||
|
|
||
|
|
||
| class GenderAudioProcessor: | ||
| """Computes """ | ||
| def __init__(self, path_to_file, path_to_audio): | ||
| self.title = path_to_file.split(sep='\\')[-1].split(sep='.')[0] | ||
| self.media = path_to_file | ||
| self.audio = path_to_audio | ||
| self.gendered_audio_seg = self.segment() # Dataframe | ||
| self.dialogues = self.run_speech_to_text() | ||
| self.speaking_time = self.compute_speaking_time_allocation() | ||
|
|
||
| def __str__(self): | ||
| return "Film : {}".format(self.title) | ||
|
|
||
| def __repr__(self): | ||
| return self.title | ||
|
|
||
| def segment(self): | ||
| """Extract time intervals from self.media, according to the speaker's gender. | ||
|
|
||
| Returns: | ||
| pd.DataFrame: Pandas' DataFrame with 3 columns (gender, start, end) and as many lines as needed. | ||
| """ | ||
| seg = Segmenter(vad_engine='sm', energy_ratio=0.05) | ||
| # The higher the energy ratio, the more selective it is ; vad_engine works better with sm than smn | ||
| segment = seg(self.media) | ||
| return pd.DataFrame(list(filter(lambda x: x[0] == 'male' or x[0] == 'female', segment)), | ||
| columns=['gender', 'start', 'end']) | ||
|
|
||
| def search_gender_tag(self, time: int): | ||
| """Retrieves the genre associated with the time given in parameter (in seconds) for a film. | ||
|
|
||
| Requires access to the dataframe generated by the segmentor. | ||
|
|
||
| Parameters: | ||
| time (int): The time of interest, given in seconds. | ||
|
|
||
| Returns: | ||
| gender (str OR None): The gender of the speaker corresponding to the given time. None if out of range. | ||
| """ | ||
| gender = None | ||
| if time > self.gendered_audio_seg['end'].tail(1).item(): | ||
| return None | ||
| for i in self.gendered_audio_seg.index: | ||
| if time > self.gendered_audio_seg['start'][i]: | ||
| if time < self.gendered_audio_seg['end'][i]: | ||
| gender = self.gendered_audio_seg['gender'][i] | ||
| if time > self.gendered_audio_seg['end'][i]: | ||
| pass | ||
| return gender | ||
|
|
||
| def compute_speaking_time_allocation(self): | ||
| speaking_time = {'male': 0, 'female': 0} | ||
| dif = pd.Series(self.gendered_audio_seg['end'] - self.gendered_audio_seg['start'], name='time_frame') | ||
| totaldf = pd.concat([self.gendered_audio_seg['gender'], dif], axis=1) | ||
| for i in totaldf.index: | ||
| if totaldf['gender'][i] == 'male': | ||
| speaking_time['male'] += float(totaldf['time_frame'][i]) | ||
| if totaldf['gender'][i] == 'female': | ||
| speaking_time['female'] += float(totaldf['time_frame'][i]) | ||
| return speaking_time | ||
|
|
||
| def decode_speech(self, start_time=None, end_time=None, language="en-US"): | ||
| r = sr.Recognizer() | ||
| # r.pause_threshold = 3 | ||
| # r.dynamic_energy_adjustment_damping = 0.5 | ||
| # language can be "fr-FR" | ||
|
|
||
| with sr.WavFile(self.audio) as source: | ||
| if start_time is None and end_time is None: | ||
| audio_text = r.record(source) | ||
| else: | ||
| audio_text = r.record(source, duration=end_time - start_time, offset=start_time) | ||
|
|
||
| # recognize_() method will throw a request error if the API is unreachable, hence using exception handling | ||
| try: | ||
| # using google speech recognition | ||
| text = r.recognize_google(audio_text, language=language) | ||
| print('Converting audio transcripts into text ...') | ||
| return text | ||
|
|
||
| except: | ||
| print('Sorry.. run again...') | ||
|
|
||
| def run_speech_to_text(self): | ||
| transcript = [] | ||
| for i in self.gendered_audio_seg.index: | ||
| transcript.append(self.decode_speech(start_time=self.gendered_audio_seg['start'][i], | ||
| end_time=self.gendered_audio_seg['end'][i], | ||
| language='fr-FR')) | ||
| transcription = pd.concat([self.gendered_audio_seg['gender'], pd.Series(transcript, name="transcription")], | ||
| axis=1) | ||
| return transcription | ||
|
|
||
| def export_to_csv(self, file_path: str): | ||
| result = pd.concat([self.gendered_audio_seg, self.dialogues['transcription']], axis=1) | ||
| result.to_csv(path_or_buf=file_path, sep=";", header=True, index=False) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| from transformers import pipeline | ||
|
DnzzL marked this conversation as resolved.
Outdated
|
||
|
|
||
|
|
||
| class SpeechRecognition: | ||
| """Speech recognition model for audio files.""" | ||
|
|
||
| def __init__(self, model_name="openai/whisper-small"): | ||
| """Initialize speech recognition model. | ||
|
|
||
| Args: | ||
| language (str): target language | ||
| task (str): transcribe for same language or translate to another language | ||
| model_name (str): Whisper model name. Defaults to "openai/whisper-small". | ||
| """ | ||
| self.pipe = pipeline( | ||
| task="automatic-speech-recognition", | ||
| model=model_name, | ||
| chunk_length_s=30, | ||
| stride_length_s=(5, 5), | ||
| return_timestamps=True, | ||
| generate_kwargs={"max_length": 1000}, | ||
| ) | ||
|
|
||
| def transcribe(self, audio_path, language, task="transcribe"): | ||
| """Transcribe audio file. | ||
|
|
||
| Args: | ||
| audio_path (str): Path to audio file | ||
| language (str): target language | ||
| task (str): transcribe for same language or translate to another language | ||
|
|
||
| Returns: | ||
| Dict: Transcribed text | ||
| """ | ||
| self.pipe.model.config.forced_decoder_ids = ( | ||
| self.pipe.tokenizer.get_decoder_prompt_ids(language=language, task=task) | ||
| ) | ||
| return self.pipe(audio_path) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| import pandas as pd | ||
|
|
||
|
|
||
| class AudioProcessor: | ||
| """Computes complete pipeline from audio to text format dialogues | ||
| gendered_audio_seg represents gender identification + voice activity detection | ||
| feminine_dialogues aims to keep only dialogues where women are speaking | ||
| result transforms selected audios into text | ||
| """ | ||
| def __init__(self, config, audio_file=""): | ||
| self.audio = audio_file | ||
| self.config = config | ||
| self.gendered_audio_seg = self.gender_segmentor() | ||
| self.feminine_dialogues = self.dialogue_tagger() | ||
| self.result = self.run_speech_to_text() | ||
|
|
||
| def gender_segmentor(self): | ||
| return self.config.get_profile().get_gender_segmentor().segment(self.audio) | ||
|
|
||
| def dialogue_tagger(self): | ||
| return self.config.get_profile().get_dialogue_tagger().extract_dialogues_subsets(self.gendered_audio_seg) | ||
|
|
||
| def run_speech_to_text(self): | ||
| transcript = [] | ||
| for i in self.feminine_dialogues.index: | ||
| duration = self.feminine_dialogues['end'][i] - self.feminine_dialogues['start'][i] | ||
| transcript.append(self.config.get_profile().get_transcriber().speech_to_text(self.audio, | ||
| self.gendered_audio_seg['start'][i], | ||
| duration)) | ||
| transcription = pd.concat([self.gendered_audio_seg['gender'], pd.Series(transcript, name="transcription")], | ||
| axis=1) | ||
| return transcription | ||
|
|
||
| def full_dataframe(self): | ||
| return pd.concat([self.gendered_audio_seg, self.result['transcription']], axis=1) | ||
|
|
||
| def export_to_csv(self, file_path: str): | ||
| result = self.full_dataframe() | ||
| result.to_csv(path_or_buf=file_path, sep=";", header=True, index=False, encoding="utf-8") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| def separate_voice_and_music(path_to_mixed_audio: str) -> None: | ||
| """Splits an audio file into its individual parts using spleeter | ||
|
|
||
| Does not work above 700 seconds or about 11 minutes. | ||
|
|
||
| Stores the results in separate folders, upstream of the project root. | ||
|
|
||
| Parameters: | ||
| path_to_mixed_audio (str): Path to an audio file (.wav) | ||
|
|
||
| Returns: | ||
| None | ||
| """ | ||
| os.system('spleeter separate -d 700.0 -o ../../../ -f "{instrument}/{filename}.{codec}" ' + path_to_mixed_audio) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| from . import profiles | ||
|
|
||
|
|
||
| class Config: | ||
| def __init__(self): | ||
| self.selected_profile: profiles.Profiles | ||
|
|
||
| def select_profile(self, option): | ||
| if option == "FR": | ||
| self.selected_profile = profiles.French() | ||
| elif option == "US": | ||
| self.selected_profile = profiles.USEnglish() | ||
|
|
||
| def get_profile(self): | ||
| return self.selected_profile |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| from abc import ABC, abstractmethod | ||
|
DnzzL marked this conversation as resolved.
|
||
|
|
||
|
|
||
| class DialogueTagger(ABC): | ||
| """Abstract Class common to every gender segmentor. | ||
| Convert an audio file to a dataframe of time slots associated with the speaker's gender. | ||
| """ | ||
| @abstractmethod | ||
| def extract_dialogues_subsets(self, segments_dataframe): | ||
| pass | ||
|
|
||
|
|
||
| class RuleBasedTagger(DialogueTagger): | ||
| def __init__(self): | ||
| pass | ||
|
|
||
| def extract_dialogues_subsets(self, segments_dataframe): | ||
| return segments_dataframe | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.