Paweł Masarczyk

2 years ago

Paweł Masarczyk
2 years ago

Recently I have been playing with various GUI's for the Whisper transcription software. Buzz has definitely won the showdown. Almost completely keyboard accessible, give or take the toolbar which needs exploring through object navigation of NVDA or an equivallent in your screen reader of choice; handles the downloading of models, FFMPEG conversion and everything that otherwise would have required operation in the command line, works with Whisper.CPP as far as I can tell and can be localized to other languages.
Now I can finally listen to podcasts in all the languages I can't speak. I love it when technology enhances my access to knowledge and helps me do my work even better for those who benefit from it.
github.com/chidiwilliams/buzz
#Accessibility #Audio #Languages #OpenSource

GitHub - chidiwilliams/buzz: Buzz transcribes and translates audio offline on your personal computer. Powered by OpenAI's Whisper.

Buzz transcribes and translates audio offline on your personal computer. Powered by OpenAI's Whisper. - GitHub - chidiwilliams/buzz: Buzz transcribes and translates audio offline on your person...

^GitHub

reshared this

in reply to Paweł Masarczyk

Jens Bertrams

in reply to Paweł Masarczyk 2 years ago

Hi Pawel: I tried Buzz under Windows, but I find it not really accessible with NVDA. I can import a File, but I cannot export it and don't find the transcribed text. Could you give me a hint? Best greetings from Marburg and thanks in advance.

in reply to Jens Bertrams

Steffen

in reply to Jens Bertrams 2 years ago

@Radiojens The GUI currently requires some struggling with NVDA's object navigation and mouse click simulation, but at least I was able to get some results from an audio book. Try to simulate a double-click at one of the entries in the results data table, then another window should open with an export button in it.

@Jens Bertrams

in reply to Jens Bertrams

Steffen

in reply to Jens Bertrams 2 years ago

@Radiojens Apparently this thing is made with Python and Qt6, therefore technical chances are good to push it's accessibility way forward from the current state.

@Jens Bertrams

in reply to Steffen

Paweł Masarczyk

in reply to Steffen 2 years ago

@radiorobbe @Radiojens Yes, I wanted to suggest object nav of NVDA as well. I usually navigate to the toolbar which is one object above and to the left of the table with the loaded file, and find the "Open Transcript" button there. I also hope that either the software will receive the needed improvements or that somebody writes an NVDA addon around it. Apart from the toolbar, the edit box with the transcript is the inaccessible part but then I just export the result to a txt file and work with a regular text editor from there.

@Steffen @Jens Bertrams

in reply to Paweł Masarczyk

Jens Bertrams

in reply to Paweł Masarczyk 2 years ago

@radiorobbe I tried to navigate to the Toolbar and to open Transcript File, then I clicked with nvda+numpad-return, but nothing happened. I also simulated a left-mouse-click on a completed transcription. Both had no resuolt, no window opened. So how do you export exactly, Pawel? I found the file tasks in Appdatea/local/buzz/buzz/cache and it seems to be a raw File of the transcript, but almost non readable with lots of unreadable charakters, don't know the real format of it.

@Steffen

in reply to Paweł Masarczyk

Jens Bertrams

in reply to Paweł Masarczyk 2 years ago

@radiorobbe Ah, and do you use Whisper or something else? In the logfile it says "error loading Whisper.dll" although it is there in the program folder. And which Model do you use with imported files? I trief large for best results.

@Steffen

in reply to Jens Bertrams

Paweł Masarczyk

in reply to Jens Bertrams 2 years ago

@Radiojens @radiorobbe I use the regular Whisper, I think it's the Whisper.CPP implementation, actually, with the large model. Here are the steps:
1. I import the file using ctrl+o
2. I setup the options for the transcription job as I like them: the mechanism is Whisper, the model is large, the language is set to automatic detection, all the rest left at defaults;
3. I click Run and wait. I will eventually be moved to the table where the progresss on the task is reported.
4. I wait for it to finish i.e. to say "Completed" in the second column.
5. I navigate to the toolbar. I use the laptop layout of NVDA so I'll try to explain it using that keymap:
A. I call the navigator focus to my system focus by pressing NVDA+Backspace;
B. I navigate out of the table object - NVDA+Shift+Up arrow;
C. I navigate then two objects to the left - NVDA+shift+left arrow twice, so that I find the toolbar;
D. I expand that object with NVDA+shift+down;
E. I navigate to the right using NVDA+Shift+right arrow until I find the "Open Transcript" button;
F. I call the focus to my navigator object - that+'s NVDA+Shift+M
G. I activate the button by pressing NVDA+Enter;
6. A new window opens where the text of the transcript is presented in this inaccessible edit field that you can't handle with a keyboard. The "Export" button is found by pressing Tab. You can pick the format you need from the context menu that pops up and save it anywhere you choose.

I hope this helped. If not, and you find it a good idea, we could try to communicate somewhere else and coordinate a remote session so that I could try and see what the problem might be on your end.

@Steffen @Jens Bertrams

in reply to Paweł Masarczyk

Jens Bertrams

in reply to Paweł Masarczyk 2 years ago

@radiorobbe Okay, thank you, it worked. Not always, but mostly. - Thanks.

@Steffen

in reply to Paweł Masarczyk

Jens Bertrams

in reply to Paweł Masarczyk 2 years ago

@radiorobbe Hey you two, I got one more question: Does anyone know what word_level_timing and "initial prompt" means? There seems to be no readme for buzz. And: Is it possible for the program to recognize when a new person is speaking? Would be great for a podcast with three people who talk, and I could offer a textversion for people with hearing difficulties.

@Steffen

in reply to Jens Bertrams

Paweł Masarczyk

in reply to Jens Bertrams 2 years ago

@Radiojens @radiorobbe Hello! Word level timing allows you to generate timestamps for each word so that you get per-word subtitles which apparently looks cool in some social media concepts. Initial prompt, if I'm correct, lets you tell Whisper some context around the recording so that it can better adapt the recognition. As far as I know, Whisper by itself can't do diarization i.e. identification of individual speakers. I'm afraid the much more trivial consequence of this is the output of everything as a huge block of text, regardless of the number of voices.

@Steffen @Jens Bertrams

⇧

Paweł Masarczyk 2 years ago • •

Paweł Masarczyk
2 years ago