audio to text issue

Ytbnd · 18. sep 2024

aljoman mi je dal idejo

OKvirji pa kar po starem

Če je sklepati po imenu polja, ki je Radensko je voda na njem...

www.alter.si

kjer je dal link na pogovorno oddajo iz val202.
Tega se mi vsega ne da poslušati, zato sem na hitro poiskal nekaj mp3 to text servisov in rezultat enega dal na alter https://www.alter.si/tema/okvirji-pa-kar-po-starem.1309964/page-256#post-3590117
Z rezultatom nisem ravno navdušen, zato sem zagnal https://lmarena.ai/ -> direct chat -> claude-3-5-sonnet-20240620, da mi napiše python3 kodo, ki bi jo nato pognal znotraj PyCharm community edition [PCE], python3 je že nameščen v sistemu.

Eden od problemov: slovenščina, google speech recognition [GSR] jo podpira, samo ne pozna diarizacije (ang diarization) oz. po domače prepoznavanje različnih govorcev, zato nameče vse skupaj v eno dolgo kačo, ter (vsaj ta free verzija brez prijave, audio fajl ne sme biti daljši kot 8 minut, 8min 1 s == no go), javi zelo "jasno" napako, (tu sem malo potestiral z velikostjo, da sem ugotovil zgornjo sprejemljivo mejo):

Could not request results from speech recognition service; recognition request failed: Bad Request

Prvo sem iz rtv strani pridobil mp3 fajl, (chrome/firefox, developer tools, network, stolpec type: media, je samo 1

https://progressive.rtvslo.si/ava_archive11/2024/09/13/SandiHoRA_SLO_LJT_3332819_13314201.mp3?exp=......

velikost 34,5 MB.

imenoval sem ga "sandi.mp3"

claude-3-5-sonnet pravi, da je najbolje, če je input fajl v wav formatu, PCM 16khz 16bit mono, zato sem omenjeni mp3 kar znotraj PCE pretvoril v želeno obliko.

from pydub import AudioSegmentdef convert_audio(input_file, output_file): - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

Koda:

from pydub import AudioSegment
def convert_audio(input_file, output_file):
audio = AudioSegment.from_mp3(input_file)
    audio = audio.set_channels(1)  # Convert to mono
    audio = audio.set_frame_rate(16000)  # Set sample rate to 16kHz
    audio = audio.set_sample_width(2)  # Set sample width to 2 bytes (16-bit)
    audio.export(output_file, format="wav")

# Usage
input_mp3 = "r:\\sandi.mp3"
output_wav = "r:\\sandi.wav"
convert_audio(input_mp3, output_wav)

//10 in 11 vrstica za input in output

Glavni del
## process wav, split to 8 min parts and produce output.txt ##
vzame sandi.wav, ga v spominu razdeli na 8min segmente in sprocesira preko google speech recognition, (GSR) vsakega posebej (zaporedno), vse delo se naredi na google strani, za 1 part porabi cca 2 minuti časa
malo daljša koda

## process wav, split to 8 min parts and produce output.txtimport speech_recog - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

//47 in 51 vrstica za input in output (od kod bere in kam naj zapiše)

ker je part 4 delal težave je tu koda, ki razdeli wav na 8min segmente in jih shrani na hdd
34 in 35 vrstica , sandi.wav in kam naj shrani parte, na windows sistemu mora biti dvojni \ za pot

## split a WAV file into 8-minute parts ##from pydub import AudioSegmentimpo - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

v tem primeru se v mapi "r:\\output" pojavi 5 delov ; chunk_1.wav do chunk_5.wav

Nato sem clauda vprašal kateri "audio to text" servisi/programi obstajajo, ki prepoznajo slovenščino: hint ni jih veliko free, izmed naštetih je edino https://www.speechmatics.com/ imel free trial brez nekega kompliciranja (vpišeš nek email/ geslo in si notri, sploh ne preverja emaila),
speechmatics free https://www.speechmatics.com/pricing ponuja 4 ure pretvorbe že posnetega audio materiala in 4 ure pretvorbe v živo, tudi lepo prepozna različne govorce.

seznam naštetih

When it comes to MP3 to text conversion (speech recognition) for the Slovenian l - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

claude malo fantazira, sem preveril https://alphacephei.com/vosk/models

8. ALPHACEPHEI:
- Open-source speech recognition toolkit
- Has models for various languages including Slovenian

4. Wit.ai:

Ne znam uporabljati, niti dejansko ne podpira

Wit.ai

wit.ai

What languages do you support speech recognition for?

Speech to Text poizvedba za claude najde še Mozilla DeepSpeech, Whisper by OpenAI https://github.com/mozilla/DeepSpeech ter https://github.com/openai/whisper , zadnji Whisper izgleda enostavnejši za uporabo,
med podprtimi jeziki je Slo https://github.com/openai/whisper/blob/main/whisper/tokenizer.py (57 vrstica), bom testiral ; EDIT ne bom, nimam dodatne grafične, samo integrirano na cpu, ki trenutno uporablja 0.5 Gb.

part 4 je nato speechmatics uspešno pretvoril v tekst.

končni rezultat je tu

[0:00:00] sandi horvat lepo pozdravljeni v studio vale 2022 dober dan in lep poz - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

brane_new · 19. sep 2024

Uporabnik Ytbnd je napisal:
aljoman mi je dal idejo

Ta prava dva

Drugače pa: WAW! :aplauz:

Ja, sem fouš!

Ytbnd · 19. sep 2024

Whisper + Pyannote lokalna rešitev, potrebovali boste račun na https://huggingface.co (je free), kjer potem generirate svoj token, ki ga vpišete v

21 vrstici
use_auth_token="YOUR_TOKEN_HERE"

token mora biti med narekovajem spredaj in zadaj, ie "vsebina tokena"
token je v obliki 37 znakov, velike/male črke , ko ga na huggingface strani zgenerirate, ga shranite nekam na varno, ker potem cel ni več viden.

python3 koda (python 3.12.5)
Whisper + Pyannote, audio-to-text transcription with speaker diarization

## Whisper + Pyannote, audio-to-text transcription with speaker diarization ## - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

//14 in 48 vrstica, input in output
//17 vrstica je model small
//2 vrstica, število govorcev
na https://huggingface.co/pyannote/speaker-diarization-3.1 piše, kako se namesto cpu uporablja gpu, število govorcev itd...

vsaj jaz sem moral poklikati linke med 4 in 7 vrstico, sem si jih zabeležil, drugače jih spodaj v outputu IDE okolja napiše, kaj mu trenutno manjka
https://github.com/pyannote/pyannote-audio/issues/1474#issuecomment-1746998271 zadnji post za začetek

pyannote/speaker-diarization-3.0 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

pyannote/speaker-diarization-3.1 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

pyannote/speaker-diarization · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

pyannote/segmentation · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

You need to agree to share your contact information to access this model

neko fake ime podjetja in naslov strani je dovolj

//za test sem vzel en kratek wav fajl, dolžina 1 minuta, ker je bilo par različnih zadev za poklikati, ko tega brez težav spravi skozi gremo lahko na resno delo.

part 4 (8 minut izsek, od 24 do 32 minute) ), ki ga google speech recognition včeraj sploh ni mogel obdelati
copy/paste na pastebin, vsebine se nisem dotikal/spreminjal/urejal karkoli

Speaker None: tradici, to da je v resmice grez na vadno trgovino z ljudmina? - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

izbrani model je bil "small", gpu nimam, zato je program znotraj PyCharm community edition delal nekje 6 minut na cpu (torej nekje 28 minut za cel posnetek) , z 50% cpu usage (amd ryzen 5 7600x), program uporablja večino jeder, ki so na voljo, zadeva je bila OFFLINE ves čas.
Edino modele mora DL na hdd, tiny mislim da je privzeto zraven.
S kakšno nvidia 4070 in več pomoje zadeva dela precej hitreje.

small.pt 461 MB, base.pt 139 MB, tiny.pt 72 MB, medium je ~1.4 GB velik fajl. ( v X:\Users\ up_ime \.cache\whisper\ so shranjeni).

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

github.com

SS iz strani

Ytbnd · 19. sep 2024

Kako hitro lahko dela na gpu

pyannote/speaker-diarization-3.0 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).

In other words, it takes approximately 1.5 minutes to process a one hour conversation.

Pricing & Plans | Deepgram

Find the right plan with our clear, transparent, and flexible pricing structure. Unmatched accuracy. Blazing fast. Unbeatable pricing. $200 free credit.

deepgram.com

Deepgram's voice AI platform provides APIs for speech-to-text, text-to-speech, and language understanding. From medical transcription to autonomous agents, Deepgram is the go-to choice for developers of voice AI experiences.

tega še nisem preizkusil, na free računu dobiš za 200usd dobroimetja, dalje po porabi.
//nima slovenščine

Ytbnd · 19. sep 2024

Edit

## Whisper + Pyannote, audio-to-text transcription with speaker diarization ## - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

pastebin.com

//22 vrstica, število govorcev, da ne bo kdo mislil, da je govora o številu zvočnikov.

Iskanje

Iskanje

audio to text issue

Ytbnd

Guru

OKvirji pa kar po starem

from pydub import AudioSegmentdef convert_audio(input_file, output_file): - Pastebin.com

## process wav, split to 8 min parts and produce output.txtimport speech_recog - Pastebin.com

## split a WAV file into 8-minute parts ##from pydub import AudioSegmentimpo - Pastebin.com

When it comes to MP3 to text conversion (speech recognition) for the Slovenian l - Pastebin.com

Wit.ai

[0:00:00] sandi horvat lepo pozdravljeni v studio vale 2022 dober dan in lep poz - Pastebin.com

brane_new

Guru

Ytbnd

Guru

## Whisper + Pyannote, audio-to-text transcription with speaker diarization ## - Pastebin.com

pyannote/speaker-diarization-3.0 · Hugging Face

pyannote/speaker-diarization-3.1 · Hugging Face

pyannote/speaker-diarization · Hugging Face

pyannote/segmentation · Hugging Face

Speaker None: tradici, to da je v resmice grez na vadno trgovino z ljudmina? - Pastebin.com

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Ytbnd

Guru

pyannote/speaker-diarization-3.0 · Hugging Face

Pricing & Plans | Deepgram

Ytbnd

Guru

## Whisper + Pyannote, audio-to-text transcription with speaker diarization ## - Pastebin.com