Update: Thanks @pitermach showing a great demo that it's actually Mist World Upsampling to 48 in this demo, not NVDA downsampling to 16!
I stitched together an audio file showing you how bad it is at ignoring the setting of -1 as the output. Instead #NVDASR tries to be too smart, enumerate the list and gather which you have set as your sound mapper output, and explicitly call that sound device when passing to the TTS outputs.
I updated this to add a little more at the end and show how Mist World treats audio output switching properly, that I now know is not proper.
Good night, Mastodon. This really ruined my weekend at first, until that amazing demo in my mentions by @pitermach clarified things. :)
Update: People are asking, "how can I tell?" Listen for the sharpness of S's and other consonants. If you have the ear you'll notice.
This entry was edited (1 month ago)
in reply to Tamas G

What you're hearing isn't actually downsampling to 16, it's aliasing artifacts introduced by whatever resampling algorithm Mistworld's audio library is using. Vocalizer actually runs at a native 22 KHZ as far as I know. I recorded a quick demo of what it sounds like when you bring a 22 KHZ file to 48 KHZ with a low quality resampling algorithm versus a file that's actually at 16 KHZ.

Tamas G reshared this.

in reply to JamminJerry

@JamminJerry There isn't much of a difference between the old default 64 point sync interpolation and r8brain, but a more pronounced difference with point sampling and linear interpolation there's a bit more high end. Something else I just remembered is back in the day Klango also used to do this, so any voice you used with it would get resampled up like this with noticeable aliasing.
in reply to Pitermach

super informative, wow! I wonder if the default output set within Windows sound pannel to 48K causes the upsampling to 48 and not keeping at 44100, how odd. It could be that with the sound mapper Windows always forces its own sampling rate rather than sticking with the one set as playback in the program. But I'm not 100% sure. (Nope, not this, did a test run by force-changing a Bluetooth A2DP driver to 44) It's definitely odd that when you choose an audio device directly it's correctly sampling in the game though, and maybe Klango did the same thing.
This entry was edited (1 month ago)
in reply to Tamas G

Unless you're using ASIO and/or exclusive mode, some degree of resampling is unavoidable. Windows opens the device with a particular configuration of sample rate, number of channels, etc. and audio from applications is adapted as needed to match so you get a mix. If three applications are simultaneously sending audio output to the same device, the device is only opened once and receives the sum total of those sources.
in reply to Tamas G

It's also odd that there's a difference because with WASAPI, which even legacy WinMM now uses behind the scenes, there isn't really a separate sound mapper device. You ask the system for the default endpoint, it tells you which that is (the real endpoint, not some sound mapper thing) and then you open that endpoint directly. Thus, there should be no difference in how the device is actually opened. I guess it's possible the app makes different decisions about resampling based on whether the user wants to use the default device or not, but why would it do that?
in reply to Zvonimir Stanecic

yeah, I do think that the new R8Brain algo mentioned there does a lot better job at still making the voice have that higher crisp quality but not so much the sharpness on the actual consonants, which to me feels like the best of both worlds. You get a lot higher quality to the ear but also don't get those weird artifacts the older ways of upsampling to 44 or 48 introduces.
I don't think people who don't like it are wrong, especially for some minds, sharper noises like that in the audio can really stand out and become annoying or a headache.
This entry was edited (1 month ago)
in reply to Zvonimir Stanecic

@asael I wonder if we'll ever get a true TTS that's not 22050 but true 44.1K sampling rate. Now I think I'm on the hunt for that. My guess is the newer AI voices might be the first of their kind this way if so. It's interesting because 22050 in actuality is 11025. The Nyquist frequency (or Nyquist limit) refers to the highest frequency that can be accurately represented by a given sample rate. It is half of the sample rate. The reason for this is due to how digital sampling works: you need at least two samples per cycle of the waveform to fully capture its shape. This is known as the Nyquist-Shannon sampling theorem. So really, any TTS claiming to be 22050 HZ is really just 11025, and any TTS claiming to be 11025 is just 5.5 K-hertz, youch
in reply to Zvonimir Stanecic

I've thought of making a voice of myself in RHVoice but I'm unsure about the intonation, I often talk iwth a higher inflection and want that to be reflected, and haven't quite figured out the ridiculously complicated pipeline. Plus I've only got WSL to train it so it will be sloooooow! @valiant8086 @BorrisInABox @pitermach @Tamasg Oh, has that replaced the one they'll ship for the NVDA add-on? I only see one version of SLT on rhvoice.org
This entry was edited (3 weeks ago)
in reply to Patrick Perdue

@BorrisInABox @asael OMG. This was a thing? What year did that all get created? Feel like I've missed like, a major milestone in speech history. LOL. It may be that using actual 44K data is too large in size, so that's why even the human voices by companies such as Nuance stuck with 22050, as it's tolerable enough to not be an AM radio but not high quality to be even like an FM signal could be.
in reply to Zvonimir Stanecic

@asael @BorrisInABox Believe it or not I still have it here, yay for a Windows install on its 9th year now. Fun fact, that John voice was created from the voice of Jon St. John, aka the guy who voiced Duke Nukem, sadly not doing the Duke voice in this case. But that single reason is why we used it for a very long time either for reading chats or games while streaming with @talon. Here's a quick demo. Not a super responsive voice, but yeah there's a lot of highs.