This is not going to be very popular but the use of data for corpus training is not stealing. It wasn't stealing when people copied music either. IP maximalism is not helpful; it wasn't then, it isn't now. The use of data for these purposes is explicitly allowed for in EU law, and probably part of fair use.
It's also nothing new. All sorts of vital accessibility tools (voice recognition, voice synthesis) or other things such as spell checking rely on corpora.