i'm just spitballing here, but maybe phone speaker is designed primarily (ideally) for the frequency range of the human voice and various harmonics in the instrumental are outside its ideal range, or push into a range which it doesn't reproduce well
Phone speakers are generally geared towards voice over music because the primary use of the speaker (at least in their perception) is speakerphone, Skype, etc. So they're not nearly as good at "blended" sounds. Additionally, they're very small, which is fine for vocal patterns but less so for music.
Good place to start fixing this though is to find your phones EQ (on iPhone is settings -> music -> EQ) and choose the preset "Small Speakers". It will lower your maximum volume slightly, but it does clarify the sound a good amount.
vocals have a narrower band than something like a synth instrument that could be overdriving frequencies in ranges that the phone speakers have trouble with like lower bass freqs
My synth drum machine always cranks out ABSURD levels of sub-bass on its kick if I'm not careful. Could be something too low/too high to hear but it's clipping maybe?
and, if it is being streamed you’re likely subjected to some lossy compression masked by psychoacoustics.
so consider optimizing for those realities as well.