Default configuration for SIP used to be 20ms, the rationale behind it was actually sourced in the fact that most SIP was done on LANs and inter-campus WAN which had generally high bitrate connectivity and low latency. The lower the packet time window the sooner the recipient could "hear" your voice, and if there were to be packet loss, there would be less of an impact if that packet were dropped - you'd only lose 20ms of audio vs 100ms. The same applies for high bitrate but high latency (3g for example) connectivity - you want to take advantage of the bandwidth to mitigate some of the network level latency that would impact the audio delay - being "wasteful" to ensure lower latency and higher packet loss tolerance.
Pointedly - if you had a 75ms of one-way latency (150ms RTT) between two parties, and you used a 150ms audio segment length (ptime) you'd be getting close to the 250ms generally accepted max audio delay for smooth two-way communication. the recipient is hearing your first millisecond of audio 226ms later at best. If any packet does get lost, the recipient would lose 150ms of your message vs 20ms.
Modern voice apps and voip use dynamic ptime (usually via "maxptime" which specifies the highest/worst case) in their protocol for this reason - it allows the clients to optimize for all combinations of high/low bandwidth, high/low latency and high/low packet loss in realtime - as network conditions can often change during the course of a call especially while driving around or roaming between wifi and cellular.
> the rationale behind it was actually sourced in the fact that most SIP was done on LANs and inter-campus WAN which had generally high bitrate connectivity and low latency
In addition to that, early VoIP applications mostly used uncompressed G.711 audio, both for interoperability with circuit switched networks and because efficient voice compression codecs weren't yet available royalty-free.
G.711 is 64 kbps, so 12 kbps of overhead are less than 25% – not much point in cutting that down to, say, 10% at the expense of doubling effective latency in a LAN use case.
Pointedly - if you had a 75ms of one-way latency (150ms RTT) between two parties, and you used a 150ms audio segment length (ptime) you'd be getting close to the 250ms generally accepted max audio delay for smooth two-way communication. the recipient is hearing your first millisecond of audio 226ms later at best. If any packet does get lost, the recipient would lose 150ms of your message vs 20ms.
Modern voice apps and voip use dynamic ptime (usually via "maxptime" which specifies the highest/worst case) in their protocol for this reason - it allows the clients to optimize for all combinations of high/low bandwidth, high/low latency and high/low packet loss in realtime - as network conditions can often change during the course of a call especially while driving around or roaming between wifi and cellular.