I’m honestly a little surprised bc my impression was that speed improvements from within-chain parallelization were nonmonotonic in the number of cores due to the fixed costs of each additional core, and 64 is way beyond the
point where I thought performance declined. But maybe that’s outdated?
Fixed cost but with very big data gradient computation per core increases as well. This model still takes a while but it goes from forever to a few days.
Comments
point where I thought performance declined. But maybe that’s outdated?