![]() ![]() ![]() (Same training time, but faster autoregressive decoding in inference).Ģ) Parallelized transformer blocks: Replaces serial by parallel formulation to improve training time by 15% without sacrificing modeling performance (at large parameter sizes)ģ) SwiGLU activation: combines Swish and GLU activations: Swish(xW) ![]() There are 7 interesting architecture improvements over GPT models.ġ) Multi-query attention: Different from multi-head attention, the key/value projections are shared for each head. PaLM is a really interesting decoder-style language model that I initially overlooked when it was published last year. ![]()
0 Comments
Leave a Reply. |