LLaMA: Concepts Explained (Summary)

Anshu Kumar
4 min readMar 2, 2023

--

Pre-normalization, SwiGLU, Rotary Embeddings

In this article we would understand the why LLaMA is able to achieve comparable performance with smaller model size.

LLaMA-13B outperforms GPT-3(175B) on most benchmark despite being 10× smaller. It is possible to train state-of-the-art models using publicly available datasets.

The best performances are not achieved by the largest models, but by smaller models trained on more data. https://arxiv.org/abs/2203.15556

Pre-training data is mixture of many open datasets of diverse domain. It is helping LLaMA to achieve few shot capabilities.

https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/

T5 success can also be credited C4(Colossal Clean Crawled Corpus) dataset. Its cleaning process includes de-duplication, discarding incomplete sentences, and removing offensive or noisy content.

Now let's jump on the most important concepts which LLaMA has included.

1. Pre-normalization Using RMSNorm

RMSNorm : Root Mean Square Layer Normalization [1]

LLaMA normalizes the input of each transformer sub-layer, instead of normalizing the output.

Inspiration of including pre-normalization is taken from GPT3.

RMSNorm is extension of Layer Normalization (LayerNorm). Reason behind using RMSNorm is the computational overhead in LayerNorm. This makes improvements slow and expensive. RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%∼64%.

Let first understand LayerNorm, It has two properties.

a. re-centring : It make model insensitive to shift noises on both inputs and weights.

b. re-scaling: It keeps the output representations intact when both inputs and weights are randomly scaled.

RMSNorm claims that most of the benefits comes from re-scaling.

RMSNorm does re-scaling invariance and regularizes the summed
inputs simply according to the root mean square (RMS) statistic.

a_i : activation of ith neuron

g ∈ Rn is the gain parameter used to re-scale the standardized summed inputs

Intuitively, RMSNorm simplifies LayerNorm by totally removing the mean statistic in LayerNorm.

Feel free to take a look into the implementation of RMSNorm : https://github.com/bzhangGo/rmsnorm/blob/master/rmsnorm_torch.py

2. SwiGLU

To understand SwiGLU activation function we need to understand Swish activation function.

Inspiration of using SwiGLU in LLaMA is taken from PaLM.

def sigmoid(x):
return 1/(1 + np.exp(-x))

def swish(x):
return x*sigmoid(x)
https://arxiv.org/pdf/2002.05202v1.pdf

Python Implementation of SwiGLU.[2]

class SwiGLU(tf.keras.layers.Layer):
def __init__(self, bias=True, dim=-1, **kwargs):
super(SwiGLU, self).__init__(**kwargs)
self.bias = bias
self.dim = dim
self.dense = tf.keras.layers.Dense(2, use_bias=bias)

def call(self, x):
out, gate = tf.split(x, num_split=2, axis=self.dim)
gate = tf.keras.activations.swish(gate)
x = tf.multiply(out, gate)
return x

3. Rotary Embeddings (RopE)

RoPE, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation.

Advantage of RoPE

  • Can be expanded to any sequence lengths
  • Decaying inter-token dependency with increasing relative distances.
  • Capability of equipping the linear self-attention with relative position encoding.

The key idea is to encode relative position by multiplying the context
representations with a rotation matrix.

RoPE decays with the relative distance increased, which is desired for natural language encoding.

https://arxiv.org/pdf/2104.09864v4.pdf

Inspiration of using RoPE in LLaMA is taken from GPTNeo.

Other important approaches used in paper are

Optimizer

AdamW optimizer (β1 = 0.9, β2 = 0.95) with cosine learning rate schedule. Weight decay of 0.1 and gradient clipping of 1.0 with 2000 warmup steps.

Efficient Implementations

Efficient implementation of the causal multi-head attention operator. Available in xformers library[5].

Manually implemented the backward function for the transformer layers to save costly activation during backward pass.

Summary:

LLaMA build upon diverse open-sourced public datasets. It makes effective and efficient architecture choices discussed above.

LLaMA performs really well on Question Answering and Code Generation.

Despite of being smaller in size LLaMA performing better than large models.
Model performance for code generation.

Hope you like it…… Keep learning!!

Sources:

[1] https://dl.acm.org/doi/pdf/10.5555/3454287.3455397

[2] https://github.com/Rishit-dagli/GLU/blob/main/glu_tf/glu.py

[3] https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/

[4] https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

[5] https://github.com/facebookresearch/xformers

--

--

Anshu Kumar
Anshu Kumar

Written by Anshu Kumar

Data Scientist, Author. Building Semantic Search and Recommendation Engine.