🚧🚧🚧 The experiment files are now found in https://github.com/softmax1/nanoGPT_softmax1_reddit/tree/softmax1-final
🚧🚧🚧
The goal is to run an experiment to see if there is evidence softmax1, also known as "Quiet Attention" can perform better than regular old transformers.
This is a humble experiment, on a fairly small model, so I am not expecting any big conclusions. It would be suprising if I found a solid result to be honest. However this experiemnt (and people's feedback) may convince me or someone else to run bigger experiments.
It is good to define what the experiment is before I do it and how I will evaluate the result. This will avoid temptations to change the experiment as I go to "fix" things.
I will use a model that trains fairly quick and cheaply. This is the nanoGPT model. I will use this specific commit so you can reproduce this at home: eba36e84649f3c6d840a93092cb779a260544d08
I will run the data preparation here: prepare.py, but instead of Shakespeare I will use text from Reddit. The reason is I want to still have a look at what is produced and judge it by eye and that will be easier in modern everyday English.
This preparation script uses tiktoken
encoding from OpenAI, with the gpt2
setting.
The script to obtain the text and the text itself can be found in the obtaintext.py folder of this repo.
You can download the text from here (sah256 b5d4fe2899431bed3b9f16c1a4773b5088094c5e7c03e77431affffd869e2d3c), at the moment anyway!
I will use the train_shakespeare_char.py as the basis for the parameters, which defines a baby GPT model. I will double the size of the n_embd
to 768 since we are now using bigger tokens and not just characters.
I will increase max_iters
to 100000 to get more training done on both models. And increase lr_decay_iters
to be consistent too.
In summary
Parameter | Value | Meaning |
---|---|---|
gradient_accumulation_steps | 1 | Number of forward passes per iteration to perform |
batch_size | 64 | Number of examples to use to simulataneously train the network |
block_size | 256 | Number of tokens entering the model |
n_layer | 6 | Number of layers of transformer blocks |
n_head | 6 | Number of attention heads |
n_embd | 768 | Size of trained vector representing each token |
dropout | 0.2 | Proportion of randomly selected parameters to not update on each step |
learning_rate | 1e-3 | Control how big the updates are |
max_iters | 1000000 | Number of iterations |
decay_lr | True | whether to decay learning rate |
warmup_iters | 100 | During warm up iterations it ramps up from 0 to min_lr linearly |
lr_decay_iters | 1000000 | After this number of iters, stop cosine decay and stick to min_lr |
min_lr | 1e-4 | Minimum learning rate |
beta1 | 0.9 | beta1 for AdamW |
beta2 | 0.99 | beta2 for AdamW |
bias | False | do we use bias inside LayerNorm and Linear layers? |
grad_clip | 1.0 | clip gradients at this value |
backend | nccl | torch backend type; nccl is best for gpu |
device | cuda | device to run calculations on |
dtype | bfloat16 | datatype to use in tensors |
compile | true | torch performanc optimization |
I have reviewed he following code, taken from https://github.com/softmax1/EsperBERTo/blob/7d2d5ed8695b95ade6bcbe21b7ce981b3c9394d7/src/functional.py#L7, and I will probably use this (subject to license being published).
def softmax_n_shifted_zeros(input: Tensor, n: int) -> Tensor:
"""
$\text(softmax)_n(x_i) = exp(x_i) / (n + \sum_j exp(x_j))$
Note: softmax_n, with fixed input, is _not_ shift-symmetric when n != 0, and we must account for this.
Normally when computing a softmax, the maxes are subtracted from the inputs for numeric stability.
"""
# compute the maxes along the last dimension
input_maxes = input.max(dim=-1, keepdim=True).values
# shift the input to prevent overflow (and underflow in the denominator)
shifted_inputs = subtract(input, input_maxes)
# compute the numerator and softmax_0 denominator using the shifted input
numerator = exp(shifted_inputs)
original_denominator = numerator.sum(dim=-1, keepdim=True)
# we need to shift the zeros in the same way we shifted the inputs
shifted_zeros = multiply(input_maxes, -1)
# and then add this contribution to the denominator
denominator = add(original_denominator, multiply(exp(shifted_zeros), n))
return divide(numerator, denominator)
Todo: It might be worth figuring out the most efficient backprop for this?
Todo: Describe the changes to the model.py
I will be using modal.com to train and evaluate the model.
The script experiment.py uses modal to perform the init and training steps on a cloud gpu.
Usage:
modal token new
modal run experiment.py:init
modal run experiment.py:train
I won't use perplexity to compare the models as we don't expect it to change. However it will be recorded and might be interesting.
I will use the method here to measure kurtosis: https://github.com/softmax1/softmax1
for name, param in model.named_parameters():
mean = param.data.mean()
diffs = param.data - mean
var = torch.mean(torch.pow(diffs, 2.0))
std = torch.pow(var, 0.5)
zscores = diffs / std
skews = torch.mean(torch.pow(zscores, 3.0))
kurtosis = torch.mean(torch.pow(zscores, 4.0)) - 3.0
I will then, time permitting, see what compressability looks like, i.e. what if we quantize the model?