Pythia-12b supervised finetuned with Anthropic-hh-rlhf dataset for 1 epoch (sft-model), before DPO (paper) with same dataset for 1 epoch.

wandb log

Benchmark evaluations included in repo done using lm-evaluation-harness.

See Pythia-12b for original model details (paper).

Downloads last month
13
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train lomahony/eleuther-pythia12b-hh-dpo

Collection including lomahony/eleuther-pythia12b-hh-dpo