Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
dctanner 
posted an update Jan 15, 2024
Post
As the amount of datasets for fine tuning chat models has grown, there's been a plethora of dataset formats emerge. The most popular of these include the formats used by Alpaca, ShareGPT and Open Assistant datasets. The datasets and their formats have also evolved from single-turn conversation to multi-turn. Many of these formats share similarities (and they all have the same goal), but handling the variations in formats across datasets is often a hassle, and source of potential bugs.

Luckily the community seems to be converging on a simple and elegant chat dataset format: a list with each record being an array with each conversation turn being an object with a role (system, assistant or user) and content. Hugging Face uses this input format in the [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates) docs:

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]


Popular datasets like HuggingFaceH4/no_robots follow this format.

To encourage usage of this format, I propose we give it a name: Hugging Face MessagesList format.

The format is defined as:

- Having at least one messages column of type list.
- Each messages record is an array containing one or more message turn objects.
- A message turn must have role and content keys.
- role should be one of system, assistant or user.
- content is the text content of the message.

This may be a small thing, but having a common dataset format will reduce wasted time data wrangling and help everyone.

This is the same as OpenAI chat format or is there any difference?

·

It's the same. But I feel we would benefit from a different name than ChatML, which I sometimes see used to refer this format and and the actual chat template format.

Do you know how much this format is currently being used? i.e. what % of datasets adopted this format? Could be a nice community effort to convert some existing datasets with permissive licences into a standard format?

·

All the HF H4 datasets follow it :)
We could definitely do some like https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs, orca etc.
I'm doing hh-rlhf right now so it can be used with alignment-handbook.

For pretrained chat models, is the format dictated by the model itself? Or is the user free to choose any format they want?