ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change
Abstract
This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change. We trained two 7B models from scratch on a science-oriented dataset of 300B tokens. For the first model, the 4.2B domain-specific tokens were included during pre-training and the second was adapted to the climate domain after pre-training. Additionally, ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama~2 on a domain-specific dataset of 4.2B tokens. Each model is instruction fine-tuned on a high-quality and human-generated domain-specific dataset that has been created in close cooperation with climate scientists. To reduce the number of hallucinations, we optimize the model for retrieval augmentation and propose a hierarchical retrieval strategy. To increase the accessibility of our model to non-English speakers, we propose to make use of cascaded machine translation and show that this approach can perform comparably to natively multilingual models while being easier to scale to a large number of languages. Further, to address the intrinsic interdisciplinary aspect of climate change we consider different research perspectives. Therefore, the model can produce in-depth answers focusing on different perspectives in addition to an overall answer. We propose a suite of automatic climate-specific benchmarks to evaluate LLMs. On these benchmarks, ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model while not degrading results on general domain benchmarks. Our human evaluation confirms the trends we saw in our benchmarks. All models were trained and evaluated using renewable energy and are released publicly.
Community
Hi, I'm wondering how exactly the Clima500 dataset was made. It seems like you got some of it from expert and non-expert demonstrations, but I'm not sure where the remaining conversation pairs came from. I'm confused how StackExchange, AppTek General, Dolly, FLAN, CoT, etc. datasets were used since they aren't necessarily related to climate change.
Hi! To clarify the Clima500 dataset (https://huggingface.co/datasets/mbzuai-oryx/Clima500) was created by MBZUAI for their Arabic Mini-ClimateGPT model (https://aclanthology.org/2023.findings-emnlp.941/). Even though we both use the name ClimateGPT (I know kinda suboptimal), there is no relation between our and MBZUAI's work.
Now, talking about our Instruction Fine-Tuning data: the different subsets we used for training are listed in Table 3 in the paper (first column size of the data, second column how it was up-/downsampled for training). The climate-specific data consists of the data generated from scratch by us (~10k demonstrations) + QA pairs only from StackExchange from climate.related communities (~3k).
In addition to the climate-specific data, we also included additional general domain data (i.e. AppTek General, OASST-1, Dolly, FLAN, CoT, etc.) with the goal to improve the general instruction following capabilities of the model. As described in the paper we used separate system prompts for each of the subsets during training to further increase the relevance of the high-quality climate expert data at inference time.
In contrast to our work, Clima500 was generated by augmenting existing climate-specific QA dataset using ChatGPT into instruction and completion pairs. This way they were able to create a much larger dataset of ~500k pairs. Our focus was more on collecting a smaller but high-quality dataset.
Fyi. we're currently working on a few more ablation experiments to analyse these design choices more systematically (like the importance of adding the additional general domain data and maybe also add a comparison to Clima500) for an updated version of the paper.
Great idea. How is ClimateGPT and Hedera blockchain related? I didn't see any mention in the PDF