Prompting to reproduce MBPP test results
Hi, I'm trying to reproduce SantaCoder test results on MBPP from the paper, and I'm wondering what is the recommended way to prompt the model.
MBPP provides text instructions, e.g. "Write a function to reverse words in a given string.", which the SantaCoder model card explicitly advises against using. Nevertheless, I try to prompt the model in one of two ways (in Python):
- Function signature, followed by docstring:
def reverse_words(s):
"""Write a function to reverse words in a given string."""
- Comment, followed by function signature
# Write a function to reverse words in a given string.
def reverse_words(s):
In both cases I get reasonable output, except that after defining the function, generation repeats until max_length without terminating in the following manner:
def reverse_words(s):
"""Write a function to reverse words in a given string."""
return''.join(s.split()[::-1])
def reverse_words_2(s):
"""Write a function to reverse words in a given string."""
return''.join(s.split()[::-1])
def reverse_words_3(s):
"""Write a function to reverse words in a given string."""
return''.join(s.split()[::-1])
Should I change the prompting method, or is this output acceptable and I should just truncate the output manually? I am trying to reproduce the eval results from the paper as closely as possible. Thanks for your help.
Hi we evaluated using the MultiPL-E version of MBPP which already implements functions signatures, so evaluation is very similar to Human-Eval
Thank you! And regarding the other part of my question, generation with greedy search or sampling with temperature=0.2 does not terminate in the way shown above. Should I manually truncate the output?
How are you doing the generations? If you use model.generate()
it should stop at eos
token if comes up, if it doesn't come up often you can add a stopping criteria like it's done here. Note that then you need to post-process the output to only keep the first function like it's done here. You can also find more examples in our evaluation harness
This answers my question, thank you for the great and prompt responses!