Model Card for Model ID

Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense proposed a strong discourse paraphraser known as DIPPER.

DIPPER is a large model, built from google/t5-efficient-xxl and finetuned on 6.3M datapoints. I am proposing a lightweight, non-context equivalent for lower-cost usage.

This model is built from google/t5-large-nl32 and finetuned on 100,000 datapoints. Notably, the datapoints are all non-context. Refer to the original paper if you wish for further understanding on this topic.

The dataset used to finetune this model is available here: Dataset

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: Sam Jackson
  • Model type: Sequence-to-Sequence Model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model [optional]: google/t5-efficient-large-nl32

Model Sources [optional]

Uses

The model is intended to be used for paraphrasing with notions of control. The dataset used encourages lexical (word) and order (paragraph structure) parameters, which control the degree of strength in paraphrasing.

See the example code usage for a further understanding.

Direct Use

The model is entirely usable from the uploaded state. No further finetuning is required, although possible.

Downstream Use [optional]

This model was finetuned from a T5 checkpoint. It is possible to further finetune this model, if desired. If you plan for transfer learning, I would simply recommend starting from the initial checkpoint model: google/t5-large-nl32.

Recommendations

In terms of recommendation, if you have the capacity, I would recommend using the more powerful model: DIPPER

Otherwise, this model is sufficiently strong. It outperforms the sentence-based paraphraser ChatGPT Paraphraser when it comes to perplexity scores - when both models are compared using the facebook/opt-2.7b model.

How to Get Started with the Model

Use the code below to get started with the model.

Training Details

Training Data

As mentioned, the training data is here: kpar3-no-ctx Pre-processing simply contains tokenisation through the google/t5-efficient-large-nl32 tokenizer.

The data is classic paraphrase pairs. However, the first element in the pair has terms "lexical = x" and "order = y". The values x and y are in the set {0, 20, 40, 60, 80, 100} and denote the strength with which the model should paraphrase.

In particular, a sentence with "lexical = 0" should change as many words as possible, while maintaining the original meaning. Meanwhile, a sentence with "order = 0" should restructure the paragraph to the model's greatest extent.

The dataset only contains parameter values in increments of 20.

Training Hyperparameters

  • Training regime:
learning_rate = 1e-4
bf16 = True
num_train_epochs = 2
auto_find_batch_size = True,
generation_num_beams = 2,
generation_max_length = 200

Speeds, Sizes, Times [optional]

Finetuning on 100,000 datapoints, this took around 14 GPU hours using a GTX 3090.

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("google/t5-efficient-large-nl32")

model = AutoModelForSeq2SeqLM.from_pretrained("SamSJackson/paraphrase-dipper-no-ctx")
model = model.to(device)

text = "Each Wednesdsay, I take my dog for a walk in Central Park."

lexical = 20
order = 40

prompt = f"lexical = {lexical}, order = {order} {text}"

input_ids = tokenizer(
    prompt,
    return_tensors='pt',
    padding="longest",
    max_length=1000,
    truncation=True,
).to(device)

outputs = model.generate(
    **input_ids,
    top_p=0.75,
    do_sample=True,
    max_new_tokens=300,
)

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
response = f"{' '.join(response)}"

print(response)

Citation [optional]

BibTeX:

@misc{krishna2023paraphrasing,
      title={Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense}, 
      author={Kalpesh Krishna and Yixiao Song and Marzena Karpinska and John Wieting and Mohit Iyyer},
      year={2023},
      eprint={2303.13408},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Model Card Contact

Contact me through huggingface if you have any questions.

Downloads last month
83
Safetensors
Model size
973M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.