stereoplegic
's Collections
Dataset generation
updated
Ensemble-Instruct: Generating Instruction-Tuning Data with a
Heterogeneous Mixture of LMs
Paper
•
2310.13961
•
Published
•
5
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
Paper
•
2202.07922
•
Published
•
1
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models
Paper
•
2310.13671
•
Published
•
19
Fabricator: An Open Source Toolkit for Generating Labeled Training Data
with Teacher LLMs
Paper
•
2309.09582
•
Published
•
4
Auto-Instruct: Automatic Instruction Generation and Ranking for
Black-Box Language Models
Paper
•
2310.13127
•
Published
•
12
TeGit: Generating High-Quality Instruction-Tuning Data with
Text-Grounded Task Design
Paper
•
2309.05447
•
Published
•
1
Ada-Instruct: Adapting Instruction Generators for Complex Reasoning
Paper
•
2310.04484
•
Published
•
5
Diversity of Thought Improves Reasoning Abilities of Large Language
Models
Paper
•
2310.07088
•
Published
•
5
Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large
Language Models
Paper
•
2310.01119
•
Published
•
1
Training Generative Question-Answering on Synthetic Data Obtained from
an Instruct-tuned Model
Paper
•
2310.08072
•
Published
•
1
Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations
Paper
•
2310.07849
•
Published
•
2
Generative Data Augmentation using LLMs improves Distributional
Robustness in Question Answering
Paper
•
2309.06358
•
Published
•
1
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Feedback
Paper
•
2309.00267
•
Published
•
48
Adapting Large Language Models via Reading Comprehension
Paper
•
2309.09530
•
Published
•
77
Self-Alignment with Instruction Backtranslation
Paper
•
2308.06259
•
Published
•
42
Unnatural Instructions: Tuning Language Models with (Almost) No Human
Labor
Paper
•
2212.09689
•
Published
•
1
Democratizing Reasoning Ability: Tailored Learning from Large Language
Model
Paper
•
2310.13332
•
Published
•
15
Teaching Language Models to Self-Improve through Interactive
Demonstrations
Paper
•
2310.13522
•
Published
•
12
Self-Convinced Prompting: Few-Shot Question Answering with Repeated
Introspection
Paper
•
2310.05035
•
Published
•
1
Tuna: Instruction Tuning using Feedback from Large Language Models
Paper
•
2310.13385
•
Published
•
11
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning
Paper
•
2310.11716
•
Published
•
5
CITING: Large Language Models Create Curriculum for Instruction Tuning
Paper
•
2310.02527
•
Published
•
2
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
•
2307.08701
•
Published
•
23
Reverse Chain: A Generic-Rule for LLMs to Master Multi-API Planning
Paper
•
2310.04474
•
Published
•
2
UltraFeedback: Boosting Language Models with High-quality Feedback
Paper
•
2310.01377
•
Published
•
5
Promptor: A Conversational and Autonomous Prompt Generation Agent for
Intelligent Text Entry Techniques
Paper
•
2310.08101
•
Published
•
2
FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation
Paper
•
2310.03214
•
Published
•
18
WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct
Paper
•
2308.09583
•
Published
•
7
Retrieval-Generation Synergy Augmented Large Language Models
Paper
•
2310.05149
•
Published
•
1
Prompting Large Language Models with Chain-of-Thought for Few-Shot
Knowledge Base Question Generation
Paper
•
2310.08395
•
Published
•
1
Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models
Paper
•
2310.08491
•
Published
•
54
LMDX: Language Model-based Document Information Extraction and
Localization
Paper
•
2309.10952
•
Published
•
65
Quality-Diversity through AI Feedback
Paper
•
2310.13032
•
Published
•
1
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images
Paper
•
2310.16825
•
Published
•
33
A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation
Paper
•
2310.16656
•
Published
•
42
In-Context Pretraining: Language Modeling Beyond Document Boundaries
Paper
•
2310.10638
•
Published
•
29
Large Language Models Are Also Good Prototypical Commonsense Reasoners
Paper
•
2309.13165
•
Published
•
1
DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller
Language Models
Paper
•
2310.05074
•
Published
•
1
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language
Models
Paper
•
2309.12284
•
Published
•
19
Commonsense Knowledge Transfer for Pre-trained Language Models
Paper
•
2306.02388
•
Published
•
1
Snowman: A Million-scale Chinese Commonsense Knowledge Graph Distilled
from Foundation Model
Paper
•
2306.10241
•
Published
•
1
VIGC: Visual Instruction Generation and Correction
Paper
•
2308.12714
•
Published
•
1
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Paper
•
2309.11998
•
Published
•
25
In-Context Alignment: Chat with Vanilla Language Models Before
Fine-Tuning
Paper
•
2308.04275
•
Published
•
1
Large Language Model as a User Simulator
Paper
•
2308.11534
•
Published
•
2
Enable Language Models to Implicitly Learn Self-Improvement From Data
Paper
•
2310.00898
•
Published
•
23
Textbooks Are All You Need II: phi-1.5 technical report
Paper
•
2309.05463
•
Published
•
87
Aligning Large Language Models through Synthetic Feedback
Paper
•
2305.13735
•
Published
•
1
Reinforced Self-Training (ReST) for Language Modeling
Paper
•
2308.08998
•
Published
•
2
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources
Paper
•
2306.04751
•
Published
•
5
Query2doc: Query Expansion with Large Language Models
Paper
•
2303.07678
•
Published
•
1
Query Expansion by Prompting Large Language Models
Paper
•
2305.03653
•
Published
•
1
Generative Relevance Feedback with Large Language Models
Paper
•
2304.13157
•
Published
•
1
InPars-v2: Large Language Models as Efficient Dataset Generators for
Information Retrieval
Paper
•
2301.01820
•
Published
•
1
Exploring the Viability of Synthetic Query Generation for Relevance
Prediction
Paper
•
2305.11944
•
Published
•
1
LM-CPPF: Paraphrasing-Guided Data Augmentation for Contrastive
Prompt-Based Few-Shot Fine-Tuning
Paper
•
2305.18169
•
Published
•
1
Automated Annotation with Generative AI Requires Validation
Paper
•
2306.00176
•
Published
•
1
Augmented Large Language Models with Parametric Knowledge Guiding
Paper
•
2305.04757
•
Published
•
2
Pre-training with Large Language Model-based Document Expansion for
Dense Passage Retrieval
Paper
•
2308.08285
•
Published
•
1
Learning to Retrieve In-Context Examples for Large Language Models
Paper
•
2307.07164
•
Published
•
21
Tuning Language Models as Training Data Generators for
Augmentation-Enhanced Few-Shot Learning
Paper
•
2211.03044
•
Published
•
1
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large
Language Models
Paper
•
2309.10707
•
Published
•
2
PromptMix: A Class Boundary Augmentation Method for Large Language Model
Distillation
Paper
•
2310.14192
•
Published
•
1
The Program Testing Ability of Large Language Models for Code
Paper
•
2310.05727
•
Published
•
1
Assessing the potential of AI-assisted pragmatic annotation: The case of
apologies
Paper
•
2305.08339
•
Published
•
1
Effectiveness of Data Augmentation for Parameter Efficient Tuning with
Limited Data
Paper
•
2303.02577
•
Published
•
1
Rethink the Effectiveness of Text Data Augmentation: An Empirical
Analysis
Paper
•
2306.07664
•
Published
•
1
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language
Modeling Likewise
Paper
•
2310.19019
•
Published
•
10
Textbooks Are All You Need
Paper
•
2306.11644
•
Published
•
143
Connecting Large Language Models with Evolutionary Algorithms Yields
Powerful Prompt Optimizers
Paper
•
2309.08532
•
Published
•
53
SAIL: Search-Augmented Instruction Learning
Paper
•
2305.15225
•
Published
•
2
Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data
Paper
•
2309.13876
•
Published
•
1
Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning
Paper
•
2305.18170
•
Published
•
2
Constructing Multilingual Code Search Dataset Using Neural Machine
Translation
Paper
•
2306.15604
•
Published
•
1
TRACED: Execution-aware Pre-training for Source Code
Paper
•
2306.07487
•
Published
•
1
Too Few Bug Reports? Exploring Data Augmentation for Improved
Changeset-based Bug Localization
Paper
•
2305.16430
•
Published
•
1
Learning to Reason and Memorize with Self-Notes
Paper
•
2305.00833
•
Published
•
5
Generating Efficient Training Data via LLM-based Attribute Manipulation
Paper
•
2307.07099
•
Published
•
1
End-to-end Knowledge Retrieval with Multi-modal Queries
Paper
•
2306.00424
•
Published
•
1
EchoPrompt: Instructing the Model to Rephrase Queries for Improved
In-context Learning
Paper
•
2309.10687
•
Published
•
1
AugGPT: Leveraging ChatGPT for Text Data Augmentation
Paper
•
2302.13007
•
Published
•
1
Large Language Models as Annotators: Enhancing Generalization of NLP
Models at Minimal Cost
Paper
•
2306.15766
•
Published
•
1
Quick Starting Dialog Systems with Paraphrase Generation
Paper
•
2204.02546
•
Published
•
1
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework
Paper
•
2111.04130
•
Published
•
1
Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models
Paper
•
2308.01825
•
Published
•
21
Harnessing the Power of David against Goliath: Exploring Instruction
Data Generation without Using Closed-Source Models
Paper
•
2308.12711
•
Published
•
1
AnnoLLM: Making Large Language Models to Be Better Crowdsourced
Annotators
Paper
•
2303.16854
•
Published
•
1
Training Language Models with Language Feedback at Scale
Paper
•
2303.16755
•
Published
•
1
Magicoder: Source Code Is All You Need
Paper
•
2312.02120
•
Published
•
80
Asking Questions the Human Way: Scalable Question-Answer Generation from
Text Corpus
Paper
•
2002.00748
•
Published
•
1
Beyond Human Data: Scaling Self-Training for Problem-Solving with
Language Models
Paper
•
2312.06585
•
Published
•
29
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with
Refined Data Generation
Paper
•
2312.14187
•
Published
•
50
Self-Instruct: Aligning Language Model with Self Generated Instructions
Paper
•
2212.10560
•
Published
•
9
WizardLM: Empowering Large Language Models to Follow Complex
Instructions
Paper
•
2304.12244
•
Published
•
14
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Paper
•
2306.08568
•
Published
•
28
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction
Paper
•
2401.06201
•
Published
•
2
AceCoder: Utilizing Existing Code to Enhance Code Generation
Paper
•
2303.17780
•
Published
•
1
SPADE: Synthesizing Assertions for Large Language Model Pipelines
Paper
•
2401.03038
•
Published
•
2
Mixture of Soft Prompts for Controllable Data Generation
Paper
•
2303.01580
•
Published
•
1
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling
Paper
•
2401.16380
•
Published
•
48
Improving Text Embeddings with Large Language Models
Paper
•
2401.00368
•
Published
•
80
CooK: Empowering General-Purpose Language Models with Modular and
Collaborative Knowledge
Paper
•
2305.09955
•
Published
•
1
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM
Workflows
Paper
•
2402.10379
•
Published
•
31
Pre-trained Language Models as Re-Annotators
Paper
•
2205.05368
•
Published
•
1
A Morphologically-Aware Dictionary-based Data Augmentation Technique for
Machine Translation of Under-Represented Languages
Paper
•
2402.01939
•
Published
•
1
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of
Large Vision-Language Models
Paper
•
2403.00231
•
Published
•
1
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Paper
•
2309.11346
•
Published
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper
•
2403.15042
•
Published
•
26
MathGenie: Generating Synthetic Data with Question Back-translation for
Enhancing Mathematical Reasoning of LLMs
Paper
•
2402.16352
•
Published
•
1
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
•
2402.10176
•
Published
•
37
CodecLM: Aligning Language Models with Tailored Synthetic Data
Paper
•
2404.05875
•
Published
•
17
NExT: Teaching Large Language Models to Reason about Code Execution
Paper
•
2404.14662
•
Published
•
4
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task
Adaptation
Paper
•
2402.18334
•
Published
•
12
GeMQuAD : Generating Multilingual Question Answering Datasets from Large
Language Models using Few Shot Learning
Paper
•
2404.09163
•
Published
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper
•
2404.14361
•
Published
•
2
DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by
Diversifying Synthetic Query Generation
Paper
•
2404.02489
•
Published
Prompting-based Synthetic Data Generation for Few-Shot Question
Answering
Paper
•
2405.09335
•
Published
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Paper
•
2405.10040
•
Published
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale
Synthetic Data
Paper
•
2405.14333
•
Published
•
37
Grounding Data Science Code Generation with Input-Output Specifications
Paper
•
2402.08073
•
Published
AgentTuning: Enabling Generalized Agent Abilities for LLMs
Paper
•
2310.12823
•
Published
•
35
SemCoder: Training Code Language Models with Comprehensive Semantics
Paper
•
2406.01006
•
Published
•
1
CrossTune: Black-Box Few-Shot Classification with Label Enhancement
Paper
•
2403.12468
•
Published
TarGEN: Targeted Data Generation with Large Language Models
Paper
•
2310.17876
•
Published
Enhancing Conversational Search: Large Language Model-Aided Informative
Query Rewriting
Paper
•
2310.09716
•
Published
Automatically Generating Numerous Context-Driven SFT Data for LLMs
across Diverse Granularity
Paper
•
2405.16579
•
Published