Rainforest Connection

non-profit
Verified

AI & ML interests

None defined yet.

Recent Activity

lhoestqĀ  authored a paper about 1 month ago
Croissant: A Metadata Format for ML-Ready Datasets
sashaĀ  updated a dataset about 1 month ago
rfcx/frugalai
antonyharfieldĀ  updated a Space 2 months ago
rfcx/README
View all activity

rfcx's activity

lhoestqĀ 
posted an update about 2 months ago
view post
Post
1760
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
āœļø Edit datasets in the UI
šŸ”— Share link with collaborators
šŸ Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
antonyharfieldĀ 
updated a Space 2 months ago
lhoestqĀ 
posted an update 6 months ago
view post
Post
4151
Hey ! I'm working on a 100% synthetic Dataset Hub here (you can search for any kind of datasets an the app invents them). The link is here: infinite-dataset-hub/infinite-dataset-hub

Question for the Community:

Which models should I use to generate images and audio samples for those datasets ? šŸ¤—
  • 4 replies
Ā·
lhoestqĀ 
posted an update 10 months ago
view post
Post
3060
āœØ Easy Synthetic Dataset File Generation using LLM DataGen ! Link: https://huggingface.co/spaces/lhoestq/LLM_DataGen

features + how it works:

āœļø Generate the dataset content you want just by entering a file name
šŸ’” Optionally specify the column names you need
šŸ’Ø The dataset is streamed and generated on-the-fly in JSON Lines format
āœ… Generation is constrained to always output valid JSON

How does this work ?
1/ Enter a file name
2/ The model generates column names for such a file. Using structured generation, it can generate 2 to 5 column names using lower case characters and underscores. I use a prompt that asks to generate column names for a realistic dataset and low temperature.
3/ The columns are used to update the Finite State Machine for the dataset content structured generation, so that it is used to generate JSON objects using those columns
4/ The model generates JSON objects using structured generation again, using the updated Finite State Machine. I use a prompt that asks for realistic data and a temperature of 1.

> Why update a Finite State Machine instead of re-creating one ?

Creating one can take up to 30sec, while updating one takes 0.1s (though it requires to manipulate a graph which is not easy to implement)

> Batched generation is faster, why not use it ?

Generate in batches is faster but tends to generate duplicates for this demo.
Further work can be to provide different prompts (one per sequence in the batch) to end up with a different distribution of sequences in each batch. Or implement a custom sampler that would forbid generating the same data in sequences of the same batch.

> How does structured generation work ?

I used the outlines library with transformers to to define a JSON schema that the generation has to follow. It uses a Finite State Machine with token_id as transitions.

Let me know what you think ! And feel free to duplicate/modify it to try other models/prompts or sampling methods :)