Datasets documentation

Load tabular data

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Load tabular data

A tabular dataset is a generic dataset used to describe any data stored in rows and columns, where the rows represent an example and the columns represent a feature (can be continuous or categorical). These datasets are commonly stored in CSV files, Pandas DataFrames, and in database tables. This guide will show you how to load and create a tabular dataset from:

  • CSV files
  • Pandas DataFrames
  • Databases

CSV files

🤗 Datasets can read CSV files by specifying the generic csv dataset builder name in the load_dataset() method. To load more than one CSV file, pass them as a list to the data_files parameter:

>>> from datasets import load_dataset
>>> dataset = load_dataset("csv", data_files="my_file.csv")

# load multiple CSV files
>>> dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])

You can also map specific CSV files to the train and test splits:

>>> dataset = load_dataset("csv", data_files={"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"})

To load remote CSV files, pass the URLs instead:

>>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
>>> dataset = load_dataset('csv', data_files={"train": base_url + "train.csv", "test": base_url + "test.csv"})

To load zipped CSV files:

>>> url = "https://domain.org/train_data.zip"
>>> data_files = {"train": url}
>>> dataset = load_dataset("csv", data_files=data_files)

Pandas DataFrames

🤗 Datasets also supports loading datasets from Pandas DataFrames with the from_pandas() method:

>>> from datasets import Dataset
>>> import pandas as pd

# create a Pandas DataFrame
>>> df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv")
>>> df = pd.DataFrame(df)
# load Dataset from Pandas DataFrame
>>> dataset = Dataset.from_pandas(df)

Use the splits parameter to specify the name of the dataset split:

>>> train_ds = Dataset.from_pandas(train_df, split="train")
>>> test_ds = Dataset.from_pandas(test_df, split="test")

If the dataset doesn’t look as expected, you should explicitly specify your dataset features. A pandas.Series may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length 0 or if the Series only contains None/NaN objects, the type is set to null.

Databases

Datasets stored in databases are typically accessed with SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.

SQLite

SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if you’d like, or follow along and start from scratch.

Start by creating a quick SQLite database with this Covid-19 data from the New York Times:

>>> import sqlite3
>>> import pandas as pd

>>> conn = sqlite3.connect("us_covid_data.db")
>>> df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
>>> df.to_sql("states", conn, if_exists="replace")

This creates a states table in the us_covid_data.db database which you can now load into a dataset.

To connect to the database, you’ll need the URI string that identifies your database. Connecting to a database with a URI caches the returned dataset. The URI string differs for each database dialect, so be sure to check the Database URLs for whichever database you’re using.

For SQLite, it is:

>>> uri = "sqlite:///us_covid_data.db"

Load the table by passing the table name and URI to from_sql():

>>> from datasets import Dataset

>>> ds = Dataset.from_sql("states", uri)
>>> ds
Dataset({
    features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
    num_rows: 54382
})

Then you can use all of 🤗 Datasets process features like filter() for example:

>>> ds.filter(lambda x: x["state"] == "California")

You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.

Load the dataset by passing your query and URI to from_sql():

>>> from datasets import Dataset

>>> ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
>>> ds
Dataset({
    features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
    num_rows: 1019
})

Then you can use all of 🤗 Datasets process features like filter() for example:

>>> ds.filter(lambda x: x["cases"] > 10000)

PostgreSQL

You can also connect and load a dataset from a PostgreSQL database, however we won’t directly demonstrate how in the documentation because the example is only meant to be run in a notebook. Instead, take a look at how to install and setup a PostgreSQL server in this notebook!

After you’ve setup your PostgreSQL database, you can use the from_sql() method to load a dataset from a table or query.

< > Update on GitHub