Pandas
Pandas is a popular DataFrame library for data analysis.
To read from a single Parquet file, use the read_parquet
function to read it into a DataFrame:
import pandas as pd
df = (
pd.read_parquet("https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet")
.groupby('sign')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
)
To read multiple Parquet files - for example, if the dataset is sharded - you’ll need to use the concat
function to concatenate the files into a single DataFrame:
urls = ["https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", "https://huggingface.co/datasets/tasksource/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet"]
df = (
pd.concat([pd.read_parquet(url) for url in urls])
.groupby('sign')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
)