mlcroissant
mlcroissant is a library to load datasets from Croissant metadata.
💡 Learn more about how to get the metadata from the dataset viewer API in the Get Croissant metadata guide.
Let’s start by parsing the Croissant metadata for the tasksource/blog_authorship_corpus
dataset. Be sure to first install mlcroissant[parquet]
and GitPython
to be able to load Parquet files over the git+https protocol.
from mlcroissant import Dataset
ds = Dataset(jsonld="https://huggingface.co/api/datasets/tasksource/blog_authorship_corpus/croissant")
To read from the first subset (called RecordSet in Croissant’s vocabulary), use the records
function, which returns an iterator of dicts.
records = ds.records("default")
Finally use Pandas to compute your query on the first 1,000 rows:
import itertools
import pandas as pd
df = (
pd.DataFrame(list(itertools.islice(records, 100)))
.groupby("default/sign")["default/text"]
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
)
print(df)
default/sign
b'Leo' 6463.500000
b'Capricorn' 2374.500000
b'Aquarius' 2303.757143
b'Gemini' 1420.333333
b'Aries' 918.666667
Name: default/text, dtype: float64