ElixirDatasets
ElixirDatasets is a comprehensive library for accessing and managing datasets from Hugging Face Hub in Elixir. Inspired by the Python datasets library, it brings powerful dataset management capabilities to the Elixir ecosystem with seamless integration with Explorer DataFrames.
β¨ Features
- π Easy Access to Hugging Face Hub - Load thousands of datasets with a single function call
- π Explorer Integration - Automatic conversion to Explorer DataFrames for data manipulation
- πΎ Smart Caching - Intelligent local caching to avoid redundant downloads
- π Streaming Support - Process large datasets without loading everything into memory
- π€ Upload Datasets - Publish your own datasets to Hugging Face Hub
- π Private Repositories - Full support for authentication and private datasets
- π― Multiple Formats - Support for CSV, Parquet, and JSONL files
π¦ Installation
Add elixir_datasets to your list of dependencies in mix.exs:
def deps do
[
{:elixir_datasets, "~> 0.1.0"}
]
endπ Quick Start
{:ok, [train_df]} = ElixirDatasets.load_dataset(
{:hf, "cornell-movie-review-data/rotten_tomatoes"},
split: "train"
)
{:ok, datasets} = ElixirDatasets.load_dataset({:local, "./data"})
{:ok, stream} = ElixirDatasets.load_dataset(
{:hf, "stanfordnlp/imdb", subdir: "plain_text"},
split: "train",
streaming: true
)
stream |> Enum.take(100) |> IO.inspect()π Examples
All examples can be found in the examples directory.
examples/usage_examples.livemd- Comprehensive usage examples of the elixir_datasets apiexamples/integration_examples.livemd- Examples demonstrating integration with other Elixir libraries like Nx, Axon, and Bumblebee
π§ Configuration
Environment Variables
ELIXIR_DATASETS_CACHE_DIR- Custom cache directoryELIXIR_DATASETS_OFFLINE- Enable offline mode ("1"or"true")HF_TOKEN- Authentication token for private datasets-
[π§ In-progress]
HF_DEBUG- Enable debug logging ("1"or"true")
π Documentation
Full documentation is available at HexDocs and hosted on GitHub Pages for current status of under-development features. Documentation can be generated locally using:
mix docsπ§ͺ Testing
MIX_ENV=test mix testπ License
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2025 RadosΕaw Rolka, Weronika Wojtas