Skip to content

Releases: huggingface/datasets

2.16.1

30 Dec 16:46
7b2bcd7
Compare
Choose a tag to compare

Bug fixes

  • Fix dl_manager.extract returning FileNotFoundError by @lhoestq in #6543
    • Fix bug causing FileNotFoundError when passing a relative directory as cache_dir to load_dataset
  • Fix custom configs from script by @lhoestq in #6544
    • Fix bug when loading a dataset with a loading script using custom arguments would fail
    • e.g. load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")

Full Changelog: 2.16.0...2.16.1

2.16.0

22 Dec 14:21
a85fb52
Compare
Choose a tag to compare

Security features

  • Add trust_remote_code argument by @lhoestq in #6429
    • Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
    • Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
    • Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
  • Use parquet export if possible by @lhoestq in #6448
    • This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
    • You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

  • Webdataset dataset builder by @lhoestq in #6391
  • Implement get dataset default config name by @albertvillanova in #6511
  • Lazy data files resolution and offline cache reload by @lhoestq in #6493
    • This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
    • Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
    • Reload a dataset from your cache even if you don't have internet connection
    • New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
    • Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

New Contributors

Full Changelog: 2.15.0...2.16.0

2.15.0

16 Nov 08:06
0caf912
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 2.14.7...2.15.0

2.14.7

15 Nov 08:19
bf02cff
Compare
Choose a tag to compare

Bug Fixes

New Contributors

Full Changelog: 2.14.6...2.14.7

2.14.6

24 Oct 08:15
06c3ffb
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 2.14.5...2.14.6

2.14.5

24 Oct 08:15
1a598a0
Compare
Choose a tag to compare

Bug fixes

Other improvements

New Contributors

Full Changelog: 2.14.4...2.14.5

2.13.2

06 Sep 08:29
98b1bdd
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.13.1...2.13.2

2.14.4

08 Aug 15:52
53d55f3
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.14.3...2.14.4

2.14.3

03 Aug 10:31
33f736e
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.14.2...2.14.3

2.14.2

31 Jul 06:39
Compare
Choose a tag to compare

Bug fixes

Full Changelog: 2.14.1...2.14.2