Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data mixer Integration #2240

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions docs/source/customization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,49 @@ When training large models, you should better handle the CUDA cache by iterative
```python
training_args = DPOConfig(..., optimize_cuda_cache=True)
```

## Mixing Datasets with CLI

You can mix multiple datasets for training by providing a JSON config that details both the training and testing sets.

### Config Structure
The JSON config should have two keys: `"train"` and `"test"`, each containing a list of datasets to mix. For each dataset, you must provide the following five fields in order:
1. **Path**: The path to the dataset (e.g., `"lighteval/mmlu"`).
2. **Name**: The subset of the dataset (if applicable). If there is no subset, pass `null`.
3. **Split**: The data split you want to use (e.g., `"train"`, `"test"`, `"validation"`).
4. **Column**: The column from the dataset to use (e.g., `"question"`, `"input"`).
5. **Proportion**: The proportion of the dataset to use (e.g., `0.5` for 50%).

Here’s an example configuration:

```json
{
"train": [
["lighteval/mmlu", "anatomy", "test", "question", 0.5],
["lukaemon/bbh", "web_of_lies", "test", "input", 0.3],
["nakamoto-yama/us-colleges-universities", null, "train", "alias", 0.1]
],
"test": [
["lighteval/mmlu", "anatomy", "validation", "question", 1],
["lukaemon/bbh", "web_of_lies", "test", "input", 0.5],
["nakamoto-yama/us-colleges-universities", null, "train", "alias", 0.4]
]
}
```

In this example:
- For the training set, 50% of the `"question"` column from the `"anatomy"` subset of `lighteval/mmlu` is used.
- For the testing set, 100% of the `"question"` column from the same dataset is used in the `"validation"` split.
- The dataset `"nakamoto-yama/us-colleges-universities"` has no subset, so `null` is passed as the second value.

### Using the Mixer Config in CLI

To use the mixer config with the CLI, provide the path to the JSON config using the `--mixer_config` flag. For example:

```bash
trl sft --model_name_or_path MODEL_NAME \
--mixer_config PATH_TO_MIXER_CONFIG \
--output_dir OUTPUT_DIR
```

Replace `MODEL_NAME`, `PATH_TO_MIXER_CONFIG`, and `OUTPUT_DIR` with your desired model, the path to the JSON config file, and the directory for your output, respectively.
9 changes: 8 additions & 1 deletion examples/scripts/sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@
from datasets import load_dataset
from transformers import AutoTokenizer

from trl.utils import data_mixer_from_json

from trl import (
ModelConfig,
ScriptArguments,
Expand Down Expand Up @@ -87,7 +89,12 @@
################
# Dataset
################
dataset = load_dataset(script_args.dataset_name)
#use data mixer if mixer_config is provided
if script_args.mixer_config:
dataset = data_mixer_from_json(script_args.mixer_config)
training_args.dataset_text_field = "text"
else:
dataset = load_dataset(script_args.dataset_name)

################
# Training
Expand Down
2 changes: 1 addition & 1 deletion trl/commands/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def train(command_name):
encoding="utf-8",
cwd=os.getcwd(),
env=os.environ.copy(),
capture_output=True,
capture_output=False,
)
except (CalledProcessError, ChildProcessError) as exc:
console.log(f"TRL - {command_name.upper()} failed on ! See the logs above for further details.")
Expand Down
39 changes: 38 additions & 1 deletion trl/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
from dataclasses import dataclass
from typing import Optional

from datasets import load_dataset,Dataset,DatasetDict
import json

@dataclass
class ScriptArguments:
Expand All @@ -29,16 +31,51 @@ class ScriptArguments:
Dataset split to use for evaluation.
config (`str` or `None`, *optional*, defaults to `None`):
Path to the optional config file.
mixer_config (`str` or `None`, *optional*, defaults to `None`):
Path to the optional data mixer config file.
gradient_checkpointing_use_reentrant (`bool`, *optional*, defaults to `False`):
Whether to apply `use_reentrant` for gradient_checkpointing.
ignore_bias_buffers (`bool`, *optional*, defaults to `False`):
Debug argument for distributed training. Fix for DDP issues with LM bias/mask buffers - invalid scalar type,
inplace operation. See https://github.com/huggingface/transformers/issues/22482#issuecomment-1595790992.
"""

dataset_name: str
dataset_name: Optional[str] = None
dataset_train_split: str = "train"
dataset_test_split: str = "test"
config: Optional[str] = None
mixer_config: Optional[str] = None
gradient_checkpointing_use_reentrant: bool = False
ignore_bias_buffers: bool = False


#uses json file to create a mixed dataset
def data_mixer_from_json(json_path: str):
# Load the JSON config file
with open(json_path) as f:
config = json.load(f)

# Initialize a dictionary to hold the sampled datasets for each split
sampled_datasets_dict = {}

# Iterate over each split in the config
for split, configs in config.items():
sampled_datasets = []
# Iterate over each dataset config in the split
for config in configs:
path, name, split_name, column, proportion = config
# Load the dataset and sample the required proportion of examples
dataset = load_dataset(path=path, name=name, split=split_name)
num_samples = int(len(dataset) * proportion)
dataset_slice = dataset.select(range(num_samples))
column_data = dataset_slice[column]

sampled_datasets.extend(column_data)
# Combine the sampled datasets into a single dataset
combined_dataset = Dataset.from_dict({"text": sampled_datasets}).shuffle(seed=42)
sampled_datasets_dict[split] = combined_dataset

# Wrap the datasets in a DatasetDict with the appropriate splits
split_dataset = DatasetDict(sampled_datasets_dict)

return split_dataset