Fixed recursion error in SentenceTransformer #1428

yafshar · 2024-10-16T12:18:04Z

What does this PR do?

In SentenceTransformer, the loss object stores the base model. To ensure compatibility with updated models (e.g., distributed or compiled models), we override the original model in the loss object. However, in distributed mode using DeepSpeed with PeftModel, this process caused a recursion error. This commit addresses the issue by properly handling the model override to prevent recursion.

This PR addresses the root causes of the issue using PeftModel, which has been tried in #1400 and it does not affect other cases non PeftModel with or without DeepSpeed

>>> cd ~/optimum-habana/examples/sentence-transformers-training/sts
>>> python ../../gaudi_spawn.py --use_deepspeed --world_size 2 training_stsbenchmark.py --peft
{'train_runtime': 31.3826, 'train_samples_per_second': 183.191, 'train_steps_per_second': 5.704, 'train_loss': 0.10181788492469149, 'epoch': 1.0, 'memory_allocated (GB)': 0.57, 'max_memory_allocated (GB)': 0.58, 'total_memory_available (GB)': 94.62}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:31<00:00,  5.70it/s]
2024-10-16 12:37:39 - EmbeddingSimilarityEvaluator: Evaluating the model on the sts-test dataset:
2024-10-16 12:37:39 - EmbeddingSimilarityEvaluator: Evaluating the model on the sts-test dataset:
2024-10-16 12:37:42 - Cosine-Similarity :       Pearson: 0.7432 Spearman: 0.7160
2024-10-16 12:37:42 - Manhattan-Distance:       Pearson: 0.7277 Spearman: 0.7021
2024-10-16 12:37:42 - Euclidean-Distance:       Pearson: 0.7268 Spearman: 0.7011
2024-10-16 12:37:42 - Dot-Product-Similarity:   Pearson: 0.5973 Spearman: 0.5729
2024-10-16 12:37:42 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/final
2024-10-16 12:37:42 - Cosine-Similarity :       Pearson: 0.7432 Spearman: 0.7160
2024-10-16 12:37:42 - Manhattan-Distance:       Pearson: 0.7277 Spearman: 0.7021
2024-10-16 12:37:42 - Euclidean-Distance:       Pearson: 0.7268 Spearman: 0.7011
2024-10-16 12:37:42 - Dot-Product-Similarity:   Pearson: 0.5973 Spearman: 0.5729
2024-10-16 12:37:42 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/final
2024-10-16 12:37:43 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/merged
2024-10-16 12:37:43 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/merged
[2024-10-16 12:37:51,162] [INFO] [launch.py:351:main] Process 1630 exits successfully.
[2024-10-16 12:37:51,162] [INFO] [launch.py:351:main] Process 1629 exits successfully.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Fixed recursion error with PeftModel in distributed mode with DeepSpeed

kplau1128

Looks good to me.

yafshar · 2024-10-16T20:07:40Z

>>> python -m pytest tests/sentence_transformers/test_training_stsbenchmark.py
2 passed, 12 warnings in 104.82s (0:01:44)

>>> python -m pytest tests/sentence_transformers/test_training_nli.py
2 passed, 7 warnings in 94.69s (0:01:34)

Fixed recursion error in SentenceTransformer

e27eb51

Fixed recursion error with PeftModel in distributed mode with DeepSpeed

yafshar mentioned this pull request Oct 16, 2024

Fixed recursion error when uses both wrapped PEFT and DeepSpped #1400

Closed

libinta added synapse1.18 and removed synapse1.18 labels Oct 16, 2024

yafshar marked this pull request as ready for review October 16, 2024 17:05

yafshar requested a review from regisss as a code owner October 16, 2024 17:05

kplau1128 approved these changes Oct 16, 2024

View reviewed changes

yafshar added 3 commits October 17, 2024 08:52

Merge branch 'main' into fix_peft_ds_recursion

f23c195

Merge branch 'main' into fix_peft_ds_recursion

1d7ae3a

Merge branch 'main' into fix_peft_ds_recursion

2dfebda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed recursion error in SentenceTransformer #1428

Fixed recursion error in SentenceTransformer #1428

yafshar commented Oct 16, 2024 •

edited

Loading

kplau1128 left a comment

yafshar commented Oct 16, 2024

Fixed recursion error in SentenceTransformer #1428

Are you sure you want to change the base?

Fixed recursion error in SentenceTransformer #1428

Conversation

yafshar commented Oct 16, 2024 • edited Loading

What does this PR do?

Before submitting

kplau1128 left a comment

Choose a reason for hiding this comment

yafshar commented Oct 16, 2024

yafshar commented Oct 16, 2024 •

edited

Loading