Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed recursion error in SentenceTransformer #1428

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

yafshar
Copy link
Contributor

@yafshar yafshar commented Oct 16, 2024

What does this PR do?

In SentenceTransformer, the loss object stores the base model. To ensure compatibility with updated models (e.g., distributed or compiled models), we override the original model in the loss object. However, in distributed mode using DeepSpeed with PeftModel, this process caused a recursion error. This commit addresses the issue by properly handling the model override to prevent recursion.

This PR addresses the root causes of the issue using PeftModel, which has been tried in #1400 and it does not affect other cases non PeftModel with or without DeepSpeed

>>> cd ~/optimum-habana/examples/sentence-transformers-training/sts
>>> python ../../gaudi_spawn.py --use_deepspeed --world_size 2 training_stsbenchmark.py --peft
{'train_runtime': 31.3826, 'train_samples_per_second': 183.191, 'train_steps_per_second': 5.704, 'train_loss': 0.10181788492469149, 'epoch': 1.0, 'memory_allocated (GB)': 0.57, 'max_memory_allocated (GB)': 0.58, 'total_memory_available (GB)': 94.62}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:31<00:00,  5.70it/s]
2024-10-16 12:37:39 - EmbeddingSimilarityEvaluator: Evaluating the model on the sts-test dataset:
2024-10-16 12:37:39 - EmbeddingSimilarityEvaluator: Evaluating the model on the sts-test dataset:
2024-10-16 12:37:42 - Cosine-Similarity :       Pearson: 0.7432 Spearman: 0.7160
2024-10-16 12:37:42 - Manhattan-Distance:       Pearson: 0.7277 Spearman: 0.7021
2024-10-16 12:37:42 - Euclidean-Distance:       Pearson: 0.7268 Spearman: 0.7011
2024-10-16 12:37:42 - Dot-Product-Similarity:   Pearson: 0.5973 Spearman: 0.5729
2024-10-16 12:37:42 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/final
2024-10-16 12:37:42 - Cosine-Similarity :       Pearson: 0.7432 Spearman: 0.7160
2024-10-16 12:37:42 - Manhattan-Distance:       Pearson: 0.7277 Spearman: 0.7021
2024-10-16 12:37:42 - Euclidean-Distance:       Pearson: 0.7268 Spearman: 0.7011
2024-10-16 12:37:42 - Dot-Product-Similarity:   Pearson: 0.5973 Spearman: 0.5729
2024-10-16 12:37:42 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/final
2024-10-16 12:37:43 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/merged
2024-10-16 12:37:43 - Save model to output/training_stsbenchmark_distilbert-base-uncased-2024-10-16_12-36-54/merged
[2024-10-16 12:37:51,162] [INFO] [launch.py:351:main] Process 1630 exits successfully.
[2024-10-16 12:37:51,162] [INFO] [launch.py:351:main] Process 1629 exits successfully.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Fixed recursion error with PeftModel in distributed mode with DeepSpeed
Copy link
Contributor

@kplau1128 kplau1128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@yafshar
Copy link
Contributor Author

yafshar commented Oct 16, 2024

>>> python -m pytest tests/sentence_transformers/test_training_stsbenchmark.py
2 passed, 12 warnings in 104.82s (0:01:44)

>>> python -m pytest tests/sentence_transformers/test_training_nli.py
2 passed, 7 warnings in 94.69s (0:01:34)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants