Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems about resuming from checkpoint for finetune_with_lora #330

Open
ypwang61 opened this issue Sep 4, 2024 · 1 comment
Open

Problems about resuming from checkpoint for finetune_with_lora #330

ypwang61 opened this issue Sep 4, 2024 · 1 comment

Comments

@ypwang61
Copy link

ypwang61 commented Sep 4, 2024

Hi, thanks for your great job. I found some errors when I wanted to continue my LoRA finetuning.

09/04/2024 15:54:23 - INFO - accelerate.accelerator - Loading states from output/tulu_v2_dolly_openorca_1M_v2_64_7B_lora/step_15600
09/04/2024 15:54:23 - INFO - accelerate.accelerator - Loading DeepSpeed Model and Optimizer
[rank0]: Traceback (most recent call last):
[rank0]:   File "/homes/gws/ypwang61/Research/LLM/open-instruct/open_instruct/finetune.py", line 682, in <module>
[rank0]:     main()
[rank0]:   File "/homes/gws/ypwang61/Research/LLM/open-instruct/open_instruct/finetune.py", line 573, in main
[rank0]:     accelerator.load_state(checkpoint_path)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/accelerate/accelerator.py", line 3064, in load_state
[rank0]:     model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2759, in load_checkpoint
[rank0]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2809, in _load_checkpoint
[rank0]:     sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list, checkpoint_engine=self.checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 43, in get_sd_loader
[rank0]:     return MegatronSDLoader(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 193, in __init__
[rank0]:     super().__init__(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 55, in __init__
[rank0]:     self.check_ckpt_list()
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 168, in check_ckpt_list
[rank0]:     assert len(self.ckpt_list) > 0
[rank0]: AssertionError

This is what I have in the step_15600 directory:

adapter_config.json  adapter_model.safetensors  README.md
@hamishivi
Copy link
Collaborator

Hi! The code doesn't currently support resuming training when doing LoRA training. We should add support for this (often internally we usually just full-finetune). Feel free to help with adding, otherwise it might take a little time to get to this, due to some upcoming deadlines. I'll leave the issue open to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants