Problems about resuming from checkpoint for finetune_with_lora #330

ypwang61 · 2024-09-04T23:01:43Z

Hi, thanks for your great job. I found some errors when I wanted to continue my LoRA finetuning.

09/04/2024 15:54:23 - INFO - accelerate.accelerator - Loading states from output/tulu_v2_dolly_openorca_1M_v2_64_7B_lora/step_15600
09/04/2024 15:54:23 - INFO - accelerate.accelerator - Loading DeepSpeed Model and Optimizer
[rank0]: Traceback (most recent call last):
[rank0]:   File "/homes/gws/ypwang61/Research/LLM/open-instruct/open_instruct/finetune.py", line 682, in <module>
[rank0]:     main()
[rank0]:   File "/homes/gws/ypwang61/Research/LLM/open-instruct/open_instruct/finetune.py", line 573, in main
[rank0]:     accelerator.load_state(checkpoint_path)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/accelerate/accelerator.py", line 3064, in load_state
[rank0]:     model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2759, in load_checkpoint
[rank0]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2809, in _load_checkpoint
[rank0]:     sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list, checkpoint_engine=self.checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 43, in get_sd_loader
[rank0]:     return MegatronSDLoader(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 193, in __init__
[rank0]:     super().__init__(ckpt_list, version, checkpoint_engine)
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 55, in __init__
[rank0]:     self.check_ckpt_list()
[rank0]:   File "/homes/gws/ypwang61/miniconda3/envs/oi/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 168, in check_ckpt_list
[rank0]:     assert len(self.ckpt_list) > 0
[rank0]: AssertionError

This is what I have in the step_15600 directory:

adapter_config.json  adapter_model.safetensors  README.md

The text was updated successfully, but these errors were encountered:

hamishivi · 2024-09-08T23:57:10Z

Hi! The code doesn't currently support resuming training when doing LoRA training. We should add support for this (often internally we usually just full-finetune). Feel free to help with adding, otherwise it might take a little time to get to this, due to some upcoming deadlines. I'll leave the issue open to track this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems about resuming from checkpoint for finetune_with_lora #330

Problems about resuming from checkpoint for finetune_with_lora #330

ypwang61 commented Sep 4, 2024 •

edited

Loading

hamishivi commented Sep 8, 2024

Problems about resuming from checkpoint for finetune_with_lora #330

Problems about resuming from checkpoint for finetune_with_lora #330

Comments

ypwang61 commented Sep 4, 2024 • edited Loading

hamishivi commented Sep 8, 2024

ypwang61 commented Sep 4, 2024 •

edited

Loading