Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding fixes so transducer can work again #247

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

bonham79
Copy link
Collaborator

Merges: #233 #197

Fixes: #192 #191

Dependent on: CUNY-CL/maxwell#17

Summary: I fixed maxwell so TQDM isn't a property of the SED parameters anymore, this allows pickling of the expert module again and thus allows multigpu training across the transducer.

In progress I changed how the expert module initializes so that it just copies the index vocabulary from the dataloader that's passed to it. So now you just need to pass an index to the action vocabulary and everything is managed in the backend. This allows free initialization of expert modules from checkpoints and thus skips epochs of em when resuming from a checkpointing. (I just do the same thing we do with indexes in which you write the sed parameters to the experiment directory and load it when initializing the model.)

I also added new flags so that you can just skip em training for the expert. This is for an upcoming change in which the transducer is no longer dependent on having an oracle function. (I've found through training that the SED actually doesn't add that much to training.) It also allows the creation of dummy experts that just hold the action vocabulary. Added error checks to prevent unsafe use. Feel free to point out more.

I also moved adams changes from #233 into the trainer so that there's no weird attribute managing going on in the init anymore. ( Turns out that checkpointing pickles that kwargs dict, so simply adding the action vocabulary was creating too large an embedding space.)

I ran experiments over the Polish data and was able to write to predictions fairly easily. Only major issue is that we're wasting some parameters on creating target vocabulary embeddings that will never be used. But that's a low bar on the efficiency stack.

Copy link
Contributor

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but some questions and style notes. Thanks Travis.

yoyodyne/models/expert.py Outdated Show resolved Hide resolved
yoyodyne/models/expert.py Outdated Show resolved Hide resolved
yoyodyne/models/expert.py Outdated Show resolved Hide resolved
yoyodyne/models/expert.py Outdated Show resolved Hide resolved
yoyodyne/models/expert.py Show resolved Hide resolved
yoyodyne/models/expert.py Outdated Show resolved Hide resolved
actions.encode_actions(source)

actions = ActionVocabulary(unk_idx=train_data.index.unk_idx)
assert data.has_target, """Passed dataset with no target to expert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are exceptions elsewhere, why is this an assertion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a sanity check thing to me. Makes more sense to write a single assert line than defining an error class, doing a coinditional check, then raising. Just an economy thing I guess.

yoyodyne/models/expert.py Outdated Show resolved Hide resolved
yoyodyne/train.py Outdated Show resolved Hide resolved
@@ -557,11 +560,19 @@ def predict_step(self, batch: Tuple[torch.tensor], batch_idx: int) -> Dict:

def convert_prediction(self, prediction: List[List[int]]) -> torch.Tensor:
"""Converts prediction values to tensor for evaluator compatibility."""
# FIXME: the two steps below may be partially redundant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the "two steps" referred to here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from Adam's PR, no comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, let's remove it then since we don't know what it means.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is copied from #233

iirc I meant that looping and stacking predictions, and calling util.pad_tensor_after_eos may do some redundant things that could be cleaned up at some point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should we leave the comment to clean up later or just remove now?

@@ -32,19 +37,24 @@ class ActionVocabulary:
start_vocab_idx: int
target_characters: Set[Any]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Any and not str?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the edit actions can accept any hashable symbol. So strings, ints, even tuples. So any is a better representation of its coverage given that maxwell is also symbol agnostic.

yoyodyne/models/base.py Show resolved Hide resolved
self.target_characters = set()
self.encode_actions([unk_idx]) # Sets unknown character decoding.
# Use index from dataset to create action vocabulary.
self.encode_actions([index(t) for t in index.target_vocabulary])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start an issue to document exactly what's going on here? encode_actions converts vocab into Actions and stored them in a separate vocabulary, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it nees to track all potential edit actions for a given symbol.

SED training over the default data sampling is expensive.
Training is quicker if tensors are converted to lists.
For efficiency, we encode action vocabulary simultaneously.
We want just the encodings without BOS or EOS tokens. This
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a more general comment here before this for the unfamiliar user: this basically converts the dataset into a format that is usable by the Expert -- is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just want the raw symbols string. We don't want to worry about BOS and EOS being encoded by maxwell.

What comment would make sense to you? (I'm too familiar for a good user friendly doc string.)

yoyodyne/models/expert.py Outdated Show resolved Hide resolved
@@ -557,11 +560,19 @@ def predict_step(self, batch: Tuple[torch.tensor], batch_idx: int) -> Dict:

def convert_prediction(self, prediction: List[List[int]]) -> torch.Tensor:
"""Converts prediction values to tensor for evaluator compatibility."""
# FIXME: the two steps below may be partially redundant.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is copied from #233

iirc I meant that looping and stacking predictions, and calling util.pad_tensor_after_eos may do some redundant things that could be cleaned up at some point.

@@ -273,8 +277,16 @@ def get_model_from_argparse_args(
source_attention_heads=args.source_attention_heads,
source_encoder_cls=source_encoder_cls,
start_idx=datamodule.index.start_idx,
target_vocab_size=datamodule.index.target_vocab_size,
vocab_size=datamodule.index.vocab_size,
target_vocab_size=(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we orchestrate the vocab sizes once in the trainer, which already has a handle on the initialized expert? This is much nicer than w/e I was trying to do before.

It still somehow feels clunky I think, but that is an effect of the single embeddings matrix updates not easily aligning to the expert.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's still a bit clunky but at least the clunkiness is occuring in the train script (which is localized clumsiness). Probably a good next issue is to break up the train script some more so it's less of a monolith.

)
sed_aligner.params.write_params(sed_params_path)
else:
sed_params = sed.ParamDict.read_params(sed_params_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that in order to use an existing sed, the user would specify 0 or None for oracle_em_epochs? Avoiding rerunning em every time is a great feature, but I wonder if we can think through a more intuitive user interface for it, this feels a bit buried. Maybe you've already put some thought into this though so please let me know if you think this is already a good interface.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I see the comment in the method header. I missed it before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a bool to the train script that's triggered if a sed file already exists. I think this is a bit cleaner. Thoughts?

sed_params_path (str): path to read/write location of sed parameters.
If epochs > 0, this is a write path.
If epochs == 0, this is a read path.
If empty string then creates 'dummy' expert.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here or in the README about what behavior a 'dummy' expert entails.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed dummy expert. Not important for this go around.

@Adamits
Copy link
Collaborator

Adamits commented Sep 16, 2024

Thanks for these changes + cleaning this stuff up!!

Summary: I fixed maxwell so TQDM isn't a property of the SED parameters anymore, this allows pickling of the expert module again and thus allows multigpu training across the transducer.

Awesome!

In progress I changed how the expert module initializes so that it just copies the index vocabulary from the dataloader that's passed to it. So now you just need to pass an index to the action vocabulary and everything is managed in the backend. This allows free initialization of expert modules from checkpoints and thus skips epochs of em when resuming from a checkpointing. (I just do the same thing we do with indexes in which you write the sed parameters to the experiment directory and load it when initializing the model.)

  1. "is managed in the backend" means in the expert.py code? 2. Are the expert modules deterministic given a dataset? For the index, this is the case and it will be created and written every time you run an experiment I think right? I believe we only load it when it is explicitly asked for in the args (e.g. for inference), though correct me if i'm wrong. So here, I guess if we default to epochs > 0, then we just need to put something in the README that you can re-use the expert by setting epochs = 0 (and then maybe we raise an error if there is not expert where we expect it to be in this case.).

I also moved adams changes from #233 into the trainer so that there's no weird attribute managing going on in the init anymore. ( Turns out that checkpointing pickles that kwargs dict, so simply adding the action vocabulary was creating too large an embedding space.)

Nice, thanks. Yeah I was worried about that but in a big rush trying to make a fix when I made the PR and then promplty disappeared. It looks like you cleaned up some naming, etc as well, so is it safe to assume that you have valdated that those are the correct changes, or should I do more testing?

I ran experiments over the Polish data and was able to write to predictions fairly easily. Only major issue is that we're wasting some parameters on creating target vocabulary embeddings that will never be used. But that's a low bar on the efficiency stack.

How was the accuracy w/ and w/out features?

@Adamits Adamits mentioned this pull request Sep 24, 2024
Signed-off-by: Bonham79 <[email protected]>
@bonham79
Copy link
Collaborator Author

1. "is managed in the backend" means in the expert.py code? 2. Are the expert modules deterministic given a dataset? For the index, this is the case and it will be created and written every time you run an experiment I think right? I believe we only load it when it is explicitly asked for in the args (e.g. for inference), though correct me if i'm wrong. So here, I guess if we default to epochs > 0,  then we just need to put something in the README that you can re-use the expert by setting epochs = 0 (and then maybe we raise an error if there is not expert where we expect it to be in this case.).

Yep, expert.py manages. No dataset is passed to expert anymore. It just reads from the index file.

Indexes are pickled between runs no? So it should be deterministic with new initialization. Regardless, the expert is pickled now so the weights remain same between runs.

Nice, thanks. Yeah I was worried about that but in a big rush trying to make a fix when I made the PR and then promplty disappeared. It looks like you cleaned up some naming, etc as well, so is it safe to assume that you have valdated that those are the correct changes, or should I do more testing?

I've validated but always a fan of triple checking.

How was the accuracy w/ and w/out features?

I forget off hand but they were on par with my usual runs with polish.

Signed-off-by: Bonham79 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TQDM Error with multi GPU Transducer
3 participants