Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: engine.preload() #529

Open
flatsiedatsie opened this issue Aug 6, 2024 · 5 comments
Open

Feature request: engine.preload() #529

flatsiedatsie opened this issue Aug 6, 2024 · 5 comments

Comments

@flatsiedatsie
Copy link

flatsiedatsie commented Aug 6, 2024

Perhaps related to this PR, but opposite:

I'd like to be able to easily ask WebLLM to download a second (or third, etc) model to cache, while continuing to use the existing, already loaded model. Then get a callback when the second model has loaded, so that I can inform the user they can now switch to the other model if they prefer.

Or is there an optimal way to do this already?

Currently my idea is to create a separate function to load the new shards into the cache manually, separately/outside of WebLLM. But I'd prefer to use WebLLM for this if there is a feature for this already (I searched the repo but couldn't find any).

@Neet-Nestor
Copy link
Contributor

I think you can simply achieve this by creating a second instance of MLCEngine and call engine.reload() on the new engine instance, and switch the engine once it finished loading.

@flatsiedatsie
Copy link
Author

Interesting idea, thanks.

The thing is, I don't always need to also start the model. For example, a user might want to go on a long airplane trip and pre-download some models from a list (kind of like pre-loading the map for Spain into OSMAND (or your map-app of choice) before going on a holiday.

But maybe I can just forego switching to the new engine instance? Then the files will still be downloaded anyway, right?

For comparison, this is how Wllama does it. It's just a helper function that loads the chunks into cache, and.. stops there.

@Neet-Nestor
Copy link
Contributor

@CharlieFRuan Follow up on this, if I do something like this to create and load additional engine instance but doesn't actually do completion, would this achieve the result of downloading additional models without causing GPU memory issues?

const engine1 = new MLCEngine();
const engine2 = new MLCEngine();

engine1.reload('model_1');
engine2.reload('model_2');

engine1.chat.completions.create({ messages });

@CharlieFRuan
Copy link
Contributor

CharlieFRuan commented Aug 15, 2024

Thanks for the thoughts and discussions @Neet-Nestor @flatsiedatsie! The code above will work fine: engine2 will not do completion and engine1 is not affected by engine2. However, engine2 will load model_2 onto the WebGPU device, hence creating more burden for the hardware than just "downloading a model". So the code above may fail if model_1 and model_2 together exceed the VRAM that the device has.

Therefore, one way to "only download a model, without touching WebGPU" is to:
On tvmjs side:

On webllm side:

  • Add an API that is similar to reload, but only does the download part, without the need of a WebGPU

@flatsiedatsie
Copy link
Author

I ended up coding a custom function that manually loads the files into the cache. I didn't expect splitting the loading from the inference to have such a big effect, but it has helped simplify my code. And it's now possible for users to use and load models they have already downloaded while they are waiting for the new one to download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants