Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a more scalable full text search via SQLite #63

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

phiresky
Copy link

@phiresky phiresky commented Apr 10, 2022

I implemented a POC for SQLite-based full text search.

You can try a demo here:

https://phiresky.github.io/replayweb.page/?source=https://phiresky.github.io/replayweb.page/examples/netpreserve-twitter-4k-minimal.wacz#view=pages&query=фізичної+юридичної

It's implemented in the service worker. The api is a GET to /w/api/c/<collectionid>/sqliteFtsSearch?matchString=foo&limit=10 which responds with a ndjson stream of rows and progress events:

image

This means you can start showing results as more are found, before the search is complete.

If there's an error, a {"type": "error", "message": "..."} object is returned in the stream.

Here's the changes made to the frontend (outside of this PR): phiresky/replayweb.page@202673f

Notes:

  • the script to convert a pages.jsonl to a pages-fts.sqlite3 file is in [src/sqlite-fts/create-sqlite-fts.ts].

  • i added the full text index for "dako-gov-ua.wacz" to the netpreserve-twitter wacz instead so I could push it to github as an example - the search works but the warc files are obviously missing.

  • i did not implement detection of presence of the sqlite index. right now it just assumes /pages/extraPages-fts.sqlite3 exists.

  • right now everything stays cached in ram, there needs to be some eviction (right now you need to evict by unloading the service worker manually)

  • right now I directly pass the query to the sqlite MATCH syntax. that means the following mean different things:

    • "foo bar" vs foo bar
    • there's more specific syntax for prefixes etc, see https://www.sqlite.org/fts5.html
    • probably for actual use we want to always just search for a set of single words (not phrases)
  • I created two variants: a "minimal" FTS index and a "full" index. the minimal index only allows for searching for any number of words in the page content but can't filter by closeness of words, by occurence count, sort by BM25 relevance, or even display the page contents.

  • We'll need a larger pages.jsonl to really evaluate this. The ones you sent are too small. With the pages.jsonl from dako-gov-ua.wacz these are the stats:

    • original extraPages.jsonl: 49MB

    • extraPages.jsonl(zipped): 3.5MB

    • 4kB pages, minimal index (search in page content but can only show id, title and url, not page content itself):

      • size of extraPages-fts.sqlite3: 2.2MB
      • search for це запити фізичної юридичної: 16 requests, 127kB
      • search for ВИДИ ЗАПИТІВ: 13 requests, 111 kB
      • search for Соціально правові after searching for ВИДИ ЗАПИТІВ: 3 requests, 20kB total. this shows that after the initial query a lot of needed data is already fetched, even when the text is completely different. for more similar queries it has to fetch even less (or nothing).
    • 4kB pages, full index:

    • 32kB pages, full index:

      • size of extraPages-fts.sqlite3: 30.5MB
      • search for це запити фізичної юридичної: 19 requests, 635kB
      • search for ВИДИ ЗАПИТІВ: 16 requests, 537kB
      • search for "Соціально-правові" after searching for ВИДИ ЗАПИТІВ: 4 requests, 130kB.
  • I had to modify the wa-sqlite library to add -DSQLITE_ENABLE_FTS5 to WASQLITE_DEFINES, that's why I included a wasm binary of that library.

@phiresky
Copy link
Author

By the way (just thought of this), if the minimal index is enough then another option for "medium-sized" archives is to simply compress the sqlite3 file, download it completely, and load it as an in-memory database. That would remove most of the complexity of this code.

The compressed sqlite3 file for dako.gov.ua is 400kB compared to 3.5MB for your extraPages.jsonl file. It can't show snippets, but it can still search with "ok" quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant