Pubmed Tutorial

- Pubmed tutorial; containing a README + 4 configuration files. - Added a link in the main README. - Docs have a README with correct links.
IntelLabs · Aug 13, 2024 · 6e1064d · 6e1064d
1 parent 1ac2d8f
commit 6e1064d
Show file tree

Hide file tree

Showing 10 changed files with 511 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -21,6 +21,10 @@ Clone locally and run:
 pip install -r requirements.txt
 ```
 
+### Quick Start
+
+For a simple, end-to-end example, see the [PubmedQA Tutorial](./docs/pubmed.md).
+
 ## Overview
 
 The RAG Foundry framework facilitates fast prototyping and experimentation with various RAG settings and configurations,

diff --git a/configs/paper/evaluation-pubmed.yaml b/configs/paper/evaluation-pubmed.yaml
@@ -0,0 +1,25 @@
+answer_processor:
+    _target_: ragfoundry.processing.answer_processors.regex.RegexAnswer
+    capture_pattern:
+    stopping_pattern:
+
+metrics:
+    - _target_: ragfoundry.evaluation.metrics.Classification
+      mapping:
+        "yes": 1
+        "no": 0
+        "maybe": 2
+      else_value: 2
+
+key_names:
+    generated: text
+    label: answers
+    query: query
+
+results_file: evaluation-pubmed-rag.yaml
+generated_file: pubmed-rag-test-generated.jsonl
+data_file: pubmed-rag-test.jsonl
+limit:
+use_wandb:
+experiment:
+wandb_entity:
diff --git a/configs/paper/inference-pubmed.yaml b/configs/paper/inference-pubmed.yaml
@@ -0,0 +1,27 @@
+model:
+    _target_: ragfoundry.models.hf.HFInference
+    model_name_or_path: microsoft/Phi-3-mini-128k-instruct
+    load_in_4bit: false
+    load_in_8bit: true
+    device_map: auto
+    torch_dtype:
+    trust_remote_code: true
+    instruction: ragfoundry/processing/prompts/prompt_instructions/qa-yes-no.txt
+    instruct_in_prompt: false
+    lora_path: ./trained_model/checkpoint
+    generation:
+        do_sample: false
+        max_new_tokens: 50
+        max_length:
+        temperature:
+        top_k:
+        top_p:
+        return_full_text: false
+
+data_file: pubmed-rag-test.jsonl
+generated_file: pubmed-rag-test-generated.jsonl
+input_key: prompt
+generation_key: output
+target_key: answers
+limit:
+
diff --git a/configs/paper/processing-pubmed-context.yaml b/configs/paper/processing-pubmed-context.yaml
@@ -0,0 +1,41 @@
+name: pubmed_rag
+cache: true
+output_path: .
+steps:
+    - _target_: ragfoundry.processing.dataset_loaders.loaders.HFLoader
+      inputs: train
+      dataset_config:
+            path: bigbio/pubmed_qa
+            split: train
+
+    - _target_: ragfoundry.processing.dataset_loaders.loaders.HFLoader
+      inputs: test
+      dataset_config:
+            path: bigbio/pubmed_qa
+            name: pubmed_qa_labeled_fold0_source
+            split: test
+
+    - _target_: ragfoundry.processing.global_steps.sampling.ShuffleSelect
+      inputs: train
+      limit: 50000
+
+    - _target_: ragfoundry.processing.local_steps.common_datasets.PubMed
+      inputs: [train, test]
+
+    - _target_: ragfoundry.processing.local_steps.context.DocumentsJoiner
+      inputs: [train, test]
+      docs_key: positive_passages
+      k: 5
+
+    - _target_: ragfoundry.processing.local_steps.prompter.TextPrompter
+      inputs: [train, test]
+      prompt_file: ragfoundry/processing/prompts/qa.txt
+      output_key: prompt
+      mapping:
+            question: query
+            context: positive_passages
+
+    - _target_: ragfoundry.processing.global_steps.output.OutputData
+      inputs: [train, test]
+      prefix: pubmed-rag
+
diff --git a/configs/paper/training-pubmed.yaml b/configs/paper/training-pubmed.yaml
@@ -0,0 +1,58 @@
+model:
+    _target_: ragfoundry.models.hf.HFTrain
+    model_name_or_path: microsoft/Phi-3-mini-128k-instruct
+    load_in_4bit: false
+    load_in_8bit: true
+    torch_dtype:
+    device_map:
+    trust_remote_code: true
+    lora:
+        bias: none
+        fan_in_fan_out: false
+        layers_pattern:
+        layers_to_transform:
+        lora_alpha: 16
+        lora_dropout: 0.1
+        peft_type: LORA
+        r: 16
+        target_modules:
+            - qkv_proj
+        task_type: CAUSAL_LM
+        use_rslora: true
+    completion_start: <|assistant|>
+    instruction_in_prompt:
+    max_sequence_len: 2000
+
+train:
+    output_dir: ./trained_model/
+    bf16: false
+    fp16: false
+    gradient_accumulation_steps: 2
+    group_by_length:
+    learning_rate: 1e-4
+    logging_steps: 10
+    lr_scheduler_type: cosine
+    max_steps: -1
+    num_train_epochs: 1
+    per_device_train_batch_size: 1
+    optim: paged_adamw_8bit
+    remove_unused_columns: true
+    save_steps: 20000
+    save_total_limit: 1
+    warmup_ratio: 0.03
+    weight_decay: 0.001
+    report_to:
+
+instruction: ragfoundry/processing/prompts/prompt_instructions/qa-yes-no.txt
+template:
+data_file: pubmed-rag-train.jsonl
+input_key: prompt
+output_key: answers
+resume_checkpoint:
+limit:
+shuffle:
+hfhub_tag:
+use_wandb:
+experiment:
+wandb_entity:
+
diff --git a/docs/blog/index.md b/docs/blog/index.md
diff --git a/docs/index.md b/docs/index.md
@@ -1 +1,101 @@
---8<-- "README.md"
+<div align="center">
+    <img src="assets/rag_foundry.png" width="500"/>
+</div>
+
+----------
+
+[RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation](https://arxiv.org/abs/2408.02545)
+
+**RAG Foundry** is a library designed to improve LLMs ability to use external information by fine-tuning models on
+specially created RAG-augmented datasets. The library helps create the data for training, given a RAG technique, helps
+easily train models using parameter-efficient finetuning (PEFT), and finally can help users measure the improved
+performance using various, RAG-specific metrics. The library is modular, workflows are customizable using configuration
+files.
+
+Comments, suggestions, issues and pull-requests are welcomed! ❤️
+
+### Installation
+Clone locally and run:
+
+```sh
+pip install -r requirements.txt
+```
+
+### Quick Start
+
+For a simple, end-to-end example, see the [PubmedQA Tutorial](pubmed.md).
+
+## Overview
+
+The RAG Foundry framework facilitates fast prototyping and experimentation with various RAG settings and configurations,
+including data selection and filtering, processing, retrieval, ranking, query manipulation, prompt generation, training,
+inference, output processing and evaluation. The library is comprised of 4 modules: dataset creation, training,
+inference and evaluation.
+
+* **Dataset Creation**: The processing module creates datasets, persisting RAG interactions, to be used for RAG training
+and inference. RAG interactions include dataset loading, columns normalization, data aggregation (fewshot creation),
+information retrieval using external tools and frameworks, API integration, template-based prompt creation and any other
+form of pre-processing. The data is saved in a consistent, model-independent, input-output format, along with all other
+fields and metadata. See [Processing](processing.md).
+
+* **Training**: using PEFT for efficient training and TRL (e.g. supervised FT) users can train any model on the augmented
+datasets. Training is done on the completions. Models can be pushed to HF Hub. See [Training](training.md).
+
+* **Inference**: generating predictions using the augmented datasets with trained or untrained LLMs. See [Inference](inference.md).
+
+* **Evaluation**: running evaluation on the generated output from the inference module. Users can provide a list of
+metrics to run; custom metrics can be implemented easily. Current metrics include EM, F1, ROUGE, BERTScore, Deepeval,
+RAGAS, HF `evaluate` and classification. Metrics can be *local*—run on each example, or *global*—run on the entire
+dataset, e.g. recall. Metrics can utilize any feature in the dataset, like retrieval results, reasoning,
+citations and attributions, not just the input and output texts. See [Evaluation](evaluation.md).
+
+
+## Running
+The 4 modules are represented as scripts: `processing.py`, `training.py`, `inference.py` and `evaluation.py` at the top
+level. Every call has the form `python SCRIPT options...`.
+
+The library utilizes the [Hydra](https://hydra.cc/docs/intro/) configuration tool; it enables the use of hierarchical
+configurations, easily overridden of values in the CLI and the ability to run multiple jobs remotely (e.g. integrations with
+SLURM and Ray). It represents a *configuration-as-code* approach, as it can instantiate python classes according to
+configuration (the `_target_` keyword indicates the python class to use in a given context).
+
+There are default configurations for each module in the [configs](./configs/) folder. A configuration file can be
+overridden like so:
+
+```sh
+python processing -cp configs/paper -cn processing-asqa-retrieval
+```
+
+Individual keywords can be overridden as well:
+```sh
+python processing -cp configs/paper -cn processing-asqa-retrieval   \
+       output_path=/store/data/here                                 \
+       cache=true
+```
+
+For a complete set of configurations, **reproducing the experimentation in the paper with the ASQA dataset**, see the
+configurations in the [Paper](./configs/paper) folder.
+
+## Citation
+
+Please cite our paper if it helps your research:
+
+```BibTex
+@article{fleischerRAGFoundryFramework2024,
+  title =        {{RAG} {Foundry}: {A} {Framework} for {Enhancing} {LLMs} for {Retrieval} {Augmented} {Generation}},
+  author =       {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe and Izsak, Peter},
+  year =         2024,
+  note =         {arXiv:2408.02545 [cs]},
+  annote =       {Comment: 10 pages},
+  url =          {http://arxiv.org/abs/2408.02545},
+  publisher =    {arXiv},
+}
+```
+
+## License
+
+The code is licensed under the [Apache 2.0 License](LICENSE).
+
+## Disclaimer
+
+This is not an official Intel product.