ci: update path-filtering orb to 1.3.0 (#3588 )

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
readme: add Windows ARM download link
2025-06-23 00:02:10 -04:00 · 2025-05-27 15:46:52 -04:00 · 2025-02-24 19:51:59 -05:00 · 2025-02-24 19:44:45 -05:00 · 2025-02-24 19:41:13 -05:00 · 2025-02-24 17:15:34 -05:00
279 changed files with 45165 additions and 11332 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@ -1,13 +1,17 @@
 version: 2.1
 setup: true
 orbs:
-  path-filtering: circleci/path-filtering@0.0.1
+  path-filtering: circleci/path-filtering@1.3.0

 workflows:
  version: 2.1
  generate-config:
    jobs:
      - path-filtering/filter:
+          filters:
+            tags:
+              only:
+                - /.*/
          base-revision: main
          config-path: .circleci/continue_config.yml
          mapping: |
@ -16,4 +20,3 @@ workflows:
            gpt4all-bindings/python/.* run-python-workflow true
            gpt4all-bindings/typescript/.* run-ts-workflow true
            gpt4all-chat/.* run-chat-workflow true
-            .* run-default-workflow true
--- a/.circleci/continue_config.yml
+++ b/.circleci/continue_config.yml
--- a/.codespellrc
+++ b/.codespellrc
@ -1,3 +1,3 @@
 [codespell]
-ignore-words-list = blong, afterall, som, assistent, crasher
-skip = .git,*.pdf,*.svg,*.lock
+ignore-words-list = blong, afterall, assistent, crasher, requestor
+skip = ./.git,./gpt4all-chat/translations,*.pdf,*.svg,*.lock
--- a/.gitignore
+++ b/.gitignore
@ -181,6 +181,8 @@ CMakeLists.txt.user
 gpt4all-chat/models/*
 build_*
 build-*
+cmake-build-*
+/gpt4all-chat/tests/python/config.py

 # IntelliJ
 .idea/
--- a/.gitmodules
+++ b/.gitmodules
@ -1,7 +1,25 @@
 [submodule "llama.cpp-mainline"]
-	path = gpt4all-backend/llama.cpp-mainline
+	path = gpt4all-backend/deps/llama.cpp-mainline
 	url = https://github.com/nomic-ai/llama.cpp.git
 	branch = master
 [submodule "gpt4all-chat/usearch"]
-	path = gpt4all-chat/usearch
+	path = gpt4all-chat/deps/usearch
 	url = https://github.com/nomic-ai/usearch.git
+[submodule "gpt4all-chat/deps/SingleApplication"]
+	path = gpt4all-chat/deps/SingleApplication
+	url = https://github.com/nomic-ai/SingleApplication.git
+[submodule "gpt4all-chat/deps/fmt"]
+	path = gpt4all-chat/deps/fmt
+	url = https://github.com/fmtlib/fmt.git
+[submodule "gpt4all-chat/deps/DuckX"]
+	path = gpt4all-chat/deps/DuckX
+	url = https://github.com/nomic-ai/DuckX.git
+[submodule "gpt4all-chat/deps/QXlsx"]
+	path = gpt4all-chat/deps/QXlsx
+	url = https://github.com/nomic-ai/QXlsx.git
+[submodule "gpt4all-chat/deps/minja"]
+	path = gpt4all-chat/deps/minja
+	url = https://github.com/nomic-ai/minja.git
+[submodule "gpt4all-chat/deps/json"]
+	path = gpt4all-chat/deps/json
+	url = https://github.com/nlohmann/json.git
--- a/MAINTAINERS.md
+++ b/MAINTAINERS.md
@ -0,0 +1,77 @@
+# MAINTAINERS
+
+## Rules
+
+* All content inside GPT4All shall have a documented maintainer
+* If a maintainer decides to retire or resign a call for volunteers will go
+  out
+* If no further maintainer can be found in a reasonable time frame, then the
+  content will be marked deprecated and removed in time
+
+## Job
+
+Maintainers will be...
+
+1. Responsible for overseeing content under their stewardship
+2. Responsible for triaging new issues, reviewing PRs, assigning priority
+   to tasks
+3. Responsible for keeping content in sufficient quality in a timely fashion
+
+## List
+
+Adam Treat ([@manyoso](https://github.com/manyoso))<br/>
+E-mail: adam@nomic.ai<br/>
+Discord: `@gonzochess75`
+- Overall project maintainer
+- Chat UI
+
+Jared Van Bortel ([@cebtenzzre](https://github.com/cebtenzzre))<br/>
+E-mail: jared@nomic.ai<br/>
+Discord: `@cebtenzzre`
+- gpt4all-backend
+- Python binding
+- Python CLI app
+
+Jacob Nguyen ([@jacoobes](https://github.com/jacoobes))<br/>
+Discord: `@jacoobes`<br/>
+E-mail: `jacoobes@sern.dev`
+- TypeScript binding
+
+Dominik ([@cosmic-snow](https://github.com/cosmic-snow))<br/>
+E-mail: cosmic-snow@mailfence.com<br/>
+Discord: `@cosmic__snow`
+- Community documentation (GitHub Wiki)
+
+Max Cembalest ([@mcembalest](https://github.com/mcembalest))<br/>
+E-mail: max@nomic.ai<br/>
+Discord: `@maxcembalest.`
+- Official documentation (gpt4all-bindings/python/docs -> https://docs.gpt4all.io/)
+
+Thiago Ramos ([@thiagojramos](https://github.com/thiagojramos))<br/>
+E-mail: thiagojramos@outlook.com<br/>
+- pt\_BR translation
+
+不知火 Shiranui ([@supersonictw](https://github.com/supersonictw))<br/>
+E-mail: supersonic@livemail.tw<br/>
+Discord: `@supersonictw`
+- zh\_TW translation
+
+Jeremy Tayco ([@jstayco](https://github.com/jstayco))<br/>
+E-mail: jstayco@protonmail.ch<br/>
+Discord: `@vertana`
+- es\_MX translation
+
+Riccardo Giovanetti ([@Harvester62](https://github.com/Harvester62))<br/>
+E-mail: riccardo.giovanetti@gmail.com<br/>
+Discord: `@harvester62`
+- it\_IT translation
+
+Tim ([@Tim453](https://github.com/Tim453))<br/>
+E-mail: tim453@mailbox.org<br/>
+Discord: `@Tim453`
+- Flatpak
+
+Jack ([@wuodoo](https://github.com/wuodoo))<br/>
+E-mail: 2296103047@qq.com<br/>
+Discord: `@mikage`
+- zh\_CN translation
--- a/README.md
+++ b/README.md
@ -1,48 +1,110 @@
 <h1 align="center">GPT4All</h1>
-<p align="center">Privacy-oriented software for chatting with large language models that run on your own computer.</p>
+
 <p align="center">
-  <a href="https://gpt4all.io">Official Website</a> &bull; <a href="https://docs.gpt4all.io">Documentation</a> &bull; <a href="https://discord.gg/mGZE39AS3e">Discord</a>
+  Now with support for DeepSeek R1 Distillations
+</p>
+
+<p align="center">
+  <a href="https://www.nomic.ai/gpt4all">Website</a> &bull; <a href="https://docs.gpt4all.io">Documentation</a> &bull; <a href="https://discord.gg/mGZE39AS3e">Discord</a> &bull; <a href="https://www.youtube.com/watch?v=gQcZDXRVJok">YouTube Tutorial</a>
+</p>
+
+<p align="center">
+  GPT4All runs large language models (LLMs) privately on everyday desktops & laptops.
 </p>
 <p align="center">
-  Official Download Links: <a href="https://gpt4all.io/installers/gpt4all-installer-win64.exe">Windows</a> &mdash; <a href="https://gpt4all.io/installers/gpt4all-installer-darwin.dmg">macOS</a> &mdash; <a href="https://gpt4all.io/installers/gpt4all-installer-linux.run">Ubuntu</a>
+  No API calls or GPUs required - you can just download the application and <a href="https://docs.gpt4all.io/gpt4all_desktop/quickstart.html#quickstart">get started</a>.
+</p>
+
+<p align="center">
+  Read about what's new in <a href="https://www.nomic.ai/blog/tag/gpt4all">our blog</a>.
 </p>
 <p align="center">
-  <b>NEW:</b> <a href="https://forms.nomic.ai/gpt4all-release-notes-signup">Subscribe to our mailing list</a> for updates and news!
+  <a href="https://nomic.ai/gpt4all/#newsletter-form">Subscribe to the newsletter</a>
 </p>
+
+https://github.com/nomic-ai/gpt4all/assets/70534565/513a0f15-4964-4109-89e4-4f9a9011f311
+
 <p align="center">
 GPT4All is made possible by our compute partner <a href="https://www.paperspace.com/">Paperspace</a>.
 </p>
-<p align="center">
- <a href="https://www.phorm.ai/query?projectId=755eecd3-24ad-49cc-abf4-0ab84caacf63"><img src="https://img.shields.io/badge/Phorm-Ask_AI-%23F2777A.svg" alt="phorm.ai"></a>
+
+## Download Links
+
+<p>
+  &mdash; <a href="https://gpt4all.io/installers/gpt4all-installer-win64.exe">
+    <img src="gpt4all-bindings/python/docs/assets/windows.png" style="height: 1em; width: auto" /> Windows Installer
+  </a> &mdash;
+</p>
+<p>
+  &mdash; <a href="https://gpt4all.io/installers/gpt4all-installer-win64-arm.exe">
+    <img src="gpt4all-bindings/python/docs/assets/windows.png" style="height: 1em; width: auto" /> Windows ARM Installer
+  </a> &mdash;
+</p>
+<p>
+  &mdash; <a href="https://gpt4all.io/installers/gpt4all-installer-darwin.dmg">
+    <img src="gpt4all-bindings/python/docs/assets/mac.png" style="height: 1em; width: auto" /> macOS Installer
+  </a> &mdash;
+</p>
+<p>
+  &mdash; <a href="https://gpt4all.io/installers/gpt4all-installer-linux.run">
+    <img src="gpt4all-bindings/python/docs/assets/ubuntu.svg" style="height: 1em; width: auto" /> Ubuntu Installer
+  </a> &mdash;
+</p>
+<p>
+  The Windows and Linux builds require Intel Core i3 2nd Gen / AMD Bulldozer, or better.
+</p>
+<p>
+  The Windows ARM build supports Qualcomm Snapdragon and Microsoft SQ1/SQ2 processors.
+</p>
+<p>
+  The Linux build is x86-64 only (no ARM).
+</p>
+<p>
+  The macOS build requires Monterey 12.6 or newer. Best results with Apple Silicon M-series processors.
 </p>

-<p align="center">
-  <img width="auto" height="400" src="https://github.com/nomic-ai/gpt4all/assets/14168726/495fce3e-769b-4e5a-a394-99f072ac4d29">
-</p>
-<p align="center">
-Run on an M2 MacBook Pro (not sped up!)
+See the full [System Requirements](gpt4all-chat/system_requirements.md) for more details.
+
+<br/>
+<br/>
+<p>
+  <a href='https://flathub.org/apps/io.gpt4all.gpt4all'>
+    <img style="height: 2em; width: auto" alt='Get it on Flathub' src='https://flathub.org/api/badge'><br/>
+    Flathub (community maintained)
+  </a>
 </p>

+## Install GPT4All Python

-## About GPT4All
+`gpt4all` gives you access to LLMs with our Python client around [`llama.cpp`](https://github.com/ggerganov/llama.cpp) implementations. 

-GPT4All is an ecosystem to run **powerful** and **customized** large language models that work locally on consumer grade CPUs and NVIDIA and AMD GPUs. Note that your CPU needs to support [AVX instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions).
+Nomic contributes to open source software like [`llama.cpp`](https://github.com/ggerganov/llama.cpp) to make LLMs accessible and efficient **for all**.

-Learn more in the [documentation](https://docs.gpt4all.io).
+```bash
+pip install gpt4all
+```

-A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4All software. **Nomic AI** supports and maintains this software ecosystem to enforce quality and security alongside spearheading the effort to allow any person or enterprise to easily deploy their own on-edge large language models.
+```python
+from gpt4all import GPT4All
+model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM
+with model.chat_session():
+    print(model.generate("How can I run LLMs efficiently on my laptop?", max_tokens=1024))
+```


-### Installation
+## Integrations

-The recommended way to install GPT4All is to use one of the online installers linked above in this README, which are also available at the [GPT4All website](https://gpt4all.io/). These require an internet connection at install time, are slightly easier to use on macOS due to code signing, and provide a version of GPT4All that can check for updates.
+:parrot::link: [Langchain](https://python.langchain.com/v0.2/docs/integrations/providers/gpt4all/)
+:card_file_box: [Weaviate Vector Database](https://github.com/weaviate/weaviate) - [module docs](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-gpt4all)
+:telescope: [OpenLIT (OTel-native Monitoring)](https://github.com/openlit/openlit) - [Docs](https://docs.openlit.io/latest/integrations/gpt4all)

-An alternative way to install GPT4All is to use one of the offline installers available on the [Releases page](https://github.com/nomic-ai/gpt4all/releases). These do not require an internet connection at install time, and can be used to install an older version of GPT4All if so desired. But using these requires acknowledging a security warning on macOS, and they provide a version of GPT4All that is unable to notify you of updates, so you should enable notifications for Releases on this repository (Watch > Custom > Releases) or sign up for announcements in our [Discord server](https://discord.gg/mGZE39AS3e).
-
-
-### What's New
+## Release History
+- **July 2nd, 2024**: V3.0.0 Release
+    - Fresh redesign of the chat application UI
+    - Improved user workflow for LocalDocs
+    - Expanded access to more model architectures
 - **October 19th, 2023**: GGUF Support Launches with Support for:
-    - Mistral 7b base model, an updated model gallery on [gpt4all.io](https://gpt4all.io), several new local code models including Rift Coder v1.5
+    - Mistral 7b base model, an updated model gallery on our website, several new local code models including Rift Coder v1.5
    - [Nomic Vulkan](https://blog.nomic.ai/posts/gpt4all-gpu-inference-with-vulkan) support for Q4\_0 and Q4\_1 quantizations in GGUF.
    - Offline build support for running old versions of the GPT4All Local LLM Chat Client.
 - **September 18th, 2023**: [Nomic Vulkan](https://blog.nomic.ai/posts/gpt4all-gpu-inference-with-vulkan) launches supporting local LLM inference on NVIDIA and AMD GPUs.
@ -51,25 +113,6 @@ An alternative way to install GPT4All is to use one of the offline installers av

 [Docker-based API server]: https://github.com/nomic-ai/gpt4all/tree/cef74c2be20f5b697055d5b8b506861c7b997fab/gpt4all-api

-
-### Building From Source
-
-* Follow the instructions [here](gpt4all-chat/build_and_run.md) to build the GPT4All Chat UI from source.
-
-
-### Bindings
-
-* :snake: <a href="https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python">Official Python Bindings</a> [![Downloads](https://static.pepy.tech/badge/gpt4all/week)](https://pepy.tech/project/gpt4all)
-* :computer: <a href="https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/typescript">Typescript Bindings</a>
-
-
-### Integrations
-
-* :parrot::link: [Langchain](https://python.langchain.com/en/latest/modules/models/llms/integrations/gpt4all.html)
-* :card_file_box: [Weaviate Vector Database](https://github.com/weaviate/weaviate) - [module docs](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-gpt4all)
-* :telescope: [OpenLIT (OTel-native Monitoring)](https://github.com/openlit/openlit) - [Docs](https://docs.openlit.io/latest/integrations/gpt4all)
-
-
 ## Contributing
 GPT4All welcomes contributions, involvement, and discussion from the open source community!
 Please see CONTRIBUTING.md and follow the issues, bug reports, and PR markdown templates.
@ -78,74 +121,6 @@ Check project discord, with project owners, or through existing issues/PRs to av
 Please make sure to tag all of the above with relevant project identifiers or your contribution could potentially get lost.
 Example tags: `backend`, `bindings`, `python-bindings`, `documentation`, etc.

-
-## GPT4All 2024 Roadmap
-To contribute to the development of any of the below roadmap items, make or find the corresponding issue and cross-reference the [in-progress task](https://github.com/orgs/nomic-ai/projects/2/views/1).
-
-Each item should have an issue link below.
-
- Chat UI Language Localization (localize UI into the native languages of users)
-    - [ ] Chinese
-    - [ ] German
-    - [ ] French
-    - [ ] Portuguese
-    - [ ] Your native language here. 
- UI Redesign: an internal effort at Nomic to improve the UI/UX of gpt4all for all users.
-    - [ ] Design new user interface and gather community feedback
-    - [ ] Implement the new user interface and experience.
- Installer and Update Improvements
-    - [ ] Seamless native installation and update process on OSX
-    - [ ] Seamless native installation and update process on Windows
-    - [ ] Seamless native installation and update process on Linux
- Model discoverability improvements:
-    - [x] Support huggingface model discoverability
-    - [ ] Support Nomic hosted model discoverability
- LocalDocs (towards a local perplexity)
-    - Multilingual LocalDocs Support
-        - [ ] Create a multilingual experience
-        - [ ] Incorporate a multilingual embedding model
-        - [ ] Specify a preferred multilingual LLM for localdocs
-    - Improved RAG techniques
-        - [ ] Query augmentation and re-writing
-        - [ ] Improved chunking and text extraction from arbitrary modalities
-            - [ ] Custom PDF extractor past the QT default (charts, tables, text)
-        - [ ] Faster indexing and local exact search with v1.5 hamming embeddings and reranking (skip ANN index construction!)
-    - Support queries like 'summarize X document'
-    - Multimodal LocalDocs support with Nomic Embed
-    - Nomic Dataset Integration with real-time LocalDocs
-        - [ ] Include an option to allow the export of private LocalDocs collections to Nomic Atlas for debugging data/chat quality
-        - [ ] Allow optional sharing of LocalDocs collections between users.
-        - [ ] Allow the import of a LocalDocs collection from an Atlas Datasets
-            - Chat with live version of Wikipedia, Chat with Pubmed, chat with the latest snapshot of world news.
- First class Multilingual LLM Support
-    - [ ] Recommend and set a default LLM for German
-    - [ ] Recommend and set a default LLM for English
-    - [ ] Recommend and set a default LLM for Chinese
-    - [ ] Recommend and set a default LLM for Spanish
-
- Server Mode improvements
-    - Improved UI and new requested features:
-        - [ ] Fix outstanding bugs and feature requests around networking configurations.
-        - [ ] Support Nomic Embed inferencing
-        - [ ] First class documentation
-        - [ ] Improving developer use and quality of server mode (e.g. support larger batches)
-
-
-## Technical Reports
-
-<p align="center">
-<a href="https://gpt4all.io/reports/GPT4All_Technical_Report_3.pdf">:green_book: Technical Report 3: GPT4All Snoozy and Groovy </a>
-</p>
-
-<p align="center">
-<a href="https://static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf">:green_book: Technical Report 2: GPT4All-J </a>
-</p>
-
-<p align="center">
-<a href="https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf">:green_book: Technical Report 1: GPT4All</a>
-</p>
-
-
 ## Citation

 If you utilize this repository, models or data in a downstream project, please consider citing it with:
--- a/common/common.cmake
+++ b/common/common.cmake
@ -0,0 +1,41 @@
+function(gpt4all_add_warning_options target)
+    if (MSVC)
+        return()
+    endif()
+    target_compile_options("${target}" PRIVATE
+        # base options
+        -Wall
+        -Wextra
+        # extra options
+        -Wcast-align
+        -Wextra-semi
+        -Wformat=2
+        -Wmissing-include-dirs
+        -Wsuggest-override
+        -Wvla
+        # errors
+        -Werror=format-security
+        -Werror=init-self
+        -Werror=pointer-arith
+        -Werror=undef
+        # disabled warnings
+        -Wno-sign-compare
+        -Wno-unused-parameter
+    )
+    if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
+        target_compile_options("${target}" PRIVATE
+            -Wduplicated-branches
+            -Wduplicated-cond
+            -Wlogical-op
+            -Wno-reorder
+            -Wno-null-dereference
+        )
+    elseif (CMAKE_CXX_COMPILER_ID MATCHES "^(Apple)?Clang$")
+        target_compile_options("${target}" PRIVATE
+            -Wunreachable-code-break
+            -Wunreachable-code-return
+            -Werror=pointer-integer-compare
+            -Wno-reorder-ctor
+        )
+    endif()
+endfunction()
--- a/gpt4all-backend/CMakeLists.txt
+++ b/gpt4all-backend/CMakeLists.txt
@ -1,4 +1,7 @@
-cmake_minimum_required(VERSION 3.21)  # for PROJECT_IS_TOP_LEVEL
+cmake_minimum_required(VERSION 3.23)  # for FILE_SET
+
+include(../common/common.cmake)
+
 set(CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ON)
 set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

@ -33,7 +36,7 @@ set(LLMODEL_VERSION_PATCH 0)
 set(LLMODEL_VERSION "${LLMODEL_VERSION_MAJOR}.${LLMODEL_VERSION_MINOR}.${LLMODEL_VERSION_PATCH}")
 project(llmodel VERSION ${LLMODEL_VERSION} LANGUAGES CXX C)

-set(CMAKE_CXX_STANDARD 20)
+set(CMAKE_CXX_STANDARD 23)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_RUNTIME_OUTPUT_DIRECTORY})
 set(BUILD_SHARED_LIBS ON)
@ -47,17 +50,15 @@ else()
    message(STATUS "Interprocedural optimization support detected")
 endif()

-set(DIRECTORY llama.cpp-mainline)
+set(DIRECTORY deps/llama.cpp-mainline)
 include(llama.cpp.cmake)

 set(BUILD_VARIANTS)
-set(GPTJ_BUILD_VARIANT cpu)
 if (APPLE)
    list(APPEND BUILD_VARIANTS metal)
 endif()
 if (LLMODEL_KOMPUTE)
    list(APPEND BUILD_VARIANTS kompute kompute-avxonly)
-    set(GPTJ_BUILD_VARIANT kompute)
 else()
    list(PREPEND BUILD_VARIANTS cpu cpu-avxonly)
 endif()
@ -65,9 +66,23 @@ if (LLMODEL_VULKAN)
    list(APPEND BUILD_VARIANTS vulkan vulkan-avxonly)
 endif()
 if (LLMODEL_CUDA)
-    if (DEFINED CMAKE_CUDA_ARCHITECTURES)
-        set(GGML_CUDA_ARCHITECTURES "${CMAKE_CUDA_ARCHITECTURES}")
+    cmake_minimum_required(VERSION 3.18)  # for CMAKE_CUDA_ARCHITECTURES
+
+    # Defaults must be set before enable_language(CUDA).
+    # Keep this in sync with the arch list in ggml/src/CMakeLists.txt (plus 5.0 for non-F16 branch).
+    if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
+        # 52 == lowest CUDA 12 standard
+        # 60 == f16 CUDA intrinsics
+        # 61 == integer CUDA intrinsics
+        # 70 == compute capability at which unrolling a loop in mul_mat_q kernels is faster
+        if (GGML_CUDA_F16 OR GGML_CUDA_DMMV_F16)
+            set(CMAKE_CUDA_ARCHITECTURES "60;61;70;75") # needed for f16 CUDA intrinsics
+        else()
+            set(CMAKE_CUDA_ARCHITECTURES "50;52;61;70;75") # lowest CUDA 12 standard + lowest for integer intrinsics
+            #set(CMAKE_CUDA_ARCHITECTURES "OFF") # use this to compile much faster, but only F16 models work
+        endif()
    endif()
+    message(STATUS "Using CUDA architectures: ${CMAKE_CUDA_ARCHITECTURES}")

    include(CheckLanguage)
    check_language(CUDA)
@ -82,8 +97,6 @@ if (LLMODEL_ROCM)
    list(APPEND BUILD_VARIANTS rocm rocm-avxonly)
 endif()

-set(CMAKE_VERBOSE_MAKEFILE ON)
-
 # Go through each build variant
 foreach(BUILD_VARIANT IN LISTS BUILD_VARIANTS)
    # Determine flags
@ -92,30 +105,34 @@ foreach(BUILD_VARIANT IN LISTS BUILD_VARIANTS)
    else()
        set(GPT4ALL_ALLOW_NON_AVX ON)
    endif()
-    set(LLAMA_AVX2 ${GPT4ALL_ALLOW_NON_AVX})
-    set(LLAMA_F16C ${GPT4ALL_ALLOW_NON_AVX})
-    set(LLAMA_FMA  ${GPT4ALL_ALLOW_NON_AVX})
+    set(GGML_AVX2 ${GPT4ALL_ALLOW_NON_AVX})
+    set(GGML_F16C ${GPT4ALL_ALLOW_NON_AVX})
+    set(GGML_FMA  ${GPT4ALL_ALLOW_NON_AVX})

-    set(LLAMA_METAL   OFF)
-    set(LLAMA_KOMPUTE OFF)
-    set(LLAMA_VULKAN  OFF)
-    set(LLAMA_CUDA    OFF)
-    set(LLAMA_ROCM    OFF)
+    set(GGML_METAL   OFF)
+    set(GGML_KOMPUTE OFF)
+    set(GGML_VULKAN  OFF)
+    set(GGML_CUDA    OFF)
+    set(GGML_ROCM    OFF)
    if (BUILD_VARIANT MATCHES metal)
-        set(LLAMA_METAL   ON)
+        set(GGML_METAL   ON)
    elseif (BUILD_VARIANT MATCHES kompute)
-        set(LLAMA_KOMPUTE ON)
+        set(GGML_KOMPUTE ON)
    elseif (BUILD_VARIANT MATCHES vulkan)
-        set(LLAMA_VULKAN  ON)
+        set(GGML_VULKAN  ON)
    elseif (BUILD_VARIANT MATCHES cuda)
-        set(LLAMA_CUDA    ON)
+        set(GGML_CUDA    ON)
    elseif (BUILD_VARIANT MATCHES rocm)
-        set(LLAMA_HIPBLAS ON)
+        set(GGML_HIPBLAS ON)
    endif()

    # Include GGML
    include_ggml(-mainline-${BUILD_VARIANT})

+    if (BUILD_VARIANT MATCHES metal)
+        set(GGML_METALLIB "${GGML_METALLIB}" PARENT_SCOPE)
+    endif()
+
    # Function for preparing individual implementations
    function(prepare_target TARGET_NAME BASE_LIB)
        set(TARGET_NAME ${TARGET_NAME}-${BUILD_VARIANT})
@ -134,28 +151,35 @@ foreach(BUILD_VARIANT IN LISTS BUILD_VARIANTS)

    # Add each individual implementations
    add_library(llamamodel-mainline-${BUILD_VARIANT} SHARED
-        llamamodel.cpp llmodel_shared.cpp)
+        src/llamamodel.cpp src/llmodel_shared.cpp)
+    gpt4all_add_warning_options(llamamodel-mainline-${BUILD_VARIANT})
    target_compile_definitions(llamamodel-mainline-${BUILD_VARIANT} PRIVATE
        LLAMA_VERSIONS=>=3 LLAMA_DATE=999999)
+    target_include_directories(llamamodel-mainline-${BUILD_VARIANT} PRIVATE
+        src include/gpt4all-backend
+    )
    prepare_target(llamamodel-mainline llama-mainline)

-    if (BUILD_VARIANT MATCHES ${GPTJ_BUILD_VARIANT})
-        add_library(gptj-${BUILD_VARIANT} SHARED
-            gptj.cpp utils.h utils.cpp llmodel_shared.cpp llmodel_shared.h)
-        prepare_target(gptj llama-mainline)
-    endif()
-
    if (NOT PROJECT_IS_TOP_LEVEL AND BUILD_VARIANT STREQUAL cuda)
        set(CUDAToolkit_BIN_DIR ${CUDAToolkit_BIN_DIR} PARENT_SCOPE)
    endif()
 endforeach()

 add_library(llmodel
-    llmodel.h llmodel.cpp llmodel_shared.cpp
-    llmodel_c.h llmodel_c.cpp
-    dlhandle.cpp
+    src/dlhandle.cpp
+    src/llmodel.cpp
+    src/llmodel_c.cpp
+    src/llmodel_shared.cpp
+)
+gpt4all_add_warning_options(llmodel)
+target_sources(llmodel PUBLIC
+    FILE_SET public_headers TYPE HEADERS BASE_DIRS include
+    FILES include/gpt4all-backend/llmodel.h
+          include/gpt4all-backend/llmodel_c.h
+          include/gpt4all-backend/sysinfo.h
 )
 target_compile_definitions(llmodel PRIVATE LIB_FILE_EXT="${CMAKE_SHARED_LIBRARY_SUFFIX}")
+target_include_directories(llmodel PRIVATE src include/gpt4all-backend)

 set_target_properties(llmodel PROPERTIES
                              VERSION ${PROJECT_VERSION}
--- a/gpt4all-backend/README.md
+++ b/gpt4all-backend/README.md
@ -27,7 +27,7 @@ Unfortunately, no for three reasons:

 # What is being done to make them more compatible?

-A few things. Number one, we are maintaining compatibility with our current model zoo by way of the submodule pinning. However, we are also exploring how we can update to newer versions of llama.cpp without breaking our current models. This might involve an additional magic header check or it could possibly involve keeping the currently pinned submodule and also adding a new submodule with later changes and differienting them with namespaces or some other manner. Investigations continue.
+A few things. Number one, we are maintaining compatibility with our current model zoo by way of the submodule pinning. However, we are also exploring how we can update to newer versions of llama.cpp without breaking our current models. This might involve an additional magic header check or it could possibly involve keeping the currently pinned submodule and also adding a new submodule with later changes and differentiating them with namespaces or some other manner. Investigations continue.

 # What about GPU inference?

--- a/gpt4all-backend/deps/llama.cpp-mainline
+++ b/gpt4all-backend/deps/llama.cpp-mainline
@ -0,0 +1 @@
+Subproject commit 11f734c3b0334dbae4823b4a7467764e447fc6d6
--- a/gpt4all-backend/gptj.cpp
+++ b/gpt4all-backend/gptj.cpp
@ -1,853 +0,0 @@
-#define GPTJ_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE
-#include "gptj_impl.h"
-
-#include "llmodel.h"
-#include "llmodel_shared.h"
-#include "utils.h"
-
-#include <ggml.h>
-
-#include <algorithm>
-#include <cassert>
-#include <cinttypes>
-#include <cmath>
-#include <cstdio>
-#include <cstring>
-#include <ctime>
-#include <iostream>
-#include <map>
-#include <memory>
-#include <random>
-#include <sstream>
-#include <stdexcept>
-#include <string>
-#include <thread>
-#include <vector>
-
-namespace {
-const char *modelType_ = "GPT-J";
-}
-
-// default hparams (GPT-J 6B)
-struct gptj_hparams {
-    int32_t n_vocab = 50400;
-    int32_t n_ctx   = 2048;
-    int32_t n_embd  = 4096;
-    int32_t n_head  = 16;
-    int32_t n_layer = 28;
-    int32_t n_rot   = 64;
-    float norm_eps  = 1e-5;
-};
-
-struct gptj_layer {
-    // normalization
-    struct ggml_tensor * ln_1_g;
-    struct ggml_tensor * ln_1_b;
-
-    // attention
-    struct ggml_tensor * c_attn_q_proj_w;
-    struct ggml_tensor * c_attn_k_proj_w;
-    struct ggml_tensor * c_attn_v_proj_w;
-
-    struct ggml_tensor * c_attn_proj_w;
-
-    // ff
-    struct ggml_tensor * c_mlp_fc_w;
-    struct ggml_tensor * c_mlp_fc_b;
-
-    struct ggml_tensor * c_mlp_proj_w;
-    struct ggml_tensor * c_mlp_proj_b;
-};
-
-struct gptj_model {
-    gptj_hparams hparams;
-
-    // normalization
-    struct ggml_tensor * ln_f_g;
-    struct ggml_tensor * ln_f_b;
-
-    struct ggml_tensor * wte; // position embedding
-
-    struct ggml_tensor * lmh_g; // language model head
-    struct ggml_tensor * lmh_b; // language model bias
-
-    std::vector<gptj_layer> layers;
-
-    // key + value memory
-    struct llm_kv_cache kv_self;
-
-    //
-    struct ggml_context * ctx;
-    std::map<std::string, struct ggml_tensor *> tensors;
-
-    llm_buffer eval_buf;
-    llm_buffer scr0_buf;
-    llm_buffer scr1_buf;
-
-    ~gptj_model() {
-        if (ctx) {
-            ggml_free(ctx);
-        }
-    }
-};
-
-static bool kv_cache_init(
-        const struct gptj_hparams & hparams,
-              struct llm_kv_cache & cache,
-                         ggml_type   wtype,
-                               int   n_ctx) {
-    const int n_embd  = hparams.n_embd;
-    const int n_layer = hparams.n_layer;
-
-    const int64_t n_mem      = (int64_t)n_layer*n_ctx;
-    const int64_t n_elements = n_embd*n_mem;
-
-    cache.buf.resize(2u*n_elements*ggml_type_size(wtype) + 2_MiB);
-
-    struct ggml_init_params params;
-    params.mem_size   = cache.buf.size;
-    params.mem_buffer = cache.buf.addr;
-    params.no_alloc   = false;
-
-    cache.ctx = ggml_init(params);
-
-    if (!cache.ctx) {
-        fprintf(stderr, "%s: failed to allocate memory for kv cache\n", __func__);
-        return false;
-    }
-
-    cache.k = ggml_new_tensor_1d(cache.ctx, wtype, n_elements);
-    cache.v = ggml_new_tensor_1d(cache.ctx, wtype, n_elements);
-
-    return true;
-}
-
-// load the model's weights from a file path
-bool gptj_model_load(const std::string &fname, gptj_model & model, gpt_vocab & vocab, size_t * mem_req = nullptr)
-{
-    printf("%s: loading model from '%s' - please wait ...\n", __func__, fname.c_str());
-    if(mem_req != nullptr) {
-        *mem_req = 0;
-    }
-
-    // create the ggml context
-    struct gguf_init_params params = {
-        /*.no_alloc = */ false,
-        /*.ctx      = */ &model.ctx,
-    };
-
-    gguf_context *ggufctx = gguf_init_from_file(fname.c_str(), params);
-    if (!ggufctx) {
-        fprintf(stderr, "%s: gguf_init_from_file() failed\n", __func__);
-        return false;
-    }
-
-    // load hparams
-    {
-        auto & hparams = model.hparams;
-
-        bool ok = false;
-        int keyidx;
-
-        do {
-            keyidx = gguf_find_key(ggufctx, "gptj.context_length");
-            if (keyidx == -1) { break; }
-            hparams.n_ctx = gguf_get_val_u32(ggufctx, keyidx);
-
-            keyidx = gguf_find_key(ggufctx, "gptj.embedding_length");
-            if (keyidx == -1) { break; }
-            hparams.n_embd = gguf_get_val_u32(ggufctx, keyidx);
-
-            keyidx = gguf_find_key(ggufctx, "gptj.attention.head_count");
-            if (keyidx == -1) { break; }
-            hparams.n_head = gguf_get_val_u32(ggufctx, keyidx);
-
-            keyidx = gguf_find_key(ggufctx, "gptj.block_count");
-            if (keyidx == -1) { break; }
-            hparams.n_layer = gguf_get_val_u32(ggufctx, keyidx);
-
-            keyidx = gguf_find_key(ggufctx, "gptj.rope.dimension_count");
-            if (keyidx == -1) { break; }
-            hparams.n_rot = gguf_get_val_u32(ggufctx, keyidx);
-
-            keyidx = gguf_find_key(ggufctx, "gptj.attention.layer_norm_epsilon");
-            if (keyidx == -1) { break; }
-            hparams.norm_eps = gguf_get_val_f32(ggufctx, keyidx);
-
-            ok = true;
-        } while (false);
-
-        if (!ok) {
-            fprintf(stderr, "%s: required hparam missing!\n", __func__);
-            return false;
-        }
-
-        printf("%s: n_ctx   = %d\n", __func__, hparams.n_ctx);
-        printf("%s: n_embd  = %d\n", __func__, hparams.n_embd);
-        printf("%s: n_head  = %d\n", __func__, hparams.n_head);
-        printf("%s: n_layer = %d\n", __func__, hparams.n_layer);
-        printf("%s: n_rot   = %d\n", __func__, hparams.n_rot);
-    }
-
-    // load vocab
-    {
-        auto & hparams = model.hparams;
-
-        int keyidx = gguf_find_key(ggufctx, "tokenizer.ggml.model");
-        if (keyidx == -1) {
-            fprintf(stderr, "%s: tokenizer model not found!\n", __func__);
-            return false;
-        }
-        if (strcmp(gguf_get_val_str(ggufctx, keyidx), "gpt2") != 0) {
-            fprintf(stderr, "%s: tokenizer model not supported!\n", __func__);
-            return false;
-        }
-
-        int tokens_keyidx = gguf_find_key(ggufctx, "tokenizer.ggml.tokens");
-        if (tokens_keyidx == -1) {
-            fprintf(stderr, "%s: gpt2 tokenizer vocab not found!\n", __func__);
-            return false;
-        }
-
-        hparams.n_vocab = gguf_get_arr_n(ggufctx, tokens_keyidx);
-        printf("%s: gpt2 tokenizer vocab = %d\n", __func__, int(hparams.n_vocab));
-
-        for (int i = 0; i < hparams.n_vocab; i++) {
-            std::string word = gguf_get_arr_str(ggufctx, tokens_keyidx, i);
-            vocab.token_to_id[word] = i;
-            vocab.id_to_token[i] = word;
-        }
-    }
-
-    auto & ctx = model.ctx;
-
-    size_t ctx_size = ggml_get_mem_size(ctx);
-    printf("%s: ggml ctx size = %6.2f MB\n", __func__, ctx_size / (1024.0 * 1024.0));
-
-    if (mem_req != nullptr) {
-        *mem_req = ctx_size;
-        gguf_free(ggufctx);
-        return false;
-    }
-
-    // prepare memory for the weights
-    {
-        const auto & hparams = model.hparams;
-        model.layers.resize(hparams.n_layer);
-
-        model.wte    = ggml_get_tensor(ctx, "token_embd.weight");
-
-        model.ln_f_g = ggml_get_tensor(ctx, "output_norm.weight");
-        model.ln_f_b = ggml_get_tensor(ctx, "output_norm.bias");
-
-        model.lmh_g  = ggml_get_tensor(ctx, "output.weight");
-        model.lmh_b  = ggml_get_tensor(ctx, "output.bias");
-
-        auto name = [](int i, std::string n) {
-            static std::string key;
-            key = "blk." + std::to_string(i) + "." + n;
-            return key.c_str();
-        };
-
-        for (int i = 0; i < hparams.n_layer; ++i) {
-            auto & layer = model.layers[i];
-
-            layer.ln_1_g          = ggml_get_tensor(ctx, name(i, "attn_norm.weight"));
-            layer.ln_1_b          = ggml_get_tensor(ctx, name(i, "attn_norm.bias"));
-
-            layer.c_attn_q_proj_w = ggml_get_tensor(ctx, name(i, "attn_q.weight"));
-            layer.c_attn_k_proj_w = ggml_get_tensor(ctx, name(i, "attn_k.weight"));
-            layer.c_attn_v_proj_w = ggml_get_tensor(ctx, name(i, "attn_v.weight"));
-
-            layer.c_attn_proj_w   = ggml_get_tensor(ctx, name(i, "attn_output.weight"));
-
-            layer.c_mlp_fc_w      = ggml_get_tensor(ctx, name(i, "ffn_up.weight"));
-            layer.c_mlp_fc_b      = ggml_get_tensor(ctx, name(i, "ffn_up.bias"));
-
-            layer.c_mlp_proj_w    = ggml_get_tensor(ctx, name(i, "ffn_down.weight"));
-            layer.c_mlp_proj_b    = ggml_get_tensor(ctx, name(i, "ffn_down.bias"));
-        }
-    }
-
-    // key + value memory
-    {
-        const auto & hparams = model.hparams;
-        if (!kv_cache_init(hparams, model.kv_self, GGML_TYPE_F16, model.hparams.n_ctx)) {
-            fprintf(stderr, "%s: kv_cache_init() failed for self-attention cache\n", __func__);
-            ggml_free(ctx);
-            return false;
-        }
-
-        const size_t memory_size = ggml_nbytes(model.kv_self.k) + ggml_nbytes(model.kv_self.v);
-        printf("%s: kv self size  = %7.2f MB\n", __func__, memory_size / 1024.0 / 1024.0);
-    }
-
-    model.scr0_buf.resize(256u * 1024 * 1024);
-    model.scr1_buf.resize(256u * 1024 * 1024);
-
-    return true;
-}
-
-// evaluate the transformer
-//
-//   - model:     the model
-//   - n_threads: number of threads to use
-//   - n_past:    the context size so far
-//   - embd_inp:  the embeddings of the tokens in the context
-//   - embd_w:    the predicted logits for the next token
-//
-// The GPT-J model requires about 16MB of memory per input token.
-//
-bool gptj_eval(
-        gptj_model & model,
-        const int n_threads,
-        const int n_past,
-        const std::vector<gpt_vocab::id> & embd_inp,
-              std::vector<float>         & embd_w,
-              size_t                     & mem_per_token) {
-    const int N = embd_inp.size();
-
-    const auto & hparams = model.hparams;
-
-    const int n_embd  = hparams.n_embd;
-    const int n_layer = hparams.n_layer;
-    const int n_ctx   = hparams.n_ctx;
-    const int n_head  = hparams.n_head;
-    const int n_vocab = hparams.n_vocab;
-    const int n_rot   = hparams.n_rot;
-
-    const size_t init_buf_size = 1024_MiB;
-    if (!model.eval_buf.addr || model.eval_buf.size < init_buf_size)
-        model.eval_buf.resize(init_buf_size);
-
-    if (mem_per_token > 0 && mem_per_token*N > model.eval_buf.size) {
-        const size_t buf_size_new = 1.1*(mem_per_token*N); // add 10% to account for ggml object overhead
-        printf("\n%s: reallocating buffer from %zu to %zu bytes\n", __func__, model.eval_buf.size, buf_size_new);
-
-        // reallocate
-        model.eval_buf.resize(buf_size_new);
-        if (model.eval_buf.addr == nullptr) {
-            fprintf(stderr, "%s: failed to allocate %zu bytes\n", __func__, model.eval_buf.size);
-            return false;
-        }
-    }
-
-    struct ggml_init_params params = {
-        .mem_size   = model.eval_buf.size,
-        .mem_buffer = model.eval_buf.addr,
-        .no_alloc = false
-    };
-
-    struct ggml_context * ctx0 = ggml_init(params);
-    struct ggml_cgraph * gf = ggml_new_graph(ctx0);
-
-    // KQ_pos - contains the positions
-    struct ggml_tensor * KQ_pos = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, N);
-    int * data = (int *) KQ_pos->data;
-    for (int i = 0; i < N; ++i) {
-        data[i] = n_past + i;
-    }
-
-    struct ggml_tensor * embd = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, N);
-    memcpy(embd->data, embd_inp.data(), N*ggml_element_size(embd));
-
-    // wte
-    struct ggml_tensor * inpL = ggml_get_rows(ctx0, model.wte, embd);
-
-    for (int il = 0; il < n_layer; ++il) {
-        struct ggml_tensor * cur;
-        ggml_set_scratch(ctx0, {0, model.scr0_buf.size, model.scr0_buf.addr, });
-        // norm
-        {
-            cur = ggml_norm(ctx0, inpL, model.hparams.norm_eps);
-
-            // cur = ln_1_g*cur + ln_1_b
-            cur = ggml_add(ctx0,
-                    ggml_mul(ctx0,
-                        ggml_repeat(ctx0, model.layers[il].ln_1_g, cur),
-                        cur),
-                    ggml_repeat(ctx0, model.layers[il].ln_1_b, cur));
-        }
-
-        struct ggml_tensor * inpSA = cur;
-
-        // self-attention
-        {
-            struct ggml_tensor * Qcur = ggml_rope(
-                ctx0, ggml_reshape_3d(ctx0, ggml_mul_mat(ctx0, model.layers[il].c_attn_q_proj_w, cur), n_embd/n_head, n_head, N),
-                KQ_pos, n_rot, 0, 0
-            );
-            struct ggml_tensor * Kcur = ggml_rope(
-                ctx0, ggml_reshape_3d(ctx0, ggml_mul_mat(ctx0, model.layers[il].c_attn_k_proj_w, cur), n_embd/n_head, n_head, N),
-                KQ_pos, n_rot, 0, 0
-            );
-
-            // store key and value to memory
-            {
-                struct ggml_tensor * Vcur = ggml_transpose(ctx0, ggml_mul_mat(ctx0, model.layers[il].c_attn_v_proj_w, cur));
-
-                struct ggml_tensor * k = ggml_view_1d(ctx0, model.kv_self.k, N*n_embd, (ggml_element_size(model.kv_self.k)*n_embd)*(il*n_ctx + n_past));
-                struct ggml_tensor * v = ggml_view_2d(ctx0, model.kv_self.v, N, n_embd,
-                        (   n_ctx)*ggml_element_size(model.kv_self.v),
-                        (il*n_ctx)*ggml_element_size(model.kv_self.v)*n_embd + n_past*ggml_element_size(model.kv_self.v));
-
-                ggml_build_forward_expand(gf, ggml_cpy(ctx0, Kcur, k));
-                ggml_build_forward_expand(gf, ggml_cpy(ctx0, Vcur, v));
-            }
-
-            // Q = Qcur.contiguous().view(n_embd/n_head, n_head, N).permute(0, 2, 1, 3)
-            struct ggml_tensor * Q = ggml_permute(ctx0, Qcur, 0, 2, 1, 3);
-
-            // K = Kmem.view(n_embd/n_head, n_head, n_past + N).permute(0, 2, 1, 3)
-            struct ggml_tensor * K =
-                ggml_permute(ctx0,
-                        ggml_reshape_3d(ctx0,
-                            ggml_view_1d(ctx0, model.kv_self.k, (n_past + N)*n_embd, il*n_ctx*ggml_element_size(model.kv_self.k)*n_embd),
-                            n_embd/n_head, n_head, n_past + N),
-                        0, 2, 1, 3);
-
-            // K * Q
-            struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q);
-
-            // KQ_scaled = KQ / sqrt(n_embd/n_head)
-            struct ggml_tensor * KQ_scaled = ggml_scale(ctx0, KQ, 1.0f/sqrt(float(n_embd)/n_head));
-
-            // KQ_masked = mask_past(KQ_scaled)
-            struct ggml_tensor * KQ_masked = ggml_diag_mask_inf(ctx0, KQ_scaled, n_past);
-
-            // KQ = soft_max(KQ_masked)
-            struct ggml_tensor * KQ_soft_max = ggml_soft_max(ctx0, KQ_masked);
-
-            // V_trans = Vmem.view(n_embd/n_head, n_head, n_past + N).permute(1, 2, 0, 3).contiguous()
-            struct ggml_tensor * V =
-                ggml_view_3d(ctx0, model.kv_self.v,
-                        n_past + N, n_embd/n_head, n_head,
-                        n_ctx*ggml_element_size(model.kv_self.v),
-                        n_ctx*ggml_element_size(model.kv_self.v)*n_embd/n_head,
-                        il*n_ctx*ggml_element_size(model.kv_self.v)*n_embd);
-
-            // KQV = transpose(V) * KQ_soft_max
-            struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max);
-
-            // KQV_merged = KQV.permute(0, 2, 1, 3)
-            struct ggml_tensor * KQV_merged = ggml_permute(ctx0, KQV, 0, 2, 1, 3);
-
-            // cur = KQV_merged.contiguous().view(n_embd, N)
-            cur = ggml_cpy(ctx0,
-                    KQV_merged,
-                    ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, N));
-
-            // projection (no bias)
-            cur = ggml_mul_mat(ctx0,
-                    model.layers[il].c_attn_proj_w,
-                    cur);
-        }
-
-        struct ggml_tensor * inpFF = cur;
-
-        ggml_set_scratch(ctx0, {0, model.scr1_buf.size, model.scr1_buf.addr, });
-        // feed-forward network
-        // this is independent of the self-attention result, so it could be done in parallel to the self-attention
-        {
-            // note here we pass inpSA instead of cur
-            cur = ggml_mul_mat(ctx0,
-                    model.layers[il].c_mlp_fc_w,
-                    inpSA);
-
-            cur = ggml_add(ctx0,
-                    ggml_repeat(ctx0, model.layers[il].c_mlp_fc_b, cur),
-                    cur);
-
-            // GELU activation
-            cur = ggml_gelu(ctx0, cur);
-
-            // projection
-            // cur = proj_w*cur + proj_b
-            cur = ggml_mul_mat(ctx0,
-                    model.layers[il].c_mlp_proj_w,
-                    cur);
-
-            cur = ggml_add(ctx0,
-                    ggml_repeat(ctx0, model.layers[il].c_mlp_proj_b, cur),
-                    cur);
-        }
-
-        // self-attention + FF
-        cur  = ggml_add(ctx0, cur, inpFF);
-
-        // input for next layer
-        inpL = ggml_add(ctx0, cur, inpL);
-    }
-
-    ggml_set_scratch(ctx0, {0, model.scr0_buf.size, model.scr0_buf.addr, });
-
-    // norm
-    {
-        inpL = ggml_norm(ctx0, inpL, model.hparams.norm_eps);
-
-        // inpL = ln_f_g*inpL + ln_f_b
-        inpL = ggml_add(ctx0,
-                ggml_mul(ctx0,
-                    ggml_repeat(ctx0, model.ln_f_g, inpL),
-                    inpL),
-                ggml_repeat(ctx0, model.ln_f_b, inpL));
-    }
-
-    ggml_set_scratch(ctx0, { 0, 0, nullptr, });
-
-    // lm_head
-    {
-        inpL = ggml_mul_mat(ctx0, model.lmh_g, inpL);
-
-        inpL = ggml_add(ctx0,
-                ggml_repeat(ctx0, model.lmh_b, inpL),
-                inpL);
-    }
-
-    // logits -> probs
-    //inpL = ggml_soft_max(ctx0, inpL);
-
-    ggml_build_forward_expand(gf, inpL);
-
-    // run the computation
-    {
-        std::unique_ptr<uint8_t []> data;
-        auto plan = ggml_graph_plan(gf, n_threads);
-        if (plan.work_size > 0) {
-            data.reset(new uint8_t[plan.work_size]);
-            plan.work_data = data.get();
-        }
-        ggml_graph_compute(gf, &plan);
-    }
-
-    //if (n_past%100 == 0) {
-    //    ggml_graph_print   (gf);
-    //    ggml_graph_dump_dot(gf, NULL, "gpt-2.dot");
-    //}
-
-    //embd_w.resize(n_vocab*N);
-    //memcpy(embd_w.data(), ggml_get_data(inpL), sizeof(float)*n_vocab*N);
-
-    // return result for just the last token
-    embd_w.resize(n_vocab);
-    memcpy(embd_w.data(), (float *) ggml_get_data(inpL) + (n_vocab*(N-1)), sizeof(float)*n_vocab);
-
-    if (mem_per_token == 0) {
-        mem_per_token = ggml_used_mem(ctx0)/N;
-    }
-    //printf("used_mem = %zu\n", ggml_used_mem(ctx0));
-
-    ggml_free(ctx0);
-
-    return true;
-}
-
-#define GPTJ_MAX_RNG_STATE 64*1024
-
-size_t gptj_get_state_size(const gptj_model &model)
-{
-    // we don't know size of rng until we actually serialize it. so reserve more than enough memory for its serialized state.
-    // for reference, std::mt19937(1337) serializes to 6701 bytes.
-    const size_t s_rng_size        = sizeof(size_t);
-    const size_t s_rng             = GPTJ_MAX_RNG_STATE;
-    const size_t s_kv_size         = sizeof(size_t);
-    const size_t s_kv_ntok         = sizeof(int);
-    const size_t s_kv              = model.kv_self.buf.size;
-    const size_t s_total = (
-        + s_rng_size
-        + s_rng
-        + s_kv_size
-        + s_kv_ntok
-        + s_kv
-    );
-    fflush(stdout);
-    return s_total;
-}
-
-size_t gptj_copy_state_data(const gptj_model &model, const std::mt19937 &rng, uint8_t *dest)
-{
-    uint8_t * out = dest;
-    fflush(stdout);
-    // copy rng
-    {
-        std::stringstream rng_ss;
-        rng_ss << rng;
-
-        const size_t rng_size = rng_ss.str().size();
-        char rng_buf[GPTJ_MAX_RNG_STATE];
-
-        memset(&rng_buf[0], 0, GPTJ_MAX_RNG_STATE);
-        memcpy(&rng_buf[0], rng_ss.str().data(), rng_ss.str().size());
-
-        memcpy(out, &rng_size,   sizeof(rng_size));   out += sizeof(rng_size);
-        memcpy(out, &rng_buf[0], GPTJ_MAX_RNG_STATE); out += GPTJ_MAX_RNG_STATE;
-    }
-
-    // copy kv cache
-    {
-        const size_t kv_size = model.kv_self.buf.size;
-        const int    kv_ntok = model.kv_self.n;
-
-        memcpy(out, &kv_size, sizeof(kv_size)); out += sizeof(kv_size);
-        memcpy(out, &kv_ntok, sizeof(kv_ntok)); out += sizeof(kv_ntok);
-
-        if (kv_size) {
-            memcpy(out, model.kv_self.buf.addr, kv_size); out += kv_size;
-        }
-    }
-
-    const size_t written  = out - dest;
-    assert(written == gptj_get_state_size(model));
-    fflush(stdout);
-    return written;
-}
-
-size_t gptj_set_state_data(gptj_model *model, std::mt19937 *rng, const uint8_t *src)
-{
-    const uint8_t * in = src;
-
-    // set rng
-    {
-        size_t rng_size;
-        char   rng_buf[GPTJ_MAX_RNG_STATE];
-
-        memcpy(&rng_size,   in, sizeof(rng_size));    in += sizeof(rng_size);
-        memcpy(&rng_buf[0], in, GPTJ_MAX_RNG_STATE); in += GPTJ_MAX_RNG_STATE;
-
-        std::stringstream rng_ss;
-        rng_ss.str(std::string(&rng_buf[0], rng_size));
-        rng_ss >> *rng;
-
-        assert(rng_ss.fail() == false);
-    }
-
-    // set kv cache
-    {
-        size_t kv_size;
-        int kv_ntok;
-
-        memcpy(&kv_size, in, sizeof(kv_size)); in += sizeof(kv_size);
-        memcpy(&kv_ntok, in, sizeof(kv_ntok)); in += sizeof(kv_ntok);
-
-        if (kv_size) {
-            assert(model->kv_self.buf.size == kv_size);
-
-            void * k_data = model->kv_self.k->data; // remember data pointers
-            void * v_data = model->kv_self.v->data; // because their value is stored in buf and overwritten by memcpy
-
-            memcpy(model->kv_self.buf.addr, in, kv_size); in += kv_size;
-
-            model->kv_self.k->data = k_data; // restore correct data pointers
-            model->kv_self.v->data = v_data;
-
-        }
-
-        model->kv_self.n = kv_ntok;
-    }
-
-    const size_t nread    = in - src;
-    assert(nread == gptj_get_state_size(*model));
-    fflush(stdout);
-    return nread;
-}
-
-struct GPTJPrivate {
-    const std::string modelPath;
-    bool modelLoaded;
-    gpt_vocab vocab;
-    gptj_model *model = nullptr;
-    int64_t n_threads = 0;
-    size_t mem_per_token = 0;
-    std::mt19937 rng;
-};
-
-GPTJ::GPTJ()
-    : d_ptr(new GPTJPrivate) {
-    d_ptr->model = new gptj_model;
-    d_ptr->model->ctx = nullptr;
-    d_ptr->modelLoaded = false;
-}
-
-size_t GPTJ::requiredMem(const std::string &modelPath, int n_ctx, int ngl)
-{
-    (void)n_ctx;
-    (void)ngl;
-    gptj_model dummy_model;
-    gpt_vocab dummy_vocab;
-    size_t mem_req;
-    gptj_model_load(modelPath, dummy_model, dummy_vocab, &mem_req);
-    return mem_req;
-}
-
-bool GPTJ::loadModel(const std::string &modelPath, int n_ctx, int ngl)
-{
-    (void)n_ctx;
-    (void)ngl;
-    d_ptr->modelLoaded = false;
-
-    std::mt19937 rng(time(NULL));
-    d_ptr->rng = rng;
-
-    // load the model
-    bool ok = gptj_model_load(modelPath, *d_ptr->model, d_ptr->vocab);
-    fflush(stdout);
-    if (!ok) {
-        std::cerr << "GPT-J ERROR: failed to load model from " <<  modelPath;
-        return false;
-    }
-
-    d_ptr->n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
-    d_ptr->modelLoaded = true;
-    return true;
-}
-
-void GPTJ::setThreadCount(int32_t n_threads)
-{
-    d_ptr->n_threads = n_threads;
-}
-
-int32_t GPTJ::threadCount() const
-{
-    return d_ptr->n_threads;
-}
-
-GPTJ::~GPTJ()
-{
-    delete d_ptr->model;
-}
-
-bool GPTJ::isModelLoaded() const
-{
-    return d_ptr->modelLoaded;
-}
-
-size_t GPTJ::stateSize() const
-{
-    return gptj_get_state_size(*d_ptr->model);
-}
-
-size_t GPTJ::saveState(uint8_t *dest) const
-{
-    return gptj_copy_state_data(*d_ptr->model, d_ptr->rng, dest);
-}
-
-size_t GPTJ::restoreState(const uint8_t *src)
-{
-    return gptj_set_state_data(d_ptr->model, &d_ptr->rng, src);
-}
-
-std::vector<LLModel::Token> GPTJ::tokenize(PromptContext &ctx, const std::string &str, bool special) const
-{
-    (void)ctx;
-    (void)special;
-    return ::gpt_tokenize(d_ptr->vocab, str);
-}
-
-LLModel::Token GPTJ::sampleToken(PromptContext &promptCtx) const
-{
-    const size_t n_prev_toks = std::min((size_t) promptCtx.repeat_last_n, promptCtx.tokens.size());
-    return gpt_sample_top_k_top_p(d_ptr->model->hparams.n_vocab,
-        promptCtx.tokens.data() + promptCtx.tokens.size() - n_prev_toks,
-        n_prev_toks,
-        promptCtx.logits,
-        promptCtx.top_k, promptCtx.top_p, promptCtx.temp,
-        promptCtx.repeat_penalty,
-        d_ptr->rng);
-}
-
-std::string GPTJ::tokenToString(Token id) const
-{
-    return d_ptr->vocab.id_to_token[id];
-}
-
-bool GPTJ::evalTokens(PromptContext &ctx, const std::vector<int32_t> &tokens) const
-{
-    // determine the required inference memory per token:
-    static bool initialized = false;
-    if (!initialized) {
-        gptj_eval(*d_ptr->model, d_ptr->n_threads, 0, { 0, 1, 2, 3 }, ctx.logits,
-            d_ptr->mem_per_token);
-        initialized = true;
-    }
-
-    return gptj_eval(*d_ptr->model, d_ptr->n_threads, ctx.n_past, tokens, ctx.logits, d_ptr->mem_per_token);
-}
-
-int32_t GPTJ::contextLength() const
-{
-    return d_ptr->model->hparams.n_ctx;
-}
-
-const std::vector<LLModel::Token> &GPTJ::endTokens() const
-{
-    static const std::vector<LLModel::Token> fres = {50256};
-    return fres;
-}
-
-const char *get_arch_name(gguf_context *ctx_gguf)
-{
-    const int kid = gguf_find_key(ctx_gguf, "general.architecture");
-    if (kid == -1)
-        throw std::runtime_error("key not found in model: general.architecture");
-
-    enum gguf_type ktype = gguf_get_kv_type(ctx_gguf, kid);
-    if (ktype != GGUF_TYPE_STRING)
-        throw std::runtime_error("key general.architecture has wrong type");
-
-    return gguf_get_val_str(ctx_gguf, kid);
-}
-
-#if defined(_WIN32)
-#define DLL_EXPORT __declspec(dllexport)
-#else
-#define DLL_EXPORT __attribute__ ((visibility ("default")))
-#endif
-
-extern "C" {
-DLL_EXPORT bool is_g4a_backend_model_implementation()
-{
-    return true;
-}
-
-DLL_EXPORT const char *get_model_type()
-{
-    return modelType_;
-}
-
-DLL_EXPORT const char *get_build_variant()
-{
-    return GGML_BUILD_VARIANT;
-}
-
-DLL_EXPORT char *get_file_arch(const char *fname)
-{
-    struct ggml_context * ctx_meta = NULL;
-    struct gguf_init_params params = {
-        /*.no_alloc = */ true,
-        /*.ctx      = */ &ctx_meta,
-    };
-    gguf_context *ctx_gguf = gguf_init_from_file(fname, params);
-
-    char *arch = nullptr;
-    if (ctx_gguf && gguf_get_version(ctx_gguf) <= 3) {
-        try {
-            arch = strdup(get_arch_name(ctx_gguf));
-        } catch (const std::runtime_error &) {
-            // cannot read key -> return null
-        }
-    }
-
-    gguf_free(ctx_gguf);
-    return arch;
-}
-
-DLL_EXPORT bool is_arch_supported(const char *arch)
-{
-    return !strcmp(arch, "gptj");
-}
-
-DLL_EXPORT LLModel *construct()
-{
-    return new GPTJ;
-}
-}
--- a/gpt4all-backend/gptj_impl.h
+++ b/gpt4all-backend/gptj_impl.h
@ -1,43 +0,0 @@
-#ifndef GPTJ_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE
-#error This file is NOT meant to be included outside of gptj.cpp. Doing so is DANGEROUS. Be sure to know what you are doing before proceeding to #define GPTJ_H_I_KNOW_WHAT_I_AM_DOING_WHEN_INCLUDING_THIS_FILE
-#endif
-#ifndef GPTJ_H
-#define GPTJ_H
-
-#include "llmodel.h"
-
-#include <functional>
-#include <string>
-#include <vector>
-
-struct GPTJPrivate;
-class GPTJ : public LLModel {
-public:
-    GPTJ();
-    ~GPTJ();
-
-    bool supportsEmbedding() const override { return false; }
-    bool supportsCompletion() const override { return true; }
-    bool loadModel(const std::string &modelPath, int n_ctx, int ngl) override;
-    bool isModelLoaded() const override;
-    size_t requiredMem(const std::string &modelPath, int n_ctx, int ngl) override;
-    size_t stateSize() const override;
-    size_t saveState(uint8_t *dest) const override;
-    size_t restoreState(const uint8_t *src) override;
-    void setThreadCount(int32_t n_threads) override;
-    int32_t threadCount() const override;
-
-private:
-    GPTJPrivate *d_ptr;
-
-protected:
-    std::vector<Token> tokenize(PromptContext &ctx, const std::string &str, bool special) const override;
-    Token sampleToken(PromptContext &ctx) const override;
-    std::string tokenToString(Token id) const override;
-    bool evalTokens(PromptContext &ctx, const std::vector<int32_t> &tokens) const override;
-    int32_t contextLength() const override;
-    const std::vector<Token> &endTokens() const override;
-    bool shouldAddBOS() const override { return false; }
-};
-
-#endif // GPTJ_H
--- a/gpt4all-backend/include/gpt4all-backend/llmodel.h
+++ b/gpt4all-backend/include/gpt4all-backend/llmodel.h
@ -5,8 +5,10 @@
 #include <cassert>
 #include <cstddef>
 #include <cstdint>
+#include <expected>
 #include <functional>
 #include <optional>
+#include <span>
 #include <stdexcept>
 #include <string>
 #include <string_view>
@ -14,14 +16,19 @@
 #include <utility>
 #include <vector>

+class Dlhandle;
+
 using namespace std::string_literals;

 #define LLMODEL_MAX_PROMPT_BATCH 128

-class Dlhandle;
 class LLModel {
 public:
    using Token = int32_t;
+    using PromptCallback      = std::function<bool(std::span<const Token> batch, bool cached)>;
+    using ResponseCallback    = std::function<bool(Token token, std::string_view piece)>;
+    using EmbedCancelCallback = bool(unsigned *batchSizes, unsigned nBatch, const char *backend);
+    using ProgressCallback    = std::function<bool(float progress)>;

    class BadArchError: public std::runtime_error {
    public:
@ -99,6 +106,7 @@ public:
        static int32_t maxContextLength(const std::string &modelPath);
        static int32_t layerCount(const std::string &modelPath);
        static bool isEmbeddingModel(const std::string &modelPath);
+        static auto chatTemplate(const char *modelPath) -> std::expected<std::string, std::string>;
        static void setImplementationsSearchPath(const std::string &path);
        static const std::string &implementationsSearchPath();
        static bool hasSupportedCPU();
@ -122,10 +130,6 @@ public:
    };

    struct PromptContext {
-        std::vector<float> logits;      // logits of current context
-        std::vector<int32_t> tokens;    // current tokens in the context window
-        int32_t n_past = 0;             // number of tokens in past conversation
-        int32_t n_ctx = 0;              // number of tokens possible in context window
        int32_t n_predict = 200;
        int32_t top_k = 40;
        float   top_p = 0.9f;
@ -134,38 +138,31 @@ public:
        int32_t n_batch = 9;
        float   repeat_penalty = 1.10f;
        int32_t repeat_last_n = 64;     // last n tokens to penalize
-        float   contextErase = 0.75f;   // percent of context to erase if we exceed the context window
-        int32_t n_last_batch_tokens = 0;
+        float   contextErase = 0.5f;    // percent of context to erase if we exceed the context window
    };

-    using ProgressCallback = std::function<bool(float progress)>;
-
    explicit LLModel() {}
    virtual ~LLModel() {}

    virtual bool supportsEmbedding() const = 0;
    virtual bool supportsCompletion() const = 0;
    virtual bool loadModel(const std::string &modelPath, int n_ctx, int ngl) = 0;
-    virtual bool isModelBlacklisted(const std::string &modelPath) const { (void)modelPath; return false; };
+    virtual bool isModelBlacklisted(const std::string &modelPath) const { (void)modelPath; return false; }
    virtual bool isEmbeddingModel(const std::string &modelPath) const { (void)modelPath; return false; }
    virtual bool isModelLoaded() const = 0;
    virtual size_t requiredMem(const std::string &modelPath, int n_ctx, int ngl) = 0;
-    virtual size_t stateSize() const { return 0; }
-    virtual size_t saveState(uint8_t *dest) const { (void)dest; return 0; }
-    virtual size_t restoreState(const uint8_t *src) { (void)src; return 0; }
+    virtual size_t stateSize() const = 0;
+    virtual size_t saveState(std::span<uint8_t> stateOut, std::vector<Token> &inputTokensOut) const = 0;
+    virtual size_t restoreState(std::span<const uint8_t> state, std::span<const Token> inputTokens) = 0;

    // This method requires the model to return true from supportsCompletion otherwise it will throw
    // an error
-    virtual void prompt(const std::string &prompt,
-                        const std::string &promptTemplate,
-                        std::function<bool(int32_t)> promptCallback,
-                        std::function<bool(int32_t, const std::string&)> responseCallback,
-                        std::function<bool(bool)> recalculateCallback,
-                        PromptContext &ctx,
-                        bool special = false,
-                        std::string *fakeReply = nullptr);
+    virtual void prompt(std::string_view        prompt,
+                        const PromptCallback   &promptCallback,
+                        const ResponseCallback &responseCallback,
+                        const PromptContext    &ctx);

-    using EmbedCancelCallback = bool(unsigned *batchSizes, unsigned nBatch, const char *backend);
+    virtual int32_t countPromptTokens(std::string_view prompt) const;

    virtual size_t embeddingSize() const {
        throw std::logic_error(std::string(implementation().modelType()) + " does not support embeddings");
@ -210,14 +207,24 @@ public:

    void setProgressCallback(ProgressCallback callback) { m_progressCallback = callback; }

+    virtual int32_t contextLength() const = 0;
+    virtual auto specialTokens() -> std::unordered_map<std::string, std::string> const = 0;
+
 protected:
    // These are pure virtual because subclasses need to implement as the default implementation of
    // 'prompt' above calls these functions
-    virtual std::vector<Token> tokenize(PromptContext &ctx, const std::string &str, bool special = false) const = 0;
+    virtual std::vector<Token> tokenize(std::string_view str) const = 0;
+    virtual bool isSpecialToken(Token id) const = 0;
    virtual std::string tokenToString(Token id) const = 0;
-    virtual Token sampleToken(PromptContext &ctx) const = 0;
-    virtual bool evalTokens(PromptContext &ctx, const std::vector<int32_t> &tokens) const = 0;
-    virtual int32_t contextLength() const = 0;
+    virtual void initSampler(const PromptContext &ctx) = 0;
+    virtual Token sampleToken() const = 0;
+    virtual bool evalTokens(int32_t nPast, std::span<const Token> tokens) const = 0;
+    virtual void shiftContext(const PromptContext &promptCtx, int32_t *nPast) = 0;
+    virtual int32_t inputLength() const = 0;
+    virtual int32_t computeModelInputPosition(std::span<const Token> input) const = 0;
+    virtual void setModelInputPosition(int32_t pos) = 0;
+    virtual void appendInputToken(Token tok) = 0;
+    virtual std::span<const Token> inputTokens() const = 0;
    virtual const std::vector<Token> &endTokens() const = 0;
    virtual bool shouldAddBOS() const = 0;

@ -233,9 +240,11 @@ protected:
        return -1;
    }

-    // This is a helper function called from the default implementation of 'prompt' but it can be
-    // shared by all base classes so it isn't virtual
-    void recalculateContext(PromptContext &promptCtx, std::function<bool(bool)> recalculate);
+    virtual auto chatTemplate(const char *modelPath) const -> std::expected<std::string, std::string>
+    {
+        (void)modelPath;
+        return std::unexpected("not implemented");
+    }

    const Implementation *m_implementation = nullptr;

@ -248,16 +257,16 @@ protected:
        return true;
    }

-    void decodePrompt(std::function<bool(int32_t)> promptCallback,
-                      std::function<bool(int32_t, const std::string&)> responseCallback,
-                      std::function<bool(bool)> recalculateCallback,
-                      PromptContext &promptCtx,
-                      std::vector<Token> embd_inp);
-    void generateResponse(std::function<bool(int32_t, const std::string&)> responseCallback,
-                          std::function<bool(bool)> recalculateCallback,
-                          PromptContext &promptCtx);
+    // prefill context with prompt
+    auto decodePrompt(const PromptCallback &promptCallback,
+                      const PromptContext  &promptCtx,
+                      std::vector<Token>    embd_inp)
+        -> std::optional<int32_t>;
+    // generate a response
+    void generateResponse(const ResponseCallback &responseCallback,
+                          const PromptContext    &promptCtx,
+                          int32_t                 nPast);

-private:
    friend class LLMImplementation;
 };

--- a/gpt4all-backend/include/gpt4all-backend/llmodel_c.h
+++ b/gpt4all-backend/include/gpt4all-backend/llmodel_c.h
@ -23,6 +23,11 @@ extern "C" {
 */
 typedef void *llmodel_model;

+/**
+ * A token.
+ */
+typedef int32_t token_t;
+
 /**
 * llmodel_prompt_context structure for holding the prompt context.
 * NOTE: The implementation takes care of all the memory handling of the raw logits pointer and the
@ -30,21 +35,15 @@ typedef void *llmodel_model;
 * behavior.
 */
 struct llmodel_prompt_context {
-    float *logits;          // logits of current context
-    size_t logits_size;     // the size of the raw logits vector
-    int32_t *tokens;        // current tokens in the context window
-    size_t tokens_size;     // the size of the raw tokens vector
-    int32_t n_past;         // number of tokens in past conversation
-    int32_t n_ctx;          // number of tokens possible in context window
    int32_t n_predict;      // number of tokens to predict
    int32_t top_k;          // top k logits to sample from
-    float top_p;            // nucleus sampling probability threshold
-    float min_p;            // Min P sampling
-    float temp;             // temperature to adjust model's output distribution
+    float   top_p;          // nucleus sampling probability threshold
+    float   min_p;          // Min P sampling
+    float   temp;           // temperature to adjust model's output distribution
    int32_t n_batch;        // number of predictions to generate in parallel
-    float repeat_penalty;   // penalty factor for repeated tokens
+    float   repeat_penalty; // penalty factor for repeated tokens
    int32_t repeat_last_n;  // last n tokens to penalize
-    float context_erase;    // percent of context to erase if we exceed the context window
+    float   context_erase;  // percent of context to erase if we exceed the context window
 };

 struct llmodel_gpu_device {
@ -63,10 +62,12 @@ typedef struct llmodel_gpu_device llmodel_gpu_device;

 /**
 * Callback type for prompt processing.
- * @param token_id The token id of the prompt.
+ * @param token_ids An array of token ids of the prompt.
+ * @param n_token_ids The number of tokens in the array.
+ * @param cached Whether the tokens were already in cache.
 * @return a bool indicating whether the model should keep processing.
 */
-typedef bool (*llmodel_prompt_callback)(int32_t token_id);
+typedef bool (*llmodel_prompt_callback)(const token_t *token_ids, size_t n_token_ids, bool cached);

 /**
 * Callback type for response.
@ -74,14 +75,7 @@ typedef bool (*llmodel_prompt_callback)(int32_t token_id);
 * @param response The response string. NOTE: a token_id of -1 indicates the string is an error string.
 * @return a bool indicating whether the model should keep generating.
 */
-typedef bool (*llmodel_response_callback)(int32_t token_id, const char *response);
-
-/**
- * Callback type for recalculation of context.
- * @param whether the model is recalculating the context.
- * @return a bool indicating whether the model should keep generating.
- */
-typedef bool (*llmodel_recalculate_callback)(bool is_recalculating);
+typedef bool (*llmodel_response_callback)(token_t token_id, const char *response);

 /**
 * Embedding cancellation callback for use with llmodel_embed.
@ -92,6 +86,8 @@ typedef bool (*llmodel_recalculate_callback)(bool is_recalculating);
 */
 typedef bool (*llmodel_emb_cancel_callback)(unsigned *batch_sizes, unsigned n_batch, const char *backend);

+typedef void (*llmodel_special_token_callback)(const char *name, const char *token);
+
 /**
 * Create a llmodel instance.
 * Recognises correct model type from file at model_path
@ -150,46 +146,57 @@ bool llmodel_isModelLoaded(llmodel_model model);
 * @param model A pointer to the llmodel_model instance.
 * @return the size in bytes of the internal state of the model
 */
-uint64_t llmodel_get_state_size(llmodel_model model);
+uint64_t llmodel_state_get_size(llmodel_model model);

 /**
- * Saves the internal state of the model to the specified destination address.
+ * Saves the internal state of the model.
 * NOTE: This state data is specific to the type of model you have created.
 * @param model A pointer to the llmodel_model instance.
- * @param dest A pointer to the destination.
- * @return the number of bytes copied
+ * @param state Where to store the state. This must be a buffer of at least llmodel_state_get_size() bytes.
+ * @param state_size The size of the destination for the state.
+ * @param input_tokens_out Where to store the address of the token cache state. This is dynamically allocated and must
+ * be freed with llmodel_state_free_input_tokens.
+ * @param n_input_tokens Where to store the size of the token cache state.
+ * @return The number of bytes copied. On error, zero is returned, the token cache is set to NULL, and the token cache
+ * size is set to zero.
 */
-uint64_t llmodel_save_state_data(llmodel_model model, uint8_t *dest);
+uint64_t llmodel_state_get_data(llmodel_model model, uint8_t *state_out, uint64_t state_size,
+                                token_t **input_tokens_out, uint64_t *n_input_tokens);
+
+/**
+ * Frees the temporary token cache buffer created by a call to llmodel_state_get_data().
+ * @param input_tokens The token cache buffer.
+ */
+void llmodel_state_free_input_tokens(token_t *input_tokens);

 /**
 * Restores the internal state of the model using data from the specified address.
 * NOTE: This state data is specific to the type of model you have created.
 * @param model A pointer to the llmodel_model instance.
- * @param src A pointer to the src.
- * @return the number of bytes read
+ * @param state A pointer to the state data.
+ * @param state_size The size of the state data.
+ * @param input_tokens The token cache associated with the saved state.
+ * @param n_input_tokens The number of tokens in input_tokens.
+ * @return The number of bytes read, or zero on error.
 */
-uint64_t llmodel_restore_state_data(llmodel_model model, const uint8_t *src);
+uint64_t llmodel_state_set_data(llmodel_model model, const uint8_t *state, uint64_t state_size,
+                                const token_t *input_tokens, uint64_t n_input_tokens);

 /**
 * Generate a response using the model.
 * @param model A pointer to the llmodel_model instance.
 * @param prompt A string representing the input prompt.
- * @param prompt_template A string representing the input prompt template.
 * @param prompt_callback A callback function for handling the processing of prompt.
 * @param response_callback A callback function for handling the generated response.
- * @param recalculate_callback A callback function for handling recalculation requests.
- * @param special True if special tokens in the prompt should be processed, false otherwise.
- * @param fake_reply A string to insert into context as the model's reply, or NULL to generate one.
 * @param ctx A pointer to the llmodel_prompt_context structure.
+ * @param error A pointer to a string; will only be set on error.
 */
-void llmodel_prompt(llmodel_model model, const char *prompt,
-                    const char *prompt_template,
-                    llmodel_prompt_callback prompt_callback,
-                    llmodel_response_callback response_callback,
-                    llmodel_recalculate_callback recalculate_callback,
-                    llmodel_prompt_context *ctx,
-                    bool special,
-                    const char *fake_reply);
+bool llmodel_prompt(llmodel_model               model,
+                    const char                 *prompt,
+                    llmodel_prompt_callback     prompt_callback,
+                    llmodel_response_callback   response_callback,
+                    llmodel_prompt_context     *ctx,
+                    const char                **error);

 /**
 * Generate an embedding using the model.
@ -301,6 +308,10 @@ const char *llmodel_model_backend_name(llmodel_model model);
 */
 const char *llmodel_model_gpu_device_name(llmodel_model model);

+int32_t llmodel_count_prompt_tokens(llmodel_model model, const char *prompt, const char **error);
+
+void llmodel_model_foreach_special_token(llmodel_model model, llmodel_special_token_callback callback);
+
 #ifdef __cplusplus
 }
 #endif
--- a/gpt4all-backend/include/gpt4all-backend/sysinfo.h
+++ b/gpt4all-backend/include/gpt4all-backend/sysinfo.h
--- a/gpt4all-backend/llama.cpp-mainline
+++ b/gpt4all-backend/llama.cpp-mainline
@ -1 +0,0 @@
-Subproject commit b2db03acf299111885af2921a4230de07623eaf8
--- a/gpt4all-backend/llama.cpp.cmake
+++ b/gpt4all-backend/llama.cpp.cmake
@ -7,7 +7,7 @@ set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
 #
 # some of the options here are commented out so they can be set "dynamically" before calling include_ggml()

-set(LLAMA_LLAMAFILE_DEFAULT ON)
+set(GGML_LLAMAFILE_DEFAULT ON)

 # general
 option(LLAMA_STATIC                     "llama: static link libraries"                          OFF)
@ -22,15 +22,15 @@ option(LLAMA_GPROF                      "llama: enable gprof"
 option(LLAMA_FATAL_WARNINGS             "llama: enable -Werror flag"                            OFF)

 # instruction set specific
-#option(LLAMA_AVX                    "llama: enable AVX"                                     ON)
-#option(LLAMA_AVX2                   "llama: enable AVX2"                                    ON)
-#option(LLAMA_AVX512                 "llama: enable AVX512"                                  OFF)
-#option(LLAMA_AVX512_VBMI            "llama: enable AVX512-VBMI"                             OFF)
-#option(LLAMA_AVX512_VNNI            "llama: enable AVX512-VNNI"                             OFF)
-#option(LLAMA_FMA                    "llama: enable FMA"                                     ON)
+#option(GGML_AVX                     "ggml: enable AVX"                                     ON)
+#option(GGML_AVX2                    "ggml: enable AVX2"                                    ON)
+#option(GGML_AVX512                  "ggml: enable AVX512"                                  OFF)
+#option(GGML_AVX512_VBMI             "ggml: enable AVX512-VBMI"                             OFF)
+#option(GGML_AVX512_VNNI             "ggml: enable AVX512-VNNI"                             OFF)
+#option(GGML_FMA                     "ggml: enable FMA"                                     ON)
 # in MSVC F16C is implied with AVX2/AVX512
 #if (NOT MSVC)
-#    option(LLAMA_F16C               "llama: enable F16C"                                    ON)
+#    option(GGML_F16C                "ggml: enable F16C"                                    ON)
 #endif()

 if (WIN32)
@ -38,40 +38,46 @@ if (WIN32)
 endif()

 # 3rd party libs
-option(LLAMA_ACCELERATE                      "llama: enable Accelerate framework"               ON)
-option(LLAMA_BLAS                            "llama: use BLAS"                                  OFF)
-option(LLAMA_LLAMAFILE                       "llama: use llamafile SGEMM"                       ${LLAMA_LLAMAFILE_DEFAULT})
-set(LLAMA_BLAS_VENDOR "Generic" CACHE STRING "llama: BLAS library vendor")
-#option(LLAMA_CUDA                            "llama: use CUDA"                                  OFF)
-option(LLAMA_CUDA_FORCE_DMMV                 "llama: use dmmv instead of mmvq CUDA kernels"     OFF)
-option(LLAMA_CUDA_FORCE_MMQ                  "llama: use mmq kernels instead of cuBLAS"         OFF)
-set(LLAMA_CUDA_DMMV_X      "32" CACHE STRING "llama: x stride for dmmv CUDA kernels")
-set(LLAMA_CUDA_MMV_Y        "1" CACHE STRING "llama: y block size for mmv CUDA kernels")
-option(LLAMA_CUDA_F16                        "llama: use 16 bit floats for some calculations"   OFF)
-set(LLAMA_CUDA_KQUANTS_ITER "2" CACHE STRING "llama: iters./thread per block for Q2_K/Q6_K")
-set(LLAMA_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
-                                             "llama: max. batch size for using peer access")
-option(LLAMA_CUDA_NO_PEER_COPY               "llama: do not use peer to peer copies"            OFF)
-#option(LLAMA_HIPBLAS                         "llama: use hipBLAS"                               OFF)
-option(LLAMA_HIP_UMA                         "llama: use HIP unified memory architecture"       OFF)
-#option(LLAMA_CLBLAST                         "llama: use CLBlast"                               OFF)
-#option(LLAMA_VULKAN                          "llama: use Vulkan"                                OFF)
-option(LLAMA_VULKAN_CHECK_RESULTS            "llama: run Vulkan op checks"                      OFF)
-option(LLAMA_VULKAN_DEBUG                    "llama: enable Vulkan debug output"                OFF)
-option(LLAMA_VULKAN_VALIDATE                 "llama: enable Vulkan validation"                  OFF)
-option(LLAMA_VULKAN_RUN_TESTS                "llama: run Vulkan tests"                          OFF)
-#option(LLAMA_METAL                           "llama: use Metal"                                 ${LLAMA_METAL_DEFAULT})
-option(LLAMA_METAL_NDEBUG                    "llama: disable Metal debugging"                   OFF)
-option(LLAMA_METAL_SHADER_DEBUG              "llama: compile Metal with -fno-fast-math"         OFF)
-set(LLAMA_METAL_MACOSX_VERSION_MIN "" CACHE STRING
-                                             "llama: metal minimum macOS version")
-set(LLAMA_METAL_STD "" CACHE STRING          "llama: metal standard version (-std flag)")
-#option(LLAMA_KOMPUTE                         "llama: use Kompute"                               OFF)
-option(LLAMA_QKK_64                          "llama: use super-block size of 64 for k-quants"   OFF)
-set(LLAMA_SCHED_MAX_COPIES  "4" CACHE STRING "llama: max input copies for pipeline parallelism")
+option(GGML_ACCELERATE                      "ggml: enable Accelerate framework"               ON)
+option(GGML_BLAS                            "ggml: use BLAS"                                  OFF)
+option(GGML_LLAMAFILE                       "ggml: use llamafile SGEMM"                       ${GGML_LLAMAFILE_DEFAULT})
+set(GGML_BLAS_VENDOR "Generic" CACHE STRING "ggml: BLAS library vendor")
+
+#option(GGML_CUDA                            "ggml: use CUDA"                                  OFF)
+option(GGML_CUDA_FORCE_DMMV                 "ggml: use dmmv instead of mmvq CUDA kernels"     OFF)
+option(GGML_CUDA_FORCE_MMQ                  "ggml: use mmq kernels instead of cuBLAS"         OFF)
+option(GGML_CUDA_FORCE_CUBLAS               "ggml: always use cuBLAS instead of mmq kernels"  OFF)
+set   (GGML_CUDA_DMMV_X   "32" CACHE STRING "ggml: x stride for dmmv CUDA kernels")
+set   (GGML_CUDA_MMV_Y     "1" CACHE STRING "ggml: y block size for mmv CUDA kernels")
+option(GGML_CUDA_F16                        "ggml: use 16 bit floats for some calculations"   OFF)
+set   (GGML_CUDA_KQUANTS_ITER "2" CACHE STRING
+                                            "ggml: iters./thread per block for Q2_K/Q6_K")
+set   (GGML_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
+                                            "ggml: max. batch size for using peer access")
+option(GGML_CUDA_NO_PEER_COPY               "ggml: do not use peer to peer copies"            OFF)
+option(GGML_CUDA_NO_VMM                     "ggml: do not try to use CUDA VMM"                OFF)
+option(GGML_CUDA_FA_ALL_QUANTS              "ggml: compile all quants for FlashAttention"     OFF)
+option(GGML_CUDA_USE_GRAPHS                 "ggml: use CUDA graphs (llama.cpp only)"          OFF)
+
+#option(GGML_HIPBLAS                         "ggml: use hipBLAS"                               OFF)
+option(GGML_HIP_UMA                         "ggml: use HIP unified memory architecture"       OFF)
+#option(GGML_VULKAN                          "ggml: use Vulkan"                                OFF)
+option(GGML_VULKAN_CHECK_RESULTS            "ggml: run Vulkan op checks"                      OFF)
+option(GGML_VULKAN_DEBUG                    "ggml: enable Vulkan debug output"                OFF)
+option(GGML_VULKAN_VALIDATE                 "ggml: enable Vulkan validation"                  OFF)
+option(GGML_VULKAN_RUN_TESTS                "ggml: run Vulkan tests"                          OFF)
+#option(GGML_METAL                           "ggml: use Metal"                                 ${GGML_METAL_DEFAULT})
+option(GGML_METAL_NDEBUG                    "ggml: disable Metal debugging"                   OFF)
+option(GGML_METAL_SHADER_DEBUG              "ggml: compile Metal with -fno-fast-math"         OFF)
+set(GGML_METAL_MACOSX_VERSION_MIN "" CACHE STRING
+                                            "ggml: metal minimum macOS version")
+set(GGML_METAL_STD "" CACHE STRING          "ggml: metal standard version (-std flag)")
+#option(GGML_KOMPUTE                        "ggml: use Kompute"                               OFF)
+option(GGML_QKK_64                          "ggml: use super-block size of 64 for k-quants"   OFF)
+set(GGML_SCHED_MAX_COPIES  "4" CACHE STRING "ggml: max input copies for pipeline parallelism")

 # add perf arguments
-option(LLAMA_PERF                            "llama: enable perf"                               OFF)
+option(LLAMA_PERF                           "llama: enable perf"                               OFF)

 #
 # Compile flags
@ -80,14 +86,14 @@ option(LLAMA_PERF                            "llama: enable perf"
 set(THREADS_PREFER_PTHREAD_FLAG ON)
 find_package(Threads REQUIRED)

-list(APPEND GGML_COMPILE_DEFS GGML_SCHED_MAX_COPIES=${LLAMA_SCHED_MAX_COPIES})
+list(APPEND GGML_COMPILE_DEFS GGML_SCHED_MAX_COPIES=${GGML_SCHED_MAX_COPIES})

 # enable libstdc++ assertions for debug builds
 if (CMAKE_SYSTEM_NAME MATCHES "Linux")
    list(APPEND GGML_COMPILE_DEFS $<$<CONFIG:Debug>:_GLIBCXX_ASSERTIONS>)
 endif()

-if (APPLE AND LLAMA_ACCELERATE)
+if (APPLE AND GGML_ACCELERATE)
    find_library(ACCELERATE_FRAMEWORK Accelerate)
    if (ACCELERATE_FRAMEWORK)
        message(STATUS "Accelerate framework found")
@ -101,7 +107,7 @@ if (APPLE AND LLAMA_ACCELERATE)
    endif()
 endif()

-if (LLAMA_BLAS)
+if (GGML_BLAS)
    if (LLAMA_STATIC)
        set(BLA_STATIC ON)
    endif()
@ -109,7 +115,7 @@ if (LLAMA_BLAS)
        set(BLA_SIZEOF_INTEGER 8)
    endif()

-    set(BLA_VENDOR ${LLAMA_BLAS_VENDOR})
+    set(BLA_VENDOR ${GGML_BLAS_VENDOR})
    find_package(BLAS)

    if (BLAS_FOUND)
@ -119,24 +125,24 @@ if (LLAMA_BLAS)
            # BLAS_INCLUDE_DIRS is missing in FindBLAS.cmake.
            # see https://gitlab.kitware.com/cmake/cmake/-/issues/20268
            find_package(PkgConfig REQUIRED)
-            if (${LLAMA_BLAS_VENDOR} MATCHES "Generic")
+            if (${GGML_BLAS_VENDOR} MATCHES "Generic")
                pkg_check_modules(DepBLAS REQUIRED blas)
-            elseif (${LLAMA_BLAS_VENDOR} MATCHES "OpenBLAS")
+            elseif (${GGML_BLAS_VENDOR} MATCHES "OpenBLAS")
                # As of openblas v0.3.22, the 64-bit is named openblas64.pc
                pkg_check_modules(DepBLAS openblas64)
                if (NOT DepBLAS_FOUND)
                    pkg_check_modules(DepBLAS REQUIRED openblas)
                endif()
-            elseif (${LLAMA_BLAS_VENDOR} MATCHES "FLAME")
+            elseif (${GGML_BLAS_VENDOR} MATCHES "FLAME")
                pkg_check_modules(DepBLAS REQUIRED blis)
-            elseif (${LLAMA_BLAS_VENDOR} MATCHES "ATLAS")
+            elseif (${GGML_BLAS_VENDOR} MATCHES "ATLAS")
                pkg_check_modules(DepBLAS REQUIRED blas-atlas)
-            elseif (${LLAMA_BLAS_VENDOR} MATCHES "FlexiBLAS")
+            elseif (${GGML_BLAS_VENDOR} MATCHES "FlexiBLAS")
                pkg_check_modules(DepBLAS REQUIRED flexiblas_api)
-            elseif (${LLAMA_BLAS_VENDOR} MATCHES "Intel")
+            elseif (${GGML_BLAS_VENDOR} MATCHES "Intel")
                # all Intel* libraries share the same include path
                pkg_check_modules(DepBLAS REQUIRED mkl-sdl)
-            elseif (${LLAMA_BLAS_VENDOR} MATCHES "NVHPC")
+            elseif (${GGML_BLAS_VENDOR} MATCHES "NVHPC")
                # this doesn't provide pkg-config
                # suggest to assign BLAS_INCLUDE_DIRS on your own
                if ("${NVHPC_VERSION}" STREQUAL "")
@ -170,7 +176,7 @@ if (LLAMA_BLAS)

        list(APPEND GGML_COMPILE_DEFS GGML_USE_OPENBLAS)

-        if (${BLAS_INCLUDE_DIRS} MATCHES "mkl" AND (${LLAMA_BLAS_VENDOR} MATCHES "Generic" OR ${LLAMA_BLAS_VENDOR} MATCHES "Intel"))
+        if (${BLAS_INCLUDE_DIRS} MATCHES "mkl" AND (${GGML_BLAS_VENDOR} MATCHES "Generic" OR ${GGML_BLAS_VENDOR} MATCHES "Intel"))
            list(APPEND GGML_COMPILE_DEFS GGML_BLAS_USE_MKL)
        endif()

@ -179,18 +185,18 @@ if (LLAMA_BLAS)
    else()
        message(WARNING "BLAS not found, please refer to "
        "https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors"
-        " to set correct LLAMA_BLAS_VENDOR")
+        " to set correct GGML_BLAS_VENDOR")
    endif()
 endif()

-if (LLAMA_LLAMAFILE)
+if (GGML_LLAMAFILE)
    list(APPEND GGML_COMPILE_DEFS GGML_USE_LLAMAFILE)

-    set(GGML_HEADERS_LLAMAFILE ${DIRECTORY}/sgemm.h)
-    set(GGML_SOURCES_LLAMAFILE ${DIRECTORY}/sgemm.cpp)
+    set(GGML_HEADERS_LLAMAFILE ${DIRECTORY}/ggml/src/llamafile/sgemm.h)
+    set(GGML_SOURCES_LLAMAFILE ${DIRECTORY}/ggml/src/llamafile/sgemm.cpp)
 endif()

-if (LLAMA_QKK_64)
+if (GGML_QKK_64)
    list(APPEND GGML_COMPILE_DEFS GGML_QKK_64)
 endif()

@ -361,8 +367,9 @@ function(include_ggml SUFFIX)
    # libraries
    #

-    if (LLAMA_CUDA)
-        cmake_minimum_required(VERSION 3.17)
+    if (GGML_CUDA)
+        cmake_minimum_required(VERSION 3.18)  # for CMAKE_CUDA_ARCHITECTURES
+
        get_property(LANGS GLOBAL PROPERTY ENABLED_LANGUAGES)
        if (NOT CUDA IN_LIST LANGS)
            message(FATAL_ERROR "The CUDA language must be enabled.")
@ -371,40 +378,64 @@ function(include_ggml SUFFIX)
        find_package(CUDAToolkit REQUIRED)
        set(CUDAToolkit_BIN_DIR ${CUDAToolkit_BIN_DIR} PARENT_SCOPE)

-        if (NOT DEFINED GGML_CUDA_ARCHITECTURES)
-            # 52 == lowest CUDA 12 standard
-            # 60 == f16 CUDA intrinsics
-            # 61 == integer CUDA intrinsics
-            # 70 == compute capability at which unrolling a loop in mul_mat_q kernels is faster
-            if (LLAMA_CUDA_F16 OR LLAMA_CUDA_DMMV_F16)
-                set(GGML_CUDA_ARCHITECTURES "60;61;70") # needed for f16 CUDA intrinsics
-            else()
-                set(GGML_CUDA_ARCHITECTURES "52;61;70") # lowest CUDA 12 standard + lowest for integer intrinsics
-                #set(GGML_CUDA_ARCHITECTURES "OFF") # use this to compile much faster, but only F16 models work
-            endif()
+        # architectures are set in gpt4all-backend/CMakeLists.txt
+
+        set(GGML_HEADERS_CUDA ${DIRECTORY}/ggml/include/ggml-cuda.h)
+        file(GLOB   GGML_HEADERS_CUDA "${DIRECTORY}/ggml/src/ggml-cuda/*.cuh")
+        list(APPEND GGML_HEADERS_CUDA "${DIRECTORY}/ggml/include/ggml-cuda.h")
+
+        file(GLOB   GGML_SOURCES_CUDA "${DIRECTORY}/ggml/src/ggml-cuda/*.cu")
+        list(APPEND GGML_SOURCES_CUDA "${DIRECTORY}/ggml/src/ggml-cuda.cu")
+        file(GLOB   SRCS "${DIRECTORY}/ggml/src/ggml-cuda/template-instances/fattn-wmma*.cu")
+        list(APPEND GGML_SOURCES_CUDA ${SRCS})
+        file(GLOB   SRCS "${DIRECTORY}/ggml/src/ggml-cuda/template-instances/mmq*.cu")
+        list(APPEND GGML_SOURCES_CUDA ${SRCS})
+
+        if (GGML_CUDA_FA_ALL_QUANTS)
+            file(GLOB   SRCS "${DIRECTORY}/ggml/src/ggml-cuda/template-instances/fattn-vec*.cu")
+            list(APPEND GGML_SOURCES_CUDA ${SRCS})
+            add_compile_definitions(GGML_CUDA_FA_ALL_QUANTS)
+        else()
+            file(GLOB   SRCS "${DIRECTORY}/ggml/src/ggml-cuda/template-instances/fattn-vec*q4_0-q4_0.cu")
+            list(APPEND GGML_SOURCES_CUDA ${SRCS})
+            file(GLOB   SRCS "${DIRECTORY}/ggml/src/ggml-cuda/template-instances/fattn-vec*q8_0-q8_0.cu")
+            list(APPEND GGML_SOURCES_CUDA ${SRCS})
+            file(GLOB   SRCS "${DIRECTORY}/ggml/src/ggml-cuda/template-instances/fattn-vec*f16-f16.cu")
+            list(APPEND GGML_SOURCES_CUDA ${SRCS})
        endif()
-        message(STATUS "Using CUDA architectures: ${GGML_CUDA_ARCHITECTURES}")
-
-        set(GGML_HEADERS_CUDA ${DIRECTORY}/ggml-cuda.h)
-
-        file(GLOB GGML_SOURCES_CUDA "${DIRECTORY}/ggml-cuda/*.cu")
-        list(APPEND GGML_SOURCES_CUDA "${DIRECTORY}/ggml-cuda.cu")

        list(APPEND GGML_COMPILE_DEFS_PUBLIC GGML_USE_CUDA)
-        if (LLAMA_CUDA_FORCE_DMMV)
+
+        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_DMMV_X=${GGML_CUDA_DMMV_X})
+        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_MMV_Y=${GGML_CUDA_MMV_Y})
+        list(APPEND GGML_COMPILE_DEFS K_QUANTS_PER_ITERATION=${GGML_CUDA_KQUANTS_ITER})
+        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_PEER_MAX_BATCH_SIZE=${GGML_CUDA_PEER_MAX_BATCH_SIZE})
+
+        if (GGML_CUDA_USE_GRAPHS)
+            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_USE_GRAPHS)
+        endif()
+
+        if (GGML_CUDA_FORCE_DMMV)
            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_FORCE_DMMV)
        endif()
-        if (LLAMA_CUDA_FORCE_MMQ)
+
+        if (GGML_CUDA_FORCE_MMQ)
            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_FORCE_MMQ)
        endif()
-        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_DMMV_X=${LLAMA_CUDA_DMMV_X})
-        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_MMV_Y=${LLAMA_CUDA_MMV_Y})
-        if (LLAMA_CUDA_F16)
+
+        if (GGML_CUDA_FORCE_CUBLAS)
+            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_FORCE_CUBLAS)
+        endif()
+
+        if (GGML_CUDA_NO_VMM)
+            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_NO_VMM)
+        endif()
+
+        if (GGML_CUDA_F16)
            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_F16)
        endif()
-        list(APPEND GGML_COMPILE_DEFS K_QUANTS_PER_ITERATION=${LLAMA_CUDA_KQUANTS_ITER})
-        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_PEER_MAX_BATCH_SIZE=${LLAMA_CUDA_PEER_MAX_BATCH_SIZE})
-        if (LLAMA_CUDA_NO_PEER_COPY)
+
+        if (GGML_CUDA_NO_PEER_COPY)
            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_NO_PEER_COPY)
        endif()

@ -422,45 +453,34 @@ function(include_ggml SUFFIX)
        set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} CUDA::cuda_driver)
    endif()

-    if (LLAMA_CLBLAST)
-        find_package(CLBlast REQUIRED)
-
-        set(GGML_HEADERS_OPENCL ${DIRECTORY}/ggml-opencl.h)
-        set(GGML_SOURCES_OPENCL ${DIRECTORY}/ggml-opencl.cpp)
-
-        list(APPEND GGML_COMPILE_DEFS_PUBLIC GGML_USE_CLBLAST)
-
-        set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} clblast)
-    endif()
-
-    if (LLAMA_VULKAN)
+    if (GGML_VULKAN)
        find_package(Vulkan REQUIRED)

-        set(GGML_HEADERS_VULKAN ${DIRECTORY}/ggml-vulkan.h)
-        set(GGML_SOURCES_VULKAN ${DIRECTORY}/ggml-vulkan.cpp)
+        set(GGML_HEADERS_VULKAN ${DIRECTORY}/ggml/include/ggml-vulkan.h)
+        set(GGML_SOURCES_VULKAN ${DIRECTORY}/ggml/src/ggml-vulkan.cpp)

        list(APPEND GGML_COMPILE_DEFS_PUBLIC GGML_USE_VULKAN)

-        if (LLAMA_VULKAN_CHECK_RESULTS)
+        if (GGML_VULKAN_CHECK_RESULTS)
            list(APPEND GGML_COMPILE_DEFS GGML_VULKAN_CHECK_RESULTS)
        endif()

-        if (LLAMA_VULKAN_DEBUG)
+        if (GGML_VULKAN_DEBUG)
            list(APPEND GGML_COMPILE_DEFS GGML_VULKAN_DEBUG)
        endif()

-        if (LLAMA_VULKAN_VALIDATE)
+        if (GGML_VULKAN_VALIDATE)
            list(APPEND GGML_COMPILE_DEFS GGML_VULKAN_VALIDATE)
        endif()

-        if (LLAMA_VULKAN_RUN_TESTS)
+        if (GGML_VULKAN_RUN_TESTS)
            list(APPEND GGML_COMPILE_DEFS GGML_VULKAN_RUN_TESTS)
        endif()

        set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} Vulkan::Vulkan)
    endif()

-    if (LLAMA_HIPBLAS)
+    if (GGML_HIPBLAS)
        if ($ENV{ROCM_PATH})
            set(ROCM_PATH $ENV{ROCM_PATH})
        else()
@ -490,32 +510,32 @@ function(include_ggml SUFFIX)

        message(STATUS "HIP and hipBLAS found")

-        set(GGML_HEADERS_ROCM ${DIRECTORY}/ggml-cuda.h)
+        set(GGML_HEADERS_ROCM ${DIRECTORY}/ggml/include/ggml-cuda.h)

-        file(GLOB GGML_SOURCES_ROCM "${DIRECTORY}/ggml-rocm/*.cu")
-        list(APPEND GGML_SOURCES_ROCM "${DIRECTORY}/ggml-rocm.cu")
+        file(GLOB GGML_SOURCES_ROCM "${DIRECTORY}/ggml/src/ggml-rocm/*.cu")
+        list(APPEND GGML_SOURCES_ROCM "${DIRECTORY}/ggml/src/ggml-rocm.cu")

        list(APPEND GGML_COMPILE_DEFS_PUBLIC GGML_USE_HIPBLAS GGML_USE_CUDA)

-        if (LLAMA_HIP_UMA)
+        if (GGML_HIP_UMA)
            list(APPEND GGML_COMPILE_DEFS GGML_HIP_UMA)
        endif()

-        if (LLAMA_CUDA_FORCE_DMMV)
+        if (GGML_CUDA_FORCE_DMMV)
            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_FORCE_DMMV)
        endif()

-        if (LLAMA_CUDA_FORCE_MMQ)
+        if (GGML_CUDA_FORCE_MMQ)
            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_FORCE_MMQ)
        endif()

-        if (LLAMA_CUDA_NO_PEER_COPY)
+        if (GGML_CUDA_NO_PEER_COPY)
            list(APPEND GGML_COMPILE_DEFS GGML_CUDA_NO_PEER_COPY)
        endif()

-        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_DMMV_X=${LLAMA_CUDA_DMMV_X})
-        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_MMV_Y=${LLAMA_CUDA_MMV_Y})
-        list(APPEND GGML_COMPILE_DEFS K_QUANTS_PER_ITERATION=${LLAMA_CUDA_KQUANTS_ITER})
+        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_DMMV_X=${GGML_CUDA_DMMV_X})
+        list(APPEND GGML_COMPILE_DEFS GGML_CUDA_MMV_Y=${GGML_CUDA_MMV_Y})
+        list(APPEND GGML_COMPILE_DEFS K_QUANTS_PER_ITERATION=${GGML_CUDA_KQUANTS_ITER})

        if (CXX_IS_HIPCC)
            set_source_files_properties(${GGML_SOURCES_ROCM} PROPERTIES LANGUAGE CXX)
@ -533,9 +553,9 @@ function(include_ggml SUFFIX)

    set(LLAMA_DIR ${CMAKE_CURRENT_SOURCE_DIR}/${DIRECTORY})

-    if (LLAMA_KOMPUTE AND NOT GGML_KOMPUTE_ONCE)
+    if (GGML_KOMPUTE AND NOT GGML_KOMPUTE_ONCE)
        set(GGML_KOMPUTE_ONCE ON PARENT_SCOPE)
-        if (NOT EXISTS "${LLAMA_DIR}/kompute/CMakeLists.txt")
+        if (NOT EXISTS "${LLAMA_DIR}/ggml/src/kompute/CMakeLists.txt")
            message(FATAL_ERROR "Kompute not found")
        endif()
        message(STATUS "Kompute found")
@ -559,12 +579,12 @@ function(include_ggml SUFFIX)
                set(spv_file ${CMAKE_CURRENT_BINARY_DIR}/${OP_FILE}.spv)
                add_custom_command(
                    OUTPUT ${spv_file}
-                    DEPENDS ${LLAMA_DIR}/${source}
-                        ${LLAMA_DIR}/kompute-shaders/common.comp
-                        ${LLAMA_DIR}/kompute-shaders/op_getrows.comp
-                        ${LLAMA_DIR}/kompute-shaders/op_mul_mv_q_n_pre.comp
-                        ${LLAMA_DIR}/kompute-shaders/op_mul_mv_q_n.comp
-                    COMMAND ${glslc_executable} --target-env=vulkan1.2 -o ${spv_file} ${LLAMA_DIR}/${source}
+                    DEPENDS ${LLAMA_DIR}/ggml/src/kompute-shaders/${source}
+                        ${LLAMA_DIR}/ggml/src/kompute-shaders/common.comp
+                        ${LLAMA_DIR}/ggml/src/kompute-shaders/op_getrows.comp
+                        ${LLAMA_DIR}/ggml/src/kompute-shaders/op_mul_mv_q_n_pre.comp
+                        ${LLAMA_DIR}/ggml/src/kompute-shaders/op_mul_mv_q_n.comp
+                    COMMAND ${glslc_executable} --target-env=vulkan1.2 -o ${spv_file} ${LLAMA_DIR}/ggml/src/kompute-shaders/${source}
                    COMMENT "Compiling ${source} to ${source}.spv"
                    )

@ -610,39 +630,39 @@ function(include_ggml SUFFIX)
        set(KOMPUTE_OPT_BUILT_IN_VULKAN_HEADER_TAG "v1.3.239" CACHE STRING "Kompute Vulkan headers tag")
        set(KOMPUTE_OPT_LOG_LEVEL Critical CACHE STRING "Kompute log level")
        set(FMT_INSTALL OFF)
-        add_subdirectory(${LLAMA_DIR}/kompute)
+        add_subdirectory(${LLAMA_DIR}/ggml/src/kompute)

        # Compile our shaders
        compile_shader(SOURCES
-            kompute-shaders/op_scale.comp
-            kompute-shaders/op_scale_8.comp
-            kompute-shaders/op_add.comp
-            kompute-shaders/op_addrow.comp
-            kompute-shaders/op_mul.comp
-            kompute-shaders/op_silu.comp
-            kompute-shaders/op_relu.comp
-            kompute-shaders/op_gelu.comp
-            kompute-shaders/op_softmax.comp
-            kompute-shaders/op_norm.comp
-            kompute-shaders/op_rmsnorm.comp
-            kompute-shaders/op_diagmask.comp
-            kompute-shaders/op_mul_mat_mat_f32.comp
-            kompute-shaders/op_mul_mat_f16.comp
-            kompute-shaders/op_mul_mat_q8_0.comp
-            kompute-shaders/op_mul_mat_q4_0.comp
-            kompute-shaders/op_mul_mat_q4_1.comp
-            kompute-shaders/op_mul_mat_q6_k.comp
-            kompute-shaders/op_getrows_f32.comp
-            kompute-shaders/op_getrows_f16.comp
-            kompute-shaders/op_getrows_q4_0.comp
-            kompute-shaders/op_getrows_q4_1.comp
-            kompute-shaders/op_getrows_q6_k.comp
-            kompute-shaders/op_rope_f16.comp
-            kompute-shaders/op_rope_f32.comp
-            kompute-shaders/op_cpy_f16_f16.comp
-            kompute-shaders/op_cpy_f16_f32.comp
-            kompute-shaders/op_cpy_f32_f16.comp
-            kompute-shaders/op_cpy_f32_f32.comp
+            op_scale.comp
+            op_scale_8.comp
+            op_add.comp
+            op_addrow.comp
+            op_mul.comp
+            op_silu.comp
+            op_relu.comp
+            op_gelu.comp
+            op_softmax.comp
+            op_norm.comp
+            op_rmsnorm.comp
+            op_diagmask.comp
+            op_mul_mat_mat_f32.comp
+            op_mul_mat_f16.comp
+            op_mul_mat_q8_0.comp
+            op_mul_mat_q4_0.comp
+            op_mul_mat_q4_1.comp
+            op_mul_mat_q6_k.comp
+            op_getrows_f32.comp
+            op_getrows_f16.comp
+            op_getrows_q4_0.comp
+            op_getrows_q4_1.comp
+            op_getrows_q6_k.comp
+            op_rope_f16.comp
+            op_rope_f32.comp
+            op_cpy_f16_f16.comp
+            op_cpy_f16_f32.comp
+            op_cpy_f32_f16.comp
+            op_cpy_f32_f32.comp
        )

        # Create a custom target for our generated shaders
@ -687,12 +707,12 @@ function(include_ggml SUFFIX)
        )
    endif()

-    if (LLAMA_KOMPUTE)
+    if (GGML_KOMPUTE)
        list(APPEND GGML_COMPILE_DEFS VULKAN_HPP_DISPATCH_LOADER_DYNAMIC=1)

        # Add the stamp to the main sources to ensure dependency tracking
-        set(GGML_SOURCES_KOMPUTE ${LLAMA_DIR}/ggml-kompute.cpp ${CMAKE_CURRENT_BINARY_DIR}/ggml-kompute.stamp)
-        set(GGML_HEADERS_KOMPUTE ${LLAMA_DIR}/ggml-kompute.h)
+        set(GGML_SOURCES_KOMPUTE ${LLAMA_DIR}/ggml/src/ggml-kompute.cpp ${CMAKE_CURRENT_BINARY_DIR}/ggml-kompute.stamp)
+        set(GGML_HEADERS_KOMPUTE ${LLAMA_DIR}/ggml/include/ggml-kompute.h)

        list(APPEND GGML_COMPILE_DEFS_PUBLIC GGML_USE_KOMPUTE)

@ -701,7 +721,7 @@ function(include_ggml SUFFIX)

    set(CUDA_CXX_FLAGS "")

-    if (LLAMA_CUDA)
+    if (GGML_CUDA)
        set(CUDA_FLAGS -use_fast_math)

        if (LLAMA_FATAL_WARNINGS)
@ -748,25 +768,25 @@ function(include_ggml SUFFIX)
        endif()
    endif()

-    if (LLAMA_METAL)
+    if (GGML_METAL)
        find_library(FOUNDATION_LIBRARY Foundation REQUIRED)
        find_library(METAL_FRAMEWORK    Metal      REQUIRED)
        find_library(METALKIT_FRAMEWORK MetalKit   REQUIRED)

        message(STATUS "Metal framework found")
-        set(GGML_HEADERS_METAL ${DIRECTORY}/ggml-metal.h)
-        set(GGML_SOURCES_METAL ${DIRECTORY}/ggml-metal.m)
+        set(GGML_HEADERS_METAL ${DIRECTORY}/ggml/include/ggml-metal.h)
+        set(GGML_SOURCES_METAL ${DIRECTORY}/ggml/src/ggml-metal.m)

        list(APPEND GGML_COMPILE_DEFS_PUBLIC GGML_USE_METAL)
-        if (LLAMA_METAL_NDEBUG)
+        if (GGML_METAL_NDEBUG)
            list(APPEND GGML_COMPILE_DEFS GGML_METAL_NDEBUG)
        endif()

        # copy ggml-common.h and ggml-metal.metal to bin directory
-        configure_file(${DIRECTORY}/ggml-common.h    ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-common.h    COPYONLY)
-        configure_file(${DIRECTORY}/ggml-metal.metal ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.metal COPYONLY)
+        configure_file(${DIRECTORY}/ggml/src/ggml-common.h    ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-common.h    COPYONLY)
+        configure_file(${DIRECTORY}/ggml/src/ggml-metal.metal ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.metal COPYONLY)

-        if (LLAMA_METAL_SHADER_DEBUG)
+        if (GGML_METAL_SHADER_DEBUG)
            # custom command to do the following:
            #   xcrun -sdk macosx metal    -fno-fast-math -c ggml-metal.metal -o ggml-metal.air
            #   xcrun -sdk macosx metallib                   ggml-metal.air   -o default.metallib
@ -782,16 +802,17 @@ function(include_ggml SUFFIX)
        endif()

        # Append macOS metal versioning flags
-        if (LLAMA_METAL_MACOSX_VERSION_MIN)
-            message(STATUS "Adding -mmacosx-version-min=${LLAMA_METAL_MACOSX_VERSION_MIN} flag to metal compilation")
-            list(APPEND XC_FLAGS -mmacosx-version-min=${LLAMA_METAL_MACOSX_VERSION_MIN})
+        if (GGML_METAL_MACOSX_VERSION_MIN)
+            message(STATUS "Adding -mmacosx-version-min=${GGML_METAL_MACOSX_VERSION_MIN} flag to metal compilation")
+            list(APPEND XC_FLAGS -mmacosx-version-min=${GGML_METAL_MACOSX_VERSION_MIN})
        endif()
-        if (LLAMA_METAL_STD)
-            message(STATUS "Adding -std=${LLAMA_METAL_STD} flag to metal compilation")
-            list(APPEND XC_FLAGS -std=${LLAMA_METAL_STD})
+        if (GGML_METAL_STD)
+            message(STATUS "Adding -std=${GGML_METAL_STD} flag to metal compilation")
+            list(APPEND XC_FLAGS -std=${GGML_METAL_STD})
        endif()

-        set(GGML_METALLIB ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/default.metallib)
+        set(GGML_METALLIB "${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/default.metallib")
+        set(GGML_METALLIB "${GGML_METALLIB}" PARENT_SCOPE)
        add_custom_command(
            OUTPUT ${GGML_METALLIB}
            COMMAND xcrun -sdk macosx metal    ${XC_FLAGS} -c ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.metal -o ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.air
@ -799,10 +820,9 @@ function(include_ggml SUFFIX)
            COMMAND rm -f ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.air
            COMMAND rm -f ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-common.h
            COMMAND rm -f ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.metal
-            DEPENDS ${DIRECTORY}/ggml-metal.metal ${DIRECTORY}/ggml-common.h
+            DEPENDS ${DIRECTORY}/ggml/src/ggml-metal.metal ${DIRECTORY}/ggml/src/ggml-common.h
            COMMENT "Compiling Metal kernels"
            )
-        set_source_files_properties(${GGML_METALLIB} DIRECTORY ${CMAKE_SOURCE_DIR} PROPERTIES GENERATED ON)

        add_custom_target(
            ggml-metal ALL
@ -853,49 +873,49 @@ function(include_ggml SUFFIX)
             CMAKE_SYSTEM_PROCESSOR MATCHES "^(x86_64|i686|AMD64)$"))
        message(STATUS "x86 detected")
        if (MSVC)
-            if (LLAMA_AVX512)
+            if (GGML_AVX512)
                list(APPEND ARCH_FLAGS /arch:AVX512)
                # MSVC has no compile-time flags enabling specific
                # AVX512 extensions, neither it defines the
                # macros corresponding to the extensions.
                # Do it manually.
-                if (LLAMA_AVX512_VBMI)
+                if (GGML_AVX512_VBMI)
                    list(APPEND GGML_COMPILE_DEFS $<$<COMPILE_LANGUAGE:C>:__AVX512VBMI__>)
                    list(APPEND GGML_COMPILE_DEFS $<$<COMPILE_LANGUAGE:CXX>:__AVX512VBMI__>)
                endif()
-                if (LLAMA_AVX512_VNNI)
+                if (GGML_AVX512_VNNI)
                    list(APPEND GGML_COMPILE_DEFS $<$<COMPILE_LANGUAGE:C>:__AVX512VNNI__>)
                    list(APPEND GGML_COMPILE_DEFS $<$<COMPILE_LANGUAGE:CXX>:__AVX512VNNI__>)
                endif()
-            elseif (LLAMA_AVX2)
+            elseif (GGML_AVX2)
                list(APPEND ARCH_FLAGS /arch:AVX2)
-            elseif (LLAMA_AVX)
+            elseif (GGML_AVX)
                list(APPEND ARCH_FLAGS /arch:AVX)
            endif()
        else()
-            if (LLAMA_NATIVE)
+            if (GGML_NATIVE)
                list(APPEND ARCH_FLAGS -march=native)
            endif()
-            if (LLAMA_F16C)
+            if (GGML_F16C)
                list(APPEND ARCH_FLAGS -mf16c)
            endif()
-            if (LLAMA_FMA)
+            if (GGML_FMA)
                list(APPEND ARCH_FLAGS -mfma)
            endif()
-            if (LLAMA_AVX)
+            if (GGML_AVX)
                list(APPEND ARCH_FLAGS -mavx)
            endif()
-            if (LLAMA_AVX2)
+            if (GGML_AVX2)
                list(APPEND ARCH_FLAGS -mavx2)
            endif()
-            if (LLAMA_AVX512)
+            if (GGML_AVX512)
                list(APPEND ARCH_FLAGS -mavx512f)
                list(APPEND ARCH_FLAGS -mavx512bw)
            endif()
-            if (LLAMA_AVX512_VBMI)
+            if (GGML_AVX512_VBMI)
                list(APPEND ARCH_FLAGS -mavx512vbmi)
            endif()
-            if (LLAMA_AVX512_VNNI)
+            if (GGML_AVX512_VNNI)
                list(APPEND ARCH_FLAGS -mavx512vnni)
            endif()
        endif()
@ -914,7 +934,7 @@ function(include_ggml SUFFIX)
    list(APPEND GGML_COMPILE_OPTS "$<$<COMPILE_LANGUAGE:CXX>:${ARCH_FLAGS}>")
    list(APPEND GGML_COMPILE_OPTS "$<$<COMPILE_LANGUAGE:C>:${ARCH_FLAGS}>")

-    if (LLAMA_CUDA)
+    if (GGML_CUDA)
        list(APPEND CUDA_CXX_FLAGS ${ARCH_FLAGS})
        list(JOIN CUDA_CXX_FLAGS " " CUDA_CXX_FLAGS_JOINED)  # pass host compiler flags as a single argument
        if (NOT CUDA_CXX_FLAGS_JOINED STREQUAL "")
@ -926,24 +946,26 @@ function(include_ggml SUFFIX)
    # ggml

    add_library(ggml${SUFFIX} OBJECT
-                ${DIRECTORY}/ggml.c
-                ${DIRECTORY}/ggml.h
-                ${DIRECTORY}/ggml-alloc.c
-                ${DIRECTORY}/ggml-alloc.h
-                ${DIRECTORY}/ggml-backend.c
-                ${DIRECTORY}/ggml-backend.h
-                ${DIRECTORY}/ggml-quants.c
-                ${DIRECTORY}/ggml-quants.h
+                ${DIRECTORY}/ggml/include/ggml.h
+                ${DIRECTORY}/ggml/include/ggml-alloc.h
+                ${DIRECTORY}/ggml/include/ggml-backend.h
+                ${DIRECTORY}/ggml/src/ggml.c
+                ${DIRECTORY}/ggml/src/ggml-alloc.c
+                ${DIRECTORY}/ggml/src/ggml-backend.c
+                ${DIRECTORY}/ggml/src/ggml-quants.c
+                ${DIRECTORY}/ggml/src/ggml-quants.h
                ${GGML_SOURCES_CUDA}      ${GGML_HEADERS_CUDA}
-                ${GGML_SOURCES_OPENCL}    ${GGML_HEADERS_OPENCL}
                ${GGML_SOURCES_METAL}     ${GGML_HEADERS_METAL}
                ${GGML_SOURCES_KOMPUTE}   ${GGML_HEADERS_KOMPUTE}
                ${GGML_SOURCES_VULKAN}    ${GGML_HEADERS_VULKAN}
                ${GGML_SOURCES_ROCM}      ${GGML_HEADERS_ROCM}
                ${GGML_SOURCES_LLAMAFILE} ${GGML_HEADERS_LLAMAFILE}
+                ${DIRECTORY}/ggml/src/ggml-aarch64.c
+                ${DIRECTORY}/ggml/src/ggml-aarch64.h
                )

-    target_include_directories(ggml${SUFFIX} PUBLIC ${DIRECTORY} ${LLAMA_EXTRA_INCLUDES})
+    target_include_directories(ggml${SUFFIX} PUBLIC ${DIRECTORY}/ggml/include ${LLAMA_EXTRA_INCLUDES})
+    target_include_directories(ggml${SUFFIX} PRIVATE ${DIRECTORY}/ggml/src)
    target_compile_features(ggml${SUFFIX} PUBLIC c_std_11) # don't bump

    target_link_libraries(ggml${SUFFIX} PUBLIC Threads::Threads ${LLAMA_EXTRA_LIBS})
@ -955,14 +977,18 @@ function(include_ggml SUFFIX)
    # llama

    add_library(llama${SUFFIX} STATIC
-                ${DIRECTORY}/llama.cpp
-                ${DIRECTORY}/llama.h
-                ${DIRECTORY}/unicode.h
-                ${DIRECTORY}/unicode.cpp
-                ${DIRECTORY}/unicode-data.cpp
+                ${DIRECTORY}/include/llama.h
+                ${DIRECTORY}/src/llama-grammar.cpp
+                ${DIRECTORY}/src/llama-sampling.cpp
+                ${DIRECTORY}/src/llama-vocab.cpp
+                ${DIRECTORY}/src/llama.cpp
+                ${DIRECTORY}/src/unicode-data.cpp
+                ${DIRECTORY}/src/unicode.cpp
+                ${DIRECTORY}/src/unicode.h
                )

-    target_include_directories(llama${SUFFIX} PUBLIC ${DIRECTORY})
+    target_include_directories(llama${SUFFIX} PUBLIC  ${DIRECTORY}/include ${DIRECTORY}/ggml/include)
+    target_include_directories(llama${SUFFIX} PRIVATE ${DIRECTORY}/src)
    target_compile_features   (llama${SUFFIX} PUBLIC cxx_std_11) # don't bump

    target_link_libraries(llama${SUFFIX} PRIVATE
@ -983,9 +1009,6 @@ function(include_ggml SUFFIX)
        C_STANDARD 11
        C_STANDARD_REQUIRED true
        )
-    if (GGML_CUDA_ARCHITECTURES)
-        set_property(TARGET ggml${SUFFIX} llama${SUFFIX} PROPERTY CUDA_ARCHITECTURES "${GGML_CUDA_ARCHITECTURES}")
-    endif()

    target_compile_options(ggml${SUFFIX} PRIVATE "${GGML_COMPILE_OPTS}")
    target_compile_options(llama${SUFFIX} PRIVATE "${GGML_COMPILE_OPTS}")
--- a/gpt4all-backend/llmodel_shared.cpp
+++ b/gpt4all-backend/llmodel_shared.cpp
@ -1,307 +0,0 @@
-#include "llmodel.h"
-
-#include <algorithm>
-#include <cassert>
-#include <cstddef>
-#include <cstdint>
-#include <functional>
-#include <iostream>
-#include <optional>
-#include <regex>
-#include <stdexcept>
-#include <string>
-#include <unordered_set>
-#include <vector>
-
-// TODO(cebtenzzre): replace this with llama_kv_cache_seq_shift for llamamodel (GPT-J needs this as-is)
-void LLModel::recalculateContext(PromptContext &promptCtx, std::function<bool(bool)> recalculate)
-{
-    int n_keep = shouldAddBOS();
-    const int32_t n_discard = (promptCtx.n_ctx - n_keep) * promptCtx.contextErase;
-
-    // Erase the first percentage of context from the tokens
-    std::cerr << implementation().modelType() << ": reached the end of the context window so resizing\n";
-    promptCtx.tokens.erase(promptCtx.tokens.begin() + n_keep, promptCtx.tokens.begin() + n_keep + n_discard);
-
-    size_t i = n_keep;
-    promptCtx.n_past = n_keep;
-    while (i < promptCtx.tokens.size()) {
-        size_t batch_end = std::min(i + promptCtx.n_batch, promptCtx.tokens.size());
-        std::vector<int32_t> batch(promptCtx.tokens.begin() + i, promptCtx.tokens.begin() + batch_end);
-        assert(promptCtx.n_past + int32_t(batch.size()) <= promptCtx.n_ctx);
-        if (!evalTokens(promptCtx, batch)) {
-            std::cerr << "LLModel ERROR: Failed to process prompt\n";
-            goto stop_generating;
-        }
-        promptCtx.n_past += batch.size();
-        if (!recalculate(true))
-            goto stop_generating;
-        i = batch_end;
-    }
-    assert(promptCtx.n_past == int32_t(promptCtx.tokens.size()));
-
-stop_generating:
-    recalculate(false);
-}
-
-static bool parsePromptTemplate(const std::string &tmpl, std::vector<std::smatch> &placeholders, std::string &err)
-{
-    static const std::regex placeholderRegex(R"(%[1-2](?![0-9]))");
-
-    auto it = std::sregex_iterator(tmpl.begin(), tmpl.end(), placeholderRegex);
-    placeholders.clear();
-    placeholders.insert(placeholders.end(), it, std::sregex_iterator());
-
-    if (placeholders.size() > 2) {
-        err = "ERROR: expected at most two placeholders, got " + std::to_string(placeholders.size());
-        return false;
-    }
-    if (placeholders.size() >= 1 && placeholders[0].str() != "%1") {
-        err = "ERROR: first placeholder must be %1, got " + placeholders[0].str();
-        return false;
-    }
-    if (placeholders.size() >= 2 && placeholders[1].str() != "%2") {
-        err = "ERROR: second placeholder must be %2, got " + placeholders[1].str();
-        return false;
-    }
-    return true;
-}
-
-void LLModel::prompt(const std::string &prompt,
-                     const std::string &promptTemplate,
-                     std::function<bool(int32_t)> promptCallback,
-                     std::function<bool(int32_t, const std::string&)> responseCallback,
-                     std::function<bool(bool)> recalculateCallback,
-                     PromptContext &promptCtx,
-                     bool special,
-                     std::string *fakeReply)
-{
-    if (!isModelLoaded()) {
-        std::cerr << implementation().modelType() << " ERROR: prompt won't work with an unloaded model!\n";
-        return;
-    }
-
-    if (!supportsCompletion()) {
-        std::string errorMessage = "ERROR: this model does not support text completion or chat!";
-        responseCallback(-1, errorMessage);
-        std::cerr << implementation().modelType() << " " << errorMessage << "\n";
-        return;
-    }
-
-    // parse the prompt template
-    std::vector<std::smatch> placeholders;
-    {
-        std::string err;
-        if (!parsePromptTemplate(promptTemplate, placeholders, err)) {
-            responseCallback(-1, err);
-            std::cerr << err << "\n";
-            return;
-        }
-    }
-
-    auto old_n_past = promptCtx.n_past; // prepare to fake n_past for tokenize
-
-    // tokenize the user prompt
-    std::vector<Token> embd_inp;
-    if (placeholders.empty()) {
-        // this is unusual, but well-defined
-        std::cerr << __func__ << ": prompt template has no placeholder\n";
-        embd_inp = tokenize(promptCtx, promptTemplate, true);
-    } else {
-        // template: beginning of user prompt
-        const auto &phUser = placeholders[0];
-        std::string userPrefix(phUser.prefix());
-        if (!userPrefix.empty()) {
-            embd_inp = tokenize(promptCtx, userPrefix, true);
-            promptCtx.n_past += embd_inp.size();
-        }
-
-        // user input (shouldn't have special token processing)
-        auto tokens = tokenize(promptCtx, prompt, special);
-        embd_inp.insert(embd_inp.end(), tokens.begin(), tokens.end());
-        promptCtx.n_past += tokens.size();
-
-        // template: end of user prompt + start of assistant prompt
-        size_t start = phUser.position() + phUser.length();
-        size_t end = placeholders.size() >= 2 ? placeholders[1].position() : promptTemplate.length();
-        auto userToAsst = promptTemplate.substr(start, end - start);
-        if (!userToAsst.empty()) {
-            tokens = tokenize(promptCtx, userToAsst, true);
-            embd_inp.insert(embd_inp.end(), tokens.begin(), tokens.end());
-            promptCtx.n_past += tokens.size();
-        }
-    }
-
-    promptCtx.n_past = old_n_past; // restore n_past so decodePrompt can increment it
-
-    // decode the user prompt
-    decodePrompt(promptCallback, responseCallback, recalculateCallback, promptCtx, embd_inp);
-
-    // decode the assistant's reply, either generated or spoofed
-    if (fakeReply == nullptr) {
-        generateResponse(responseCallback, recalculateCallback, promptCtx);
-    } else {
-        embd_inp = tokenize(promptCtx, *fakeReply, false);
-        decodePrompt(promptCallback, responseCallback, recalculateCallback, promptCtx, embd_inp);
-    }
-
-    // decode the rest of the prompt template
-    // template: end of assistant prompt
-    std::string asstSuffix;
-    if (placeholders.size() >= 2) {
-        size_t start = placeholders[1].position() + placeholders[1].length();
-        asstSuffix = promptTemplate.substr(start);
-    } else {
-        asstSuffix = "\n\n"; // default to a blank link, good for e.g. Alpaca
-    }
-    if (!asstSuffix.empty()) {
-        embd_inp = tokenize(promptCtx, asstSuffix, true);
-        decodePrompt(promptCallback, responseCallback, recalculateCallback, promptCtx, embd_inp);
-    }
-}
-
-void LLModel::decodePrompt(std::function<bool(int32_t)> promptCallback,
-                           std::function<bool(int32_t, const std::string&)> responseCallback,
-                           std::function<bool(bool)> recalculateCallback,
-                           PromptContext &promptCtx,
-                           std::vector<Token> embd_inp) {
-    // save the context size
-    promptCtx.n_ctx = contextLength();
-
-    if ((int) embd_inp.size() > promptCtx.n_ctx - 4) {
-        responseCallback(-1, "ERROR: The prompt size exceeds the context window size and cannot be processed.");
-        std::cerr << implementation().modelType() << " ERROR: The prompt is " << embd_inp.size() <<
-            " tokens and the context window is " << promptCtx.n_ctx << "!\n";
-        return;
-    }
-
-    promptCtx.n_predict = std::min(promptCtx.n_predict, promptCtx.n_ctx - (int) embd_inp.size());
-    promptCtx.n_past = std::min(promptCtx.n_past, promptCtx.n_ctx);
-    promptCtx.n_batch = std::min(promptCtx.n_batch, LLMODEL_MAX_PROMPT_BATCH);
-
-    // process the prompt in batches
-    size_t i = 0;
-    while (i < embd_inp.size()) {
-        size_t batch_end = std::min(i + promptCtx.n_batch, embd_inp.size());
-        std::vector<Token> batch(embd_inp.begin() + i, embd_inp.begin() + batch_end);
-
-        // Check if the context has run out...
-        if (promptCtx.n_past + int32_t(batch.size()) > promptCtx.n_ctx) {
-            recalculateContext(promptCtx, recalculateCallback);
-            assert(promptCtx.n_past + int32_t(batch.size()) <= promptCtx.n_ctx);
-        }
-
-        if (!evalTokens(promptCtx, batch)) {
-            std::cerr << implementation().modelType() << " ERROR: Failed to process prompt\n";
-            return;
-        }
-
-        size_t tokens = batch_end - i;
-        for (size_t t = 0; t < tokens; ++t) {
-            if (int32_t(promptCtx.tokens.size()) == promptCtx.n_ctx)
-                promptCtx.tokens.erase(promptCtx.tokens.begin());
-            promptCtx.tokens.push_back(batch.at(t));
-            promptCtx.n_past += 1;
-            if (!promptCallback(batch.at(t)))
-                return;
-        }
-        i = batch_end;
-    }
-}
-
-void LLModel::generateResponse(std::function<bool(int32_t, const std::string&)> responseCallback,
-                               std::function<bool(bool)> recalculateCallback,
-                               PromptContext &promptCtx) {
-    std::string cachedResponse;
-    std::vector<Token> cachedTokens;
-    std::unordered_set<std::string> reversePrompts
-        = { "### Instruction", "### Prompt", "### Response", "### Human", "### Assistant", "### Context" };
-
-    // predict next tokens
-    for (int i = 0; i < promptCtx.n_predict; i++) {
-
-        // sample next token
-        auto id = sampleToken(promptCtx);
-
-        // Check if the context has run out...
-        if (promptCtx.n_past + 1 > promptCtx.n_ctx) {
-            recalculateContext(promptCtx, recalculateCallback);
-            assert(promptCtx.n_past + 1 <= promptCtx.n_ctx);
-        }
-
-        if (!evalTokens(promptCtx, { id })) {
-            std::cerr << implementation().modelType() << " ERROR: Failed to predict next token\n";
-            return;
-        }
-
-        // display text
-        for (const auto token : endTokens()) {
-            if (id == token) return;
-        }
-
-        const std::string str = tokenToString(id);
-
-        // Check if the provided str is part of our reverse prompts
-        bool foundPartialReversePrompt = false;
-        const std::string completed = cachedResponse + std::string(str);
-        if (reversePrompts.find(completed) != reversePrompts.end())
-            return;
-
-        // Check if it partially matches our reverse prompts and if so, cache
-        for (const auto& s : reversePrompts) {
-            if (s.compare(0, completed.size(), completed) == 0) {
-                foundPartialReversePrompt = true;
-                cachedResponse = completed;
-                break;
-            }
-        }
-
-        // Regardless the token gets added to our cache
-        cachedTokens.push_back(id);
-
-        // Continue if we have found a partial match
-        if (foundPartialReversePrompt)
-            continue;
-
-        // Empty the cache
-        for (auto t : cachedTokens) {
-            if (int32_t(promptCtx.tokens.size()) == promptCtx.n_ctx)
-                promptCtx.tokens.erase(promptCtx.tokens.begin());
-            promptCtx.tokens.push_back(t);
-            promptCtx.n_past += 1;
-            //TODO: Conversion to std::string can be avoided here...
-            if (!responseCallback(t, std::string(tokenToString(t))))
-                return;
-        }
-        cachedTokens.clear();
-    }
-}
-
-void LLModel::embed(
-    const std::vector<std::string> &texts, float *embeddings, std::optional<std::string> prefix, int dimensionality,
-    size_t *tokenCount, bool doMean, bool atlas, EmbedCancelCallback *cancelCb
-) {
-    (void)texts;
-    (void)embeddings;
-    (void)prefix;
-    (void)dimensionality;
-    (void)tokenCount;
-    (void)doMean;
-    (void)atlas;
-    (void)cancelCb;
-    throw std::logic_error(std::string(implementation().modelType()) + " does not support embeddings");
-}
-
-void LLModel::embed(
-    const std::vector<std::string> &texts, float *embeddings, bool isRetrieval, int dimensionality, size_t *tokenCount,
-    bool doMean, bool atlas
-) {
-    (void)texts;
-    (void)embeddings;
-    (void)isRetrieval;
-    (void)dimensionality;
-    (void)tokenCount;
-    (void)doMean;
-    (void)atlas;
-    throw std::logic_error(std::string(implementation().modelType()) + " does not support embeddings");
-}
--- a/gpt4all-backend/llmodel_shared.h
+++ b/gpt4all-backend/llmodel_shared.h
@ -1,49 +0,0 @@
-#pragma once
-
-#include <ggml.h>
-
-#include <cstddef>
-#include <cstdint>
-#include <vector>
-
-struct llm_buffer {
-    uint8_t * addr = NULL;
-    size_t size = 0;
-
-    void resize(size_t size) {
-        delete[] addr;
-        addr = new uint8_t[size];
-        this->size = size;
-    }
-
-    ~llm_buffer() {
-        delete[] addr;
-    }
-};
-
-struct llm_kv_cache {
-    struct ggml_tensor * k;
-    struct ggml_tensor * v;
-
-    struct ggml_context * ctx = NULL;
-
-    llm_buffer buf;
-
-    int n; // number of tokens currently in the cache
-
-    ~llm_kv_cache() {
-        if (ctx) {
-            ggml_free(ctx);
-        }
-    }
-};
-
-inline void ggml_graph_compute_g4a(llm_buffer& buf, ggml_cgraph * graph, int n_threads)
-{
-    struct ggml_cplan plan = ggml_graph_plan(graph, n_threads);
-    if (plan.work_size > 0) {
-        buf.resize(plan.work_size);
-        plan.work_data = buf.addr;
-    }
-    ggml_graph_compute(graph, &plan);
-}
--- a/gpt4all-backend/scripts/convert_bert_hf_to_gguf.py
+++ b/gpt4all-backend/scripts/convert_bert_hf_to_gguf.py
@ -1,140 +0,0 @@
-#!/usr/bin/env python3
-from __future__ import annotations
-
-import json
-import struct
-import sys
-from pathlib import Path
-
-import gguf
-import numpy as np
-from transformers import AutoConfig, AutoModel, AutoTokenizer
-
-
-if not 2 <= len(sys.argv) < 4:
-    print("Usage: {} dir-model [ftype]\n".format(Path(__file__).name))
-    print("  ftype == 0 -> float32")
-    print("  ftype == 1 -> float16")
-    sys.exit(1)
-
-# output in the same directory as the model
-dir_model = Path(sys.argv[1])
-
-with open(dir_model / "vocab.txt", encoding="utf-8") as f:
-    vocab = f.readlines()
-
-# possible data types
-#   ftype == 0 -> float32
-#   ftype == 1 -> float16
-#
-# map from ftype to string
-ftype_str = ["f32", "f16"]
-
-ftype = 1
-if len(sys.argv) > 2:
-    ftype = int(sys.argv[2])
-    if ftype < 0 or ftype > 1:
-        print("Invalid ftype: " + str(ftype))
-        sys.exit(1)
-
-fname_out = dir_model / ("ggml-model-" + ftype_str[ftype] + ".gguf")
-
-
-ARCH = gguf.MODEL_ARCH.BERT
-gguf_writer = gguf.GGUFWriter(fname_out, gguf.MODEL_ARCH_NAMES[ARCH])
-
-print("gguf: get model metadata")
-
-config = AutoConfig.from_pretrained(dir_model)
-
-block_count = config.num_hidden_layers
-gguf_writer.add_name("BERT")
-gguf_writer.add_context_length(config.max_position_embeddings)
-gguf_writer.add_embedding_length(config.hidden_size)
-gguf_writer.add_feed_forward_length(config.intermediate_size)
-gguf_writer.add_block_count(block_count)
-gguf_writer.add_head_count(config.num_attention_heads)
-gguf_writer.add_file_type(ftype)
-
-print("gguf: get tokenizer metadata")
-
-try:
-    with open(dir_model / "tokenizer.json", encoding="utf-8") as f:
-        tokenizer_json = json.load(f)
-except FileNotFoundError as e:
-    print(f'Error: Missing {e.filename!r}', file=sys.stderr)
-    sys.exit(1)
-
-print("gguf: get wordpiece tokenizer vocab")
-
-tokenizer = AutoTokenizer.from_pretrained(dir_model)
-print(tokenizer.encode('I believe the meaning of life is'))
-
-tokens: list[bytearray] = []
-reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}
-
-# The number of tokens in tokenizer.json can differ from the expected vocab size.
-# This causes downstream issues with mismatched tensor sizes when running the inference
-for i in range(config.vocab_size):
-    try:
-        text = reverse_vocab[i]
-    except KeyError:
-        print(f"Key {i} not in tokenizer vocabulary. Padding with an arbitrary token.")
-        pad_token = f"[PAD{i}]".encode("utf8")
-        text = bytearray(pad_token)
-
-    tokens.append(text)
-
-gguf_writer.add_tokenizer_model("bert")  # wordpiece
-gguf_writer.add_token_list(tokens)
-
-special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
-special_vocab.add_to_gguf(gguf_writer)
-
-print("gguf: get tensor metadata")
-
-model = AutoModel.from_pretrained(dir_model, config=config, low_cpu_mem_usage=True)
-print(model)
-
-tensor_map = gguf.get_tensor_name_map(ARCH, block_count)
-
-list_vars = model.state_dict()
-for name in list_vars.keys():
-    print(name, list_vars[name].shape, list_vars[name].dtype)
-
-for name in list_vars.keys():
-    data = list_vars[name].squeeze().numpy()
-    if name in ['embeddings.position_ids', 'pooler.dense.weight', 'pooler.dense.bias']:
-        continue
-    print("Processing variable:", name, "with shape:", data.shape)
-
-    n_dims = len(data.shape)
-
-    # ftype == 0 -> float32, ftype == 1 -> float16
-    if ftype == 1 and name[-7:] == ".weight" and n_dims == 2:
-        print("  Converting to float16")
-        data = data.astype(np.float16)
-        l_type = 1
-    else:
-        l_type = 0
-
-    # map tensor names
-    new_name = tensor_map.get_name(name, try_suffixes=(".weight", ".bias"))
-    if new_name is None:
-        print("Can not map tensor '" + name + "'")
-        sys.exit()
-
-    gguf_writer.add_tensor(new_name, data)
-
-
-print("gguf: write header")
-gguf_writer.write_header_to_file()
-print("gguf: write metadata")
-gguf_writer.write_kv_data_to_file()
-print("gguf: write tensors")
-gguf_writer.write_tensors_to_file()
-
-gguf_writer.close()
-
-print(f"gguf: model successfully exported to '{fname_out}'")
-print()
--- a/gpt4all-backend/scripts/convert_gptj_to_gguf.py
+++ b/gpt4all-backend/scripts/convert_gptj_to_gguf.py
@ -1,165 +0,0 @@
-#!/usr/bin/env python3
-# Convert GPT-J-6B h5 transformer model to ggml format
-#
-# Load the model using GPTJForCausalLM.
-# Iterate over all variables and write them to a binary file.
-#
-# For each variable, write the following:
-#   - Number of dimensions (int)
-#   - Name length (int)
-#   - Dimensions (int[n_dims])
-#   - Name (char[name_length])
-#   - Data (float[n_dims])
-#
-# By default, the bigger matrices are converted to 16-bit floats.
-# This can be disabled by adding the "ftype" CLI argument.
-#
-# At the start of the ggml file we write the model parameters
-# and vocabulary.
-#
-
-from __future__ import annotations
-
-import sys
-import struct
-import json
-from pathlib import Path
-
-import gguf
-import numpy as np
-from transformers import AutoConfig, AutoTokenizer, GPTJForCausalLM
-from transformers.models.gpt2 import tokenization_gpt2
-
-
-if not 2 <= len(sys.argv) < 4:
-    print("Usage: python {} dir-model [ftype]\n".format(Path(__file__).name))
-    print("  ftype == 0 -> float32")
-    print("  ftype == 1 -> float16")
-    sys.exit(1)
-
-# output in the same directory as the model
-dir_model = Path(sys.argv[1])
-fname_out = dir_model / "ggml-model.gguf"
-
-# possible data types
-#   ftype == 0 -> float32
-#   ftype == 1 -> float16
-#
-# map from ftype to string
-ftype_str = ["f32", "f16"]
-
-ftype = 1
-if len(sys.argv) > 2:
-    ftype = int(sys.argv[2])
-    if ftype < 0 or ftype > 1:
-        print("Invalid ftype: " + str(ftype))
-        sys.exit(1)
-
-fname_out = dir_model / ("ggml-model-" + ftype_str[ftype] + ".gguf")
-
-
-ARCH = gguf.MODEL_ARCH.GPTJ
-gguf_writer = gguf.GGUFWriter(fname_out, gguf.MODEL_ARCH_NAMES[ARCH])
-
-print("gguf: get model metadata")
-
-config = AutoConfig.from_pretrained(dir_model)
-
-block_count = config.n_layer
-gguf_writer.add_name("GPT-J")
-gguf_writer.add_context_length(config.n_positions)
-gguf_writer.add_embedding_length(config.n_embd)
-gguf_writer.add_block_count(block_count)
-gguf_writer.add_feed_forward_length(4 * config.n_embd)
-gguf_writer.add_head_count(config.n_head)
-gguf_writer.add_rope_dimension_count(config.rotary_dim)
-gguf_writer.add_layer_norm_eps(config.layer_norm_epsilon)
-gguf_writer.add_file_type(ftype)
-
-print("gguf: get gpt2 tokenizer vocab")
-
-tokenizer = AutoTokenizer.from_pretrained(dir_model)
-
-reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}
-byte_encoder = tokenization_gpt2.bytes_to_unicode()
-byte_decoder = {v: k for k, v in byte_encoder.items()}
-
-tokens: list[bytearray] = []
-
-for i in range(config.vocab_size):
-    if i in reverse_vocab:
-        try:
-            text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
-        except KeyError:
-            text = bytearray()
-            for c in reverse_vocab[i]:
-                if ord(c) < 256:  # single byte character
-                    text.append(byte_decoder[c])
-                else:  # multibyte special token character
-                    text.extend(c.encode('utf-8'))
-    else:
-        print(f"Key {i} not in tokenizer vocabulary. Padding with an arbitrary token.")
-        pad_token = f"[PAD{i}]".encode("utf8")
-        text = bytearray(pad_token)
-
-    tokens.append(text)
-
-
-gguf_writer.add_tokenizer_model("gpt2")
-gguf_writer.add_token_list(tokens)
-
-special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
-special_vocab.add_to_gguf(gguf_writer)
-
-print("gguf: get tensor metadata")
-
-model = GPTJForCausalLM.from_pretrained(dir_model, config=config, low_cpu_mem_usage=True)
-#print (model)
-
-tensor_map = gguf.get_tensor_name_map(ARCH, block_count)
-
-list_vars = model.state_dict()
-#print (list_vars)
-
-for name in list_vars.keys():
-    data = list_vars[name].squeeze().numpy()
-    print("Processing variable:", name, "with shape:", data.shape)
-
-    # we don't need these
-    if name.endswith("attn.masked_bias") or name.endswith(".attn.bias"):
-        print("  Skipping variable:", name)
-        continue
-
-    n_dims = len(data.shape)
-
-    # ftype == 0 -> float32, ftype == 1 -> float16
-    ftype_cur = 0
-    if ftype == 1 and name[-7:] == ".weight" and n_dims == 2:
-        print("  Converting to float16")
-        data = data.astype(np.float16)
-        ftype_cur = 1
-    elif ftype == 1 or data.dtype != np.float32:
-        print("  Converting to float32")
-        data = data.astype(np.float32)
-        ftype_cur = 0
-
-    # map tensor names
-    new_name = tensor_map.get_name(name, try_suffixes=(".weight", ".bias"))
-    if new_name is None:
-        print("Can not map tensor '" + name + "'")
-        sys.exit()
-
-    gguf_writer.add_tensor(new_name, data)
-
-
-print("gguf: write header")
-gguf_writer.write_header_to_file()
-print("gguf: write metadata")
-gguf_writer.write_kv_data_to_file()
-print("gguf: write tensors")
-gguf_writer.write_tensors_to_file()
-
-gguf_writer.close()
-
-print(f"gguf: model successfully exported to '{fname_out}'")
-print()
--- a/gpt4all-backend/src/dlhandle.cpp
+++ b/gpt4all-backend/src/dlhandle.cpp
--- a/gpt4all-backend/src/dlhandle.h
+++ b/gpt4all-backend/src/dlhandle.h
--- a/gpt4all-backend/src/llamamodel.cpp
+++ b/gpt4all-backend/src/llamamodel.cpp
@ -2,6 +2,7 @@
 #include "llamamodel_impl.h"

 #include "llmodel.h"
+#include "utils.h"

 #include <ggml.h>
 #include <llama.h>
@ -30,9 +31,9 @@

 #ifdef GGML_USE_KOMPUTE
 #   include <ggml-kompute.h>
-#elif GGML_USE_VULKAN
+#elif defined(GGML_USE_VULKAN)
 #   include <ggml-vulkan.h>
-#elif GGML_USE_CUDA
+#elif defined(GGML_USE_CUDA)
 #   include <ggml-cuda.h>
 #endif

@ -51,14 +52,16 @@ static const std::vector<const char *> KNOWN_ARCHES {
    // "grok", -- 314B parameters
    "gpt2",
    // "gptj", -- no inference code
-    // "gptneox", -- no inference code
+    "gptneox",
+    "granite",
+    "granitemoe",
    "mpt",
    "baichuan",
    "starcoder",
-    // "persimmon", -- CUDA generates garbage
    "refact",
    "bert",
    "nomic-bert",
+    // "jina-bert-v2", -- Assertion `i01 >= 0 && i01 < ne01' failed.
    "bloom",
    "stablelm",
    "qwen",
@ -72,12 +75,21 @@ static const std::vector<const char *> KNOWN_ARCHES {
    "internlm2",
    // "minicpm", -- CUDA generates garbage
    "gemma",
+    "gemma2",
    "starcoder2",
    // "mamba", -- CUDA missing SSM_CONV
    "xverse",
    "command-r",
    // "dbrx", -- 16x12B parameters
    "olmo",
+    "olmoe",
+    "openelm",
+    // "arctic", -- 10B+128x3.66B parameters
+    "deepseek2",
+    "chatglm",
+    // "bitnet", -- tensor not within file bounds?
+    // "t5", -- seq2seq model
+    "jais",
 };

 static const std::vector<const char *> EMBEDDING_ARCHES {
@ -95,16 +107,34 @@ static bool llama_verbose()
    return var && *var;
 }

-static void llama_log_callback(enum ggml_log_level level, const char *text, void *userdata)
+static void llama_log_callback(ggml_log_level level, const char *text, void *userdata, bool warn)
 {
    (void)userdata;
-    if (llama_verbose() || level <= GGML_LOG_LEVEL_ERROR) {
-        fputs(text, stderr);
+
+    static ggml_log_level lastlevel = GGML_LOG_LEVEL_NONE;
+    if (!llama_verbose()) {
+        auto efflevel = level == GGML_LOG_LEVEL_CONT ? lastlevel : level;
+        lastlevel = efflevel;
+        switch (efflevel) {
+            case GGML_LOG_LEVEL_CONT:
+                UNREACHABLE();
+                break;
+            case GGML_LOG_LEVEL_WARN:
+                if (warn) break;
+                [[fallthrough]];
+            case GGML_LOG_LEVEL_NONE: // not used?
+            case GGML_LOG_LEVEL_INFO:
+            case GGML_LOG_LEVEL_DEBUG:
+                return; // suppress
+            case GGML_LOG_LEVEL_ERROR:
+                ;
+        }
    }
+
+    fputs(text, stderr);
 }

 struct gpt_params {
-    int32_t seed          = -1;   // RNG seed
    int32_t n_keep        = 0;    // number of tokens to keep from initial prompt

    // sampling parameters
@ -119,37 +149,6 @@ struct gpt_params {
    bool use_mlock         = false; // use mlock to keep model in memory
 };

-static int llama_sample_top_p_top_k(
-        llama_context *ctx,
-        const llama_token *last_n_tokens_data,
-        int last_n_tokens_size,
-        int top_k,
-        float top_p,
-        float min_p,
-        float temp,
-        float repeat_penalty,
-        int32_t pos) {
-    auto logits = llama_get_logits_ith(ctx, pos);
-    auto n_vocab = llama_n_vocab(llama_get_model(ctx));
-    // Populate initial list of all candidates
-    std::vector<llama_token_data> candidates;
-    candidates.reserve(n_vocab);
-    for (int token_id = 0; token_id < n_vocab; token_id++) {
-        candidates.emplace_back(llama_token_data{token_id, logits[token_id], 0.0f});
-    }
-    llama_token_data_array candidates_p = {candidates.data(), candidates.size(), false};
-    // Sample repeat penalty
-    llama_sample_repetition_penalties(nullptr, &candidates_p, last_n_tokens_data, last_n_tokens_size, repeat_penalty, 0.0f, 0.0f);
-    // Temperature sampling
-    llama_sample_top_k(ctx, &candidates_p, top_k, 1);
-    llama_sample_tail_free(ctx, &candidates_p, 1.0f, 1);
-    llama_sample_typical(ctx, &candidates_p, 1.0f, 1);
-    llama_sample_top_p(ctx, &candidates_p, top_p, 1);
-    llama_sample_min_p(ctx, &candidates_p, min_p, 1);
-    llama_sample_temp(ctx, &candidates_p, temp);
-    return llama_sample_token(ctx, &candidates_p);
-}
-
 const char *get_arch_name(gguf_context *ctx_gguf)
 {
    const int kid = gguf_find_key(ctx_gguf, "general.architecture");
@ -206,7 +205,7 @@ static int32_t get_arch_key_u32(std::string const &modelPath, std::string const
        if (keyidx != -1) {
            value = gguf_get_val_u32(ctx, keyidx);
        } else {
-            std::cerr << __func__ << ": " << key << "not found in " << modelPath << "\n";
+            std::cerr << __func__ << ": " << key << " not found in " << modelPath << "\n";
        }
    }

@ -216,21 +215,27 @@ cleanup:
 }

 struct LLamaPrivate {
-    const std::string modelPath;
-    bool modelLoaded = false;
-    int device = -1;
-    std::string deviceName;
-    llama_model *model = nullptr;
-    llama_context *ctx = nullptr;
-    llama_model_params model_params;
-    llama_context_params ctx_params;
-    int64_t n_threads = 0;
-    std::vector<LLModel::Token> end_tokens;
-    const char *backend_name = nullptr;
+    bool                         modelLoaded  = false;
+    int                          device       = -1;
+    std::string                  deviceName;
+    int64_t                      n_threads    = 0;
+    std::vector<LLModel::Token>  end_tokens;
+    const char                  *backend_name = nullptr;
+    std::vector<LLModel::Token>  inputTokens;
+
+    llama_model          *model        = nullptr;
+    llama_context        *ctx          = nullptr;
+    llama_model_params    model_params;
+    llama_context_params  ctx_params;
+    llama_sampler        *sampler_chain;
 };

 LLamaModel::LLamaModel()
-    : d_ptr(new LLamaPrivate) {}
+    : d_ptr(std::make_unique<LLamaPrivate>())
+{
+    auto sparams = llama_sampler_chain_default_params();
+    d_ptr->sampler_chain = llama_sampler_chain_init(sparams);
+}

 // default hparams (LLaMA 7B)
 struct llama_file_hparams {
@ -419,10 +424,9 @@ bool LLamaModel::loadModel(const std::string &modelPath, int n_ctx, int ngl)
        }
    }

-    d_ptr->ctx_params.n_ctx   = n_ctx;
-    d_ptr->ctx_params.seed    = params.seed;
-    d_ptr->ctx_params.type_k  = params.kv_type;
-    d_ptr->ctx_params.type_v  = params.kv_type;
+    d_ptr->ctx_params.n_ctx  = n_ctx;
+    d_ptr->ctx_params.type_k = params.kv_type;
+    d_ptr->ctx_params.type_v = params.kv_type;

    // The new batch API provides space for n_vocab*n_tokens logits. Tell llama.cpp early
    // that we want this many logits so the state serializes consistently.
@ -488,6 +492,7 @@ LLamaModel::~LLamaModel()
        llama_free(d_ptr->ctx);
    }
    llama_free_model(d_ptr->model);
+    llama_sampler_free(d_ptr->sampler_chain);
 }

 bool LLamaModel::isModelLoaded() const
@ -497,38 +502,48 @@ bool LLamaModel::isModelLoaded() const

 size_t LLamaModel::stateSize() const
 {
-    return llama_get_state_size(d_ptr->ctx);
+    return llama_state_get_size(d_ptr->ctx);
 }

-size_t LLamaModel::saveState(uint8_t *dest) const
+size_t LLamaModel::saveState(std::span<uint8_t> stateOut, std::vector<Token> &inputTokensOut) const
 {
-    return llama_copy_state_data(d_ptr->ctx, dest);
+    size_t bytesWritten = llama_state_get_data(d_ptr->ctx, stateOut.data(), stateOut.size());
+    if (bytesWritten)
+        inputTokensOut.assign(d_ptr->inputTokens.begin(), d_ptr->inputTokens.end());
+    return bytesWritten;
 }

-size_t LLamaModel::restoreState(const uint8_t *src)
+size_t LLamaModel::restoreState(std::span<const uint8_t> state, std::span<const Token> inputTokens)
 {
-    // const_cast is required, see: https://github.com/ggerganov/llama.cpp/pull/1540
-    return llama_set_state_data(d_ptr->ctx, const_cast<uint8_t*>(src));
+    size_t bytesRead = llama_state_set_data(d_ptr->ctx, state.data(), state.size());
+    if (bytesRead)
+        d_ptr->inputTokens.assign(inputTokens.begin(), inputTokens.end());
+    return bytesRead;
 }

-std::vector<LLModel::Token> LLamaModel::tokenize(PromptContext &ctx, const std::string &str, bool special) const
+std::vector<LLModel::Token> LLamaModel::tokenize(std::string_view str) const
 {
-    const bool wantBOS = ctx.n_past == 0 && ctx.tokens.empty();
-    const bool useBOS = wantBOS && shouldAddBOS();
-    auto strCat = wantBOS && !special ? " " + str : str; // insert leading space ourselves, llama.cpp fork doesn't anymore
-    std::vector<LLModel::Token> fres(strCat.size()+4);
-    auto fres_len = llama_tokenize(d_ptr->model, strCat.c_str(), strCat.length(), fres.data(), fres.size(), useBOS, special);
+    std::vector<LLModel::Token> fres(str.length() + 4);
+    int32_t fres_len = llama_tokenize(
+        d_ptr->model, str.data(), str.length(), fres.data(), fres.size(), /*add_special*/ true, /*parse_special*/ true
+    );
    fres.resize(fres_len);
    return fres;
 }

+bool LLamaModel::isSpecialToken(Token id) const
+{
+    return llama_token_get_attr(d_ptr->model, id)
+        & (LLAMA_TOKEN_ATTR_CONTROL | LLAMA_TOKEN_ATTR_USER_DEFINED | LLAMA_TOKEN_ATTR_UNKNOWN);
+}
+
 std::string LLamaModel::tokenToString(Token id) const
 {
    std::vector<char> result(8, 0);
-    const int n_tokens = llama_token_to_piece(d_ptr->model, id, result.data(), result.size(), false);
+    const int n_tokens = llama_token_to_piece(d_ptr->model, id, result.data(), result.size(), 0, true);
    if (n_tokens < 0) {
        result.resize(-n_tokens);
-        int check = llama_token_to_piece(d_ptr->model, id, result.data(), result.size(), false);
+        int check = llama_token_to_piece(d_ptr->model, id, result.data(), result.size(), 0, true);
        GGML_ASSERT(check == -n_tokens);
    }
    else {
@ -538,27 +553,66 @@ std::string LLamaModel::tokenToString(Token id) const
    return std::string(result.data(), result.size());
 }

-LLModel::Token LLamaModel::sampleToken(PromptContext &promptCtx) const
+void LLamaModel::initSampler(const PromptContext &promptCtx)
 {
-    const size_t n_prev_toks = std::min((size_t) promptCtx.repeat_last_n, promptCtx.tokens.size());
-    return llama_sample_top_p_top_k(d_ptr->ctx,
-        promptCtx.tokens.data() + promptCtx.tokens.size() - n_prev_toks,
-        n_prev_toks, promptCtx.top_k, promptCtx.top_p, promptCtx.min_p, promptCtx.temp,
-        promptCtx.repeat_penalty, promptCtx.n_last_batch_tokens - 1);
+    auto *model = d_ptr->model;
+    auto *chain = d_ptr->sampler_chain;
+
+    // clear sampler chain
+    for (int i = llama_sampler_chain_n(chain) - 1; i >= 0; i--) {
+        auto *smpl = llama_sampler_chain_remove(chain, i);
+        llama_sampler_free(smpl);
+    }
+
+    // build new chain
+    llama_sampler_chain_add(chain,
+        llama_sampler_init_penalties(
+            llama_n_vocab(model),
+            llama_token_eos(model),
+            llama_token_nl(model),
+            promptCtx.repeat_last_n,
+            promptCtx.repeat_penalty,
+            // TODO(jared): consider making the below configurable
+            /*penalty_freq*/    0.0f,
+            /*penalty_present*/ 0.0f,
+            /*penalize_nl*/     true,
+            /*ignore_eos*/      false
+        )
+    );
+    if (promptCtx.temp == 0.0f) {
+        llama_sampler_chain_add(chain, llama_sampler_init_greedy());
+    } else {
+        struct llama_sampler *samplers[] = {
+            llama_sampler_init_top_k(promptCtx.top_k),
+            llama_sampler_init_top_p(promptCtx.top_p, 1),
+            llama_sampler_init_min_p(promptCtx.min_p, 1),
+            llama_sampler_init_temp(promptCtx.temp),
+            llama_sampler_init_softmax(),
+            llama_sampler_init_dist(LLAMA_DEFAULT_SEED),
+        };
+        for (auto *smpl : samplers)
+            llama_sampler_chain_add(chain, smpl);
+    }
 }

-bool LLamaModel::evalTokens(PromptContext &ctx, const std::vector<int32_t> &tokens) const
+LLModel::Token LLamaModel::sampleToken() const
 {
-    llama_kv_cache_seq_rm(d_ptr->ctx, 0, ctx.n_past, -1);
+    return llama_sampler_sample(d_ptr->sampler_chain, d_ptr->ctx, -1);
+}
+
+bool LLamaModel::evalTokens(int32_t nPast, std::span<const Token> tokens) const
+{
+    assert(!tokens.empty());
+
+    llama_kv_cache_seq_rm(d_ptr->ctx, 0, nPast, -1);

    llama_batch batch = llama_batch_init(tokens.size(), 0, 1);

    batch.n_tokens = tokens.size();
-    ctx.n_last_batch_tokens = tokens.size();

    for (int32_t i = 0; i < batch.n_tokens; i++) {
        batch.token   [i] = tokens[i];
-        batch.pos     [i] = ctx.n_past + i;
+        batch.pos     [i] = nPast + i;
        batch.n_seq_id[i] = 1;
        batch.seq_id  [i][0] = 0;
        batch.logits  [i] = false;
@ -572,11 +626,86 @@ bool LLamaModel::evalTokens(PromptContext &ctx, const std::vector<int32_t> &toke
    return res == 0;
 }

+void LLamaModel::shiftContext(const PromptContext &promptCtx, int32_t *nPast)
+{
+    // infinite text generation via context shifting
+
+    // erase up to n_ctx*contextErase tokens
+    int n_keep = shouldAddBOS();
+    int n_past = *nPast;
+    int n_discard = std::min(n_past - n_keep, int(contextLength() * promptCtx.contextErase));
+
+    assert(n_discard > 0);
+    if (n_discard <= 0)
+        return;
+
+    std::cerr << "Llama: context full, swapping: n_past = " << n_past << ", n_keep = " << n_keep
+              << ", n_discard = " << n_discard << "\n";
+
+    // erase the first n_discard tokens from the context
+    llama_kv_cache_seq_rm (d_ptr->ctx, 0, n_keep,             n_keep + n_discard);
+    llama_kv_cache_seq_add(d_ptr->ctx, 0, n_keep + n_discard, n_past,             -n_discard);
+
+    auto &inp = d_ptr->inputTokens;
+    inp.erase(inp.begin() + n_keep, inp.begin() + n_keep + n_discard);
+    *nPast = inp.size();
+}
+
 int32_t LLamaModel::contextLength() const
 {
    return llama_n_ctx(d_ptr->ctx);
 }

+auto LLamaModel::specialTokens() -> std::unordered_map<std::string, std::string> const
+{
+    if (!d_ptr->model)
+        throw std::logic_error("model not loaded");
+
+    std::unordered_map<std::string, std::string> tokens;
+    if (auto id = llama_token_bos(d_ptr->model); id != LLAMA_TOKEN_NULL)
+        tokens.emplace("bos_token", tokenToString(id));
+    if (auto id = llama_token_eos(d_ptr->model); id != LLAMA_TOKEN_NULL)
+        tokens.emplace("eos_token", tokenToString(id));
+    return tokens;
+}
+
+int32_t LLamaModel::inputLength() const
+{
+    return d_ptr->inputTokens.size();
+}
+
+int32_t LLamaModel::computeModelInputPosition(std::span<const Token> input) const
+{
+    // find common prefix
+    auto cacheIt = d_ptr->inputTokens.begin();
+    auto inputIt = input.begin();
+    while (cacheIt < d_ptr->inputTokens.end() && inputIt < input.end() && *cacheIt == *inputIt) {
+        ++cacheIt; ++inputIt;
+    }
+    // tell the caller to ignore the tokens between [begin, inputIt)
+    return inputIt - input.begin();
+}
+
+void LLamaModel::setModelInputPosition(int32_t pos)
+{
+    auto &inp = d_ptr->inputTokens;
+    assert(pos >= 0);
+    assert(pos <= inp.size());
+    // truncate token cache to end at the new n_past
+    if (pos < inp.size())
+        inp.resize(pos);
+}
+
+void LLamaModel::appendInputToken(Token tok)
+{
+    d_ptr->inputTokens.push_back(tok);
+}
+
+auto LLamaModel::inputTokens() const -> std::span<const Token>
+{
+    return d_ptr->inputTokens;
+}
+
 const std::vector<LLModel::Token> &LLamaModel::endTokens() const
 {
    return d_ptr->end_tokens;
@ -584,10 +713,7 @@ const std::vector<LLModel::Token> &LLamaModel::endTokens() const

 bool LLamaModel::shouldAddBOS() const
 {
-    int add_bos = llama_add_bos_token(d_ptr->model);
-    if (add_bos != -1) { return add_bos; }
-    auto vocab_type = llama_vocab_type(d_ptr->model);
-    return vocab_type == LLAMA_VOCAB_TYPE_SPM || vocab_type == LLAMA_VOCAB_TYPE_WPM;
+    return llama_add_bos_token(d_ptr->model);
 }

 int32_t LLamaModel::maxContextLength(std::string const &modelPath) const
@ -600,6 +726,37 @@ int32_t LLamaModel::layerCount(std::string const &modelPath) const
    return get_arch_key_u32(modelPath, "block_count");
 }

+// TODO(jared): reduce redundant code and operations by combining all metadata getters for unloaded
+//              models into a class that keeps the model file open
+auto LLamaModel::chatTemplate(const char *modelPath) const -> std::expected<std::string, std::string>
+{
+    auto *ctx = load_gguf(modelPath);
+    if (!ctx)
+        return std::unexpected("failed to open model file");
+
+    std::expected<std::string, std::string> result;
+    enum gguf_type ktype;
+    const int kid = gguf_find_key(ctx, "tokenizer.chat_template");
+    if (kid == -1) {
+        result = std::unexpected("key not found");
+        goto cleanup;
+    }
+
+    ktype = gguf_get_kv_type(ctx, kid);
+    if (ktype != GGUF_TYPE_STRING) {
+        result = std::unexpected(
+            "expected key type STRING (" + std::to_string(GGUF_TYPE_STRING) + "), got " + std::to_string(ktype)
+        );
+        goto cleanup;
+    }
+
+    result = gguf_get_val_str(ctx, kid);
+
+cleanup:
+    gguf_free(ctx);
+    return result;
+}
+
 #ifdef GGML_USE_VULKAN
 static const char *getVulkanVendorName(uint32_t vendorID)
 {
@ -929,7 +1086,7 @@ void LLamaModel::embedInternal(
    const llama_token bos_token = llama_token_bos(d_ptr->model);
    const llama_token eos_token = llama_token_eos(d_ptr->model);

-    bool useBOS = shouldAddBOS();
+    bool useBOS = llama_add_bos_token(d_ptr->model);
    bool useEOS = llama_vocab_type(d_ptr->model) == LLAMA_VOCAB_TYPE_WPM;

    // no EOS, optional BOS
@ -937,13 +1094,16 @@ void LLamaModel::embedInternal(
        if (!text.empty() && text[0] != ' ') {
            text = ' ' + text; // normalize for SPM - our fork of llama.cpp doesn't add a space prefix
        }
-        wantBOS &= useBOS;

        tokens.resize(text.length()+4);
-        int32_t n_tokens = llama_tokenize(d_ptr->model, text.c_str(), text.length(), tokens.data(), tokens.size(), wantBOS, false);
+        int32_t n_tokens = llama_tokenize_gpt4all(
+            d_ptr->model, text.c_str(), text.length(), tokens.data(), tokens.size(), /*add_special*/ wantBOS,
+            /*parse_special*/ false, /*insert_space*/ false
+        );
        if (n_tokens) {
            (void)eos_token;
-            assert((useEOS && wantBOS) == (eos_token != -1 && tokens[n_tokens - 1] == eos_token));
+            (void)useBOS;
+            assert((useEOS && wantBOS && useBOS) == (eos_token != -1 && tokens[n_tokens - 1] == eos_token));
            if (useEOS && wantBOS)
                n_tokens--; // erase EOS/SEP
        }
@ -1169,7 +1329,10 @@ DLL_EXPORT bool is_arch_supported(const char *arch)

 DLL_EXPORT LLModel *construct()
 {
-    llama_log_set(llama_log_callback, nullptr);
+    llama_log_set([](auto l, auto t, auto u) { llama_log_callback(l, t, u, false); }, nullptr);
+#ifdef GGML_USE_CUDA
+    ggml_backend_cuda_log_set_callback([](auto l, auto t, auto u) { llama_log_callback(l, t, u, true); }, nullptr);
+#endif
    return new LLamaModel;
 }
 }
--- a/gpt4all-backend/src/llamamodel_impl.h
+++ b/gpt4all-backend/src/llamamodel_impl.h
@ -6,10 +6,12 @@

 #include "llmodel.h"

-#include <functional>
 #include <memory>
+#include <span>
 #include <string>
+#include <string_view>
 #include <vector>
+#include <unordered_map>

 struct LLamaPrivate;
 struct EmbModelSpec;
@ -27,8 +29,8 @@ public:
    bool isModelLoaded() const override;
    size_t requiredMem(const std::string &modelPath, int n_ctx, int ngl) override;
    size_t stateSize() const override;
-    size_t saveState(uint8_t *dest) const override;
-    size_t restoreState(const uint8_t *src) override;
+    size_t saveState(std::span<uint8_t> stateOut, std::vector<Token> &inputTokensOut) const override;
+    size_t restoreState(std::span<const uint8_t> state, std::span<const Token> inputTokens) override;
    void setThreadCount(int32_t n_threads) override;
    int32_t threadCount() const override;
    std::vector<GPUDevice> availableGPUDevices(size_t memoryRequired = 0) const override;
@ -47,25 +49,36 @@ public:
    void embed(const std::vector<std::string> &texts, float *embeddings, bool isRetrieval, int dimensionality = -1,
               size_t *tokenCount = nullptr, bool doMean = true, bool atlas = false) override;

-private:
-    std::unique_ptr<LLamaPrivate> d_ptr;
-    bool m_supportsEmbedding = false;
-    bool m_supportsCompletion = false;
+    int32_t contextLength() const override;
+    auto specialTokens() -> std::unordered_map<std::string, std::string> const override;

 protected:
-    std::vector<Token> tokenize(PromptContext &ctx, const std::string &str, bool special) const override;
+    std::vector<Token> tokenize(std::string_view str) const override;
+    bool isSpecialToken(Token id) const override;
    std::string tokenToString(Token id) const override;
-    Token sampleToken(PromptContext &ctx) const override;
-    bool evalTokens(PromptContext &ctx, const std::vector<int32_t> &tokens) const override;
-    int32_t contextLength() const override;
+    void initSampler(const PromptContext &ctx) override;
+    Token sampleToken() const override;
+    bool evalTokens(int32_t nPast, std::span<const Token> tokens) const override;
+    void shiftContext(const PromptContext &promptCtx, int32_t *nPast) override;
+    int32_t inputLength() const override;
+    int32_t computeModelInputPosition(std::span<const Token> input) const override;
+    void setModelInputPosition(int32_t pos) override;
+    void appendInputToken(Token tok) override;
+    std::span<const Token> inputTokens() const override;
    const std::vector<Token> &endTokens() const override;
    bool shouldAddBOS() const override;
    int32_t maxContextLength(std::string const &modelPath) const override;
    int32_t layerCount(std::string const &modelPath) const override;
+    auto chatTemplate(const char *modelPath) const -> std::expected<std::string, std::string> override;

    void embedInternal(const std::vector<std::string> &texts, float *embeddings, std::string prefix, int dimensionality,
                       size_t *tokenCount, bool doMean, bool atlas, EmbedCancelCallback *cancelCb,
                       const EmbModelSpec *spec);
+
+private:
+    std::unique_ptr<LLamaPrivate> d_ptr;
+    bool m_supportsEmbedding = false;
+    bool m_supportsCompletion = false;
 };

 #endif // LLAMAMODEL_H
--- a/gpt4all-backend/src/llmodel.cpp
+++ b/gpt4all-backend/src/llmodel.cpp
@ -130,7 +130,7 @@ const std::vector<LLModel::Implementation> &LLModel::Implementation::implementat

        addCudaSearchPath();

-        std::string impl_name_re = "(gptj|llamamodel-mainline)-(cpu|metal|kompute|vulkan|cuda)";
+        std::string impl_name_re = "llamamodel-mainline-(cpu|metal|kompute|vulkan|cuda)";
        if (cpu_supports_avx2() == 0) {
            impl_name_re += "-avxonly";
        }
@ -140,9 +140,14 @@ const std::vector<LLModel::Implementation> &LLModel::Implementation::implementat
            std::string path;
            // Split the paths string by the delimiter and process each path.
            while (std::getline(ss, path, ';')) {
-                std::u8string u8_path(path.begin(), path.end());
+                fs::directory_iterator iter;
+                try {
+                    iter = fs::directory_iterator(std::u8string(path.begin(), path.end()));
+                } catch (const fs::filesystem_error &) {
+                    continue; // skip nonexistent path
+                }
                // Iterate over all libraries
-                for (const auto &f : fs::directory_iterator(u8_path)) {
+                for (const auto &f : iter) {
                    const fs::path &p = f.path();

                    if (p.extension() != LIB_FILE_EXT) continue;
@ -326,6 +331,12 @@ bool LLModel::Implementation::isEmbeddingModel(const std::string &modelPath)
    return llama && llama->isEmbeddingModel(modelPath);
 }

+auto LLModel::Implementation::chatTemplate(const char *modelPath) -> std::expected<std::string, std::string>
+{
+    auto *llama = constructGlobalLlama();
+    return llama ? llama->chatTemplate(modelPath) : std::unexpected("backend not available");
+}
+
 void LLModel::Implementation::setImplementationsSearchPath(const std::string& path)
 {
    s_implementations_search_path = path;
--- a/gpt4all-backend/src/llmodel_c.cpp
+++ b/gpt4all-backend/src/llmodel_c.cpp
@ -7,16 +7,20 @@
 #include <cstdlib>
 #include <cstring>
 #include <exception>
-#include <functional>
 #include <iostream>
 #include <memory>
 #include <optional>
 #include <string>
+#include <string_view>
 #include <vector>
+#include <span>
+
+namespace ranges = std::ranges;
+
+static_assert(sizeof(token_t) == sizeof(LLModel::Token));

 struct LLModelWrapper {
    LLModel *llModel = nullptr;
-    LLModel::PromptContext promptContext;
    ~LLModelWrapper() { delete llModel; }
 };

@ -84,82 +88,80 @@ bool llmodel_isModelLoaded(llmodel_model model)
    return wrapper->llModel->isModelLoaded();
 }

-uint64_t llmodel_get_state_size(llmodel_model model)
+uint64_t llmodel_state_get_size(llmodel_model model)
 {
    auto *wrapper = static_cast<LLModelWrapper *>(model);
    return wrapper->llModel->stateSize();
 }

-uint64_t llmodel_save_state_data(llmodel_model model, uint8_t *dest)
+uint64_t llmodel_state_get_data(llmodel_model model, uint8_t *state_out, uint64_t state_size,
+                                token_t **input_tokens_out, uint64_t *n_input_tokens)
 {
    auto *wrapper = static_cast<LLModelWrapper *>(model);
-    return wrapper->llModel->saveState(dest);
+    std::vector<LLModel::Token> inputTokens;
+    auto bytesWritten = wrapper->llModel->saveState({state_out, size_t(state_size)}, inputTokens);
+    if (bytesWritten) {
+        auto *buf = new LLModel::Token[inputTokens.size()];
+        ranges::copy(inputTokens, buf);
+        *input_tokens_out = buf;
+        *n_input_tokens = uint64_t(inputTokens.size());
+    } else {
+        *input_tokens_out = nullptr;
+        *n_input_tokens = 0;
+    }
+    return bytesWritten;
 }

-uint64_t llmodel_restore_state_data(llmodel_model model, const uint8_t *src)
+void llmodel_state_free_input_tokens(LLModel::Token *input_tokens)
 {
-    auto *wrapper = static_cast<LLModelWrapper *>(model);
-    return wrapper->llModel->restoreState(src);
+    delete[] input_tokens;
 }

-void llmodel_prompt(llmodel_model model, const char *prompt,
-                    const char *prompt_template,
-                    llmodel_prompt_callback prompt_callback,
-                    llmodel_response_callback response_callback,
-                    llmodel_recalculate_callback recalculate_callback,
-                    llmodel_prompt_context *ctx,
-                    bool special,
-                    const char *fake_reply)
+uint64_t llmodel_state_set_data(llmodel_model model, const uint8_t *state, uint64_t state_size,
+                                const token_t *input_tokens, uint64_t n_input_tokens)
 {
    auto *wrapper = static_cast<LLModelWrapper *>(model);
+    return wrapper->llModel->restoreState({state, size_t(state_size)}, {input_tokens, size_t(n_input_tokens)});
+}

-    auto response_func = [response_callback](int32_t token_id, const std::string &response) {
-        return response_callback(token_id, response.c_str());
-    };
-
-    if (size_t(ctx->n_past) < wrapper->promptContext.tokens.size())
-        wrapper->promptContext.tokens.resize(ctx->n_past);
+bool llmodel_prompt(llmodel_model               model,
+                    const char                 *prompt,
+                    llmodel_prompt_callback     prompt_callback,
+                    llmodel_response_callback   response_callback,
+                    llmodel_prompt_context     *ctx,
+                    const char                **error)
+{
+    auto *wrapper = static_cast<LLModelWrapper *>(model);

    // Copy the C prompt context
-    wrapper->promptContext.n_past = ctx->n_past;
-    wrapper->promptContext.n_ctx = ctx->n_ctx;
-    wrapper->promptContext.n_predict = ctx->n_predict;
-    wrapper->promptContext.top_k = ctx->top_k;
-    wrapper->promptContext.top_p = ctx->top_p;
-    wrapper->promptContext.min_p = ctx->min_p;
-    wrapper->promptContext.temp = ctx->temp;
-    wrapper->promptContext.n_batch = ctx->n_batch;
-    wrapper->promptContext.repeat_penalty = ctx->repeat_penalty;
-    wrapper->promptContext.repeat_last_n = ctx->repeat_last_n;
-    wrapper->promptContext.contextErase = ctx->context_erase;
+    LLModel::PromptContext promptContext {
+        .n_predict      = ctx->n_predict,
+        .top_k          = ctx->top_k,
+        .top_p          = ctx->top_p,
+        .min_p          = ctx->min_p,
+        .temp           = ctx->temp,
+        .n_batch        = ctx->n_batch,
+        .repeat_penalty = ctx->repeat_penalty,
+        .repeat_last_n  = ctx->repeat_last_n,
+        .contextErase   = ctx->context_erase,
+    };

-    std::string fake_reply_str;
-    if (fake_reply) { fake_reply_str = fake_reply; }
-    auto *fake_reply_p = fake_reply ? &fake_reply_str : nullptr;
+    auto prompt_func = [prompt_callback](std::span<const LLModel::Token> token_ids, bool cached) {
+        return prompt_callback(token_ids.data(), token_ids.size(), cached);
+    };
+    auto response_func = [response_callback](LLModel::Token token_id, std::string_view piece) {
+        return response_callback(token_id, piece.data());
+    };

    // Call the C++ prompt method
-    wrapper->llModel->prompt(prompt, prompt_template, prompt_callback, response_func, recalculate_callback,
-                             wrapper->promptContext, special, fake_reply_p);
+    try {
+        wrapper->llModel->prompt(prompt, prompt_func, response_func, promptContext);
+    } catch (std::exception const &e) {
+        llmodel_set_error(error, e.what());
+        return false;
+    }

-    // Update the C context by giving access to the wrappers raw pointers to std::vector data
-    // which involves no copies
-    ctx->logits = wrapper->promptContext.logits.data();
-    ctx->logits_size = wrapper->promptContext.logits.size();
-    ctx->tokens = wrapper->promptContext.tokens.data();
-    ctx->tokens_size = wrapper->promptContext.tokens.size();
-
-    // Update the rest of the C prompt context
-    ctx->n_past = wrapper->promptContext.n_past;
-    ctx->n_ctx = wrapper->promptContext.n_ctx;
-    ctx->n_predict = wrapper->promptContext.n_predict;
-    ctx->top_k = wrapper->promptContext.top_k;
-    ctx->top_p = wrapper->promptContext.top_p;
-    ctx->min_p = wrapper->promptContext.min_p;
-    ctx->temp = wrapper->promptContext.temp;
-    ctx->n_batch = wrapper->promptContext.n_batch;
-    ctx->repeat_penalty = wrapper->promptContext.repeat_penalty;
-    ctx->repeat_last_n = wrapper->promptContext.repeat_last_n;
-    ctx->context_erase = wrapper->promptContext.contextErase;
+    return true;
 }

 float *llmodel_embed(
@ -298,3 +300,21 @@ const char *llmodel_model_gpu_device_name(llmodel_model model)
    const auto *wrapper = static_cast<LLModelWrapper *>(model);
    return wrapper->llModel->gpuDeviceName();
 }
+
+int32_t llmodel_count_prompt_tokens(llmodel_model model, const char *prompt, const char **error)
+{
+    auto *wrapper = static_cast<const LLModelWrapper *>(model);
+    try {
+        return wrapper->llModel->countPromptTokens(prompt);
+    } catch (const std::exception& e) {
+        llmodel_set_error(error, e.what());
+        return -1;
+    }
+}
+
+void llmodel_model_foreach_special_token(llmodel_model model, llmodel_special_token_callback callback)
+{
+    auto *wrapper = static_cast<const LLModelWrapper *>(model);
+    for (auto &[name, token] : wrapper->llModel->specialTokens())
+        callback(name.c_str(), token.c_str());
+}
--- a/gpt4all-backend/src/llmodel_shared.cpp
+++ b/gpt4all-backend/src/llmodel_shared.cpp
@ -0,0 +1,298 @@
+#include "llmodel.h"
+
+#include <algorithm>
+#include <cassert>
+#include <cstddef>
+#include <cstdint>
+#include <iostream>
+#include <iterator>
+#include <optional>
+#include <ranges>
+#include <stdexcept>
+#include <string>
+#include <string_view>
+#include <vector>
+
+namespace ranges = std::ranges;
+namespace views  = std::ranges::views;
+
+void LLModel::prompt(
+    std::string_view        prompt,
+    const PromptCallback   &promptCallback,
+    const ResponseCallback &responseCallback,
+    const PromptContext    &promptCtx
+) {
+    if (!isModelLoaded())
+        throw std::invalid_argument("Attempted to prompt an unloaded model.");
+    if (!supportsCompletion())
+        throw std::invalid_argument("Not a text completion model.");
+    if (!promptCtx.n_batch)
+        throw std::invalid_argument("Batch size cannot be zero.");
+    if (!promptCtx.n_predict)
+        return; // nothing requested
+
+    auto embd_inp = tokenize(prompt);
+    if (embd_inp.empty())
+        throw std::invalid_argument("Prompt tokenized to zero tokens.");
+
+    if (auto res = decodePrompt(promptCallback, promptCtx, std::move(embd_inp)))
+        generateResponse(responseCallback, promptCtx, /*n_past*/ *res);
+}
+
+int32_t LLModel::countPromptTokens(std::string_view prompt) const
+{
+    if (!isModelLoaded())
+        throw std::invalid_argument("Attempted to tokenize with an unloaded model.");
+    return int32_t(tokenize(prompt).size());
+}
+
+auto LLModel::decodePrompt(
+    const PromptCallback &promptCallback,
+    const PromptContext  &promptCtx,
+    std::vector<Token>    embd_inp
+) -> std::optional<int32_t>
+{
+    assert(!embd_inp.empty());
+
+    int32_t nCtx = contextLength();
+    int32_t n_batch = std::min(promptCtx.n_batch, LLMODEL_MAX_PROMPT_BATCH);
+
+    // Find the greatest n_past where the beginning of embd_inp matches the end of the token cache, starting at the
+    // requested n_past.
+    // This is used to skip unnecessary work when the prompt shares a common prefix with the previous result.
+    int32_t nPast = computeModelInputPosition(embd_inp);
+
+    // always decode up to a full batch before generating, even if cached
+    nPast -= std::min(n_batch, nPast);
+
+    // TODO(jared): generalize this to find the smallest new_embd_inp.size() - nPast given the cache
+    if (!nPast && int32_t(embd_inp.size()) > nCtx) {
+        // no cache hit -> shift the input before even processing
+
+        int32_t nKeep     = shouldAddBOS();
+        auto    newLength = int32_t(nCtx * (1.f - promptCtx.contextErase));
+        int32_t nDiscard  = int32_t(embd_inp.size()) - std::max(1, std::min(nCtx, newLength));
+
+        // execute the callback even for skipped tokens. this misrepresents the position of BOS but we don't care
+        auto discardedTokens = embd_inp | views::drop(nKeep) | views::take(nDiscard);
+        if (!promptCallback(discardedTokens, true))
+            return std::nullopt;
+
+        // erase nDiscard tokens
+        embd_inp.erase(discardedTokens.begin(), discardedTokens.end());
+        assert(int32_t(embd_inp.size()) <= nCtx);
+
+        // check the cache again, just in case
+        nPast = computeModelInputPosition(embd_inp);
+        nPast -= std::min(n_batch, nPast);
+    }
+
+    setModelInputPosition(nPast);
+
+    // execute the callback even for skipped tokens
+    if (!promptCallback(embd_inp | views::take(nPast), true))
+        return std::nullopt;
+
+    // process the prompt in batches
+    for (int32_t i = nPast; i < embd_inp.size();) {
+        auto batch_end = std::min(i + n_batch, int32_t(embd_inp.size()));
+        std::span batch(embd_inp.begin() + i, embd_inp.begin() + batch_end);
+
+        // Check if the context has run out...
+        if (nPast + int32_t(batch.size()) > nCtx) {
+            shiftContext(promptCtx, &nPast);
+            assert(nPast + int32_t(batch.size()) <= nCtx);
+        }
+
+        // FIXME(Adam): We should find a way to bubble these strings to the UI level to allow for translation
+        if (!evalTokens(nPast, batch))
+            throw std::runtime_error("An internal error was encountered during prompt processing.");
+
+        for (auto &tok : batch) {
+            appendInputToken(tok);
+            nPast++;
+            if (!promptCallback({ &tok, 1 }, false))
+                return std::nullopt;
+        }
+        i = batch_end;
+    }
+
+    return nPast;
+}
+
+/*
+ * If string s overlaps with the string key such that some prefix of the key is at the end
+ * of the string, return the position in s where the first match starts. Otherwise, return
+ * std::string::npos. Examples:
+ * s = "bfo",  key = "foo" -> 1
+ * s = "fooa", key = "foo" -> npos
+ */
+static std::string::size_type stringsOverlap(const std::string &s, const std::string &key)
+{
+    if (s.empty() || key.empty())
+        throw std::invalid_argument("arguments to stringsOverlap must not be empty");
+
+    for (int start = std::max(0, int(s.size()) - int(key.size())); start < s.size(); start++) {
+        if (s.compare(start, s.size(), key, 0, s.size() - start) == 0)
+            return start;
+    }
+    return std::string::npos;
+}
+
+void LLModel::generateResponse(
+    const ResponseCallback &responseCallback,
+    const PromptContext    &promptCtx,
+    int32_t                 nPast
+) {
+    static const char *stopSequences[] {
+        "### System", "### Instruction", "### Human", "### User", "### Response", "### Assistant", "### Context",
+        "<|im_start|>", "<|im_end|>", "<|endoftext|>",
+    };
+
+    initSampler(promptCtx);
+
+    std::string cachedResponse;
+    std::vector<Token> cachedTokens;
+    int n_predicted = 0;
+
+    // Predict next tokens
+    for (bool stop = false; !stop;) {
+        // Sample next token
+        std::optional<Token> new_tok = sampleToken();
+        std::string new_piece = tokenToString(new_tok.value());
+        cachedTokens.push_back(new_tok.value());
+        cachedResponse += new_piece;
+
+        auto accept = [this, &promptCtx, &new_tok, &nPast] {
+            // Shift context if out of space
+            if (nPast >= contextLength()) {
+                shiftContext(promptCtx, &nPast);
+                assert(nPast < contextLength());
+            }
+
+            // Accept the token
+            Token tok = std::exchange(new_tok, std::nullopt).value();
+            if (!evalTokens(nPast, { &tok, 1 }))
+                throw std::runtime_error("An internal error was encountered during response generation.");
+
+            appendInputToken(tok);
+            nPast++;
+        };
+
+        // Check for EOS
+        auto lengthLimit = std::string::npos;
+        for (const auto token : endTokens()) {
+            if (new_tok == token) {
+                stop = true;
+                lengthLimit = cachedResponse.size() - new_piece.size();
+            }
+        }
+
+        if (lengthLimit != std::string::npos) {
+            // EOS matched
+        } else if (!isSpecialToken(new_tok.value())) {
+            // Check if the response contains a stop sequence
+            for (const auto &p : stopSequences) {
+                auto match = cachedResponse.find(p);
+                if (match != std::string::npos) stop = true;
+                lengthLimit = std::min(lengthLimit, match);
+                if (match == 0) break;
+            }
+
+            // Check if the response matches the start of a stop sequence
+            if (lengthLimit == std::string::npos) {
+                for (const auto &p : stopSequences) {
+                    auto match = stringsOverlap(cachedResponse, p);
+                    lengthLimit = std::min(lengthLimit, match);
+                    if (match == 0) break;
+                }
+            }
+        } else if (ranges::find(stopSequences, new_piece) < std::end(stopSequences)) {
+            // Special tokens must exactly match a stop sequence
+            stop = true;
+            lengthLimit = cachedResponse.size() - new_piece.size();
+        }
+
+        // Empty the cache, up to the length limit
+        std::string::size_type responseLength = 0;
+        while (!cachedTokens.empty()) {
+            Token tok = cachedTokens.front();
+            std::string piece = tokenToString(tok);
+
+            // Stop if the piece (or part of it) does not fit within the length limit
+            if (responseLength + (stop ? 1 : piece.size()) > lengthLimit)
+                break;
+
+            // Remove token from cache
+            assert(cachedResponse.starts_with(piece));
+            cachedTokens.erase(cachedTokens.begin(), cachedTokens.begin() + 1);
+            cachedResponse.erase(cachedResponse.begin(), cachedResponse.begin() + piece.size());
+
+            // Accept the token, if needed (not cached)
+            if (cachedTokens.empty() && new_tok)
+                accept();
+
+            // Send the token
+            if (!responseCallback(tok, piece) || ++n_predicted >= promptCtx.n_predict) {
+                stop = true;
+                break;
+            }
+
+            // FIXME(jared): we could avoid printing partial stop sequences if we didn't have to
+            // output token IDs and could cache a partial token for the next prompt call
+            responseLength += piece.size();
+        }
+        assert(cachedTokens.empty() == cachedResponse.empty());
+
+        // Accept the token, if needed (in cache)
+        if (new_tok) {
+            assert(!cachedTokens.empty() && cachedTokens.back() == new_tok);
+            if (stop) {
+                cachedTokens.pop_back();
+            } else {
+                accept();
+            }
+        }
+    }
+
+    if (inputLength() < cachedTokens.size()) {
+        /* This is theoretically possible if the longest stop sequence is greater than
+         * n_ctx * contextErase tokens. */
+        throw std::runtime_error("shifted too much context, can't go back");
+    }
+
+#ifndef NDEBUG
+    auto inp = inputTokens();
+    auto discard_start = inp.end() - cachedTokens.size();
+    assert(std::equal(discard_start, inp.end(), cachedTokens.begin()));
+#endif
+}
+
+void LLModel::embed(
+    const std::vector<std::string> &texts, float *embeddings, std::optional<std::string> prefix, int dimensionality,
+    size_t *tokenCount, bool doMean, bool atlas, EmbedCancelCallback *cancelCb
+) {
+    (void)texts;
+    (void)embeddings;
+    (void)prefix;
+    (void)dimensionality;
+    (void)tokenCount;
+    (void)doMean;
+    (void)atlas;
+    (void)cancelCb;
+    throw std::logic_error(std::string(implementation().modelType()) + " does not support embeddings");
+}
+
+void LLModel::embed(
+    const std::vector<std::string> &texts, float *embeddings, bool isRetrieval, int dimensionality, size_t *tokenCount,
+    bool doMean, bool atlas
+) {
+    (void)texts;
+    (void)embeddings;
+    (void)isRetrieval;
+    (void)dimensionality;
+    (void)tokenCount;
+    (void)doMean;
+    (void)atlas;
+    throw std::logic_error(std::string(implementation().modelType()) + " does not support embeddings");
+}
--- a/gpt4all-backend/src/utils.h
+++ b/gpt4all-backend/src/utils.h
@ -0,0 +1,17 @@
+#pragma once
+
+#include <cassert>
+
+#ifdef NDEBUG
+#   ifdef __has_builtin
+#       if __has_builtin(__builtin_unreachable)
+#           define UNREACHABLE() __builtin_unreachable()
+#       else
+#           define UNREACHABLE() do {} while (0)
+#       endif
+#   else
+#       define UNREACHABLE() do {} while (0)
+#   endif
+#else
+#   define UNREACHABLE() assert(!"Unreachable statement was reached")
+#endif
--- a/gpt4all-backend/utils.cpp
+++ b/gpt4all-backend/utils.cpp
@ -1,339 +0,0 @@
-#include "utils.h"
-
-#include <cmath>
-#include <cstdio>
-#include <cstdlib>
-#include <fstream>
-#include <iterator>
-#include <regex>
-#include <utility>
-
-void replace(std::string & str, const std::string & needle, const std::string & replacement)
-{
-    size_t pos = 0;
-    while ((pos = str.find(needle, pos)) != std::string::npos) {
-        str.replace(pos, needle.length(), replacement);
-        pos += replacement.length();
-    }
-}
-
-std::map<std::string, int32_t> json_parse(const std::string & fname)
-{
-    std::map<std::string, int32_t> result;
-
-    // read file into string
-    std::string json;
-    {
-        std::ifstream ifs(fname);
-        if (!ifs) {
-            fprintf(stderr, "Failed to open %s\n", fname.c_str());
-            exit(1);
-        }
-
-        json = std::string((std::istreambuf_iterator<char>(ifs)),
-                (std::istreambuf_iterator<char>()));
-    }
-
-    if (json[0] != '{') {
-        return result;
-    }
-
-    // parse json
-    {
-        bool has_key  = false;
-        bool in_token = false;
-
-        std::string str_key = "";
-        std::string str_val = "";
-
-        int n = json.size();
-        for (int i = 1; i < n; ++i) {
-            if (!in_token) {
-                if (json[i] == ' ') continue;
-                if (json[i] == '"') {
-                    in_token = true;
-                    continue;
-                }
-            } else {
-                if (json[i] == '\\' && i+1 < n) {
-                    if (has_key == false) {
-                        str_key += json[i];
-                    } else {
-                        str_val += json[i];
-                    }
-                    ++i;
-                } else if (json[i] == '"') {
-                    if (has_key == false) {
-                        has_key = true;
-                        ++i;
-                        while (json[i] == ' ') ++i;
-                        ++i; // :
-                        while (json[i] == ' ') ++i;
-                        if (json[i] != '\"') {
-                            while (json[i] != ',' && json[i] != '}') {
-                                str_val += json[i++];
-                            }
-                            has_key = false;
-                        } else {
-                            in_token = true;
-                            continue;
-                        }
-                    } else {
-                        has_key = false;
-                    }
-
-                    ::replace(str_key, "\\u0120", " " ); // \u0120 -> space
-                    ::replace(str_key, "\\u010a", "\n"); // \u010a -> new line
-                    ::replace(str_key, "\\\"",    "\""); // \\\"   -> "
-
-                    try {
-                        result[str_key] = std::stoi(str_val);
-                    } catch (...) {
-                        //fprintf(stderr, "%s: ignoring key '%s' with value '%s'\n", fname.c_str(), str_key.c_str(), str_val.c_str());
-
-                    }
-                    str_key = "";
-                    str_val = "";
-                    in_token = false;
-                    continue;
-                }
-                if (has_key == false) {
-                    str_key += json[i];
-                } else {
-                    str_val += json[i];
-                }
-            }
-        }
-    }
-
-    return result;
-}
-
-std::vector<gpt_vocab::id> gpt_tokenize_inner(const gpt_vocab & vocab, const std::string & text)
-{
-    std::vector<std::string> words;
-
-    // first split the text into words
-    {
-        std::string str = text;
-        std::string pat = R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)";
-
-        std::regex re(pat);
-        std::smatch m;
-
-        while (std::regex_search(str, m, re)) {
-            for (auto x : m) {
-                words.push_back(x);
-            }
-            str = m.suffix();
-        }
-    }
-
-    // find the longest tokens that form the words:
-    std::vector<gpt_vocab::id> tokens;
-    for (const auto & word : words) {
-        if (word.size() == 0) continue;
-
-        int i = 0;
-        int n = word.size();
-        while (i < n) {
-            int j = n;
-            while (j > i) {
-                auto it = vocab.token_to_id.find(word.substr(i, j-i));
-                if (it != vocab.token_to_id.end()) {
-                    tokens.push_back(it->second);
-                    i = j;
-                    break;
-                }
-                --j;
-            }
-            if (i == n) {
-                break;
-            }
-            if (j == i) {
-                auto sub = word.substr(i, 1);
-                if (vocab.token_to_id.find(sub) != vocab.token_to_id.end()) {
-                    tokens.push_back(vocab.token_to_id.at(sub));
-                } else {
-                    fprintf(stderr, "%s: unknown token '%s'\n", __func__, sub.data());
-                }
-                ++i;
-            }
-        }
-    }
-
-    return tokens;
-}
-
-std::string regex_escape(const std::string &s)
-{
-  static const std::regex metacharacters(R"([\.\^\$\-\+\(\)\[\]\{\}\|\?\*])");
-  return std::regex_replace(s, metacharacters, "\\$&");
-}
-
-std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text)
-{
-    // Generate the subpattern from the special_tokens vector if it's not empty
-    if (!vocab.special_tokens.empty()) {
-        std::vector<gpt_vocab::id> out;
-        std::vector<std::string> chunks;
-        std::string str = text;
-        std::string special_tokens_subpattern;
-        for (const auto &token : vocab.special_tokens) {
-            if (!special_tokens_subpattern.empty()) {
-                special_tokens_subpattern += "|";
-            }
-            special_tokens_subpattern += regex_escape(token);
-        }
-        std::regex re(special_tokens_subpattern);
-        std::smatch m;
-        while (std::regex_search(str, m, re)) {
-            auto tok = vocab.token_to_id.find(m.str());
-            if (tok != vocab.token_to_id.end()) {
-                auto tokid = tok->second;
-                auto pfxtoks = gpt_tokenize_inner(vocab, m.prefix());
-                out.insert(out.end(), pfxtoks.begin(), pfxtoks.end());
-                out.push_back(tokid);
-                str = m.suffix();
-            }
-        }
-        if (!str.empty()) {
-            auto tokrest = gpt_tokenize_inner(vocab, str);
-            out.insert(out.end(), tokrest.begin(), tokrest.end());
-        }
-        return out;
-    } else {
-        return gpt_tokenize_inner(vocab, text);
-    }
-}
-
-
-bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab)
-{
-    printf("%s: loading vocab from '%s'\n", __func__, fname.c_str());
-
-    vocab.token_to_id = ::json_parse(fname);
-
-    for (const auto & kv : vocab.token_to_id) {
-        vocab.id_to_token[kv.second] = kv.first;
-    }
-
-    printf("%s: vocab size = %d\n", __func__, (int) vocab.token_to_id.size());
-
-    // print the vocabulary
-    //for (auto kv : vocab.token_to_id) {
-    //    printf("'%s' -> %d\n", kv.first.data(), kv.second);
-    //}
-
-    return true;
-}
-
-gpt_vocab::id gpt_sample_top_k_top_p(
-        const size_t actualVocabSize,
-        const int32_t * last_n_tokens_data,
-        int   last_n_tokens_size,
-        const std::vector<float> logits,
-        int    top_k,
-        double top_p,
-        double temp,
-        float repeat_penalty,
-        std::mt19937 & rng) {
-    int n_logits = actualVocabSize;
-
-    const auto last_n_tokens = std::vector<int32_t>(last_n_tokens_data, last_n_tokens_data + last_n_tokens_size);
-    const auto * plogits = logits.data();
-
-    if (temp <= 0) {
-        // select the token with the highest logit directly
-        float max_logit = plogits[0];
-        gpt_vocab::id max_id = 0;
-
-        for (int i = 1; i < n_logits; ++i) {
-            if (plogits[i] > max_logit) {
-                max_logit = plogits[i];
-                max_id = i;
-            }
-        }
-        return max_id;
-    }
-    std::vector<std::pair<double, gpt_vocab::id>> logits_id;
-    logits_id.reserve(n_logits);
-
-    {
-        const float scale = 1.0f/temp;
-        for (int i = 0; i < n_logits; ++i) {
-            // repetition penalty from ctrl paper (https://arxiv.org/abs/1909.05858)
-            // credit https://github.com/facebookresearch/llama/compare/main...shawwn:llama:main
-            if (std::find(last_n_tokens.begin(), last_n_tokens.end(), i) != last_n_tokens.end()) {
-                // if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
-                if (plogits[i] < 0.0f) {
-                    logits_id.push_back(std::make_pair(plogits[i]*scale*repeat_penalty, i));
-                } else {
-                    logits_id.push_back(std::make_pair(plogits[i]*scale/repeat_penalty, i));
-                }
-            } else {
-                logits_id.push_back(std::make_pair(plogits[i]*scale, i));
-            }
-        }
-    }
-
-    // find the top K tokens
-    std::partial_sort(
-            logits_id.begin(),
-            logits_id.begin() + top_k, logits_id.end(),
-            [](const std::pair<double, gpt_vocab::id> & a, const std::pair<double, gpt_vocab::id> & b) {
-        return a.first > b.first;
-    });
-
-    logits_id.resize(top_k);
-
-    double maxl = -INFINITY;
-    for (const auto & kv : logits_id) {
-        maxl = std::max(maxl, kv.first);
-    }
-
-    // compute probs for the top K tokens
-    std::vector<double> probs;
-    probs.reserve(logits_id.size());
-
-    double sum = 0.0;
-    for (const auto & kv : logits_id) {
-        double p = exp(kv.first - maxl);
-        probs.push_back(p);
-        sum += p;
-    }
-
-    // normalize the probs
-    for (auto & p : probs) {
-        p /= sum;
-    }
-
-    if (top_p < 1.0f) {
-        double cumsum = 0.0f;
-        for (int i = 0; i < top_k; i++) {
-            cumsum += probs[i];
-            if (cumsum >= top_p) {
-                top_k = i + 1;
-                probs.resize(top_k);
-                logits_id.resize(top_k);
-                break;
-            }
-        }
-
-        cumsum = 1.0/cumsum;
-        for (int i = 0; i < (int) probs.size(); i++) {
-            probs[i] *= cumsum;
-        }
-    }
-
-    //printf("\n");
-    //for (int i = 0; i < (int) probs.size(); i++) {
-    //    printf("%d: '%s' %f\n", i, vocab.id_to_token.at(logits_id[i].second).c_str(), probs[i]);
-    //}
-    //exit(0);
-
-    std::discrete_distribution<> dist(probs.begin(), probs.end());
-    int idx = dist(rng);
-
-    return logits_id[idx].second;
-}
--- a/gpt4all-backend/utils.h
+++ b/gpt4all-backend/utils.h
@ -1,101 +0,0 @@
-// Various helper functions and utilities
-
-#pragma once
-
-#include <algorithm>
-#include <cstddef>
-#include <cstdint>
-#include <map>
-#include <random>
-#include <string>
-#include <thread>
-#include <vector>
-
-//
-// General purpose inline functions
-//
-constexpr inline unsigned long long operator ""_MiB(unsigned long long bytes)
-{
-    return bytes*1024*1024;
-}
-
-//
-// CLI argument parsing
-//
-
-struct gpt_params {
-    int32_t seed      = -1; // RNG seed
-    int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
-    int32_t n_predict = 200; // new tokens to predict
-
-    // sampling parameters
-    int32_t top_k = 40;
-    float   top_p = 0.9f;
-    float   temp  = 0.9f;
-
-    int32_t n_batch = 8; // batch size for prompt processing
-
-    std::string model = "models/gpt-2-117M/ggml-model.bin"; // model path
-    std::string prompt;
-};
-
-bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
-
-void gpt_print_usage(int argc, char ** argv, const gpt_params & params);
-
-std::string gpt_random_prompt(std::mt19937 & rng);
-
-//
-// Vocab utils
-//
-
-struct gpt_vocab {
-    using id    = int32_t;
-    using token = std::string;
-
-    std::map<token, id> token_to_id;
-    std::map<id, token> id_to_token;
-    std::vector<std::string> special_tokens;
-
-    void add_special_token(const std::string &token) {
-        special_tokens.push_back(token);
-    }
-};
-
-void replace(std::string & str, const std::string & needle, const std::string & replacement);
-
-// poor-man's JSON parsing
-std::map<std::string, int32_t> json_parse(const std::string & fname);
-
-// split text into tokens
-//
-// ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
-//
-// Regex (Python):
-// r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
-//
-// Regex (C++):
-// R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)"
-//
-std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text);
-
-// load the tokens from encoder.json
-bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab);
-
-// sample next token given probabilities for each embedding
-//
-//   - consider only the top K tokens
-//   - from them, consider only the top tokens with cumulative probability > P
-//
-// TODO: not sure if this implementation is correct
-//
-gpt_vocab::id gpt_sample_top_k_top_p(
-        const size_t actualVocabSize,
-        const int32_t * last_n_tokens_data,
-        int   last_n_tokens_size,
-        const std::vector<float> logits,
-        int    top_k,
-        double top_p,
-        double temp,
-        float repeat_penalty,
-        std::mt19937 & rng);
--- a/gpt4all-bindings/cli/README.md
+++ b/gpt4all-bindings/cli/README.md
@ -2,8 +2,7 @@

 GPT4All on the command-line.

-## Documentation
-<https://docs.gpt4all.io/gpt4all_cli.html>
+More details on the [wiki](https://github.com/nomic-ai/gpt4all/wiki/Python-CLI).

 ## Quickstart

@ -34,11 +33,11 @@ python -m pip install --user --upgrade gpt4all typer
 # run the CLI
 python app.py repl
 ```
-By default, it will automatically download the `groovy` model to `.cache/gpt4all/` in your user
-directory, if necessary.
+By default, it will automatically download the `Mistral Instruct` model to `.cache/gpt4all/` in your
+user directory, if necessary.

 If you have already saved a model beforehand, specify its path with the `-m`/`--model` argument,
 for example:
 ```shell
-python app.py repl --model /home/user/my-gpt4all-models/gpt4all-13b-snoozy-q4_0.gguf
+python app.py repl --model /home/user/my-gpt4all-models/mistral-7b-instruct-v0.1.Q4_0.gguf
 ```
--- a/gpt4all-bindings/cli/app.py
+++ b/gpt4all-bindings/cli/app.py
@ -113,10 +113,7 @@ def _old_loop(gpt4all_instance):
        full_response = gpt4all_instance.chat_completion(
            MESSAGES,
            # preferential kwargs for chat ux
-            logits_size=0,
-            tokens_size=0,
            n_past=0,
-            n_ctx=0,
            n_predict=200,
            top_k=40,
            top_p=0.9,
--- a/gpt4all-bindings/python/CHANGELOG.md
+++ b/gpt4all-bindings/python/CHANGELOG.md
@ -0,0 +1,75 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
+
+## [Unreleased]
+
+### Added
+- Warn on Windows if the Microsoft Visual C++ runtime libraries are not found ([#2920](https://github.com/nomic-ai/gpt4all/pull/2920))
+- Basic cache for faster prefill when the input shares a prefix with previous context ([#3073](https://github.com/nomic-ai/gpt4all/pull/3073))
+- Add ability to modify or replace the history of an active chat session ([#3147](https://github.com/nomic-ai/gpt4all/pull/3147))
+
+### Changed
+- Rebase llama.cpp on latest upstream as of September 26th ([#2998](https://github.com/nomic-ai/gpt4all/pull/2998))
+- Change the error message when a message is too long ([#3004](https://github.com/nomic-ai/gpt4all/pull/3004))
+- Fix CalledProcessError on Intel Macs since v2.8.0 ([#3045](https://github.com/nomic-ai/gpt4all/pull/3045))
+- Use Jinja for chat templates instead of per-message QString.arg-style templates ([#3147](https://github.com/nomic-ai/gpt4all/pull/3147))
+
+## [2.8.2] - 2024-08-14
+
+### Fixed
+- Fixed incompatibility with Python 3.8 since v2.7.0 and Python <=3.11 since v2.8.1 ([#2871](https://github.com/nomic-ai/gpt4all/pull/2871))
+
+## [2.8.1] - 2024-08-13
+
+### Added
+- Use greedy sampling when temperature is set to zero ([#2854](https://github.com/nomic-ai/gpt4all/pull/2854))
+
+### Changed
+- Search for pip-installed CUDA 11 as well as CUDA 12 ([#2802](https://github.com/nomic-ai/gpt4all/pull/2802))
+- Stop shipping CUBINs to reduce wheel size ([#2802](https://github.com/nomic-ai/gpt4all/pull/2802))
+- Use llama\_kv\_cache ops to shift context faster ([#2781](https://github.com/nomic-ai/gpt4all/pull/2781))
+- Don't stop generating at end of context ([#2781](https://github.com/nomic-ai/gpt4all/pull/2781))
+
+### Fixed
+- Make reverse prompt detection work more reliably and prevent it from breaking output ([#2781](https://github.com/nomic-ai/gpt4all/pull/2781))
+- Explicitly target macOS 12.6 in CI to fix Metal compatibility on older macOS ([#2849](https://github.com/nomic-ai/gpt4all/pull/2849))
+- Do not initialize Vulkan driver when only using CPU ([#2843](https://github.com/nomic-ai/gpt4all/pull/2843))
+- Fix a segfault on exit when using CPU mode on Linux with NVIDIA and EGL ([#2843](https://github.com/nomic-ai/gpt4all/pull/2843))
+
+## [2.8.0] - 2024-08-05
+
+### Added
+- Support GPT-NeoX, Gemma 2, OpenELM, ChatGLM, and Jais architectures (all with Vulkan support) ([#2694](https://github.com/nomic-ai/gpt4all/pull/2694))
+- Enable Vulkan support for StarCoder2, XVERSE, Command R, and OLMo ([#2694](https://github.com/nomic-ai/gpt4all/pull/2694))
+- Support DeepSeek-V2 architecture (no Vulkan support) ([#2702](https://github.com/nomic-ai/gpt4all/pull/2702))
+- Add Llama 3.1 8B Instruct to models3.json (by [@3Simplex](https://github.com/3Simplex) in [#2731](https://github.com/nomic-ai/gpt4all/pull/2731) and [#2732](https://github.com/nomic-ai/gpt4all/pull/2732))
+- Support Llama 3.1 RoPE scaling ([#2758](https://github.com/nomic-ai/gpt4all/pull/2758))
+- Add Qwen2-1.5B-Instruct to models3.json (by [@ThiloteE](https://github.com/ThiloteE) in [#2759](https://github.com/nomic-ai/gpt4all/pull/2759))
+- Detect use of a Python interpreter under Rosetta for a clearer error message ([#2793](https://github.com/nomic-ai/gpt4all/pull/2793))
+
+### Changed
+- Build against CUDA 11.8 instead of CUDA 12 for better compatibility with older drivers ([#2639](https://github.com/nomic-ai/gpt4all/pull/2639))
+- Update llama.cpp to commit 87e397d00 from July 19th ([#2694](https://github.com/nomic-ai/gpt4all/pull/2694))
+
+### Removed
+- Remove unused internal llmodel\_has\_gpu\_device ([#2409](https://github.com/nomic-ai/gpt4all/pull/2409))
+- Remove support for GPT-J models ([#2676](https://github.com/nomic-ai/gpt4all/pull/2676), [#2693](https://github.com/nomic-ai/gpt4all/pull/2693))
+
+### Fixed
+- Fix debug mode crash on Windows and undefined behavior in LLamaModel::embedInternal ([#2467](https://github.com/nomic-ai/gpt4all/pull/2467))
+- Fix CUDA PTX errors with some GPT4All builds ([#2421](https://github.com/nomic-ai/gpt4all/pull/2421))
+- Fix mishandling of inputs greater than n\_ctx tokens after [#1970](https://github.com/nomic-ai/gpt4all/pull/1970) ([#2498](https://github.com/nomic-ai/gpt4all/pull/2498))
+- Fix crash when Kompute falls back to CPU ([#2640](https://github.com/nomic-ai/gpt4all/pull/2640))
+- Fix several Kompute resource management issues ([#2694](https://github.com/nomic-ai/gpt4all/pull/2694))
+- Fix crash/hang when some models stop generating, by showing special tokens ([#2701](https://github.com/nomic-ai/gpt4all/pull/2701))
+- Fix several backend issues ([#2778](https://github.com/nomic-ai/gpt4all/pull/2778))
+  - Restore leading space removal logic that was incorrectly removed in [#2694](https://github.com/nomic-ai/gpt4all/pull/2694)
+  - CUDA: Cherry-pick llama.cpp DMMV cols requirement fix that caused a crash with long conversations since [#2694](https://github.com/nomic-ai/gpt4all/pull/2694)
+
+[Unreleased]: https://github.com/nomic-ai/gpt4all/compare/python-v2.8.2...HEAD
+[2.8.2]: https://github.com/nomic-ai/gpt4all/compare/python-v2.8.1...python-v2.8.2
+[2.8.1]: https://github.com/nomic-ai/gpt4all/compare/python-v2.8.0...python-v2.8.1
+[2.8.0]: https://github.com/nomic-ai/gpt4all/compare/python-v2.7.0...python-v2.8.0
--- a/gpt4all-bindings/python/docs/assets/add.png
+++ b/gpt4all-bindings/python/docs/assets/add.png
--- a/gpt4all-bindings/python/docs/assets/add_model_gpt4.png
+++ b/gpt4all-bindings/python/docs/assets/add_model_gpt4.png
--- a/gpt4all-bindings/python/docs/assets/attach_spreadsheet.png
+++ b/gpt4all-bindings/python/docs/assets/attach_spreadsheet.png
--- a/gpt4all-bindings/python/docs/assets/baelor.png
+++ b/gpt4all-bindings/python/docs/assets/baelor.png
--- a/gpt4all-bindings/python/docs/assets/before_first_chat.png
+++ b/gpt4all-bindings/python/docs/assets/before_first_chat.png
--- a/gpt4all-bindings/python/docs/assets/chat_window.png
+++ b/gpt4all-bindings/python/docs/assets/chat_window.png
--- a/gpt4all-bindings/python/docs/assets/closed_chat_panel.png
+++ b/gpt4all-bindings/python/docs/assets/closed_chat_panel.png
--- a/gpt4all-bindings/python/docs/assets/configure_doc_collection.png
+++ b/gpt4all-bindings/python/docs/assets/configure_doc_collection.png
--- a/gpt4all-bindings/python/docs/assets/disney_spreadsheet.png
+++ b/gpt4all-bindings/python/docs/assets/disney_spreadsheet.png
--- a/gpt4all-bindings/python/docs/assets/download.png
+++ b/gpt4all-bindings/python/docs/assets/download.png
--- a/gpt4all-bindings/python/docs/assets/download_llama.png
+++ b/gpt4all-bindings/python/docs/assets/download_llama.png
--- a/gpt4all-bindings/python/docs/assets/explore.png
+++ b/gpt4all-bindings/python/docs/assets/explore.png
--- a/gpt4all-bindings/python/docs/assets/explore_models.png
+++ b/gpt4all-bindings/python/docs/assets/explore_models.png
--- a/gpt4all-bindings/python/docs/assets/good_tyrion.png
+++ b/gpt4all-bindings/python/docs/assets/good_tyrion.png
--- a/gpt4all-bindings/python/docs/assets/got_docs_ready.png
+++ b/gpt4all-bindings/python/docs/assets/got_docs_ready.png
--- a/gpt4all-bindings/python/docs/assets/got_done.png
+++ b/gpt4all-bindings/python/docs/assets/got_done.png
--- a/gpt4all-bindings/python/docs/assets/gpt4all_home.png
+++ b/gpt4all-bindings/python/docs/assets/gpt4all_home.png
--- a/gpt4all-bindings/python/docs/assets/gpt4all_xlsx_attachment.mp4
+++ b/gpt4all-bindings/python/docs/assets/gpt4all_xlsx_attachment.mp4
--- a/gpt4all-bindings/python/docs/assets/installed_models.png
+++ b/gpt4all-bindings/python/docs/assets/installed_models.png
--- a/gpt4all-bindings/python/docs/assets/linux.png
+++ b/gpt4all-bindings/python/docs/assets/linux.png
--- a/gpt4all-bindings/python/docs/assets/local_embed.gif
+++ b/gpt4all-bindings/python/docs/assets/local_embed.gif
--- a/gpt4all-bindings/python/docs/assets/mac.png
+++ b/gpt4all-bindings/python/docs/assets/mac.png
--- a/gpt4all-bindings/python/docs/assets/models_page_icon.png
+++ b/gpt4all-bindings/python/docs/assets/models_page_icon.png
--- a/gpt4all-bindings/python/docs/assets/new_docs_annotated.png
+++ b/gpt4all-bindings/python/docs/assets/new_docs_annotated.png
--- a/gpt4all-bindings/python/docs/assets/new_docs_annotated_filled.png
+++ b/gpt4all-bindings/python/docs/assets/new_docs_annotated_filled.png
--- a/gpt4all-bindings/python/docs/assets/new_first_chat.png
+++ b/gpt4all-bindings/python/docs/assets/new_first_chat.png
--- a/gpt4all-bindings/python/docs/assets/no_docs.png
+++ b/gpt4all-bindings/python/docs/assets/no_docs.png
--- a/gpt4all-bindings/python/docs/assets/no_models.png
+++ b/gpt4all-bindings/python/docs/assets/no_models.png
--- a/gpt4all-bindings/python/docs/assets/no_models_tiny.png
+++ b/gpt4all-bindings/python/docs/assets/no_models_tiny.png
--- a/gpt4all-bindings/python/docs/assets/obsidian_adding_collection.png
+++ b/gpt4all-bindings/python/docs/assets/obsidian_adding_collection.png
--- a/gpt4all-bindings/python/docs/assets/obsidian_docs.png
+++ b/gpt4all-bindings/python/docs/assets/obsidian_docs.png
--- a/gpt4all-bindings/python/docs/assets/obsidian_response.png
+++ b/gpt4all-bindings/python/docs/assets/obsidian_response.png
--- a/gpt4all-bindings/python/docs/assets/obsidian_sources.png
+++ b/gpt4all-bindings/python/docs/assets/obsidian_sources.png
--- a/gpt4all-bindings/python/docs/assets/open_chat_panel.png
+++ b/gpt4all-bindings/python/docs/assets/open_chat_panel.png
--- a/gpt4all-bindings/python/docs/assets/open_local_docs.png
+++ b/gpt4all-bindings/python/docs/assets/open_local_docs.png
--- a/gpt4all-bindings/python/docs/assets/open_sources.png
+++ b/gpt4all-bindings/python/docs/assets/open_sources.png
--- a/gpt4all-bindings/python/docs/assets/osbsidian_user_interaction.png
+++ b/gpt4all-bindings/python/docs/assets/osbsidian_user_interaction.png
--- a/gpt4all-bindings/python/docs/assets/search_mistral.png
+++ b/gpt4all-bindings/python/docs/assets/search_mistral.png
--- a/gpt4all-bindings/python/docs/assets/search_settings.png
+++ b/gpt4all-bindings/python/docs/assets/search_settings.png
--- a/gpt4all-bindings/python/docs/assets/spreadsheet_chat.png
+++ b/gpt4all-bindings/python/docs/assets/spreadsheet_chat.png
--- a/gpt4all-bindings/python/docs/assets/syrio_snippets.png
+++ b/gpt4all-bindings/python/docs/assets/syrio_snippets.png
--- a/gpt4all-bindings/python/docs/assets/three_model_options.png
+++ b/gpt4all-bindings/python/docs/assets/three_model_options.png
--- a/gpt4all-bindings/python/docs/assets/ubuntu.svg
+++ b/gpt4all-bindings/python/docs/assets/ubuntu.svg
@ -0,0 +1,5 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg xmlns="http://www.w3.org/2000/svg" width="285" height="285" viewBox="-142.5 -142.5 285 285" xmlns:xlink="http://www.w3.org/1999/xlink">
+<circle fill="#FFFFFF" r="141.732"/><g id="U" fill="#DD4814"><circle cx="-96.3772" r="18.9215"/>
+<path d="M-45.6059,68.395C-62.1655,57.3316-74.4844,40.4175-79.6011,20.6065-73.623,15.7354-69.8047,8.3164-69.8047,0-69.8047-8.3164-73.623-15.7354-79.6011-20.6065-74.4844-40.4175-62.1655-57.3316-45.6059-68.395L-31.7715-45.2212C-45.9824-35.2197-55.2754-18.7026-55.2754,0-55.2754,18.7026-45.9824,35.2197-31.7715,45.2212Z"/></g>
+<use xlink:href="#U" transform="rotate(120)"/><use xlink:href="#U" transform="rotate(240)"/></svg>
--- a/gpt4all-bindings/python/docs/assets/windows.png
+++ b/gpt4all-bindings/python/docs/assets/windows.png
--- a/gpt4all-bindings/python/docs/css/custom.css
+++ b/gpt4all-bindings/python/docs/css/custom.css
@ -1,5 +1,5 @@
-/* Remove the `In` and `Out` block in rendered Jupyter notebooks */
-.md-container .jp-Cell-outputWrapper .jp-OutputPrompt.jp-OutputArea-prompt,
-.md-container .jp-Cell-inputWrapper .jp-InputPrompt.jp-InputArea-prompt {
-  display: none !important;
-}
+.md-content h1,
+.md-content h2 {
+  margin-top: 0.5em;
+  margin-bottom: 0.5em;
+}
--- a/gpt4all-bindings/python/docs/gpt4all_api_server/home.md
+++ b/gpt4all-bindings/python/docs/gpt4all_api_server/home.md
@ -0,0 +1,86 @@
+# GPT4All API Server
+
+GPT4All provides a local API server that allows you to run LLMs over an HTTP API. 
+
+## Key Features
+
+- **Local Execution**: Run models on your own hardware for privacy and offline use.
+- **LocalDocs Integration**: Run the API with relevant text snippets provided to your LLM from a [LocalDocs collection](../gpt4all_desktop/localdocs.md).
+- **OpenAI API Compatibility**: Use existing OpenAI-compatible clients and tools with your local models.
+
+## Activating the API Server
+
+1. Open the GPT4All Chat Desktop Application.
+2. Go to `Settings` > `Application` and scroll down to `Advanced`.
+3. Check the box for the `"Enable Local API Server"` setting.
+4. The server listens on port 4891 by default. You can choose another port number in the `"API Server Port"` setting.
+
+## Connecting to the API Server
+
+The base URL used for the API server is `http://localhost:4891/v1` (or `http://localhost:<PORT_NUM>/v1` if you are using a different port number). 
+
+The server only accepts HTTP connections (not HTTPS) and only listens on localhost (127.0.0.1) (e.g. not to the IPv6 localhost address `::1`.)
+
+## Examples
+
+!!! note "Example GPT4All API calls"
+
+    === "cURL"
+
+        ```bash
+        curl -X POST http://localhost:4891/v1/chat/completions -d '{
+        "model": "Phi-3 Mini Instruct",
+        "messages": [{"role":"user","content":"Who is Lionel Messi?"}],
+        "max_tokens": 50,
+        "temperature": 0.28
+        }'
+        ```
+
+    === "PowerShell"
+
+        ```powershell
+        Invoke-WebRequest -URI http://localhost:4891/v1/chat/completions -Method POST -ContentType application/json -Body '{
+        "model": "Phi-3 Mini Instruct",
+        "messages": [{"role":"user","content":"Who is Lionel Messi?"}],
+        "max_tokens": 50,
+        "temperature": 0.28
+        }'
+        ```
+
+## API Endpoints
+
+| Method | Path | Description |
+|--------|------|-------------|
+| GET | `/v1/models` | List available models |
+| GET | `/v1/models/<name>` | Get details of a specific model |
+| POST | `/v1/completions` | Generate text completions |
+| POST | `/v1/chat/completions` | Generate chat completions |
+
+## LocalDocs Integration
+
+You can use LocalDocs with the API server:
+
+1. Open the Chats view in the GPT4All application.
+2. Scroll to the bottom of the chat history sidebar.
+3. Select the server chat (it has a different background color).
+4. Activate LocalDocs collections in the right sidebar.
+
+(Note: LocalDocs can currently only be activated through the GPT4All UI, not via the API itself).
+
+Now, your API calls to your local LLM will have relevant references from your LocalDocs collection retrieved and placed in the input message for the LLM to respond to.
+
+The references retrieved for your API call can be accessed in the API response object at 
+
+`response["choices"][0]["references"]`
+
+The data included in the `references` are:
+
+- `text`: the actual text content from the snippet that was extracted from the reference document
+
+- `author`: the author of the reference document (if available)
+
+- `date`: the date of creation of the reference document (if available)
+
+- `page`: the page number the snippet is from (only available for PDF documents for now)
+
+- `title`: the title of the reference document (if available)
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/chat_templates.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/chat_templates.md
@ -0,0 +1,206 @@
+## What are chat templates?
+Natively, large language models only know how to complete plain text and do not know the difference between their input and their output. In order to support a chat with a person, LLMs are designed to use a template to convert the conversation to plain text using a specific format.
+
+For a given model, it is important to use an appropriate chat template, as each model is designed to work best with a specific format. The chat templates included with the built-in models should be sufficient for most purposes.
+
+There are two reasons you would want to alter the chat template:
+
+- You are sideloading a model and there is no chat template available,
+- You would like to have greater control over the input to the LLM than a system message provides.
+
+
+## What is a system message?
+A system message is a message that controls the responses from the LLM in a way that affects the entire conversation. System messages can be short, such as "Speak like a pirate.", or they can be long and contain a lot of context for the LLM to keep in mind.
+
+Not all models are designed to use a system message, so they work with some models better than others.
+
+
+## How do I customize the chat template or system message?
+To customize the chat template or system message, go to Settings > Model. Make sure to select the correct model at the top. If you clone a model, you can use a different chat template or system message from the base model, enabling you to use different settings for each conversation.
+
+These settings take effect immediately. After changing them, you can click "Redo last response" in the chat view, and the response will take the new settings into account.
+
+
+## Do I need to write a chat template?
+You typically do not need to write your own chat template. The exception is models that are not in the official model list and do not come with a chat template built-in. These will show a "Clear" option above the chat template field in the Model Settings page instead of a "Reset" option. See the section on [finding] or [creating] a chat template.
+
+[finding]: #how-do-i-find-a-chat-template
+[creating]: #advanced-how-do-chat-templates-work
+
+
+## What changed in GPT4All v3.5?
+GPT4All v3.5 overhauled the chat template system. There are three crucial differences:
+
+- The chat template now formats an entire conversation instead of a single pair of messages,
+- The chat template now uses Jinja syntax instead of `%1` and `%2` placeholders,
+- And the system message should no longer contain control tokens or trailing whitespace.
+
+If you are using any chat templates or system messages that had been added or altered from the default before upgrading to GPT4All v3.5 or newer, these will no longer work. See below for how to solve common errors you may see after upgrading.
+
+
+## Error/Warning: System message is not plain text.
+This is easy to fix. Go to the model's settings and look at the system prompt. There are three things to look for:
+
+- Control tokens such as `<|im_start|>`, `<|start_header_id|>`, or `<|system|>`
+- A prefix such as `### System` or `SYSTEM:`
+- Trailing whitespace, such as a space character or blank line.
+
+If you see any of these things, remove them. For example, this legacy system prompt:
+```
+<|start_header_id|>system<|end_header_id|>
+You are a helpful assistant.<|eot_id|>
+```
+
+Should become this:
+```
+You are a helpful assistant.
+```
+
+If you do not see anything that needs to be changed, you can dismiss the error by making a minor modification to the message and then changing it back.
+
+If you see a warning, your system message does not appear to be plain text. If you believe this warning is incorrect, it can be safely ignored. If in doubt, ask on the [Discord].
+
+[Discord]: https://discord.gg/mGZE39AS3e
+
+
+## Error: Legacy system prompt needs to be updated in Settings.
+This is the same as [above][above-1], but appears on the chat page.
+
+[above-1]: #errorwarning-system-message-is-not-plain-text
+
+
+## Error/Warning: Chat template is not in Jinja format.
+This is the result of attempting to use an old-style template (possibly from a previous version) in GPT4All 3.5+.
+
+Go to the Model Settings page and select the affected model. If you see a "Reset" button, and you have not intentionally modified the prompt template, you can click "Reset". Otherwise, this is what you can do:
+
+1. Back up your chat template by copying it safely to a text file and saving it. In the next step, it will be removed from GPT4All.
+2. Click "Reset" or "Clear".
+3. If you clicked "Clear", the chat template is now gone. Follow the steps to [find][finding] or [create][creating] a basic chat template for your model.
+4. Customize the chat template to suit your needs. For help, read the section about [creating] a chat template.
+
+
+## Error: Legacy prompt template needs to be updated in Settings.
+This is the same as [above][above-2], but appears on the chat page.
+
+[above-2]: #errorwarning-chat-template-is-not-in-jinja-format
+
+
+## The chat template has a syntax error.
+If there is a syntax error while editing the chat template, the details will be displayed in an error message above the input box. This could be because the chat template is not actually in Jinja format (see [above][above-2]).
+
+Otherwise, you have either typed something correctly, or the model comes with a template that is incompatible with GPT4All. See [the below section][creating] on creating chat templates and make sure that everything is correct. When in doubt, ask on the [Discord].
+
+
+## Error: No chat template configured.
+This may appear for models that are not from the official model list and do not include a chat template. Older versions of GPT4All picked a poor default in this case. You will get much better results if you follow the steps to [find][finding] or [create][creating] a chat template for your model.
+
+
+## Error: The chat template cannot be blank.
+If the button above the chat template on the Model Settings page says "Clear", see [above][above-3]. If you see "Reset", click that button to restore a reasonable default. Also see the section on [syntax errors][chat-syntax-error].
+
+[above-3]: #error-no-chat-template-configured
+[chat-syntax-error]: #the-chat-template-has-a-syntax-error
+
+
+## How do I find a chat template?
+When in doubt, you can always ask the [Discord] community for help. Below are the instructions to find one on your own.
+
+The authoritative source for a model's chat template is the HuggingFace repo that the original (non-GGUF) model came from. First, you should find this page. If you just have a model file, you can try a google search for the model's name. If you know the page you downloaded the GGUF model from, its README usually links to the original non-GGUF model.
+
+Once you have located the original model, there are two methods you can use to extract its chat template. Pick whichever one you are most comfortable with.
+
+### Using the CLI (all models)
+1. Install `jq` using your preferred package manager - e.g. Chocolatey (Windows), Homebrew (macOS), or apt (Ubuntu).
+2. Download `tokenizer_config.json` from the model's "Files and versions" tab.
+3. Open a command prompt in the directory which you have downloaded the model file.
+4. Run `jq -r ".chat_template" tokenizer_config.json`. This shows the chat template in a human-readable form. You can copy this and paste it into the settings page.
+5. (Optional) You can save the output to a text file like this: `jq -r ".chat_template" tokenizer_config.json >chat_template.txt`
+
+If the output is "null", the model does not provide a chat template. See the [below instructions][creating] on creating a chat template.
+
+### Python (open models)
+1. Install `transformers` using your preferred python package manager, e.g. `pip install transformers`. Make sure it is at least version v4.43.0.
+2. Copy the ID of the HuggingFace model, using the clipboard icon next to the name. For example, if the URL is `https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B`, the ID is `NousResearch/Hermes-2-Pro-Llama-3-8B`.
+3. Open a python interpreter (`python`) and run the following commands. Change the model ID in the example to the one you copied.
+```
+>>> from transformers import AutoTokenizer
+>>> tokenizer = AutoTokenizer.from_pretrained('NousResearch/Hermes-2-Pro-Llama-3-8B')
+>>> print(tokenizer.get_chat_template())
+```
+You can copy the output and paste it into the settings page.
+4. (Optional) You can save the output to a text file like this:
+```
+>>> open('chat_template.txt', 'w').write(tokenizer.get_chat_template())
+```
+
+If you get a ValueError exception, this model does not provide a chat template. See the [below instructions][creating] on creating a chat template.
+
+
+### Python (gated models)
+Some models, such as Llama and Mistral, do not allow public access to their chat template. You must either use the CLI method above, or follow the following instructions to use Python:
+
+1. For these steps, you must have git and git-lfs installed.
+2. You must have a HuggingFace account and be logged in.
+3. You must already have access to the gated model. Otherwise, request access.
+4. You must have an SSH key configured for git access to HuggingFace.
+5. `git clone` the model's HuggingFace repo using the SSH clone URL. There is no need to download the entire model, which is very large. A good way to do this on Linux is:
+```console
+$ GIT_LFS_SKIP_SMUDGE=1 git clone hf.co:meta-llama/Llama-3.1-8B-Instruct.git
+$ cd Llama-3.1-8B-Instruct
+$ git lfs pull -I "tokenizer.*"
+```
+6. Follow the above instructions for open models, but replace the model ID with the path to the directory containing `tokenizer\_config.json`:
+```
+>>> tokenizer = AutoTokenizer.from_pretrained('.')
+```
+
+
+## Advanced: How do chat templates work?
+The chat template is applied to the entire conversation you see in the chat window. The template loops over the list of messages, each containing `role` and `content` fields. `role` is either `user`, `assistant`, or `system`.
+
+GPT4All also supports the special variables `bos_token`, `eos_token`, and `add_generation_prompt`. See the [HuggingFace docs] for what those do.
+
+[HuggingFace docs]: https://huggingface.co/docs/transformers/v4.46.3/en/chat_templating#special-variables
+
+
+## Advanced: How do I make a chat template?
+The best way to create a chat template is to start by using an existing one as a reference. Then, modify it to use the format documented for the given model. Its README page may explicitly give an example of its template. Or, it may mention the name of a well-known standard template, such as ChatML, Alpaca, Vicuna. GPT4All does not yet include presets for these templates, so they will have to be found in other models or taken from the community.
+
+For more information, see the very helpful [HuggingFace guide]. Some of this is not applicable, such as the information about tool calling and RAG - GPT4All implements those features differently.
+
+Some models use a prompt template that does not intuitively map to a multi-turn chat, because it is more intended for single instructions. The [FastChat] implementation of these templates is a useful reference for the correct way to extend them to multiple messages.
+
+[HuggingFace guide]: https://huggingface.co/docs/transformers/v4.46.3/en/chat_templating#advanced-template-writing-tips
+[FastChat]: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py
+
+
+# Advanced: What are GPT4All v1 templates?
+GPT4All supports its own template syntax, which is nonstandard but provides complete control over the way LocalDocs sources and file attachments are inserted into the conversation. These templates begin with `{# gpt4all v1 #}` and look similar to the example below.
+
+For standard templates, GPT4All combines the user message, sources, and attachments into the `content` field. For GPT4All v1 templates, this is not done, so they must be used directly in the template for those features to work correctly.
+
+```jinja
+{# gpt4all v1 #}
+{%- for message in messages %}
+    {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
+    {%- if message['role'] == 'user' %}
+        {%- for source in message['sources'] %}
+            {%- if loop.first %}
+                {{- '### Context:\n' }}
+            {%- endif %}
+            {{- 'Collection: ' + source['collection'] + '\n'   +
+                'Path: '       + source['path']       + '\n'   +
+                'Excerpt: '    + source['text']       + '\n\n' }}
+        {%- endfor %}
+    {%- endif %}
+    {%- for attachment in message['prompt_attachments'] %}
+        {{- attachment['processed_content'] + '\n\n' }}
+    {%- endfor %}
+    {{- message['content'] | trim }}
+    {{- '<|eot_id|>' }}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
+{%- endif %}
+```
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/chats.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/chats.md
@ -0,0 +1,124 @@
+# Chats
+
+Chats are conversations with [language models](models.md) that run locally on your device.
+
+!!! note "Example Chats"
+
+    === "Llama 3"
+        Download `Llama 3` and prompt:
+        ```
+        explain why the sky is blue in a way that is correct and makes sense to a child
+        ```
+
+        With the default sampling settings, you should see text resembling the following:
+
+        ```
+        What a great question!
+        So, you know how we can see different colors like red, yellow, green, and orange? Well, when sunlight enters Earth's atmosphere, it starts to interact with tiny particles called molecules of gases like nitrogen (N2) and oxygen (02). These molecules are really small, but they're all around us!
+        Now, here's the cool part: these molecules scatter, or bounce off, shorter wavelengths of light more than longer wavelengths. Think of it like a game of pool where the cue ball hits other balls with different sizes and speeds.
+        The shortest wavelength of visible light is violet (about 400 nanometers), followed by blue (around 450-495 nanometers). These shorter wavelengths are scattered in all directions, kind of like how a basketball player might bounce the ball around on the court. This scattering effect makes the sky appear blue to our eyes!
+        Here's why it doesn't look as bright or intense as a blue crayon: there are many more molecules in the atmosphere that scatter longer wavelengths (like red and orange) than shorter ones, so those colors get spread out and become less noticeable.
+        To sum it up:
+        1. Sunlight enters Earth's atmosphere.
+        2. Tiny gas molecule particles interact with light, scattering shorter wavelengths like blue and violet.
+        3. This scattered blue light reaches our eyes from all directions, making the sky appear blue!
+        Now, go outside and enjoy that beautiful blue sky!
+        ```
+
+    === "Nous Hermes 2 Mistral DPO"
+        Download `Nous Hermes 2 Mistral DPO` and prompt:
+        ```
+        write me a react app i can run from the command line to play a quick game
+        ```
+
+        With the default sampling settings, you should see text and code blocks resembling the following:
+        
+        ```
+        I'll provide you with a simple React app that allows you to play a guessing game. This app will generate a random number between 1 and 10, and ask you to guess it. You can run this app locally using Create React App. Follow the steps below:
+        Open your terminal or command line interface.
+        Navigate to the directory where you want to create the project (e.g., cd Documents/Projects).
+        Run the following commands:
+        ```
+
+        ```bash
+        npx create-react-app guessing-game
+        cd guessing-game
+        ```
+
+        ```
+        Open the project in your preferred code editor by running code . (for VS Code) or any other command to open the folder with your text editor.
+        Replace the content of the src/App.js file with the following code:
+        ```
+
+        ```javascript
+        import React, { useState } from 'react';
+
+        function App() {
+        const [guess, setGuess] = useState('');
+        const [message, setMessage] = useState('');
+        const [numberToGuess, setNumberToGuess] = useState(Math.floor(Math.random() * 10) + 1);
+
+        function handleChange(e) {
+            setGuess(e.target.value);
+        }
+
+        function handleSubmit(e) {
+            e.preventDefault();
+            if (parseInt(guess) === numberToGuess) {
+            setMessage('You guessed it right!');
+            } else {
+            setMessage(`Wrong guess, the correct number was ${numberToGuess}. Try again next time.`);
+            }
+        }
+
+        return (
+            <div>
+            <h1>Guess a Number</h1>
+            <p>{message}</p>
+            <form onSubmit={handleSubmit}>
+                <input type="number" value={guess} onChange={handleChange} />
+                <button type="submit">Submit</button>
+            </form>
+            </div>
+        );
+        }
+
+        export default App;
+        ```
+
+        ```
+        Save the file and go back to your terminal or command line interface. Run npm start. This will start a local development server, and you can access the app in your browser at http://localhost:3000.
+        Play the guessing game by entering a number between 1 and 10 into the input field and clicking "Submit". The app will tell you if your guess is correct or not.
+        Remember that this is just a simple example, and you can expand upon it to make the game more interesting with additional features like high scores, multiple difficulty levels, etc.
+        ```
+
+## New Chat
+
+Choose a model with the dropdown at the top of the Chats page
+
+If you don't have any models, [download one](models.md#download-models). Once you have models, you can start chats by loading your default model, which you can configure in [settings](settings.md#application-settings)
+
+![Choose a model](../assets/three_model_options.png)
+
+## LocalDocs
+
+Open the [LocalDocs](localdocs.md) panel with the button in the top-right corner to bring your files into the chat. With LocalDocs, your chats are enhanced with semantically related snippets from your files included in the model's context.
+
+![Open LocalDocs](../assets/open_local_docs.png)
+
+## Chat History
+
+View your chat history with the button in the top-left corner of the Chats page.
+
+<table>
+<tr>
+    <td>
+    <img src="../assets/closed_chat_panel.png" alt="Close chats" style="width:100%">
+    </td>
+    <td>
+    <img src="../assets/open_chat_panel.png" alt="Open chats" style="width:100%">
+    </td>
+</tr>
+</table>
+
+You can change a chat name or delete it from your chat history at any time.
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-Obsidian.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-Obsidian.md
@ -0,0 +1,109 @@
+# Using GPT4All to Privately Chat with your Obsidian Vault
+
+Obsidian for Desktop is a powerful management and note-taking software designed to create and organize markdown notes. This tutorial allows you to sync and access your Obsidian note files directly on your computer. By connecting it to LocalDocs, you can integrate these files into your LLM chats for private access and enhanced context.
+
+## Download Obsidian for Desktop
+
+!!! note "Download Obsidian for Desktop"
+
+      1. **Download Obsidian for Desktop**:
+         - Visit the [Obsidian website](https://obsidian.md) and create an account account.
+         - Click the Download button in the center of the homepage
+         - For more help with installing Obsidian see [Getting Started with Obsidian](https://help.obsidian.md/Getting+started/Download+and+install+Obsidian)
+      
+      2. **Set Up Obsidian**:
+         - Launch Obsidian from your Applications folder (macOS), Start menu (Windows), or equivalent location (Linux).
+         - On the welcome screen, you can either create a new vault (a collection of notes) or open an existing one.
+         - To create a new vault, click Create a new vault, name your vault, choose a location on your computer, and click Create.
+   
+   
+      3. **Sign in and Sync**:
+            - Once installed, you can start adding and organizing notes.
+            - Choose the folders you want to sync to your computer.
+   
+
+
+## Connect Obsidian to LocalDocs
+
+!!! note "Connect Obsidian to LocalDocs"
+
+      1. **Open LocalDocs**:
+         - Navigate to the LocalDocs feature within GPT4All.
+
+         <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of LocalDocs interface -->
+                  <img width="1348" alt="LocalDocs interface" src="https://github.com/nomic-ai/gpt4all/assets/132290469/d8fb2d79-2063-45d4-bcce-7299fb75b144">
+               </td>
+            </tr>
+         </table>
+   
+      2. **Add Collection**:
+         - Click on **+ Add Collection** to begin linking your Obsidian Vault.
+      
+         <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of adding collection in LocalDocs -->
+                  <img width="1348" alt="Screenshot of adding collection" src="https://raw.githubusercontent.com/nomic-ai/gpt4all/124ef867a9d9afd9e14d3858cd77bce858f79773/gpt4all-bindings/python/docs/assets/obsidian_adding_collection.png">
+               </td>
+            </tr>
+         </table>
+   
+         - Name your collection
+   
+   
+      3. **Create Collection**:
+         - Click **Create Collection** to initiate the embedding process. Progress will be displayed within the LocalDocs interface.
+   
+      4. **Access Files in Chats**:
+         - Load a model to chat with your files (Llama 3 Instruct is the fastest)
+         - In your chat, open 'LocalDocs' with the button in the top-right corner to provide context from your synced Obsidian notes.
+      
+         <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of accessing LocalDocs in chats -->
+                  <img width="1447" alt="Accessing LocalDocs in chats" src="https://raw.githubusercontent.com/nomic-ai/gpt4all/124ef867a9d9afd9e14d3858cd77bce858f79773/gpt4all-bindings/python/docs/assets/obsidian_docs.png">
+               </td>
+            </tr>
+         </table>
+   
+      5. **Interact With Your Notes:**
+         - Use the model to interact with your files
+         <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of interacting sources -->
+                  <img width="662" alt="osbsidian user interaction" src="https://raw.githubusercontent.com/nomic-ai/gpt4all/124ef867a9d9afd9e14d3858cd77bce858f79773/gpt4all-bindings/python/docs/assets/osbsidian_user_interaction.png">
+               </td>
+            </tr>
+         </table>
+         <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of viewing sources -->
+                  <img width="662" alt="osbsidian GPT4ALL response" src="https://raw.githubusercontent.com/nomic-ai/gpt4all/124ef867a9d9afd9e14d3858cd77bce858f79773/gpt4all-bindings/python/docs/assets/obsidian_response.png">
+               </td>
+            </tr>
+         </table>
+   
+      6. **View Referenced Files**:
+         - Click on **Sources** below LLM responses to see which Obsidian Notes were referenced.
+      
+         <table>
+            <tr>
+               <td>
+                  <!-- Referenced Files  -->
+                  <img width="643" alt="Referenced Files" src="https://raw.githubusercontent.com/nomic-ai/gpt4all/124ef867a9d9afd9e14d3858cd77bce858f79773/gpt4all-bindings/python/docs/assets/obsidian_sources.png">
+               </td>
+            </tr>
+         </table>
+
+## How It Works
+
+Obsidian for Desktop syncs your Obsidian notes to your computer, while LocalDocs integrates these files into your LLM chats using embedding models. These models find semantically similar snippets from your files to enhance the context of your interactions.
+
+To learn more about embedding models and explore further, refer to the [Nomic Python SDK documentation](https://docs.nomic.ai/atlas/capabilities/embeddings).
+
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-One-Drive.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-One-Drive.md
@ -0,0 +1,112 @@
+# Using GPT4All to Privately Chat with your OneDrive Data
+
+Local and Private AI Chat with your OneDrive Data
+
+OneDrive for Desktop allows you to sync and access your OneDrive files directly on your computer. By connecting your synced directory to LocalDocs, you can start using GPT4All to privately chat with data stored in your OneDrive.
+
+## Download OneDrive for Desktop
+
+!!! note "Download OneDrive for Desktop"
+
+    1. **Download OneDrive for Desktop**:
+    - Visit [Microsoft OneDrive](https://www.microsoft.com/en-us/microsoft-365/onedrive/download).
+    - Press 'download' for your respective device type.
+    - Download the OneDrive for Desktop application.
+    
+    2. **Install OneDrive for Desktop**
+    - Run the installer file you downloaded.
+    - Follow the prompts to complete the installation process.
+    
+    3. **Sign in and Sync**
+    - Once installed, sign in to OneDrive for Desktop with your Microsoft account credentials.
+    - Choose the folders you want to sync to your computer.
+
+## Connect OneDrive to LocalDocs
+
+!!! note "Connect OneDrive to LocalDocs"
+
+    1. **Install GPT4All and Open LocalDocs**:
+    
+        - Go to [nomic.ai/gpt4all](https://nomic.ai/gpt4all) to install GPT4All for your operating system.
+        
+        - Navigate to the LocalDocs feature within GPT4All to configure it to use your synced OneDrive directory.
+
+        <table>
+        <tr>
+            <td>
+                <!-- Placeholder for screenshot of LocalDocs interface -->
+                <img width="1348" alt="Screenshot 2024-07-10 at 10 55 41 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/54254bc0-d9a0-40c4-9fd1-5059abaad583">
+            </td>
+        </tr>
+        </table>
+
+    2. **Add Collection**:
+    
+        - Click on **+ Add Collection** to begin linking your OneDrive folders.
+
+        <table>
+        <tr>
+            <td>
+                <!-- Placeholder for screenshot of adding collection in LocalDocs -->
+               <img width="1348" alt="Screenshot 2024-07-10 at 10 56 29 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/7f12969a-753a-4757-bb9e-9b607cf315ca">
+            </td>
+        </tr>
+        </table>
+
+        - Name the Collection and specify the OneDrive folder path.
+
+    3. **Create Collection**:
+    
+        - Click **Create Collection** to initiate the embedding process. Progress will be displayed within the LocalDocs interface.
+
+    4. **Access Files in Chats**:
+    
+        - Load a model within GPT4All to chat with your files.
+        
+        - In your chat, open 'LocalDocs' using the button in the top-right corner to provide context from your synced OneDrive files.
+
+        <table>
+        <tr>
+            <td>
+                <!-- Placeholder for screenshot of accessing LocalDocs in chats -->
+                <img width="1447" alt="Screenshot 2024-07-10 at 10 58 55 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/b5a67fe6-0d6a-42ae-b3b8-cc0f91cbf5b1">
+            </td>
+        </tr>
+        </table>
+
+    5. **Interact With Your OneDrive**:
+    
+        - Use the model to interact with your files directly from OneDrive.
+        
+        <table>
+        <tr>
+            <td>
+                <!-- Placeholder for screenshot of interacting with sources -->
+                <img width="662" alt="Screenshot 2024-07-10 at 11 04 55 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/2c9815b8-3d1c-4179-bf76-3ddbafb193bf">
+            </td>
+        </tr>
+        </table>
+        
+        <table>
+        <tr>
+            <td>
+                <img width="662" alt="Screenshot 2024-07-11 at 11 21 46 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/ce8be292-b025-415a-bd54-f11868e0cd0a">
+            </td>
+        </tr>
+        </table>
+
+    6. **View Referenced Files**:
+    
+        - Click on **Sources** below responses to see which OneDrive files were referenced.
+
+        <table>
+        <tr>
+            <td>
+              <img width="643" alt="Screenshot 2024-07-11 at 11 22 49 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/6fe3f10d-2791-4153-88a7-2198ab3ac945">
+            </td>
+        </tr>
+        </table>
+
+## How It Works
+
+OneDrive for Desktop syncs your OneDrive files to your computer, while LocalDocs maintains a database of these synced files for use by your local GPT4All model. As your OneDrive updates, LocalDocs will automatically detect file changes and stay up to date. LocalDocs leverages [Nomic Embedding](https://docs.nomic.ai/atlas/capabilities/embeddings) models to find semantically similar snippets from your files, enhancing the context of your interactions.
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-google-drive.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-google-drive.md
@ -0,0 +1,113 @@
+# Using GPT4All to Privately Chat with your Google Drive Data
+Local and Private AI Chat with your Google Drive Data
+
+Google Drive for Desktop allows you to sync and access your Google Drive files directly on your computer. By connecting your synced directory to LocalDocs, you can start using GPT4All to privately chat with data stored in your Google Drive.
+
+## Download Google Drive for Desktop
+
+!!! note "Download Google Drive for Desktop"
+
+    1. **Download Google Drive for Desktop**:
+    - Visit [drive.google.com](https://drive.google.com) and sign in with your Google account.
+    - Navigate to the **Settings** (gear icon) and select **Settings** from the dropdown menu.
+    - Scroll down to **Google Drive for desktop** and click **Download**.
+
+    2. **Install Google Drive for Desktop**
+    - Run the installer file you downloaded.
+    - Follow the prompts to complete the installation process.
+
+    3. **Sign in and Sync**
+    - Once installed, sign in to Google Drive for Desktop with your Google account credentials.
+    - Choose the folders you want to sync to your computer.
+
+For advanced help, see [Setting up Google Drive for Desktop](https://support.google.com/drive/answer/10838124?hl=en)
+## Connect Google Drive to LocalDocs
+
+!!! note "Connect Google Drive to LocalDocs"
+
+    1. **Install GPT4All and Open LocalDocs**:
+    
+        - Go to [nomic.ai/gpt4all](https://nomic.ai/gpt4all) to install GPT4All for your operating system.
+        
+        - Navigate to the LocalDocs feature within GPT4All to configure it to use your synced directory.
+
+        <table>
+        <tr>
+            <td>
+                <!-- Screenshot of LocalDocs interface -->
+                <img width="1348" alt="Screenshot 2024-07-09 at 3 15 35 PM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/d8fb2d79-2063-45d4-bcce-7299fb75b144">
+            </td>
+        </tr>
+        </table>
+
+    2. **Add Collection**:
+    
+        - Click on **+ Add Collection** to begin linking your Google Drive folders.
+
+        <table>
+        <tr>
+            <td>
+                <!-- Screenshot of adding collection in LocalDocs -->
+                <img width="1348" alt="Screenshot 2024-07-09 at 3 17 24 PM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/39063615-9eb6-4c47-bde7-c9f04f9b168b">
+            </td>
+        </tr>
+        </table>
+
+        - Name Collection
+
+
+    3. **Create Collection**:
+    
+        - Click **Create Collection** to initiate the embedding process. Progress will be displayed within the LocalDocs interface.
+
+    4. **Access Files in Chats**:
+    
+        - Load a model to chat with your files (Llama 3 Instruct performs best)
+        
+        - In your chat, open 'LocalDocs' with the button in the top-right corner to provide context from your synced Google Drive files.
+
+        <table>
+        <tr>
+            <td>
+                <!-- Screenshot of accessing LocalDocs in chats -->
+                <img width="1447" alt="Screenshot 2024-07-09 at 3 20 53 PM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/ce68811f-9abd-451b-ac0a-fb941e185d7a">
+            </td>
+        </tr>
+        </table>
+
+    5. **Interact With Your Drive:**
+    
+        - Use the model to interact with your files
+        
+        <table>
+        <tr>
+            <td>
+                <!-- Screenshot of interacting sources -->
+                <img width="662" alt="Screenshot 2024-07-09 at 3 36 51 PM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/bc55bc36-e613-419d-a568-adb1cd993854">
+            </td>
+        </tr>
+        </table>
+
+        <table>
+        <tr>
+            <td>
+              <img width="662" alt="Screenshot 2024-07-11 at 11 34 00 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/1c0fd19a-5a22-4726-a841-d26c1bea81fc">
+            </td>
+        </tr>
+        </table>
+    
+    6. **View Referenced Files**:
+    
+        - Click on **Sources** below LLM responses to see which Google Drive files were referenced.
+
+        <table>
+        <tr>
+            <td>  
+           <img width="643" alt="Screenshot 2024-07-11 at 11 34 37 AM" src="https://github.com/nomic-ai/gpt4all/assets/132290469/78527d30-8d24-4b4c-8311-b611a2d66fcd">
+            </td>
+        </tr>
+        </table>
+
+## How It Works
+
+Google Drive for Desktop syncs your Google Drive files to your computer, while LocalDocs maintains a database of these synced files for use by your local LLM. As your Google Drive updates, LocalDocs will automatically detect file changes and get up to date. LocalDocs is powered by [Nomic Embedding](https://docs.nomic.ai/atlas/capabilities/embeddings) models which find semantically similar snippets from your files to enhance the context of your interactions.
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-microsoft-excel.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/cookbook/use-local-ai-models-to-privately-chat-with-microsoft-excel.md
@ -0,0 +1,85 @@
+# Using GPT4All to Privately Chat with your Microsoft Excel Spreadsheets
+Local and Private AI Chat with your Microsoft Excel Spreadsheets
+
+Microsoft Excel allows you to create, manage, and analyze data in spreadsheet format. By attaching your spreadsheets directly to GPT4All, you can privately chat with the AI to query and explore the data, enabling you to summarize, generate reports, and glean insights from your files—all within your conversation.
+
+<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
+  <iframe src="../../assets/gpt4all_xlsx_attachment.mp4" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" allowfullscreen title="YouTube Video"></iframe>
+</div>
+
+
+## Attach Microsoft Excel to your GPT4All Conversation
+
+!!! note "Attach Microsoft Excel to your GPT4All Conversation"
+
+    1. **Install GPT4All and Open **:
+
+        - Go to [nomic.ai/gpt4all](https://nomic.ai/gpt4all) to install GPT4All for your operating system.
+
+        - Navigate to the Chats view within GPT4All.
+
+        <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of Chat view -->
+                  <img width="1348" alt="Chat view" src="../../assets/chat_window.png">
+               </td>
+            </tr>
+         </table>
+
+    2. **Example Spreadsheet **:
+
+        <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of Spreadsheet view -->
+                  <img width="1348" alt="Spreadsheet view" src="../../assets/disney_spreadsheet.png">
+               </td>
+            </tr>
+         </table>
+
+    3. **Attach to GPT4All conversration**
+        <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of Attach view -->
+                  <img width="1348" alt="Attach view" src="../../assets/attach_spreadsheet.png">
+               </td>
+            </tr>
+         </table>
+
+    4. **Have GPT4All Summarize and Generate a Report**
+        <table>
+            <tr>
+               <td>
+                  <!-- Screenshot of Attach view -->
+                  <img width="1348" alt="Attach view" src="../../assets/spreadsheet_chat.png">
+               </td>
+            </tr>
+         </table>
+
+
+## How It Works
+
+GPT4All parses your attached excel spreadsheet into Markdown, a format understandable to LLMs, and adds the markdown text to the context for your LLM chat. You can view the code that converts `.xslx` to Markdown [here](https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/src/xlsxtomd.cpp) in the GPT4All github repo.
+
+For example, the above spreadsheet titled `disney_income_stmt.xlsx` would be formatted the following way:
+
+```markdown
+## disney_income_stmt
+
+|Walt Disney Co.|||||||
+|---|---|---|---|---|---|---|
+|Consolidated Income Statement|||||||
+|||||||||
+|US$ in millions|||||||
+|12 months ended:|2023-09-30 00:00:00|2022-10-01 00:00:00|2021-10-02 00:00:00|2020-10-03 00:00:00|2019-09-28 00:00:00|2018-09-29 00:00:00|
+|Services|79562|74200|61768|59265|60542|50869|
+...
+...
+...
+```
+
+## Limitations
+
+It is important to double-check the claims LLMs make about the spreadsheets you provide. LLMs can make mistakes about the data they are presented, particularly for the LLMs with smaller parameter counts (~8B) that fit within the memory of consumer hardware.
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/localdocs.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/localdocs.md
@ -0,0 +1,48 @@
+# LocalDocs
+
+LocalDocs brings the information you have from files on-device into your LLM chats - **privately**.
+
+## Create LocalDocs
+
+!!! note "Create LocalDocs"
+
+    1. Click `+ Add Collection`.
+    
+    2. Name your collection and link it to a folder.
+
+        <table>
+        <tr>
+            <td>
+            <img src="../assets/new_docs_annotated.png" alt="new GOT Docs" style="width:100%">
+            </td>
+            <td>
+            <img src="../assets/new_docs_annotated_filled.png" alt="new GOT Docs filled out" style="width:100%">
+            </td>
+        </tr>
+        </table>
+
+    3. Click `Create Collection`. Progress for the collection is displayed on the LocalDocs page. 
+
+        ![Embedding in progress](../assets/baelor.png)
+
+        You will see a green `Ready` indicator when the entire collection is ready. 
+
+        Note: you can still chat with the files that are ready before the entire collection is ready.
+
+        ![Embedding complete](../assets/got_done.png)
+
+        Later on if you modify your LocalDocs settings you can rebuild your collections with your new settings.
+
+    4. In your chats, open `LocalDocs` with button in top-right corner to give your LLM context from those files.
+
+        ![LocalDocs result](../assets/syrio_snippets.png)
+
+    5. See which files were referenced by clicking `Sources` below the LLM responses.
+
+        ![Sources](../assets/open_sources.png)
+
+## How It Works
+
+A LocalDocs collection uses Nomic AI's free and fast on-device embedding models to index your folder into text snippets that each get an **embedding vector**. These vectors allow us to find snippets from your files that are semantically similar to the questions and prompts you enter in your chats. We then include those semantically similar snippets in the prompt to the LLM.
+
+To try the embedding models yourself, we recommend using the [Nomic Python SDK](https://docs.nomic.ai/atlas/capabilities/embeddings)
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/models.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/models.md
@ -0,0 +1,79 @@
+# Models
+
+GPT4All is optimized to run LLMs in the 3-13B parameter range on consumer-grade hardware.
+
+LLMs are downloaded to your device so you can run them locally and privately. With our backend anyone can interact with LLMs efficiently and securely on their own hardware.
+
+## Download Models
+
+!!! note "Download Models"
+
+    <div style="text-align: center; margin-top: 20px;">
+        <table style="margin-left: auto; margin-right: auto;">
+            <tr>
+                <td style="text-align: right; padding-right: 10px;">1.</td>
+                <td style="text-align: left;">Click `Models` in the menu on the left (below `Chats` and above `LocalDocs`)</td>
+                <td><img src="../assets/models_page_icon.png" alt="Models Page Icon" style="width: 80px; height: auto;"></td>
+            </tr>
+            <tr>
+                <td style="text-align: right; padding-right: 10px;">2.</td>
+                <td style="text-align: left;">Click `+ Add Model` to navigate to the `Explore Models` page</td>
+                <td><img src="../assets/add.png" alt="Add Model button" style="width: 100px; height: auto;"></td>
+            </tr>
+            <tr>
+                <td style="text-align: right; padding-right: 10px;">3.</td>
+                <td style="text-align: left;">Search for models available online</td>
+                <td><img src="../assets/explore.png" alt="Explore Models search" style="width: 120px; height: auto;"></td>
+            </tr>
+            <tr>
+                <td style="text-align: right; padding-right: 10px;">4.</td>
+                <td style="text-align: left;">Hit `Download` to save a model to your device</td>
+                <td><img src="../assets/download.png" alt="Download Models button" style="width: 120px; height: auto;"></td>
+            </tr>
+            <tr>
+                <td style="text-align: right; padding-right: 10px;">5.</td>
+                <td style="text-align: left;">Once the model is downloaded you will see it in `Models`.</td>
+                <td><img src="../assets/installed_models.png" alt="Download Models button" style="width: 120px; height: auto;"></td>
+            </tr>
+        </table>
+    </div>
+
+## Explore Models
+
+GPT4All connects you with LLMs from HuggingFace with a [`llama.cpp`](https://github.com/ggerganov/llama.cpp) backend so that they will run efficiently on your hardware. Many of these models can be identified by the file type `.gguf`.
+
+![Explore models](../assets/search_mistral.png)
+
+## Example Models
+
+Many LLMs are available at various sizes, quantizations, and licenses. 
+
+- LLMs with more parameters tend to be better at coherently responding to instructions
+
+- LLMs with a smaller quantization (e.g. 4bit instead of 16bit) are much faster and less memory intensive, and tend to have slightly worse performance
+
+- Licenses vary in their terms for personal and commercial use
+
+Here are a few examples:
+
+| Model| Filesize| RAM Required| Parameters| Quantization| Developer| License| MD5 Sum (Unique Hash)|
+|------|---------|-------------|-----------|-------------|----------|--------|----------------------|
+| Llama 3 Instruct  | 4.66 GB| 8 GB| 8 Billion| q4_0| Meta| [Llama 3 License](https://llama.meta.com/llama3/license/)| c87ad09e1e4c8f9c35a5fcef52b6f1c9|
+| Nous Hermes 2 Mistral DPO| 4.11 GB| 8 GB| 7 Billion| q4_0| Mistral & Nous Research | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)| Coa5f6b4eabd3992da4d7fb7f020f921eb|
+| Phi-3 Mini Instruct | 2.18 GB| 4 GB| 4 billion| q4_0| Microsoft| [MIT](https://opensource.org/license/mit)| f8347badde9bfc2efbe89124d78ddaf5|
+| Mini Orca (Small)| 1.98 GB| 4 GB| 3 billion| q4_0| Microsoft | [CC-BY-NC-SA-4.0](https://spdx.org/licenses/CC-BY-NC-SA-4.0)| 0e769317b90ac30d6e09486d61fefa26|
+| GPT4All Snoozy| 7.37 GB| 16 GB| 13 billion| q4_0| Nomic AI| [GPL](https://www.gnu.org/licenses/gpl-3.0.en.html)| 40388eb2f8d16bb5d08c96fdfaac6b2c|
+
+### Search Results
+
+You can click the gear icon in the search bar to sort search results by their # of likes, # of downloads, or date of upload (all from HuggingFace).
+
+![Sort search results](../assets/search_settings.png)
+
+## Connect Model APIs
+
+You can add your API key for remote model providers.
+
+**Note**: this does not download a model file to your computer to use securely. Instead, this way of interacting with models has your prompts leave your computer to the API provider and returns the response to your computer.
+
+![Connect APIs](../assets/add_model_gpt4.png)
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/quickstart.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/quickstart.md
@ -0,0 +1,42 @@
+# GPT4All Desktop
+
+The GPT4All Desktop Application allows you to download and run large language models (LLMs) locally & privately on your device.
+
+With GPT4All, you can chat with models, turn your local files into information sources for models [(LocalDocs)](localdocs.md), or browse models available online to download onto your device.
+
+[Official Video Tutorial](https://www.youtube.com/watch?v=gQcZDXRVJok)
+
+## Quickstart
+
+!!! note "Quickstart"
+
+    1. Install GPT4All for your operating system and open the application.
+
+        <div style="text-align: center; margin-top: 20px;">
+            [Download for Windows](https://gpt4all.io/installers/gpt4all-installer-win64.exe) &nbsp;&nbsp;&nbsp;&nbsp;
+            [Download for Mac](https://gpt4all.io/installers/gpt4all-installer-darwin.dmg) &nbsp;&nbsp;&nbsp;&nbsp;
+            [Download for Linux](https://gpt4all.io/installers/gpt4all-installer-linux.run)
+        </div>
+
+    2. Hit `Start Chatting`. ![GPT4All home page](../assets/gpt4all_home.png)
+
+    3. Click `+ Add Model`.
+
+    4. Download a model. We recommend starting with Llama 3, but you can [browse more models](models.md). ![Download a model](../assets/download_llama.png)
+
+    5. Once downloaded, go to Chats (below Home and above Models in the menu on the left).  
+
+    6. Click "Load Default Model" (will be Llama 3 or whichever model you downloaded). 
+
+        <table>
+        <tr>
+            <td>
+            <img src="../assets/before_first_chat.png" alt="Before first chat" style="width:100%">
+            </td>
+            <td>
+            <img src="../assets/new_first_chat.png" alt="New first chat" style="width:100%">
+            </td>
+        </tr>
+        </table>
+
+    7. Try the [example chats](chats.md) or your own prompts!
--- a/gpt4all-bindings/python/docs/gpt4all_desktop/settings.md
+++ b/gpt4all-bindings/python/docs/gpt4all_desktop/settings.md
@ -0,0 +1,79 @@
+# Settings
+
+## Application Settings
+
+!!! note "General Application Settings"
+
+    | Setting | Description | Default Value |
+    | --- | --- | --- |
+    | **Theme** | Color theme for the application. Options are `Light`, `Dark`, and `LegacyDark` | `Light` |
+    | **Font Size** | Font size setting for text throughout the application. Options are Small, Medium, and Large | Small |
+    | **Language and Locale** | The language and locale of that language you wish to use | System Locale |
+    | **Device** | Device that will run your models. Options are `Auto` (GPT4All chooses), `Metal` (Apple Silicon M1+), `CPU`, and `GPU` | `Auto` |
+    | **Default Model** | Choose your preferred LLM to load by default on startup| Auto |
+    | **Suggestion Mode** | Generate suggested follow up questions at the end of responses | When chatting with LocalDocs | 
+    | **Download Path** | Select a destination on your device to save downloaded models | Windows: `C:\Users\{username}\AppData\Local\nomic.ai\GPT4All`<br><br>Mac: `/Users/{username}/Library/Application Support/nomic.ai/GPT4All/`<br><br>Linux: `/home/{username}/.local/share/nomic.ai/GPT4All` |
+    | **Enable Datalake** | Opt-in to sharing interactions with GPT4All community (**anonymous** and **optional**) | Off |
+
+!!! note "Advanced Application Settings"
+
+    | Setting | Description | Default Value |
+    | --- | --- | --- |
+    | **CPU Threads** | Number of concurrently running CPU threads (more can speed up responses) | 4 |
+    | **Enable System Tray** | The application will minimize to the system tray / taskbar when the window is closed | Off |
+    | **Enable Local Server** | Allow any application on your device to use GPT4All via an OpenAI-compatible GPT4All API | Off |
+    | **API Server Port** | Local HTTP port for the local API server | 4891 |
+
+## Model Settings
+
+!!! note "Model / Character Settings"
+
+    | Setting | Description | Default Value |
+    | --- | --- | --- |
+    | **Name** | Unique name of this model / character| set by model uploader |
+    | **Model File** | Filename (.gguf) of the model | set by model uploader |
+    | **System Message** | General instructions for the chats this model will be used for | set by model uploader |
+    | **Chat Template** | Format of user <-> assistant interactions for the chats this model will be used for | set by model uploader |
+    | **Chat Name Prompt** | Prompt used to automatically generate chat names | Describe the above conversation in seven words or less. |
+    | **Suggested FollowUp Prompt** | Prompt used to automatically generate follow up questions after a chat response | Suggest three very short factual follow-up questions that have not been answered yet or cannot be found inspired by the previous conversation and excerpts. |
+
+### Clone
+
+You can **clone** an existing model, which allows you to save a configuration of a model file with different prompt templates and sampling settings.
+
+### Sampling Settings
+
+!!! note "Model Sampling Settings"
+
+    | Setting             | Description                          | Default Value |
+    |----------------------------|------------------------------------------|-----------|
+    | **Context Length**         | Maximum length of input sequence in tokens        | 2048      |
+    | **Max Length**             | Maximum length of response in tokens     | 4096      |
+    | **Prompt Batch Size**      | Token batch size for parallel processing | 128      |
+    | **Temperature**            | Lower temperature gives more likely generations | 0.7       |
+    | **Top P**                  | Prevents choosing highly unlikely tokens  | 0.4       |
+    | **Top K**                  | Size of selection pool for tokens         | 40        |
+    | **Min P**                  | Minimum relative probability              | 0         |
+    | **Repeat Penalty Tokens**  | Length to apply penalty                   | 64        |
+    | **Repeat Penalty**         | Penalize repetitiveness                   | 1.18      |
+    | **GPU Layers**             | How many model layers to load into VRAM     | 32        |
+
+## LocalDocs Settings
+
+!!! note "General LocalDocs Settings"
+
+    | Setting | Description | Default Value |
+    | --- | --- | --- |
+    | **Allowed File Extensions** | Choose which file types will be indexed into LocalDocs collections as text snippets with embedding vectors | `.txt`, `.pdf`, `.md`, `.rst` |
+    | **Use Nomic Embed API** | Use Nomic API to create LocalDocs collections fast and off-device; [Nomic API Key](https://atlas.nomic.ai/) required | Off |
+    | **Embeddings Device** | Device that will run embedding models. Options are `Auto` (GPT4All chooses), `Metal` (Apple Silicon M1+), `CPU`, and `GPU` | `Auto` |
+    | **Show Sources** | Titles of source files retrieved by LocalDocs will be displayed directly in your chats.| On |
+
+!!! note "Advanced LocalDocs Settings"
+
+    Note that increasing these settings can increase the likelihood of factual responses, but may result in slower generation times.
+
+    | Setting | Description | Default Value |
+    | --- | --- | --- |
+    | **Document Snippet Size** | Number of string characters per document snippet | 512 |
+    | **Maximum Document Snippets Per Prompt** | Upper limit for the number of snippets from your files LocalDocs can retrieve for LLM context | 3 |
--- a/gpt4all-bindings/python/docs/gpt4all_help/faq.md
+++ b/gpt4all-bindings/python/docs/gpt4all_help/faq.md
@ -0,0 +1,43 @@
+# Frequently Asked Questions
+
+## Models
+
+### Which language models are supported?
+
+We support models with a `llama.cpp` implementation which have been uploaded to [HuggingFace](https://huggingface.co/).
+
+### Which embedding models are supported?
+
+We support SBert and Nomic Embed Text v1 & v1.5.
+
+## Software
+
+### What software do I need?
+
+All you need is to [install GPT4all](../index.md) onto you Windows, Mac, or Linux computer.
+
+### Which SDK languages are supported?
+
+Our SDK is in Python for usability, but these are light bindings around [`llama.cpp`](https://github.com/ggerganov/llama.cpp) implementations that we contribute to for efficiency and accessibility on everyday computers.
+
+### Is there an API?
+
+Yes, you can run your model in server-mode with our [OpenAI-compatible API](https://platform.openai.com/docs/api-reference/completions), which you can configure in [settings](../gpt4all_desktop/settings.md#application-settings)
+
+### Can I monitor a GPT4All deployment?
+
+Yes, GPT4All [integrates](../gpt4all_python/monitoring.md) with [OpenLIT](https://github.com/openlit/openlit) so you can deploy LLMs with user interactions and hardware usage automatically monitored for full observability.
+
+### Is there a command line interface (CLI)?
+
+[Yes](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/cli), we have a lightweight use of the Python client as a CLI. We welcome further contributions!
+
+## Hardware
+
+### What hardware do I need?
+
+GPT4All can run on CPU, Metal (Apple Silicon M1+), and GPU.
+
+### What are the system requirements?
+
+Your CPU needs to support [AVX or AVX2 instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) and you need enough RAM to load a model into memory.
--- a/gpt4all-bindings/python/docs/gpt4all_help/troubleshooting.md
+++ b/gpt4all-bindings/python/docs/gpt4all_help/troubleshooting.md
@ -0,0 +1,27 @@
+# Troubleshooting
+
+## Error Loading Models
+
+It is possible you are trying to load a model from HuggingFace whose weights are not compatible with our [backend](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings).
+
+Try downloading one of the officially supported models listed on the main models page in the application. If the problem persists, please share your experience on our [Discord](https://discord.com/channels/1076964370942267462).
+
+## Bad Responses 
+
+Try the [example chats](../gpt4all_desktop/chats.md) to double check that your system is implementing models correctly.
+
+### Responses Incoherent
+
+If you are seeing something **not at all** resembling the [example chats](../gpt4all_desktop/chats.md) - for example, if the responses you are seeing look nonsensical - try [downloading a different model](../gpt4all_desktop/models.md), and please share your experience on our [Discord](https://discord.com/channels/1076964370942267462).
+
+### Responses Incorrect
+
+LLMs can be unreliable. It's helpful to know what their training data was - they are less likely to be correct when asking about data they were not trained on unless you give the necessary information in the prompt as **context**.
+
+Giving LLMs additional context, like chatting using [LocalDocs](../gpt4all_desktop/localdocs.md), can help merge the language model's ability to understand text with the files that you trust to contain the information you need. 
+
+Including information in a prompt is not a guarantee that it will be used correctly, but the more clear and concise your prompts, and the more relevant your prompts are to your files, the better.
+
+### LocalDocs Issues
+
+Occasionally a model - particularly a smaller or overall weaker LLM - may not use the relevant text snippets from the files that were referenced via LocalDocs. If you are seeing this, it can help to use phrases like "in the docs" or "from the provided files" when prompting your model.
--- a/gpt4all-bindings/python/docs/gpt4all_python/home.md
+++ b/gpt4all-bindings/python/docs/gpt4all_python/home.md
@ -0,0 +1,159 @@
+# GPT4All Python SDK
+
+## Installation
+
+To get started, pip-install the `gpt4all` package into your python environment.
+
+```bash
+pip install gpt4all
+```
+
+We recommend installing `gpt4all` into its own virtual environment using `venv` or `conda`
+
+## Load LLM
+
+Models are loaded by name via the `GPT4All` class. If it's your first time loading a model, it will be downloaded to your device and saved so it can be quickly reloaded next time you create a `GPT4All` model with the same name.
+
+!!! note "Load LLM"
+
+    ```python
+    from gpt4all import GPT4All
+    model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM
+    with model.chat_session():
+        print(model.generate("How can I run LLMs efficiently on my laptop?", max_tokens=1024))
+    ```
+
+| `GPT4All` model name| Filesize| RAM Required| Parameters| Quantization| Developer| License| MD5 Sum (Unique Hash)|
+|------|---------|-------|-------|-----------|----------|--------|----------------------|
+|  `Meta-Llama-3-8B-Instruct.Q4_0.gguf`| 4.66 GB| 8 GB| 8 Billion| q4_0| Meta| [Llama 3 License](https://llama.meta.com/llama3/license/)| c87ad09e1e4c8f9c35a5fcef52b6f1c9|
+| `Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf`| 4.11 GB| 8 GB| 7 Billion| q4_0| Mistral & Nous Research | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)| Coa5f6b4eabd3992da4d7fb7f020f921eb|
+| `Phi-3-mini-4k-instruct.Q4_0.gguf` | 2.18 GB| 4 GB| 3.8 billion| q4_0| Microsoft| [MIT](https://opensource.org/license/mit)| f8347badde9bfc2efbe89124d78ddaf5|
+| `orca-mini-3b-gguf2-q4_0.gguf`| 1.98 GB| 4 GB| 3 billion| q4_0| Microsoft | [CC-BY-NC-SA-4.0](https://spdx.org/licenses/CC-BY-NC-SA-4.0)| 0e769317b90ac30d6e09486d61fefa26|
+| `gpt4all-13b-snoozy-q4_0.gguf`| 7.37 GB| 16 GB| 13 billion| q4_0| Nomic AI| [GPL](https://www.gnu.org/licenses/gpl-3.0.en.html)| 40388eb2f8d16bb5d08c96fdfaac6b2c|
+
+
+## Chat Session Generation
+
+Most of the language models you will be able to access from HuggingFace have been trained as assistants. This guides language models to not just answer with relevant text, but *helpful* text.
+
+If you want your LLM's responses to be helpful in the typical sense, we recommend you apply the chat templates the models were finetuned with. Information about specific prompt templates is typically available on the official HuggingFace page for the model.
+
+!!! note "Example LLM Chat Session Generation"
+
+    === "Code"
+
+        Load `Llama 3` and enter the following prompt in a chat session:
+
+        ```python
+        from gpt4all import GPT4All
+        model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")
+        with model.chat_session():
+            print(model.generate("quadratic formula"))
+        ```
+
+    === "Output"
+   
+        With the default sampling settings, you should see something resembling the following:
+        ```
+        The quadratic formula!
+
+        The quadratic formula is a mathematical formula that provides the solutions to a quadratic equation of the form:
+
+        ax^2 + bx + c = 0
+
+        where a, b, and c are constants. The formula is:
+
+        x = (-b ± √(b^2 - 4ac)) / 2a
+
+        Let's break it down:
+
+        * x is the variable we're trying to solve for.
+        * a, b, and c are the coefficients of the quadratic equation.
+        * ± means "plus or minus".
+        * √ denotes the square root.
+
+        To use the formula, simply plug in the values of a, b, and c into the expression above. The resulting value(s) will be the solutions to the original quadratic equation!
+
+        For example, let's say we have the quadratic equation:
+
+        x^2 + 5x + 6 = 0
+
+        We can plug these values into the formula as follows:
+
+        a = 1
+        b = 5
+        ```
+
+## Direct Generation
+
+Directly calling `model.generate()` prompts the model without applying any templates. 
+
+Note: this can result in responses that are less like helpful responses and more like mirroring the tone of your prompt. In general, a language model outside of a chat session is less of a helpful assistant and more of a lens into the distribution of the model's training data.
+
+As an example, see how the model's response changes when we give the same prompt as above without applying a chat session:
+
+!!! note "Example LLM Direct Generation"
+
+    === "Code"
+    
+        Load `Llama 3` and enter the following prompt:
+
+        ```python
+        from gpt4all import GPT4All
+        model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")
+        print(model.generate("quadratic formula"))
+        ```
+    === "Output"
+
+        With the default sampling settings, you should see something resembling the following:
+        ```
+        . The equation is in the form of a + bx = c, where a and b are constants.
+        The solution to this problem involves using the quadratic formula which states that for any quadratic equation ax^2+bx+c=0, its solutions can be found by:
+        x = (-b ± √(b^2-4ac)) / 2a
+        In your case, since you have a + bx = c, we need to rewrite it in the form of ax^2+bx+c=0. To do this, subtract both sides from c, so that:
+        c - (a + bx) = 0
+        Now, combine like terms on the left side and simplify:
+        ax^2 + (-b)x + (c-a) = 0\n\nSo now we have a quadratic equation in standard form: ax^2+bx+c=0. We can use this to find its solutions using the quadratic formula:
+        
+        x = ((-b ± √((-b)^2
+        ```
+
+Why did it respond differently? Because language models, before being fine-tuned as assistants, are trained to be more like a data mimic than a helpful assistant. Therefore our responses ends up more like a typical continuation of math-style text rather than a helpful answer in dialog. 
+
+## Embeddings
+
+Nomic trains and open-sources free embedding models that will run very fast on your hardware.
+
+The easiest way to run the text embedding model locally uses the [`nomic`](https://github.com/nomic-ai/nomic) python library to interface with our fast [C/C++ implementations](ref.md#gpt4all.gpt4all.Embed4All).
+
+!!! note "Example Embeddings Generation"
+
+    === "Code"
+
+        Importing `embed` from the [`nomic`](https://github.com/nomic-ai/nomic) library, you can call `embed.text()` with `inference_mode="local"`. This downloads an embedding model and saves it for later.
+
+        ```python
+        from nomic import embed
+        embeddings = embed.text(["String 1", "String 2"], inference_mode="local")['embeddings']
+        print("Number of embeddings created:", len(embeddings))
+        print("Number of dimensions per embedding:", len(embeddings[0]))
+        ```
+    
+    === "Output"
+
+        ```
+        Number of embeddings created: 2
+        Number of dimensions per embedding: 768
+        ```
+
+![Nomic embed text local inference](../assets/local_embed.gif)
+
+To learn more about making embeddings locally with `nomic`, visit our [embeddings guide](https://docs.nomic.ai/atlas/guides/embeddings#local-inference).
+
+The following embedding models can be used within the application and with the `Embed4All` class from the `gpt4all` Python library. The default context length as GGUF files is 2048 but can be [extended](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF#description).
+
+| Name| Using with `nomic`| `Embed4All` model name| Context Length| # Embedding Dimensions| File Size|
+|--------------------|-|------------------------------------------------------|---------------:|-----------------:|----------:|
+| [Nomic Embed v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1-GGUF)   | ```embed.text(strings, model="nomic-embed-text-v1", inference_mode="local")```| ```Embed4All("nomic-embed-text-v1.f16.gguf")```|           2048 |              768 |   262 MiB |
+| [Nomic Embed v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF) | ```embed.text(strings, model="nomic-embed-text-v1.5", inference_mode="local")```| ```Embed4All("nomic-embed-text-v1.5.f16.gguf")``` |           2048| 64-768 |   262 MiB |
+| [SBert](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)| n/a| ```Embed4All("all-MiniLM-L6-v2.gguf2.f16.gguf")```|            512 |              384 |    44 MiB |
--- a/gpt4all-bindings/python/docs/gpt4all_python/monitoring.md
+++ b/gpt4all-bindings/python/docs/gpt4all_python/monitoring.md
@ -0,0 +1,49 @@
+# GPT4All Monitoring
+
+GPT4All integrates with [OpenLIT](https://github.com/openlit/openlit) OpenTelemetry auto-instrumentation to perform real-time monitoring of your LLM application and GPU hardware.
+
+Monitoring can enhance your GPT4All deployment with auto-generated traces and metrics for
+
+- **Performance Optimization:** Analyze latency, cost and token usage to ensure your LLM application runs efficiently, identifying and resolving performance bottlenecks swiftly.
+  
+- **User Interaction Insights:** Capture each prompt and response to understand user behavior and usage patterns better, improving user experience and engagement.
+  
+- **Detailed GPU Metrics:** Monitor essential GPU parameters such as utilization, memory consumption, temperature, and power usage to maintain optimal hardware performance and avert potential issues.
+
+## Setup Monitoring
+
+!!! note "Setup Monitoring"
+
+    With [OpenLIT](https://github.com/openlit/openlit), you can automatically monitor traces and metrics for your LLM deployment:
+
+    ```shell
+    pip install openlit
+    ```
+
+    ```python
+    from gpt4all import GPT4All
+    import openlit
+
+    openlit.init()  # start
+    # openlit.init(collect_gpu_stats=True)  # Optional: To configure GPU monitoring
+
+    model = GPT4All(model_name='orca-mini-3b-gguf2-q4_0.gguf')
+
+    # Start a chat session and send queries
+    with model.chat_session():
+        response1 = model.generate(prompt='hello', temp=0)
+        response2 = model.generate(prompt='write me a short poem', temp=0)
+        response3 = model.generate(prompt='thank you', temp=0)
+
+        print(model.current_chat_session)
+    ```
+
+## Visualization
+
+### OpenLIT UI
+
+Connect to OpenLIT's UI to start exploring the collected LLM performance metrics and traces. Visit the OpenLIT [Quickstart Guide](https://docs.openlit.io/latest/quickstart) for step-by-step details.
+
+### Grafana, DataDog, & Other Integrations
+
+You can also send the data collected by OpenLIT to popular monitoring tools like Grafana and DataDog. For detailed instructions on setting up these connections, please refer to the OpenLIT [Connections Guide](https://docs.openlit.io/latest/connections/intro).
--- a/gpt4all-bindings/python/docs/gpt4all_python/ref.md
+++ b/gpt4all-bindings/python/docs/gpt4all_python/ref.md
@ -0,0 +1,4 @@
+# GPT4All Python SDK Reference
+::: gpt4all.gpt4all.GPT4All
+
+::: gpt4all.gpt4all.Embed4All
--- a/gpt4all-bindings/python/docs/index.md
+++ b/gpt4all-bindings/python/docs/index.md
@ -1,66 +1,28 @@
-# GPT4All
-Welcome to the GPT4All technical documentation.
+# GPT4All Documentation

-GPT4All is an open-source software ecosystem that allows anyone to train and deploy **powerful** and **customized** large language models (LLMs) on **everyday hardware**.
-Nomic AI oversees contributions to the open-source ecosystem ensuring quality, security and maintainability.
+GPT4All runs large language models (LLMs) privately on everyday desktops & laptops. 

-GPT4All software is optimized to run inference of 3-13 billion parameter large language models on the CPUs of laptops, desktops and servers.
+No API calls or GPUs required - you can just download the application and [get started](gpt4all_desktop/quickstart.md#quickstart).

-=== "GPT4All Example"
-    ``` py
+!!! note "Desktop Application"
+    GPT4All runs LLMs as an application on your computer. Nomic's embedding models can bring information from your local documents and files into your chats. It's fast, on-device, and completely **private**.
+
+    <div style="text-align: center; margin-top: 20px;">
+        [Download for Windows](https://gpt4all.io/installers/gpt4all-installer-win64.exe) &nbsp;&nbsp;&nbsp;&nbsp;
+        [Download for Mac](https://gpt4all.io/installers/gpt4all-installer-darwin.dmg) &nbsp;&nbsp;&nbsp;&nbsp;
+        [Download for Linux](https://gpt4all.io/installers/gpt4all-installer-linux.run)
+    </div>
+
+!!! note "Python SDK"
+    Use GPT4All in Python to program with LLMs implemented with the [`llama.cpp`](https://github.com/ggerganov/llama.cpp) backend and [Nomic's C backend](https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-backend). Nomic contributes to open source software like [`llama.cpp`](https://github.com/ggerganov/llama.cpp) to make LLMs accessible and efficient **for all**.
+
+    ```bash
+    pip install gpt4all
+    ```
+
+    ```python
    from gpt4all import GPT4All
-    model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
-    output = model.generate("The capital of France is ", max_tokens=3)
-    print(output)
+    model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM
+    with model.chat_session():
+        print(model.generate("How can I run LLMs efficiently on my laptop?", max_tokens=1024))
    ```
-=== "Output"
-    ```
-    1. Paris
-    ```
-See [Python Bindings](gpt4all_python.md) to use GPT4All.
-
-### Navigating the Documentation
-In an effort to ensure cross-operating-system and cross-language compatibility, the [GPT4All software ecosystem](https://github.com/nomic-ai/gpt4all)
-is organized as a monorepo with the following structure:
-
- **gpt4all-backend**: The GPT4All backend maintains and exposes a universal, performance optimized C API for running inference with multi-billion parameter Transformer Decoders.
-This C API is then bound to any higher level programming language such as C++, Python, Go, etc.
- **gpt4all-bindings**: GPT4All bindings contain a variety of high-level programming languages that implement the C API. Each directory is a bound programming language. The [CLI](gpt4all_cli.md) is included here, as well.
- **gpt4all-chat**: GPT4All Chat is an OS native chat application that runs on macOS, Windows and Linux. It is the easiest way to run local, privacy aware chat assistants on everyday hardware. You can download it on the [GPT4All Website](https://gpt4all.io) and read its source code in the monorepo.
-
-Explore detailed documentation for the backend, bindings and chat client in the sidebar.
-## Models
-The GPT4All software ecosystem is compatible with the following Transformer architectures:
-
- `Falcon`
- `LLaMA` (including `OpenLLaMA`)
- `MPT` (including `Replit`)
- `GPT-J`
-
-You can find an exhaustive list of supported models on the [website](https://gpt4all.io) or in the [models directory](https://raw.githubusercontent.com/nomic-ai/gpt4all/main/gpt4all-chat/metadata/models3.json)
-
-
-GPT4All models are artifacts produced through a process known as neural network quantization.
-A multi-billion parameter Transformer Decoder usually takes 30+ GB of VRAM to execute a forward pass.
-Most people do not have such a powerful computer or access to GPU hardware. By running trained LLMs through quantization algorithms, 
-some GPT4All models can run on your laptop using only 4-8GB of RAM enabling their wide-spread usage.
-Bigger models might still require more RAM, however.
-
-Any model trained with one of these architectures can be quantized and run locally with all GPT4All bindings and in the
-chat client. You can add new variants by contributing to the gpt4all-backend.
-
-## Frequently Asked Questions
-Find answers to frequently asked questions by searching the [Github issues](https://github.com/nomic-ai/gpt4all/issues) or in the [documentation FAQ](gpt4all_faq.md).
-
-## Getting the most of your local LLM
-
-**Inference Speed**
-of a local LLM depends on two factors: model size and the number of tokens given as input. 
-It is not advised to prompt local LLMs with large chunks of context as their inference speed will heavily degrade.
-You will likely want to run GPT4All models on GPU if you would like to utilize context windows larger than 750 tokens. Native GPU support for GPT4All models is planned.
-
-**Inference Performance:**
-Which model is best? That question depends on your use-case. The ability of an LLM to faithfully follow instructions is conditioned
-on the quantity and diversity of the pre-training data it trained on and the diversity, quality and factuality of the data the LLM
-was fine-tuned on. A goal of GPT4All is to bring the most powerful local assistant model to your desktop and Nomic AI is actively
-working on efforts to improve their performance and quality.
--- a/gpt4all-bindings/python/docs/old/gpt4all_chat.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_chat.md
--- a/gpt4all-bindings/python/docs/old/gpt4all_cli.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_cli.md
--- a/gpt4all-bindings/python/docs/old/gpt4all_faq.md
+++ b/gpt4all-bindings/python/docs/old/gpt4all_faq.md
--- a/Show More
+++ b/Show More
				`@ -0,0 +1 @@`
				`Subproject commit 11f734c3b0334dbae4823b4a7467764e447fc6d6`
				`@ -1 +0,0 @@`
				`Subproject commit b2db03acf299111885af2921a4230de07623eaf8`