Up Your Game: Connect Your Fine-Tined Custom GPT to a RAG Pipeline
Retrieval augmented generation (RAG) gives your custom GPTs access to your complete enterprise data.
RAG works by integrating with databases external to the custom GPT. When a user asks a question (prompts the custom GPT), the prompt is converted into a search query and the RAG first retrieves relevant documents. In best-practice cases, the enterprise will already have hybrid-enterprise search consisting of both vector embeddings (for semantic search) and keyword matching to an index.
The RAG pipeline ranks the retrieved documents and then extracts the relevant unstructured and structured data from these documents (called chunking).
Next, the RAG pipeline provides these extracts (chunks) to the custom GPT as an augmentation to a prompt (query or question). Now, the custom GPT can generate answers based on this augmented information without having previously been fine-tuned on this specific data. And, the custom GPT can cite sources used to generate a response.
This is a meaningful step up from fine-tuned GPTs because:
· With RAG, your GPTs (chatbots or AI agents, assuming you have several, each fine-tuned for a specific need and expertise) now have access to all your enterprise information. RAG-connected GPTs are even more expert than fine-tuned GPTs (note: fine-tuning and RAG are not mutually exclusive; the most powerful AI agents combine both)
· RAG-connected AI agents are more trustworthy because their answers are always grounded in your enterprise knowledge
· RAG-connected GPTs always use up-to-date information. Without RAG, the GPT’s knowledge is limited to it’s most recent retraining
· RAG-connected GPTs have face no limit on the size of the accessible knowledge base (index, embedded vectors)
· RAG-connected AI agents provide citations to the original source material they reference in formulating their answers
Fine-tuning by itself is a powerful upgrade over a generic GPT. However, fine-tuning teaches the underlying model how to use language grounded in your domain. Whereas, RAG adds comprehensive and detailed knowledge to the mix.
AI agents can be fine-tuned without RAG, or they can connect to a RAG pipeline without being fine-tuned. However, the most powerful solution is without doubt to connect fine-tuned GPTs to a RAG pipeline. And given the ease of and low cost of fine-tuning (as discussed in the article “Fine-Tine a Custom GPT on OpenAI”, available on this blog) the approach laid out in this article assumes you are working with fine-tuned GPTs. However, it also is perfectly applicable for connecting generic GPTs to a RAG pipeline.
Custom GPT and RAG Integration
A Quick Review - What is a Custom GPT? - A custom GPT has been given instructions on how to behave (depending on whom it is interacting with), and domain knowledge it used to adjust the weights of its algorithm to become more expert on a specific topic or function. Custom instructions and fine-tuning are the foundations of custom GPT’s. As powerful as customization is, it does have limitations. For example, the model can’t be retrained on the entirety of your enterprise’s knowledge and data. So, it may still have blind spots when answering product-specific technical questions, or company specific policy questions.
Integrating RAG (Retrieval-Augmented Generation) with GPTs means combining the generative abilities of the GPT model with a retrieval mechanism that pulls in external (to the GPT, but internal to the enterprise) real-time information from specific document sources. The goal is to enhance the GPT’s ability to provide answers with up-to-date, document-backed information, including citing sources from the external database.
In practical terms, integration means that when a user asks a question, the RAG pipeline first retrieves relevant passages or documents from enterprise data sources enabling the GPT/chatbot/AI agent to generatesa response that includes or references to that information.
How RAG Integration Works:
Once built, RAG has two main steps, the retrieval step and the generation step.
· Retrieval Component:
o Set up a retrieval system that indexes the enterprise’s documents and data, creates vector embeddings, and links via connectors to your SaaS vendors for functions like CRM, HR, and Finance.
o This retrieval system uses various techniques like keyword search or more advanced semantic search (using embeddings) to find the most relevant documents or snippets based on the user's query.
o The retrieval step converts the user’s original question or prompt into a format useful for search. This can include techniques like semantic expansion to increase recall from the data sources
· Generation Component (GPT):
o Once the relevant documents or text snippets are retrieved, they are ranked for relevance
o Next, snippets (or chunks) are extracted from the most relevant documents to augment the prompt given to the GPT
o The “chunks’ of information are packaged with a new prompt. The prompt is based on the user’s original query but contains instructions for the GPT telling it how to use the augmented information. These instructions also contain any relevant information about the user that will help the GPT format its repone
o the GPT uses the augmented information to generate a response grounded in truth, with deep expertise, and with citations
o The GPT combines the augmented content with its internal knowledge (including anything it’s learned through fine-tuning) to give a richer, more accurate answer.
o Citations let the user validate where the information came from, and allows the user to immediately drill down if desired
A Few Technical Features of a RAG Pipeline:
1. Embedding-Based Retrieval (RAG Process):
- Document Processing: First, all documents need to be processed (converted to text, cleaned up, etc.).
- Embedding Creation: Each document (or paragraph/sentence) is converted into a vector embedding, a mathematical representation of the text that captures its meaning. GPT or a separate model creates these embeddings.
- Indexing: These embeddings are stored in a specialized database (like Pinecone, Weaviate, or Elasticsearch) that allows for fast semantic search.
- Retrieval: When a user asks a question, the system converts the query into an embedding and searches for the most relevant document embeddings in the database.
- Augmentation: The retrieved text is then passed to GPT, which uses it to generate a final, informed response, potentially citing the document.
2. OpenAI and RAG:
- OpenAI doesn’t currently have a direct, out-of-the-box RAG integration within the standard custom GPT feature. To build DIY RAG, you would need to use external tools (like LangChain, Pinecone, or Elasticsearch) to manage the document retrieval and then pass those results to GPT.
- You can still build a custom GPT and then connect it to a RAG pipeline, essentially creating a system where GPT is augmented with external search results.
3. Technologies/Tools for RAG:
- LangChain: A framework designed to integrate language models (like GPT) with external knowledge sources. It helps manage the RAG process and can be used to retrieve documents from your research database.
- Pinecone/Weaviate: Vector databases that store embeddings for fast retrieval of semantically similar text. These help power the document retrieval side of RAG.
- Elasticsearch: A more traditional search engine that can be configured to support keyword and semantic search, useful for building a document retrieval system.
SaaS Vendors for RAG
Most SMEs won’t want to undertake DIY RAG. Fortunately, there are several very good SaaS vendors more than capable of building, hosting, and maintaining your RAG pipeline. Some of these vendors are fully integrated offering enterprise search, RAG, and connection to multiple LLMs (where your GPTs/chatbots/AI agents reside). Additionally, there are SaaS vendors focused solely on the RAG component, but you would need to already have enterprise search being delivered by another SaaS vendor. A separate blog post discusses the players in the enterprise search and RAG sector.
Integrating Internet Search
Some RAG pipelines are designed to not only access a local or custom vector database and index but also perform internet searches to gather the most up-to-date information from the web. This allows for the generation of responses that combine information from both your custom documents (like the research papers you've stored) and fresh, real-time data available online. The web search component uses APIs (such as Google Search API or other search engines) to fetch the most relevant documents, web pages, or research articles that are available online at that moment. The information retrieved from a web search is similarly re-ranked and chunked in preparation for augmenting a GPT prompt. Citations are also available when the GPT leans on the web search results. Contact us for more information.