Skip to main content

Build a Local RAG Application

The popularity of projects like PrivateGPT, llama.cpp, GPT4All, and llamafile underscore the importance of running LLMs locally.

LangChain has integrations with many open-source LLMs that can be run locally.

For example, here we show how to run OllamaEmbeddings or LLaMA2 locally (e.g., on your laptop) using local embeddings and a local LLM.

Document Loading​

First, install packages needed for local embeddings and vector storage.

Setup​

Dependencies​

We’ll use the following packages:

npm install --save langchain @langchain/community cheerio

LangSmith​

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with LangSmith.

Note that LangSmith is not needed, but it is helpful. If you do want to use LangSmith, after you sign up at the link above, make sure to set your environment variables to start logging traces:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=YOUR_KEY

Initial setup​

Load and split an example document.

We’ll use a blog post on agents as an example.

import "cheerio";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
const loader = new CheerioWebBaseLoader(
"https://lilianweng.github.io/posts/2023-06-23-agent/"
);
const docs = await loader.load();

const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 0,
});
const allSplits = await textSplitter.splitDocuments(docs);
console.log(allSplits.length);
146

Next, we’ll use OllamaEmbeddings for our local embeddings. Follow these instructions to set up and run a local Ollama instance.

import { OllamaEmbeddings } from "@langchain/community/embeddings/ollama";
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const embeddings = new OllamaEmbeddings();
const vectorStore = await MemoryVectorStore.fromDocuments(
allSplits,
embeddings
);

Test similarity search is working with our local embeddings.

const question = "What are the approaches to Task Decomposition?";
const docs = await vectorStore.similaritySearch(question);
console.log(docs.length);
4

Model​

LLaMA2​

For local LLMs we’ll use also use ollama.

import { ChatOllama } from "@langchain/community/chat_models/ollama";

const ollamaLlm = new ChatOllama({
baseUrl: "http://localhost:11434", // Default value
model: "llama2", // Default value
});
const response = await ollamaLlm.invoke(
"Simulate a rap battle between Stephen Colbert and John Oliver"
);
console.log(response.content);

[The stage is set for a fierce rap battle between two of the funniest men on television. Stephen Colbert and John Oliver are standing face to face, each with their own microphone and confident smirk on their face.]

Stephen Colbert:
Yo, John Oliver, I heard you've been talking smack
About my show and my satire, saying it's all fake
But let me tell you something, brother, I'm the real deal
I've been making fun of politicians for years, with no conceal

John Oliver:
Oh, Stephen, you think you're so clever and smart
But your jokes are stale and your delivery's a work of art
You're just a pale imitation of the real deal, Jon Stewart
I'm the one who's really making waves, while you're just a little bird

Stephen Colbert:
Well, John, I may not be as loud as you, but I'm smarter
My satire is more subtle, and it goes right over their heads
I'm the one who's been exposing the truth for years
While you're just a British interloper, trying to steal the cheers

John Oliver:
Oh, Stephen, you may have your fans, but I've got the brains
My show is more than just slapstick and silly jokes, it's got depth and gains
I'm the one who's really making a difference, while you're just a clown
My satire is more than just a joke, it's a call to action, and I've got the crown

[The crowd cheers and chants as the two comedians continue their rap battle.]

Stephen Colbert:
You may have your fans, John, but I'm the king of satire
I've been making fun of politicians for years, and I'm still standing tall
My jokes are clever and smart, while yours are just plain dumb
I'm the one who's really in control, and you're just a pretender to the throne.

John Oliver:
Oh, Stephen, you may have your moment in the sun
But I'm the one who's really shining bright, and my star is just beginning to rise
My satire is more than just a joke, it's a call to action, and I've got the power
I'm the one who's really making a difference, and you're just a fleeting flower.

[The crowd continues to cheer and chant as the two comedians continue their rap battle.]

See the LangSmith trace here

Using in a chain​

We can create a summarization chain with either model by passing in the retrieved docs and a simple prompt.

It formats the prompt template using the input key values provided and passes the formatted string to LLama-V2, or another specified LLM.

import { RunnableSequence } from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { PromptTemplate } from "@langchain/core/prompts";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

const prompt = PromptTemplate.fromTemplate(
"Summarize the main themes in these retrieved docs: {context}"
);

const chain = await createStuffDocumentsChain({
llm: ollamaLlm,
outputParser: new StringOutputParser(),
prompt,
});
const question = "What are the approaches to Task Decomposition?";
const docs = await vectorStore.similaritySearch(question);
await chain.invoke({
context: docs,
});
"The main themes retrieved from the provided documents are:\n" +
"\n" +
"1. Sensory Memory: The ability to retain"... 1117 more characters

See the LangSmith trace here

Q&A​

We can also use the LangChain Prompt Hub to store and fetch prompts that are model-specific.

Let’s try with a default RAG prompt, here.

import { pull } from "langchain/hub";
import { ChatPromptTemplate } from "@langchain/core/prompts";

const ragPrompt = await pull<ChatPromptTemplate>("rlm/rag-prompt");

const chain = await createStuffDocumentsChain({
llm: ollamaLlm,
outputParser: new StringOutputParser(),
prompt: ragPrompt,
});

Let’s see what this prompt actually looks like:

console.log(
ragPrompt.promptMessages.map((msg) => msg.prompt.template).join("\n")
);
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
await chain.invoke({ context: docs, question });
"Task decomposition is a crucial step in breaking down complex problems into manageable parts for eff"... 1095 more characters

See the LangSmith trace here

Q&A with retrieval​

Instead of manually passing in docs, we can automatically retrieve them from our vector store based on the user question.

This will use a QA default prompt and will retrieve from the vectorDB.

import {
RunnablePassthrough,
RunnableSequence,
} from "@langchain/core/runnables";
import { formatDocumentsAsString } from "langchain/util/document";

const retriever = vectorStore.asRetriever();

const qaChain = RunnableSequence.from([
{
context: (input: { question: string }, callbacks) => {
const retrieverAndFormatter = retriever.pipe(formatDocumentsAsString);
return retrieverAndFormatter.invoke(input.question, callbacks);
},
question: new RunnablePassthrough(),
},
ragPrompt,
ollamaLlm,
new StringOutputParser(),
]);

await qaChain.invoke({ question });
"Based on the context provided, I understand that you are asking me to answer a question related to m"... 948 more characters

See the LangSmith trace here


Was this page helpful?


You can also leave detailed feedback on GitHub.