Retrieval Augmented Generation

Introduction to Retrieval Augmented Generation (RAG) with LangChain and Ollama

1. Why are we talking about Retrieval Augmented Generation (RAG) today?

Generative AI models (like GPT, Claude, or LLaMA) are capable of producing fluent and relevant text. But they have a major limitation: they can only answer based on what they learned during training. Therefore, they can:

  • Provide outdated information.
  • Invent nonexistent facts (hallucinations).
  • Ignore data specific to a company or domain.

This is where RAG (Retrieval Augmented Generation) comes in: it combines the power of generative models with an external knowledge base (documents, databases, internal files, etc.).

In short: before generating an answer, the model retrieves the relevant information from your data, then uses it to provide an accurate response.

Note: some tools like ChatGPT with web browsing or Perplexity give the impression that the model knows the Internet in real time. In reality, they combine a generative model with an external online search mechanism. This approach is already a form of RAG, but applied to the public web. The value of custom RAGs is that you can apply the same principle… to your own private data (internal documents, client databases, reports, etc.).


2. The principle of Retrieval Augmented Generation

A Retrieval Augmented Generation pipeline works in three steps:

  1. Document indexing
    Text is split into small pieces (chunks), then transformed into numerical vectors (embeddings) so they can be efficiently searched.
  2. Contextual search
    When a question is asked, the system searches your documents for the most relevant passages using a vector database.
  3. Augmented generation
    The language model takes these passages as context and generates a more reliable response, possibly including the sources used.

3. The tools we will use

For our example, we will build a small RAG with:

  • LangChain: a Python framework that makes it easier to build AI chains (LLM + retrieval + memory…).
  • Ollama: a tool that allows you to easily download and run language models locally (for example LLaMA, Mistral, Gemma). In our code, we use a small local model gemma3:1b.
  • FAISS: a library from Meta for managing vector searches.
  • HuggingFace Embeddings: to transform our texts into numerical vectors.

ℹ️ It is not mandatory to use Ollama or open models, you can choose any model you like: open source (LLaMA, Mistral, Gemma…) or closed (GPT, Claude, etc.).


4. Step-by-step implementation

We will start from a file doc.txt (our knowledge base). Here is the file we use for this example:

Bio of a Fictional Person

Name: Clara Mendoza
Age: 34
Location: Barcelona, Spain
Profession: Environmental Policy Analyst

Clara Mendoza is a passionate environmental policy analyst dedicated to shaping sustainable urban development. With over a decade of experience in climate policy and renewable energy initiatives, she has worked with NGOs, municipal governments, and international organizations to design strategies that balance economic growth with environmental preservation.

She holds a master’s degree in Environmental Policy and Governance from the London School of Economics, where her thesis focused on community-driven renewable energy projects in Southern Europe. Fluent in Spanish, English, and French, Clara has presented her work at global conferences, advocating for greener cities and stronger cross-border collaborations.

Beyond her professional life, Clara is an avid traveler and amateur photographer. Her weekends often involve hiking in the Pyrenees, experimenting with plant-based recipes, or capturing urban landscapes through her camera lens. She also volunteers as a mentor for young women entering the field of environmental science and policy.

a) Load the document

from langchain.docstore.document import Document

# Opening the document
with open("./doc.txt", "r", encoding="utf-8") as f:
    input_doc = f.read()

document = Document(page_content=input_doc)

Here, we read our text file and convert it into a Document object usable by LangChain.


b) Split the text into chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=40)
chunks = splitter.split_documents([document])

We segment the document into small pieces of 150 characters, with an overlap of 40 characters to avoid cutting an idea in half.


c) Create the vector database

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(chunks, embeddings)

Each chunk is transformed into a numerical vector using a HuggingFace model, then stored in a FAISS vector database.


d) Build the RAG chain

from langchain_ollama.llms import OllamaLLM
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=OllamaLLM(model="gemma3:1b"),  # local model executed with Ollama
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type="stuff",
)

Here, we put everything together:

  • The LLM (gemma3:1b via Ollama).
  • The vector search engine.
  • A chain that sends the retrieved passages to the model.

e) Ask a question

result = qa_chain.invoke({"query": "What languages does Clara speak ?"})
print("Answer:", result['result'])

print("\nSource Documents:")
for doc in result['source_documents']:
    print(f"- {doc.page_content}")

When we ask the question “What languages does Clara speak?”, the system will search in our doc.txt and give the answer:

Answer: Clara speaks Spanish, English, and French.
Source Documents:
- Clara Mendoza is an environmental policy analyst [...]
  She is fluent in Spanish, English, and French.

5. Complete code

Here is the full Python module you can run directly:

from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain_ollama.llms import OllamaLLM
from langchain.chains import RetrievalQA

# Opening the document (knowledge base for the RAG)
with open("./doc.txt", "r", encoding="utf-8") as f:
    input_doc = f.read()

document = Document(page_content=input_doc)

# Automatic split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=40)
chunks = splitter.split_documents([document])

# Transforming chunks into vectors (embeddings)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(chunks, embeddings)

qa_chain = RetrievalQA.from_chain_type(
    llm=OllamaLLM(model="gemma3:1b"),  # We use the gemma3:1b model installed locally
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type="stuff",
)

# Prompt and retrieval of the answer (and sources)
result = qa_chain.invoke({"query": "What languages does Clara speak ?"})
print("Answer:", result['result'])
print("\nSource Documents:")
for doc in result['source_documents']:
    print(f"- {doc.page_content}")

6. Conclusion

Retrieval Augmented Generation (RAG) makes it possible to transform a generic language model into a specialized assistant, capable of relying on your own data.

  • Applications are numerous: enterprise chatbots, research assistants, educational tools…
  • Thanks to frameworks like LangChain and tools like Ollama, it becomes easy to build a working prototype.

And most importantly, you stay in control: you can choose your own data and your own models (open source or closed).

 

To go further, check out our article on automation and agentic AI.


Useful resources: