LangChain en la documentación PDF de InterSystems

Hoy os traigo otro ejemplo de aplicación de LangChain.

Inicialmente buscaba generar una "chain" o cadena para lograr hacer búsquedas dinámicas en la documentación en HTML, pero al final resultó más sencillo utilizar la versión en PDF de la documentación .

Crear un nuevo entorno virtual

mkdir chainpdf

cd chainpdf

python -m venv .


pip install openai
pip install langchain
pip install wget
pip install lancedb
pip install tiktoken
pip install pypdf

set OPENAI_API_KEY=[ Your OpenAI Key ]


Preparar los documentos

import glob
import wget;

# extract docs
import zipfile
with zipfile.ZipFile('','r') as zip_ref:

# get a list of files
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]

Cargar los documentos en Vector Store

import lancedb
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import LanceDB
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts.prompt import PromptTemplate
from langchain import OpenAI
from langchain.chains import LLMChain

embeddings = OpenAIEmbeddings()
db = lancedb.connect('lancedb')
table = db.create_table("my_table", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")

pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]
for file_name in pdfFiles:
  loader = PyPDFLoader(file_name)
  pages = loader.load_and_split()
  # Strip unwanted padding
  for page in pages:
    del page.lc_kwargs
  documents = CharacterTextSplitter().split_documents(pages)
  # Ignore the cover pages
  for document in documents[2:]:

# This will take couple of minutes to complete
docsearch = LanceDB.from_documents(documentsAll, embeddings, connection=table)

Preparar la plantilla de búsqueda

_GetDocWords_TEMPLATE = """Answer the Question: {question}

By considering the following documents:

PROMPT = PromptTemplate(
     input_variables=["docs","question"], template=_GetDocWords_TEMPLATE

llm = OpenAI(temperature=0, verbose=True)

chain = LLMChain(llm=llm, prompt=PROMPT)

¿Estáis sentados?... Vamos a hablar con la documentación

"Qué es un adaptador de ficheros?"

# Ask the queston
# First query the vector store for matching content
query = "What is a File adapter"
docs = docsearch.similarity_search(query)
# Only using the first two documents to reduce token search size on openai[:2],question=query)


'\nA file adapter is a type of software that enables the transfer of data between two different systems. It is typically used to move data from one system to another, such as from a database to a file system, or from a file system to a database. It can also be used to move data between different types of systems, such as from a web server to a database.

"¿Qué es una tabla de bloqueo?"  

# Ask the queston # First query the vector store for matching content
query = "What is a locak table"
docs = docsearch.similarity_search(query)
# Only using the first two documents to reduce token search size on openai[:2],question=query)


'\nA lock table is a system-wide, in-memory table maintained by InterSystems IRIS that records all current locks and the processes that have owned them. It is accessible via the Management Portal, where you can view the locks and (in rare cases, if needed) remove them.'


Dejaré como ejercicio futuro formatear una interfaz de usuario sobre esta funcionalidad.

