Artículo
· 18 oct, 2023 Lectura de 3 min

LangChain en la documentación PDF de InterSystems

Hoy os traigo otro ejemplo de aplicación de LangChain.

Inicialmente buscaba generar una "chain" o cadena para lograr hacer búsquedas dinámicas en la documentación en HTML, pero al final resultó más sencillo utilizar la versión en PDF de la documentación .

Crear un nuevo entorno virtual

mkdir chainpdf

cd chainpdf

python -m venv .

scripts\activate 

pip install openai
pip install langchain
pip install wget
pip install lancedb
pip install tiktoken
pip install pypdf

set OPENAI_API_KEY=[ Your OpenAI Key ]

python

Preparar los documentos

import glob
import wget;

url='https://docs.intersystems.com/irisforhealth20231/csp/docbook/pdfs.zip';
wget.download(url)
# extract docs
import zipfile
with zipfile.ZipFile('pdfs.zip','r') as zip_ref:
  zip_ref.extractall('.')

# get a list of files
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]

Cargar los documentos en Vector Store

import lancedb
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import LanceDB
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts.prompt import PromptTemplate
from langchain import OpenAI
from langchain.chains import LLMChain


embeddings = OpenAIEmbeddings()
db = lancedb.connect('lancedb')
table = db.create_table("my_table", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")

documentsAll=[]
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]
for file_name in pdfFiles:
  loader = PyPDFLoader(file_name)
  pages = loader.load_and_split()
  # Strip unwanted padding
  for page in pages:
    del page.lc_kwargs
    page.page_content=("".join((page.page_content.split('\xa0'))))
  documents = CharacterTextSplitter().split_documents(pages)
  # Ignore the cover pages
  for document in documents[2:]:
    documentsAll.append(document)

# This will take couple of minutes to complete
docsearch = LanceDB.from_documents(documentsAll, embeddings, connection=table)

Preparar la plantilla de búsqueda

_GetDocWords_TEMPLATE = """Answer the Question: {question}

By considering the following documents:
{docs}
"""

PROMPT = PromptTemplate(
     input_variables=["docs","question"], template=_GetDocWords_TEMPLATE
)

llm = OpenAI(temperature=0, verbose=True)

chain = LLMChain(llm=llm, prompt=PROMPT)

¿Estáis sentados?... Vamos a hablar con la documentación

"Qué es un adaptador de ficheros?"

# Ask the queston
# First query the vector store for matching content
query = "What is a File adapter"
docs = docsearch.similarity_search(query)
# Only using the first two documents to reduce token search size on openai
chain.run(docs=docs[:2],question=query)

Respuesta:

'\nA file adapter is a type of software that enables the transfer of data between two different systems. It is typically used to move data from one system to another, such as from a database to a file system, or from a file system to a database. It can also be used to move data between different types of systems, such as from a web server to a database.

"¿Qué es una tabla de bloqueo?"  

# Ask the queston # First query the vector store for matching content
query = "What is a locak table"
docs = docsearch.similarity_search(query)
# Only using the first two documents to reduce token search size on openai
chain.run(docs=docs[:2],question=query)

Respuesta:

'\nA lock table is a system-wide, in-memory table maintained by InterSystems IRIS that records all current locks and the processes that have owned them. It is accessible via the Management Portal, where you can view the locks and (in rare cases, if needed) remove them.'

 

Dejaré como ejercicio futuro formatear una interfaz de usuario sobre esta funcionalidad.

Comentarios (0)1
Inicie sesión o regístrese para continuar