Nueva publicación

Encontrar

Pregunta
· 31 jul, 2024

Newbie issues with XEP

Hello,

I'm brand new to InterSystems and their products.

Doing some research for my job, I've found myself going through the learning paths Connection Java Application to InterSystems(https://learning.intersystems.com/course/view.php?id=879).

Working through the proposed demos (https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls... and https://github.com/intersystems/quickstarts-java) I cannot get the methods deleteExtent, importSchemaFull and getEvent to work on the EventPersistor.

I keep getting this error

 

 

 

 

 

 

I've found that the tag <PROTECT> refers to this :

I'm creating a connection with SuperUser, I've also tried _SYSTEM. The installation of the IRIS instance is on the same machine as the java app. Interacting with the database via JDBC and NativeAPI works fine.

If anyone has any helpful debugging tips, it would be greatly appreciated. If I can provide any other info please let me know?

Thanks,

Alex

2 comentarios
Comentarios (2)1
Inicie sesión o regístrese para continuar
Artículo
· 31 jul, 2024 Lectura de 5 min

IRIS-RAG-Gen: Personalizing ChatGPT RAG Application Powered by IRIS Vector Search

image

Hi Community,

In this article, I will introduce my application iris-RAG-Gen .

Iris-RAG-Gen is a generative AI Retrieval-Augmented Generation (RAG) application that leverages the functionality of IRIS Vector Search to personalize ChatGPT with the help of the Streamlit web framework, LangChain, and OpenAI. The application uses IRIS as a vector store.
image

Application Features

  • Ingest Documents (PDF or TXT) into IRIS
  • Chat with the selected Ingested document
  • Delete Ingested Documents
  • OpenAI ChatGPT

Ingest Documents (PDF or TXT) into IRIS

Follow the Below Steps to Ingest the document:

  • Enter OpenAI Key
  • Select Document (PDF or TXT)
  • Enter Document Description
  • Click on the Ingest Document Button

image
 

Ingest Document functionality inserts document details into rag_documents table and creates 'rag_document + id' (id of the rag_documents) table to save vector data.

image

The Python code below will save the selected document into vectors:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain_iris import IRISVector
from langchain_openai import OpenAIEmbeddings
from sqlalchemy import create_engine,text

class RagOpr:
    #Ingest document. Parametres contains file path, description and file type  
    def ingestDoc(self,filePath,fileDesc,fileType):
        embeddings = OpenAIEmbeddings()	
        #Load the document based on the file type
        if fileType == "text/plain":
            loader = TextLoader(filePath)       
        elif fileType == "application/pdf":
            loader = PyPDFLoader(filePath)       
        
        #load data into documents
        documents = loader.load()        
        
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)
        #Split text into chunks
        texts = text_splitter.split_documents(documents)
        
        #Get collection Name from rag_doucments table. 
        COLLECTION_NAME = self.get_collection_name(fileDesc,fileType)
               
        # function to create collection_name table and store vector data in it.
        db = IRISVector.from_documents(
            embedding=embeddings,
            documents=texts,
            collection_name = COLLECTION_NAME,
            connection_string=self.CONNECTION_STRING,
        )

    #Get collection name
    def get_collection_name(self,fileDesc,fileType):
        # check if rag_documents table exists, if not then create it 
        with self.engine.connect() as conn:
            with conn.begin():     
                sql = text("""
                    SELECT *
                    FROM INFORMATION_SCHEMA.TABLES
                    WHERE TABLE_SCHEMA = 'SQLUser'
                    AND TABLE_NAME = 'rag_documents';
                    """)
                result = []
                try:
                    result = conn.execute(sql).fetchall()
                except Exception as err:
                    print("An exception occurred:", err)               
                    return ''
                #if table is not created, then create rag_documents table first
                if len(result) == 0:
                    sql = text("""
                        CREATE TABLE rag_documents (
                        description VARCHAR(255),
                        docType VARCHAR(50) )
                        """)
                    try:    
                        result = conn.execute(sql) 
                    except Exception as err:
                        print("An exception occurred:", err)                
                        return ''
        #Insert description value 
        with self.engine.connect() as conn:
            with conn.begin():     
                sql = text("""
                    INSERT INTO rag_documents 
                    (description,docType) 
                    VALUES (:desc,:ftype)
                    """)
                try:    
                    result = conn.execute(sql, {'desc':fileDesc,'ftype':fileType})
                except Exception as err:
                    print("An exception occurred:", err)                
                    return ''
                #select ID of last inserted record
                sql = text("""
                    SELECT LAST_IDENTITY()
                """)
                try:
                    result = conn.execute(sql).fetchall()
                except Exception as err:
                    print("An exception occurred:", err)
                    return ''
        return "rag_document"+str(result[0][0])

 

Type the below SQL command in the management portal to retrieve vector data

SELECT top 5
id, embedding, document, metadata
FROM SQLUser.rag_document2

image

 

Chat with the selected Ingested document

Select the Document from select chat option section and type question. The application will read the vector data and return the relevant answer
image

The Python code below will save the selected document into vectors:

from langchain_iris import IRISVector
from langchain_openai import OpenAIEmbeddings,ChatOpenAI
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationSummaryMemory
from langchain.chat_models import ChatOpenAI


class RagOpr:
    def ragSearch(self,prompt,id):
        #Concat document id with rag_doucment to get the collection name
        COLLECTION_NAME = "rag_document"+str(id)
        embeddings = OpenAIEmbeddings()	
        #Get vector store reference
        db2 = IRISVector (
            embedding_function=embeddings,    
            collection_name=COLLECTION_NAME,
            connection_string=self.CONNECTION_STRING,
        )
        #Similarity search
        docs_with_score = db2.similarity_search_with_score(prompt)
        #Prepair the retrieved documents to pass to LLM
        relevant_docs = ["".join(str(doc.page_content)) + " " for doc, _ in docs_with_score]
        #init LLM
        llm = ChatOpenAI(
            temperature=0,    
            model_name="gpt-3.5-turbo"
        )
        #manage and handle LangChain multi-turn conversations
        conversation_sum = ConversationChain(
            llm=llm,
            memory= ConversationSummaryMemory(llm=llm),
            verbose=False
        )
        #Create prompt
        template = f"""
        Prompt: {prompt}
        Relevant Docuemnts: {relevant_docs}
        """
        #Return the answer
        resp = conversation_sum(template)
        return resp['response']

    


For more details, please visit iris-RAG-Gen open exchange application page.

Thanks

1 Comentario
Comentarios (1)2
Inicie sesión o regístrese para continuar
Pregunta
· 31 jul, 2024

Class Reference Local Access

As the on-line documentation is currently off-line, is there a way to obtain at least the class reference locally?

2 comentarios
Comentarios (2)2
Inicie sesión o regístrese para continuar
Artículo
· 31 jul, 2024 Lectura de 4 min

d[IA]gnosis: Vectorizing Diagnostics with Embedded Python and LLM Models

In the previous article we presented the d[IA]gnosis application developed to support the coding of diagnoses in ICD-10. In this article we will see how InterSystems IRIS for Health provides us with the necessary tools for the generation of vectors from the ICD-10 code list using a pre-trained language model, its storage and the subsequent search for similarities on all these generated vectors.

Introduction

One of the main features that have emerged with the development of AI models is what we know as RAG (Retrieval-Augmented Generation) that allows us to improve the results of LLM models by incorporating a context into the model. Well, in our example the context is given by the set of ICD-10 diagnoses and to use them we must first vectorize them.

How to vectorize our list of diagnoses?

SentenceTransformers and Embedded Python

For the generation of vectors we have used the Python library SentenceTransformers which greatly facilitates the vectorization of free texts from pre-trained models. From their own website:

Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models (quickstart) or to calculate similarity scores using Cross-Encoder models (quickstart). This unlocks a wide range of applications, including semantic searchsemantic textual similarity, and paraphrase mining.

Among all the models developed by the SentenceTransformers community we have found BioLORD-2023-M, a pre-trained model that will generate 786-dimensional vectors.

This model was trained using BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts.

State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations.

BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD-2023 establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (EHR-Rel-B).

As you can see in its definition, this model is pre-trained with medical concepts that will be useful when vectorizing both our ICD-10 codes and free text.

For our project, we will download this model to speed up the vectors creation:

if not os.path.isdir('/shared/model/'):
    model = sentence_transformers.SentenceTransformer('FremyCompany/BioLORD-2023-M')            
    model.save('/shared/model/')

Once in our team, we can enter the texts to vectorize in lists to speed up the process, let's see how we vectorize the ICD-10 codes that we have previously recorded in our ENCODER.Object.Codes class.

st = iris.sql.prepare("SELECT TOP 50 CodeId, Description FROM ENCODER_Object.Codes WHERE VectorDescription is null ORDER BY ID ASC ")
resultSet = st.execute()
df = resultSet.dataframe()

if (df.size > 0):
    model = sentence_transformers.SentenceTransformer("/shared/model/")
    embeddings = model.encode(df['description'].tolist(), normalize_embeddings=True)

    df['vectordescription'] = embeddings.tolist()

    stmt = iris.sql.prepare("UPDATE ENCODER_Object.Codes SET VectorDescription = TO_VECTOR(?,DECIMAL) WHERE CodeId = ?")
    for index, row in df.iterrows():
        rs = stmt.execute(str(row['vectordescription']), row['codeid'])
else:
    flagLoop = False

As you can see, we first extract the codes stored in our ICD-10 code table that we have not yet vectorized but that we have recorded in a previous step after extracting it from the CSV file, then we extract the list of descriptions to vectorize and using the Python sentence_transformers library we will recover our model and generate the associated embeddings.

Finally, we will update the ICD-10 code with the vectorized description by executing the UPDATE. As you can see, the command to vectorize the result returned by the model is the SQL command TO_VECTOR in IRIS.

Using it in IRIS

Okay, we have our Python code, so we just need to wrap it in a class that extends Ens.BusinessProcess and include it in our production, then connect it to the Business Service in charge of retrieving the CSV file and that's it!

Let's take a look at what this code will look like in our production:

As you can see, we have our Business Service with the EnsLib.File.InboundAdapter adapter that will allow us to collect the code file and redirect it to our Business Process in which we will perform all the vectorization and storage operations, giving us a set of records like the following:

Now our application would be ready to start looking for possible matches with the texts we send it!

In the following article...

In the next article we will show how the application front-end developed in Angular 17 is integrated with our production in IRIS for Health and how IRIS receives the texts to be analyzed, vectorizes them and searches for similarities in the ICD-10 code table.

Don't miss it!

1 Comentario
Comentarios (1)2
Inicie sesión o regístrese para continuar
Artículo
· 31 jul, 2024 Lectura de 4 min

d[IA]gnosis: vectorizando diagnósticos con Embedded Python y modelos LLM

En el artículo anterior presentábamos la aplicación d[IA]gnosis desarrollada para el soporte a la codificación de diagnósticos en CIE-10. En este veremos como InterSystems IRIS for Health nos proporciona las herramientas necesarias para la generación de vectores a partir de la lista de códigos CIE-10 mediante un modelo pre-entrenado de lenguaje, su almacenamiento y la posterior búsqueda de similitudes sobre todos estos vectores generados.

Introducción

Una de las principales funcionalidades que han surgido con el desarrollo de modelos de IA es lo que conocemos como RAG (Retrieval-Augmented Generation) que nos permite mejorar los resultados de modelos LLM mediante la incorporación de un contexto sobre el modelo. Pues bien, en nuestro ejemplo el contexto nos viene dado por el conjunto de diagnósticos CIE-10 y para utilizarlos primeramente deberemos vectorizarlos.

¿Cómo vectorizar nuestra lista de diagnósticos? 

SentenceTransformers con Embedded Python

Para la generación de vectores hemos utilizado la librería de Python SentenceTransformers que nos facilita de sobremanera la vectorización de textos libres a partir de modelos pre-entrenados. De su propia página web:

Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models (quickstart) or to calculate similarity scores using Cross-Encoder models (quickstart). This unlocks a wide range of applications, including semantic searchsemantic textual similarity, and paraphrase mining.

Dentro de todos los modelos desarrollados por la comunidad de SentenceTransformers hemos encontrado BioLORD-2023-M, un modelo preentrenado que nos generará vectores de 786 dimensiones.

This model was trained using BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts.

State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations.

BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD-2023 establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (EHR-Rel-B).

Como podéis ver en su propia definición, este modelo está preentrenado con conceptos médicos que nos resultará útil a la hora de vectorizar tanto nuestros códigos CIE-10 y el texto libre.

Para nuestro proyecto nos descargaremos dicho modelo para poder agilizar la creación de los vectores:

if not os.path.isdir('/shared/model/'):
    model = sentence_transformers.SentenceTransformer('FremyCompany/BioLORD-2023-M')            
    model.save('/shared/model/')

Una vez en nuestro equipo, podremos introducir los textos a vectorizar en listas para acelerar el proceso, veamos como vectorizamos los códigos CIE-10 que hemos grabado previamente en nuestra clase ENCODER.Object.Codes

st = iris.sql.prepare("SELECT TOP 50 CodeId, Description FROM ENCODER_Object.Codes WHERE VectorDescription is null ORDER BY ID ASC ")
resultSet = st.execute()
df = resultSet.dataframe()

if (df.size > 0):
    model = sentence_transformers.SentenceTransformer("/shared/model/")
    embeddings = model.encode(df['description'].tolist(), normalize_embeddings=True)

    df['vectordescription'] = embeddings.tolist()

    stmt = iris.sql.prepare("UPDATE ENCODER_Object.Codes SET VectorDescription = TO_VECTOR(?,DECIMAL) WHERE CodeId = ?")
    for index, row in df.iterrows():
        rs = stmt.execute(str(row['vectordescription']), row['codeid'])
else:
    flagLoop = False

Como véis, primeramente extraemos los códigos almacenados en nuestra tabla de códigos CIE-10 que aún no hemos vectorizado pero que hemos registrado en un paso anterior tras extraerlo del archivo CSV, a continuación extraemos la lista de descripciones a vectorizar y mediante la librería de Python sentence_transformers recuperaremos nuestro modelo y generaremos los embeddings asociados.

Finalmente actualizaremos el código CIE-10 con la descripción vectorizada ejecutando el UPDATE, como véis, el comando para vectorizar el resultado devuelto por el modelo es el comando TO_VECTOR de SQL en IRIS.

Incluyéndolo en IRIS

Muy bien, ya tenemos nuestro código Python, así que sólo necesitamos incluirlo en una clase que extienda Ens.BusinessProcess e incluirlo en nuestra producción, a continuación lo conectamos con el Business Service encargado de recuperar el archivo CSV y ¡listo!

Echemos un vistazo a la forma que tomará este código en nuestra producción:

Como véis, tenemos nuestro Business Service con el adaptador EnsLib.File.InboundAdapter que nos permitirá recoger el fichero de códigos y redirigirlo a nuestro Business Process en el cual realizaremos toda la operativa de vectorización y almacenamiento de los mismos, dándonos como resultado un conjunto de registros como el siguiente:

¡Pues ya estaría lista nuestra aplicación para empezar a buscar posibles coincidencias con los textos que le vayamos pasando!

En el próximo artículo...

En la próxima entrega vamos a mostrar como se integra el front-end de la aplicación desarrollado en Angular 17 con nuestra producción en IRIS for Health y cómo IRIS recibe los textos a analizar, los vectoriza y busca similitudes en la tabla de códigos CIE-10.

¡No os lo perdáis!

Comentarios (0)1
Inicie sesión o regístrese para continuar