How do I calculate the difference between two texts using iKnow?

iKnow

I'm in a process of acquiring a corpus  of documents on educational courses. 

For example there is an educational course called "OOP" and it can have documents from 2008, 2009, ... 2016 etc.
And there are a lot of these courses, each one with programs from different years (hopefully)

So 1 document is 1 programm of one course for one year.

I want to calculate how much does a course changes per year.

Here's an example of information I want to get:

Can I get it via iKnow? How?

  • 0
  • 0
  • 231
  • 4
  • 2

Respuestas

Hi Edward,

the thing that comes closest here would be the %iKnow.Queries.SourceAPI:GetSimilar() query, which for a certain seed document, looks for the most similar ones in a domain, optionally constrained by a filter object. The results of that query include a figure like the one you're looking for, expressing how many entities were new in the seed document vs the corpus it's comparing against. Although that that particular calculation isn't available as an atomic function, a simple way to get to what you want would be to use the %iKnow.Filters.SourceIdFilter and just compare against an individual document.

If you prefer to write more code :o), you can just look up the entities in the one document and compare them against those in the others through the %iKnow.Objects.EntityInSourceDetails SQL projection.

Regards,

benjamin

Hello.

Thank you for this information. I started testing it and %iKnow.Queries.SourceAPI:GetSimilar()  returned the following as a result local:

result(1)=$lb(890,":SQL:2002:20020308X00320",.4737,.9606,57,27,686,.4737)

The list is formed from these values:

$lb(srcId, externalId, percentageMatched, percentageNew, nbOfTgtsInRefSrc, nbOfTgtsInCommon, nbOfTgtsInSimSrc, matchScore)

What does that mean?

  • srcId -sourceId of similar document
  • externalId - external source id of similar document
  • percentageMatched - number of targets common between source and similar documents divided by number of targets in source document
  • percentageNew - number of targets in similar document that is not present in source document divided by total number of targets in similar document
  • nbOfTgtsInRefSrc - number of targets in source document
  • nbOfTgtsInCommon - number of targets common between source and similar documents
  • nbOfTgtsInSimSrc - number of targets in similar document
  • matchScore - seems equal to percentageMatched

Is that correct? Are there documentation on that?

you are entirely correct. 

The separate MatchScore column is to accommodate methods where the score is more refined than the pure count-based one with $$$SIMSRCSIMPLE. With $$$SIMSRCDOMENT, dominance is accounted for in this metric and you'll see it'll differ from percentageMatched

you are entirely correct. 

Good to hear.

$$$SIMSRCDOMENT

If I change algorithm to $$$SIMSRCDOMENT I don't receive any results (Results local is undefined).

If I choose $$$SIMSRCEQUIVS or $$$SIMSRCSIMPLE I  get Results local as expected.

What may be the reason? I didn't modify the domain between runs.

Method returns $$$OK and %objlasterror is empty using any algorithm.

The $$$SIMSRCDOMENTS is much more restrictive and may not yield any results if your domain is small and sources are too far apart. I see results when trying it in the Aviation demo dataset. Note that you can loosen it by setting the "strict" parameter to 0 as described in the class ref.

That third alternative you quoted has been deprecated and does not anything to the regular $$$SIMSRCSIMPLE option. You dug too deep in the code ;o)

 

Regards,
benjamin

I would look into some kind of string distance measures. For example, Levenshtein distance.