Posts and Tags problem for InterSystems IRIS AI contest

Announcement

Evgeny Shvarov · Jun 30, 2020

Open Exchange

#AI #Contest #Machine Learning #InterSystems IRIS #Open Exchange

Hi Developers!

Here in Developers Community, we have posts, which are categorized by tags. Tags - are specific topics, which relate to InterSystems products, InterSystems services, or any concept related to software development, deployment, or maintenance etc.

Tag is a helpful thing because it gives the option to follow/subscribe to the tag, filter the search by the tag, understand how popular or not unpopular the topic and more.

And we have a problem!

Actually two problems. The tags for the post are selected by the author of the post, and we have the following issues: the author chooses wrong tags for a post, and the post lacks proper tags.

And we think this problem could be solved with AI/ML approach and so we suggest you solve it during the InterSystems IRIS AI Contest.

Here is the posts-and-tags repository, which uses the Python Gateway template, which contains two classes: Community.Post and Community.Tag.

Clone it or Fork it and run:

$ docker compose up -d

and it will build an InterSystems IRIS image and will load these two classes along with data from Post and Tag globals.

Community. Post class contains the data on all the developer community posts with fields:

Name - for the post title,
Text - for the post text,
Tags - for the comma-separated list of tags.

You can get the data with the following SQL query:

select top 20 * from Community.post order by id desc

And you can get posts which have the particular tag with the query:

SELECT * FROM Community.Post WHERE ($LISTFIND($ListfromString(Tags,','),'Contest')>0) ORDER BY ID DESC

Community.Tag class contains tags and its descriptions.

The task

Find the optimal set of tags, for every post which matches the text of the post.

Two hypotheses how this could be solved:

1. Find a matching tag for the post upon the tag description. Every tag has a description, which could match the title and content of the post.

2. Find proper tags considering that the majority of choices from authors are the right choices. So if text similar to some post, it can have similar tags.

Looks like a typical data categorization problem, right?

Also, it would be great to introduce new tags which we probably missed but we have posts which could be represented by these tags.

I'm not a data scientist so probably this problem can be solved with some professional approach. Maybe iKnow - InterSystems NLP engine can be used here too.

Anyway: we have the problem, we have the data, and possibly we could find a solution using InterSystems IRIS.

We are looking forward to see your solutions!