Week 11

The meeting with the Professor Min-Yen was very interesting. He first heard us out about how the blogwall and poetry generation was currently implemented, and then went on to suggest some ideas/solutions that could be used to improve poetry generation in the Blogwall.

Not an AI problem

Prof. Min-Yen started off by saying he believed the problem was not really related to the field of AI and/or NLP. It was about information retrieval as he put it. A better way to retrieve information would improve the poetry generation in the system.

Term Frequency (TF) and Inverse Document Frequency (IDF)

He proposed a way to improve the relevancy of the results (poems) that are returned for a particular word in an SMS. It consists of multiplying the term frequency (TF) with the inverse document frequency (IDF) to determine how important a particular word in the received text message is.

Term Frequency (TF)

Term Frequency is a measure of how often a term is found in a collection of documents, in this case poems.

Inverse Document Frequency (IDF)

Inverse Document Frequency is effectively a measure of how rare a particular term is. It is calculated by total collection size divided by the number of documents containing the term. Very common terms ("the", "and" etc.) will have a very low IDF and are therefore often excluded from search results.

TF-IDF weight

Then TF divided by the IDF is a statistical weight of how important a particular word in the set of poems.

en.wikipedia.org/wiki/Tf-idf

Evaluating the emotional content of a message

He also suggested a method to evaluate the mood of a message. The idea is to build up a list of what are called 'words of polarity' such as happy, sad and so forth. The idea is to identify these words of polarity in the sentence and then scan left and right for qualifiers until both sides see a full stop.

Other ideas/pointers

He suggested to try this first and stay clear of POS taggers, since these can bring in more complications to the system. In addition, he mentioned that text messages from a phone usually have shorthand forms or spelling mistakes, which would greatly reduce the utility of such tools.

Resources

  • Poetry: Bartleby.com
  • Search engine to index text files: Apache Lucene
  • Website to identify/breakdown words: WordNet
  • WebBase Term Frequency
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License