Progress Up To CA2

## Overview

The blogwall aims to open up new avenues to public artistic expression. The system will display text and pictures on a screen in public. My task is concerned with the poetry generation and display system. Keywords from incoming text messages from users will be analyzed and used to create poetry.

To this end, a database and matching algorithm is needed to make the generated poetry fun, relevant and pleasantly surprising.

## Poetry Generation

### Current System

Simple 'dumb' system. Algorithmically, it works as follows:

• Pick longest words from user message
• Lookup synonyms from an online dictionary
• Randomly select three lines containing the words/synonyms from poem database

As it can be expected, the quality of the results in the current system can vary considerably. A better and more accurate algorithm, which can select poems of greater relevance is required.

### Possible Solutions

I'll briefly go through the solutions that were considered first before elaborating on the final chosen method. In order to make the system more appealing and intelligent in the minds of users, I'd mentioned a number of ideas concerned with both the UI and aesthetics as well as the poetry generation algorithm. This page only deals with the latter. A more detailed overview of the process and train of thought can be found in the weekly updates.

#### Natural Language Parser

For a while, we were concerned with understanding the context of the user text message. For this approach to have any benefit, a large amount of external information on the context is required. Obtaining this information is neither easy nor simple.

Furthermore, in the context of text messages, when keywords alone are sent (say a user sends “mountain morning”), the issue becomes even more complicated. For best results, limiting the input stimuli was considered. For instance, users would send messages of a specific structure containing clearly identifiable objects such as a subject, verb and object.

#### Part of Speech tagging

At this stage, part of speech tagging was considered. I spent some time identifying and selecting a suitable form of poetry, from which patterns can be extracted and also start work on a simple parser.

However, it soon became clear that this was not the optimal solution. Text messages are usually written in shorthand with abbreviations for common words and prepositions. These could be stumbling blocks to a system that functions well and also gives users freedom of expression at the same time.

### A Data Mining Problem

To get more help and feedback, I thought it'd be good to meet an experienced professor in this field. Professor Kan Min-Yen's research interests include natural language processing, information retrieval and human computer interaction among others. I felt he could give us valuable advice on how best to proceed.

His feedback was extremely valuable. The problem, he said, was about data mining and information retrieval. A better way to find and retrieve relevant poems from the database would improve the performance of the system. I will now proceed to outline the chosen solution.

## New Poetry Generation Algorithm

The poems are considered a set of records or documents. The task is to pick the most relevant document by looking at the SMS message. This message is a query to find the most relevant document/poem.

The algorithm is as follows:

• Receive incoming text message from user
• Identify keywords and obtain a weighted score for each to find the relative importance of the different terms
• Pick the most relevant poem using the scores

While the process may look simple, there is a lot of processing done before and during each step. I will elaborate upon this.

### Importance of words

This is where the scores come from. Not all words in the query have the same level of importance. Identifying the most significant words, and using a weighted scoring system to select poems will improve the performance of the system.

The score is a product of the term frequency (tf) and the inverse document frequency (idf).

### Term Frequency (TF)

Term Frequency is a measure of how often a term is found in a collection of documents, in this case poems.

### Inverse Document Frequency (IDF)

Inverse Document Frequency is effectively a measure of how rare a particular term is. It is calculated by total collection size divided by the number of documents containing the term. Very common terms ("the", "and" etc.) will have a very low IDF and are therefore often excluded from search results.

### TF-IDF weight

Then TF divided by the IDF is a statistical weight of how important a particular word in the set of poems.

Given a query of i words, the end result is to calculate this weight (w) for each word in every poem.

(1)
\begin{align} w _{i,d} = tf _{i,d} \times \log (n / df _{i}) \end{align}

The system then returns the poem(s) such that $\sum_{i=0} w _{i,d}$ is maximized.

## Poetry Database

The poems will have to be indexed and $w _{i,d}$ calculated before users input text messages. This will require a substantial change in the current database structure.

The database has been migrated to MySQL, a preliminary design has been completed. This is illustrated in the following diagram.

A few remarks from the diagram:

• tf and idf denote term frequency and inverse term frequency respectively.
• A keywords table is used to store the inverse document frequency. This parameter will have to be updated everytime a new poem is added.
• Term frequency is specific to each poem, and is therefore stored in an intermediate table which links the poem id and keyword id.
• Before the system is in use, a round of indexing will be required. During this process, the system will populate the keywords table, and the term and inverse document frequencies for each entry. The associated tables are also populated accordingly.
• The poemlines and keywords table links up the keywords and poemlines tables. In other words, it shows the lines in which a particular keyword is found.

In the context of this new system, the following has been completed:

• Design of database. This is illustrated in the previous diagram.
• Program to calculate the term frequency given a line of poetry. I have a procedure with analyzes a line of poetry, splits it into its component words and calculates the frequency of each.
• Interfacing with new MySQL database backend. I have installed the tools, created/tested procedures to read and write from tables using the MySQL backend.

The following list shows the tasks to be done:

• Implementation of the database
• Program to calculate the inverse document frequency
• Populating the database with poetry

I will be working on the project during the December vacations and I hope to have the database, as well as the procedures to index poetry up and running during this period.

## Possible Problems

The last point deserves some elaboration since it could be considered one of the potential bottlenecks in the realization of the new poetry generation system. For the system to perform optimally, a large poetry database is required. During the discussion with Prof. Min-Yen, he mentioned Bartleby.com as a possible source of poetry. The work is available under the public domain in the form of text files. However, I have found that parsing these files to populate the database is not a trivial task and there might be some difficulties in that.

page revision: 10, last edited: 29 Nov 2007 11:27