Sunday, 5 February 2017

Novice Alorithm design for Opinion Mining.

A Beginner’s guide to Opinion Mining.

Today any new comer to the world of Data Analysis will come across the term, “Sentiment Analysis”.
Question: What is Sentiment Analysis?
Definition: Analysis of the customer/user feedback about a particular situation/product is called sentiment analysis.
Now, there are projects and papers that do this task with many advanced techniques, such as NLP.
I choose to do it using a very simple method, I basically wanted to do a semi-supervised, search based, supervised-updating algorithm.
The pseudo code of the algorithm,

1. Initialise:
    negative[n] = [bad, worse, worst, fucked, shit]
    positive[n] = [good, awesome, better, best, cool]
2. Input Reviews.txt
3. Read Reviews.txt
4. Clean Reviews.txt(remove special characters and punctuation)
5. Search Review.txt
6. For each hit of word from negative[n], we add a -1 to the score. 
7. For each hit of word from positive[n], we add a +1 to the score.

The code for this is:

#List of words on the analysis occurs.
Good = ['nice','great','good','awesome', 'growth', 'bought', 'buy', ]
Bad = ['jerk','hate', 'change', 'privacy', 'problem', 'apple']


#Opening text file containing twitters
file = open("Reviews.txt", "r").read().split(' ')


print file

words = file

text = [word.strip(",.") for line in words for word in line.lower().split()]

postivity = 0
negativity = 0
no_significance = 0

for word in words:
    if(word in Good):
        print "found "+str(word)
        postivity = postivity + 1
        print "++"
    if(word in Bad):
        print "found "+str(word)
        negativity = negativity + -1
        print "--"
    print "\n"

print "\nthe input text has a positivity rating of : "+str(postivity)
print "\nthe input text has a negativity rating of : "+str(negativity)
print "\nUseless words: "+str(no_significance)

total = postivity + negativity + no_significance

print "\nTotal Score: "+str(total)


if postivity > negativity:
        print "\nRecommended Product"
else:
        print("\nNot recommended")

That is it, that is what I did.
The algorithm in itself is not that cool.
And you all know that, I always want the cool :D
To make it a little awesome, I used the word-severity based scoring.
So, in this case, instead of scoring all the words from negative[n] and positive[n] as $-1$ and $+1$ respectively, we do the give varied scores on the basis of the severity of the word.
Writing the words of negative[n] in the order of increasing severity, we get,
bad $>$ worse $>$ worst $>$ shit $>$ fucked,
the scores will be,
bad = $-1$
worse = $-2$
worst = $-3$
shit = $-4$ and
fucked = $-5$
Now, we do the same for the words in positive[n], the scores are,
good = $1$
awesome = $5$
better = $2$
best = $4$
cool = $3$
The implementation of this part is still under development.

Cheers!

Friday, 3 February 2017

Multiplication Algorithms.

I am presently enrolled in a course named Computational Techniques in Control Engineering. The Syllabus is Math extensive and we are expected to be pretty fluent in programming too; which is a great combination.
The course is handled by Professor A Ramakalyan.
The syllabus is as follows,

More details about it can be found at the website of National Institute of Technology, Trichy | Academic Curriculum | Instrumentation and Control Engineering.
So, there was one class in which he concluded the class with a detailed derivation of Gram-Schmidt Orthogonalization process.
And then he asks us what is the product of $a + ib$ and $c + id$ , so with the basic grade school polynomial multiplication skill that we all have, we can easily look into the following,
$(a+ib) \times (c+id)$
$= a(c+id) + ib(c+id)$
$=ac + iad + ibc + i^{2}bd$
$=(ac-bd) + i(ad + bc)$

Note : $i^{2} = -1$

It was simple and straight forward, which was highly unlikely to happen in a course taken by that particular teacher.
He asked us to find out the number of multiplications that we involved in finding the above product.
It is clear that, there are 4 multiplications. He then asked us to do find the same product using 3 multiplications!
This is where the level of awesomeness hits so high that it almost looks CRAZY!
As soon as I got back to the room, I started Googling and looked at some random research papers and stuff.
Hence this blog post.
In this blog post, we will briefly study the essence of Multiplication Algorithms; we will study the Karatsuba Algorithm in detail. We will also see the answer to the mind blowing question by Ramakalyan Sir.

Karatsuba Algorithm.

The usual grade school style of multiplication takes around $\Theta(n^{2})$ time. Karatsuba is his paper “A. Karatsuba and Yu. Ofman (1962). “Multiplication of Many-Digital Numbers by Automatic Computers”. Proceedings of the USSR Academy of Sciences. 145: 293–294. Translation in the academic journal Physics-Doklady, 7 (1963), pp. 595–596”, presented that this multiplication can be done faster; in $\Theta( n^{log_{2}3}) \approx \Theta(n^{1.584963})$ .
This number might seem pretty insignificant, but in cases where we tackle very very large multiplication problems, this works wonders.
I was looking into stack-overflow for a proper explanation while I found this answer. Let us look briefly into it,
$(a+ib) \times (c+id) = (ac-bd) + i(ad + bc)$
Let, $m = ac-bd$ and $n = ad+bc$
We are concerned with two numbers basically, $(ac-bd)$ and $(ad+bd)$ .
Now, we compute three multiplications, $x = ac$ , $y = bd$ , and $z= (a+b)(c+d)$ .
Now,
$m = x -y$
and, $n = z - x - y$ .
This is how the Karatsuba Algorithm works.
To keep stuff in perspective, let us consider an example, where two $1024$ -digit numbers are multiplied. [ $n = 1024 = 2^{10}$ ].
As per the classical multiplication algorithm, it will required, $n^{2}$ single multiplications, that means, $2^{20}$ multiplications; however by the Karatsuba Algorithm, this number can be reduced to $2^{10log_{2}3} = 2^{log_{2}3^{10}} = 3^{10}$ single multiplications. And, if you look into the values of $3^{10}$ and $2^{20}$ , you will see that the latter is much much greater than the former.
Karatsuba algorithm was the first algorithm to be faster or better than the traditional multiplication algorithm that took a quadratic time.

Side note:

We must also consider the Gauss Complex Multiplication algorithm. It precise speaks of the $(a+ib) \times (c+id)$ problem.
This also works by decreasing the number of multiplications and increasing the number of additions and subtractions.
For the $(a+ib) \times (c+id)$ ,
we find,
$x = c.(a+b)$
$y = a.(d-c)$
$z = b.(c+d)$
Finally the real part,
$Re((a+ib) \times (c+id)) = x-z$
$Im((a+ib) \times (c+id)) = x + y$
if we look at the traditional method, we will see that we use $1$ subtraction, and $2$ addition. However, in this method, we use $3$ additions and $2$ subtractions.

Cheers!