Saturday, July 30, 2016

Authorship Attribution of WhatsApp Users through Lexical Analysis 


Imagine that you’re being messaged by a stranger. Your gut says that it’s someone you already know. How do you find which one of your diabolical friends is pulling this out.

 Problem

Let’s generalize the problem. Given chat histories of 2 users and a large enough corpus from an anonymous user, we want to find which user has a writing style which is similar to that of the anonymous user.
It’s a good ol’ binary classification problem.
To recognize the person we can analyze different features of their writing styles.

 Possible approaches

Lexical features
Syntactic features
Bag of words model
Lexical analysis is what we’ll be using here. We can analyze certain properties like sentence length variation, number of words per sentence, use of punctuation, emoticons etc..
Syntactically we can analyze the use of nouns, pronouns, adverbs, singular/plural words and so on.
Bag of words model is that we disregard the structure of the sentences and look at the use of words.

 Solution

So here’s what we’re going to do
  1. Extract text from chat histories.
  2. Analyze text
  3. Prepare the model
  4. Classify
Step 1:
Following is the code I used to extract the messages from a given user.
import glob
import os

source = "~/projects/python/nlp/attribution/corpus/chats"
files = sorted(glob.glob(os.path.join(source, "archive.txt")))

hasanga = []
pasan = []

def extractMessage(user, msg):
    global hasanga, pasan
    msg = msg.replace('\n', ' ')
    msg = msg.replace('<Media omitted>', '')
    if user == "Pasan Missaka Ucsc":
        pasan.append(msg)
    elif user == "Hasanga":
        hasanga.append(msg)


for fn in files:
    with open(fn) as f:
        txt = f.readlines()
        for line in txt:
            time, msg = line.split('M -')
            if msg.count(':') == 1:
                user, msg = msg.split(':')
                user = user.strip()
                extractMessage(user, msg)


Step 2 and 3:
Then we extract the features from the text. We’ll look at average number of words per sentence, standard deviation of sentence length, and the diversity of words. Finding them is pretty easy with a nlp library like nltk.
txt = [pasan, hasanga, unknown]
feature_matrix = np.zeros((len(txt), 3), np.float64)
for i, ch_text in enumerate(txt):
        ch_text = ch_text.decode('utf-8')
        lowercased = ch_text.lower()

        #tokenize the words
        tokens = nltk.word_tokenize(lowercased)
        words = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(lowercased)
        sentences = sentence_tokenizer.tokenize(ch_text)

        #get the unique words
        vocab = set(words)

        #words per sentence
        wps = np.array([len(word_tokenizer.tokenize(s))
                                       for s in sentences])

        # average number of words per sentence
        feature_matrix[i, 0] = wps.mean()
        # standard deviation
        feature_matrix[i, 1] = wps.std()
        # diversity
        feature_matrix[i, 2] = len(vocab) / float(len(words))

#normalize the features
feature_matrix = whiten(feature_matrix)
Step 4
Let’s use scikit-learn’s implementation of k-means algorithm to do a binary classification
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2, init='k-means++', n_init=10, verbose=1)
print km.fit(list(feature_matrix)).labels_

And you’ll get something like 1 1 0
So the first 2 texts belongs to the same user.

 Limitations

For the above approach to give good results the corpus should be fairly large. WhatsApp being a short messaging service, there’s a chance we have only a limited amount of messages.
This is where things get all hazy. We’ll probably have to consider multiple features along with sample specific features (say emoticons) without relying on just lexical features.

Wednesday, February 3, 2016

The Michael Jordan in me


Ever heard of Michael B. Jordan? (Hint: He is not a basketball player)

He is the actor behind “that guy on fire” in Fantastic Four. Guess what? He looks 22% similar to me and here’s how I found it.
So I was reading about PCA and wanted to find which Hollywood actor looks closest to me. Well, who wouldn’t want to?
Eigenfaces are basically an approach that capture the variations (AKA Principle Components) of a set of images and uses this information to compare a new image. The way Eigenfaces work is out of the scope of this post.
There’s a distinction between face detection and recognition, which many places on interweb seem to get wrong. Face detection is finding a face in an image. This is usually done with a haar classifier. Training a haar classifier is detailed in one of my previous posts. Face recognition on the other hand is finding the similarity of a given image with another image from a set of images. This is pretty hard compared to face detection.
Back on the topic. The process involves 3 steps.
  1. Collecting the faces of actors
  2. Training the classifier
  3. Recognizing a new image
I ran a google search on ‘Hollywood actors’ saved the results page. And then copied all the images from the resources directory. That’s a pretty old school way of doing it. Initially I wrote a little script to scrape off the images but it returned only 20 images as Google lazy loads the rest as we scroll. Images straight out of google cannot be used as we want the faces and nothing else. I used a haar classifier to detect and crop the faces off the images. Here’s the code.
import cv2
import glob

def detect(path):
    img = cv2.imread(path)
    cascade = cv2.CascadeClassifier("haar/haarcascade_frontalface_alt.xml")
    rects = cascade.detectMultiScale(img, 1.3, 4, cv2.cv.CV_HAAR_SCALE_IMAGE, (20,20))

    if len(rects) == 0:
        return [], img
    rects[:, 2:] += rects[:, :2]
    return rects, img

def box(rects, img, file_name):

    for x1, y1, x2, y2 in rects:
        cut = img[y1:y2, x1:x2] #   Defines the rectangle containing a face
        file_name = str(file_name) + '.jpg'
        print 'Writing ' + file_name
        cv2.imwrite('cropped/' + str(file_name), cut)   #   Write the file


def main():

    imglist = glob.glob("raw/*.jpg")

    counter = 1 # File name
    for image in imglist:
        rects, img = detect(image)
        box(rects, img, counter)
        counter += 1

if __name__ == "__main__":
    main()
Then realized that cropped images are of different dimensions. We need to have images of same dimensions to train the classifier. So I wrote another script to resize the cropped faces (On hindsight I should have just edited the previous script to save the resized images without writing a new script :/)
import sys
import glob
import cv2

def resize(img, size):
    return cv2.resize(img, size, interpolation = cv2.INTER_AREA)

def store(img, file_name):
    cv2.imwrite('cropped2/' + str(file_name) + '.jpg', img)

def main():
    path = sys.argv[1] + "/*.jpg"

    imglist = glob.glob(path)
    imglist = [cv2.imread(x) for x in imglist]
    # print imglist
    resizedlist = [resize(x, (25, 25)) for x in imglist]
    counter = 0
    for i in resizedlist:
        store(i, counter)
        counter += 1
if __name__ == "__main__":
    if len(sys.argv) < 1:
        print "USAGE: resizer.py </path/to/images>"
        sys.exit()
    main()
I used pyfaces, a pretty old implementation of eigenfaces as the classifier. It works on top of OpenCV and numpy (as always). The code looks neat.https://code.google.com/archive/p/pyface
Feed the images into pyface along with the image you want to compare. It will create eigenfaces based on the set of images and compare with the other image you feed. Result is the closest image.
python pyfacesdemo yourimage dirname numofeigenfaces threshold
facerec.png
And you heard it, I look 22% similar to Michael B. Jordan compared to the rest of the actors.

Tuesday, August 18, 2015

WhatsApp tailored notifications

I always mute WhatsApp group chat notifications (don’t we all?). But I still wanted to get a notification when someone mentions my name.
Did a little bit of googling and found out Yowsup, a Python API for WhatsApp. You can install it with pip install yowsup2
Then register it with a new number. Keep in mind that you can’t use your existing WhatsApp number. I used Ammamma’s (aka grandma)number (pretty sure she won’t come to WhatsApp)
Once it’s done I modified the sample app given in Yowsup’s github to track some keywords.
I also used pushbullet as mentioned in one of my previous posts to issue push notifications.
from yowsup.layers.interface                           import YowInterfaceLayer, ProtocolEntityCallback
from yowsup.layers.protocol_messages.protocolentities  import TextMessageProtocolEntity
from yowsup.layers.protocol_receipts.protocolentities  import OutgoingReceiptProtocolEntity
from yowsup.layers.protocol_acks.protocolentities      import OutgoingAckProtocolEntity
from Push import Push

pb = Push('KEY')

KEYWORDS = ['list', 'of', 'keywords']

class EchoLayer(YowInterfaceLayer):

    @ProtocolEntityCallback("message")
    def onMessage(self, messageProtocolEntity):
        #send receipt otherwise we keep receiving the same message over and over

        if True:
            receipt = OutgoingReceiptProtocolEntity(messageProtocolEntity.getId(), messageProtocolEntity.getFrom(), 'read', messageProtocolEntity.getParticipant())

            body, frm = messageProtocolEntity.getBody(), messageProtocolEntity.getFrom()
            if any(kw in body.lower() for kw in KEYWORDS):
                pb.notify(body, 'WhatsApp')

    @ProtocolEntityCallback("receipt")
    def onReceipt(self, entity):
        ack = OutgoingAckProtocolEntity(entity.getId(), "receipt", "delivery", entity.getFrom())
        self.toLower(ack)
Hosted the scripts on the Raspberry pi and now it sends me notifications whenever someone mentions my name :)
whatsappbot2.png
p.s. Okay that’s a bad example and it’s not what it means.

Monday, August 10, 2015

Degree of separation of Sri Lankan Twitter community


I recently moved to Arch Linux :D While backing up old project files I came across some of the experiments[1] [2] I’ve done on SNA. Thought I’d document at least some of the stuff before I forget everything :D
Inspired by ErdÅ‘s number and Bacon number I wanted to calculate the degree of separation of Sri Lankan Twitter community.
Toughest and the most epa karapu part was the data collection phase. What I wanted was a graph of Sri Lankans with nodes as users and edges as friends. Using Twecoll on Bestatlk twitter account its followers and followers of followers were scraped. Whole process spanned over an entire week due to Twitter’s API limitations. At the end a gml file was generated.
I then ran the gml file through a python script to extract twitter ids and generate a dictionary with keys as user ids and values as a list of twitter ids of the friends of the key.
Next step was to find the shortest path between all the nodes in the graph and get the maximum. Basically find the degree of separation.
First attempt was to apply breadth-first search to every 2 nodes. Needless to say that was extremely slow. Then I found about Floyd-Warshall algorithm which is capable of finding shortest paths between all the nodes in the graph. It was also found that networkx has a built in method to calculate it :D So as a perfectly normal lazy human being I ended up using networkx instead of implementing the algorithm myself.
Degree of separation was found to be 5.
In other words if you’re a Sri Lankan and you’re on twitter any other Sri Lankan on twitter is either a friend of a friend of a friend of a friend of a friend or less.
Disclaimer
  • Data were collected in January 2014 community should have grown since then.
  • Bestatlk doesn’t follow all the Sri Lankans on twitter so the results are not based on the entire community.
P.S. I also fed the data to some other algorithms in an attempt to find the most influential Sri Lankans on Twitter. I guess I’ll save that for another post. Or maybe not.

Saturday, May 30, 2015

How To Train Your Haar Classifier


I’ve been working with image processing in the last few months for a research project in SCoRe lab at UCSC. A part of the work required me to train a haar classifier.
Haar cascades are used to detect different objects in an image by using so-called haar-like features. Detection algorithm is based on the works of Viola and Jones. Although it’s not the sharpest tool in the shed (compared to HOG+SVM), it was the first to work real-time. Haar classifiers are widely used in face detection. Most image processing libraries like OpenCV and SimpleCV supports haar classifiers. And there are lots of trained classifiers on the internet. You just have to download the xml file and use it in your code.
But sometimes it really boils down to training your own classifier. Maybe because the ones in the web are not good enough or there are no classifiers available for your case at all.
Training a classifier is no easy job. You need a lot of positive images (images containing the object you want to detect) and negative images and hell of a lot of processing power. Fortunately there’s a great tutorial at Naotoshi Seo’s site which explains pretty much everything you need to know when it comes to training haar classifiers.
I used “opencv_traincascade” instead of “haartraining” which is now obsolete. “convert_cascade” doesn’t work with “opencv_traincascade” so I had to stop the training program and rerun it with one level reduced. This tricked it to produce the final cascade file with the intermediate training files.
Following result by my polkatu detector took 4800+ positive images and 2000+ negative images and 2 long months.

Sunday, April 12, 2015

Harry Potter, P versus NP and time travel


Sorry for the clickbait of a title.
Disclaimer: Might contain HPMOR spoilers.
In the last KopiJS session, I came to know about Harry Potter and the Methods of Rationality. It’s a fanfic based on HP universe. I’m not a fanfic kind of guy but after seeing ThameeraChanux and Gaveen fan-boying, I decided to put aside The Sound and the Fury for a while and start reading HPMOR.
It’s a page turner. Insanely fun to read and far better than the original series (I’m already seeing die-hard HP fans waving pitchforks).
Harry is a rationalist. His father (uncle, rather) is a professor in biochemistry. Harry is well versed in scientific literature. He has read Godel, Escher, Bach and Judgment Under Uncertainty: Heuristics and Biases and volume one of The Feynman Lectures on Physics. He’s no ordinary 11 year old.
In chapter 17 of the book (the one I just finished reading) he comes up with an algorithm to find the factors of a product of 2 prime numbers. Generating a product of two sufficiently large prime numbers is an important step in key generation of public-key cryptography. If you can decompose such a product into its factors in polynomial time, congratulations you have broken RSA. RSA is least of your concerns when you can time travel and below algorithm is not to be taken seriously.
Here goes:
1) Get two prime numbers of known digit length (3 for example), p and q and get the product n at a time x. (p and q are unknown to you)
2) Write 101*101 on a paper.
3) Go back in time to time x and give the paper to yourself. (Using a time-turner, remember that thing Hermione wore around her neck in Prisoner of Azkaban?)
4) Your past self calculates the product.
5) If product == n, return p, q
6) If product != n, increment q by 2. Go back to time x and give the paper to your past self.
7) If product !=n and q > 997, increment p by 2. Go back to time x and give the paper to yourself.
8) If p > 997 and q > 997 and product != n, send an empty paper to your past self.
With this algorithm Harry had just shown P=NP given that you can time travel. P versus NP is probably the most important unsolved problem in computer science. It is one of the seven millennium prize problems. A proof that P=NP would have immense consequences and probably will change the internet as we know today. Our current implementations of Public-key cryptography will have to be changed. Computer Vision and Voice Recognition fields would be able to make a giant leaps. It would revolutionize mathematics to name a few.
Check the movie Travelling Salesman.It talks about the implications of proving P=NP.