Saturday, April 14, 2012

Django Frontend Edit

Django Frontend Edit was created to add "add" and "delete" frontend functions for mezzanine apps but it works with any type of model.

It provides a clean and neater way of adding frontend adding and deleting functionality. It used django permissions model to determine if a user can perform the frontend actions. It is especially good for quickly prototyping an app.

You can quickly provide adding and delete functions to a todo app, for example, using the template code below:

         {# This will render a nice ajax form for adding new items #}
{% can_add object_list text %}
          {% for todoitem in object_list %}
                 {{ todoitem.text }}
                 {% can_delete todoitem %}
           {% endfor %}
{% endcan_add %} 

Check out the example in the repo http://t.co/uvxlgnd3

Saturday, February 11, 2012

Eureka, I can mongo-doop with dumbo, oops?

Ok, so thanks to the "klaasy" guy working on dumbo and uncomfortable Netbook power coding on the train, i was able to keep using my two recently favorite tools, Python and MongoDb. I merged the current mongo-hadoop repo with a fork which had implemented typedbytes mongo input and output formats (Cleaned it up a tinsy bit) and voila you can do a simple dumbo wordcount as follows:

import ast
import dumbo

class Mapper:
    def __call__(self, key, value):
        #value come in dict form and since we are not storing binary data, this should work. Safely convert string to dict
        value = ast.literal_eval(value)
        wordkey = value["key"]
        text = value["text"]
        text = " ".join(text.split("\n"))
        for word in text.split(" "):
           yield (str(wordkey), str(word)), 1

if __name__ == "__main__":
    job = dumbo.Job()
    job.additer(Mapper, sumreducer)
    job.run()

dumbo rm test.out -hadoop /usr/lib/hadoop;dumbo start ~/scratchbox/dumbo/wordcount.py -hadoop /usr/lib/hadoop  -libjar core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar -libjar streaming/target/mongo-hadoop-streaming-1.0-SNAPSHOT.jar  -inputformat  com.mongodb.hadoop.typedbytes.MongoInputFormat -input mongodb://127.0.0.1/texts.in  -outputFormat com.mongodb.hadoop.typedbytes.MongoOutputFormat -output test.out -conf examples/dumbo/mongo-wordcount.xml


The conf file used contains the actual mongo config options like the destination collection, output key type, etc.

Contact me for more info and heres the repo http://bit.ly/wyPA5g


Thursday, February 9, 2012

Ramble Rumble for Languages

I'm tired of listening to programmers ramble about what framework or programming language is better. So let me get this straight, you want me to sacrifice speed (C++) for programming convenience and smoothness (Python). No i don't, if you believe your program needs to be faster than the speed of light then by all means code with a fast language, you can write machine code for all i care. I like being able to quickly prototype programs on my net book while being on a train and as a result i have tied the knot with python. That is not to say that i dont understand the kind of relationship i'm getting into. Obviously, i cannot create the next Farcry using Python (I mean i could but it would not be for mass production). If i wanted to do that then i would use C/C++.

This logic scales up to higher platform disputes like the usual which one is better, Ruby on Rails or Django or just plain PHP. It's all about preference. I like big butts, my preference.

Tuesday, January 31, 2012

Verbose TFxIDF (Weighting) Example with Dumbo, The Begining

Recently, i ventured into the world of information retrieval and data mining because its cool to learn something new and it is the future of the "InterWebbs". Over the few weeks, i have buried my head into research papers, books, source code with my trusty Netbook as my side kick. One of the concepts i have picked up is the infamous TFxIDF. The super smart weighting algorithm that everyone seems to rant about. Its a nifty algorithm which adequately weighs a term in a text according to its relevance. I will not go on about what exaclty it is because google does a good job of explaining things.

I also came across this method called MapReduce developed by my hero's at Google. As a fan of structured programming and methodologies, i like adopting tested well structured ways of solving problems. Map Reduce helps you break large problems into simple tasks which can then be split between clusters. Hadoop seems to be the best framework for executing MapReduce jobs.

Ok, so i know what TFxIDF and Hadoop are used for, how can i implement this in my own project written in python [Enter Dumbo]. In lame man's terms, Dumbo translates python map reduce code to a hadoop cluster.

Great, lets start coding.  I went through the  Dumbo TFxIDF Example and also the short tutorial and also the IRBook (A great introduction information retrieval book). I could not really follow the example because it seemed like the Dumbo creator writes he's example code for Experts and not Dumbies (Pun? Intended?). There is mapperA, mapperB, reducerC, reducerD etc. So i give a more "Verbose" example. It also calculates Euclidean Length (To be used in calculating euclidean distance between two points, a query string and a text/docuement for example) for each document/text.
Over the comming weeks, i will explain parts of the code and also clean it up a little.

from dumbo import *
from math import pow, log, sqrt


@opt("addpath", "yes")
class TokenCountMapper:
    def __call__(self, doc, line):
        """
        Should generate a tokenezed list which may have repeated words.
        Tokens would be grouped and counted in the reducer
        """
        tokens = tokenize(line.lower()) 
        for token in tokens:
            yield (doc[1], token), 1 


class TokenCountReducer:
    #This is skipped and the Dumbo sumreducer helper is used instead
    pass
class DocumentTokenCountMapper:
    """
    I receive the total count of a token in each document 
    """
    def __call__(self, key, tokenCount):
        doc, token = key
        yield doc, (token, tokenCount)


class DocumentTokenCountReducer:
    """
    Sum the amount of tokens in a doc
    """
    def __call__(self, doc, value):
        values = list(value)
        #total number of tokens in doc n
        totalNumberOfTokens = sum(tokenCount for token, tokenCount in values)  
        #yield token info for current doc
        for token, tokenCount in values:
            yield (token, doc), (tokenCount, totalNumberOfTokens)


class TokenCountDocumentCountMapper:
    def __call__(self, key, value):
        token, doc = key
        tokenCount, totalNumberOfTokens = value
        #this token has this info and is in this document 
        yield token, (doc, tokenCount, totalNumberOfTokens, 1)


class TokenCountDocumentCountReducer:
    def __call__(self, token, value):
        values = list(value)
        #count the number of docs for this token
        df = sum(docCount for doc, tokenCount, docTokCount, docCount in values)
        for doc, tokenCount, docTokCount in (value[:3] for value in values):
            yield (doc, token), (tokenCount, docTokCount, df, float(tokenCount)/docTokCount)


class EucledianLengthMapper:
    def __call__(self, key, value): 
        doc, token = key
        tokenCount, docTokCount, df, tf = value
        yield token, (doc, tokenCount, docTokCount, df, tf)
        
class EucledianLengthReducer:
    def __call__(self, token, value): 
        values = list(value)
        for doc, tokenCount, docTokCount, df, tf in values:
            yield (doc, token), (tokenCount, docTokCount, df, tf, pow(float(tokenCount), 2))  


class EucledianLengthSummerMapper:
    def __call__(self, key, value):
        doc, token = key
        tokenCount, docTokCount, df, tf, poww = value
        yield doc, (token, tokenCount, docTokCount, df, tf, poww)


class EucledianLengthSummerReducer:
    def __call__(self, doc, value):
        values = list(value)
        totalDistances = sum(v[5] for v in values)
        for token, tokenCount, docTokCount, df, tf, poww in (v for v in values):
            yield (doc, token), (docTokCount, df, tf, poww, sqrt(totalDistances))


if __name__ == "__main__":
    import dumbo
    job = dumbo.Job()
    job.additer(TokenCountMapper, sumreducer, sumreducer)
    job.additer(DocumentTokenCountMapper, DocumentTokenCountReducer)
    job.additer(TokenCountDocumentCountMapper, TokenCountDocumentCountReducer)
    job.additer(EucledianLengthMapper, EucledianLengthReducer)
    job.additer(EucledianLengthSummerMapper, EucledianLengthSummerReducer)
    job.run()

Friday, January 20, 2012

Recipe for the Semantic Web - mongodb, hadoop, nltk, scrapy, django

Recently, i have been working on my dream (5 Years and counting) project i came up with during the first few months of my Freshman Year back in 06. It was supposed to be the best thing to happen to the internet (In my head) but i was never able to complete it.