Finding symmetries in an unsymmetrical world ..

Posts Tagged ‘machine learning

So, how much is your network really worth – Experiments in data-mining, disambiguation & Natural Language processing [Part 1]

This problem is inspired from the baby problem mentioned in Matthew A. Russell ‘s Mining the Social Web on mining Linkedin data for profit& fun. For the Faint-hearted, Short of time, Show me the Meat Quick folks, skip to the results, directly , or explore my codes in Github here

Original Problem Statement & Use-case  for Motivation

The original idea was to fetch the data using Linkedin RESTful API’s and score the connections using a bunch of interfaces across people, social media (new feeds etc) , groups, communications, companies (Proxy for Revenue) and other metrics to score companies.  The idea was to know who are the people who really add value to my network. Then scale this up by the roles I want to grow in, and here’s the fun part of it – since We are all Startup’s in Ourselves (Reid Hofmannn’s The Startup of you) , I could pivot to these interesting insights that this data set might give. Now, you might think – why not hand curate this all, (hand) “analyze” it and see what works , except that this doesn’t not scale, people keep transitioning all the time and so do your plans, plus why not build something beautiful instead ? Besides, I really wanted to work on something that lets me experiment on end-to-end with Data mining techniques, especially on disambiguating entities and Natural language processing that I am learning.

Challenges

The problem I faced here was multi-fold, some companies did not have official revenue numbers, plus this data-set had to be hand curated) – the problem though was that my architecture was so messed up that I could no longer validate what I see is what I “want” to get, and since I was coding all this data for “scale” in map-reduce on data I fetched via Linkedin’s APi’ (JSON & later the responses from the default XML format).  So, to tackle it hands-on,  I broke the problem(Divide & Conquer) to work only on “companies” for now . On a second thought, it made my life simple, now I don’t have to hand curate revenue number for “all” the companies(As a proxy for company reputation, amongst others) – I can just focus on hand-curate for top 20% say, since after that, things really thin out at the tails, as is confirmed by frequency distribution of the data.

What follows, is the real hands-on from the baby problem I attacked to disambiguate & do frequency plots on these company names.

linkedin_blog_new

Note the frequency plot above is pruned at Top 20 entities (weighted by frequency) – as I mentioned, the total entities are 761 , but plotting them in R & Inkspace gets cluttered [If you have smarter ways to visualize , like hovering text when pointed at by the mouse, but not written text, do drop a comment below- I am yet to experiment on the visualization front.

Data set :

761 data points on imported contacts from Linkedin – about companies that people work with, in my immediate network (You can import this manually, use my code in Github here – and see your personal wealth of Network for yourself !)

 

Methodology & Algorithm :

The prelim step involves looking up the data to do frequency plots on raw data . This is important because it gives you an intimate knowledge of 80-20 effort s that you should focus on, which si especially critical when doing processing with Natural language data sets and gives an idea of the transformations to do on it, including filtering for stop words. This is important since I  do Frequent item set mining – and don’t want to give more weight-age to stop words like “the”  in “The Bank of America” , “The Boston Consulting” , “ The Bank of New York Mellon”  (This would have otherwise basketed all these items with same “key”(or the “Stem” of a unique entity), meaning since it would no longer find similarity with Bank & Boston, it will incorrectly interpret the name as “The” .  This, trick, of course is only possible since I had a good look at the frequency maps of the raw data-set .

[On]Answering the WHY questions 

Before I get to the algorithm, I will answer a few “Why” questions , One, Why did I choose Frequency item set mining over Pure Distance measures/Semantic distance measures – the reason also lies in the attributes of the data that I had. In this case, with natural language constructs, Semantic Distance, or pure distance metric would have done significantly worser and here’s why -:

We would want – {Though Works , Thought Works , India , Thought Works, US } all to be in same basket – and we ALSO NEED a way to know the “stem” of this entity. Though distance metric would fare average in this case,  but so it will fare worser in cases such as this –

{The Bank of America, The bank of Boston, The Bank of New York” } and choosing an “intelligent” threshold would be the bottleneck – with a lot of manual hovering around the “best” distance to classify, had we chosen the Distance metric over frequency item set mining .

ALGORITHM

1. Transforming the data to handling any known abbreviations(example -[24]7 inc is the same as 247)converting to lower case, removing the stop words – This will create a list of similar companies – With this I create a list of companies with Key as the first character in each company, and value as the similar entities.

[Key is to think about reducing the working space, so that it is “scan’nable by human eyes” for discrepancies]

2. Remove the encoding (Very important, since I had internationalization in my input sample)- and do Frequency item set mining on the transformed dataset – In this, I fetch the output of the step 1 and use key, value pairs such as this -:

{‘Thought’: ‘Thought works ltd’ , ‘Thought works India’ , Thought works US’}

—- which then transforms to  —-

{‘Thought’: ‘Thought works’ , ‘Thought works’ , Thought works’}  [Post entity Disambiguation using Frequent item set mining, note that I carefully choose the “support” for the basket as the “total length of the basket, ie. 3 in this case. Thus, “Thought works” had to occur in all baskets to be qualified as a “(similar) unique diambiguated entity”

This will finally disambiguate and interpret Thought Works as ONE entity.  Finally, I use the concept of flattening a list of lists, since some entities appear multiple times – and will thus be transformed as lists of all similar entities.

3. Do data cosmetics (convert the 1st character of each word to upper case, except in stop words, Recall that I had first converted everything to lowercase ) and Plot the pretty table on the “standardized data-set”  – to get the final frequency counts !

4. Use R to plot these frequencies as a node in the graph to visualize this and edit it out in Inkspace.

Data structures, & Design Paradigm

I use Greedy & Divide and  Conquer  design paradigms  to approach this problem and reduce the sample size of “searches for similar companies” when in step 2 .  Apart from this I exploited the “dictionaries”   [and hence the dictionaries] for quick searches & lookups and getting the “Stem entity” “Frozen sets” for freezing the tuples/sets after having done the frequency item set mining, clever use of “support” for the “bucket” , use of sets for  searching the “frequent items” in baskets that have duplicates [Recall that the main difference between a set and  list is that the former is unordered and can not have duplicates by construct- so this saves us time for lookups and narrows down the work space, in case the “entire basket” is exactly with “identical items”] , that similar & identical items will lie locally, more on this is the python(py file at Github)

My code for Python (Step 1-3) ,and the input dataset I used is here at Github , and so is the code to plot the graph in R and modify in Inkspace (Step 4) .

There are three additional secrets to solving this problem, more like a secret sauce of beauty -:

1. I always challenge myself to simulate as if I am in an “interview” – and I am asked to solve this question. Since the natural temptation would be to think hardER – hence the optimal and accounting for design paradigm (Divide & conquer, Greedy, Dynamic Programming, why resolve that which can be “outsourced” after having solved at the 1st instance) and testing for corner cases approach comes in.

2. I always try to build for scale and as generic as possible, so that I can re-use my code on similar problem space. Why re-solve that which can be borrowed from your own problem-solving space ?

3. Think 1970′s when both the disk space & processing power was limited – and this is how your algorithm will have more resourcefulness than that you think you can do at first.

Together these elements take care of the beauty and elegant’ness part of the core engineering problem . Happy solving !

CRITIQUE is always appreciated – do post your comments, on what does not work & how I can make this better. Better still, looking to collaborate on the use case, as in the opening of the post above – reach out to me at Linkedin / or here . 

Results & Resources

linkedin5_inkspace (PDF file)

Linkedin_processed_file.csv (Frequency maps on the Transformed file)

All of this in Github

Stay Tall,

Ekta

Regressions , Huge data sets & Google

Why Google still trusts its “Manual”  Page ranking algorithm ,ie. why is it not trying to catch the wave of machine learning ? Into the Unknown , unseen and the unexplored . The Black Swan ..

Read more here .

And .. The Big data effect !

ekta


Hey there, would be nice to have you around !

That’s me

Top Posts

    Blog Stats

    • 14,930 hits
    May 2024
    M T W T F S S
     12345
    6789101112
    13141516171819
    20212223242526
    2728293031  

    Top Clicks

    • None