Finding symmetries in an unsymmetrical world ..

Archive for the ‘Experiments in Data mining & Natural Language Processing’ Category

Problem statement:

For hand collected /crafted data sets, as with manually collected datasets  from survey data and other data sets where the experience of the person keying in the data on a digital device, is short of an acceptable standard – the distance conventional metrics will fail. This is because these distance metrics “assume” that the data representation in the data sets proxies for the  intent of the end-user, which is not true, when intentional errors creep us.

This requires a way to model for the unintentional errors that have creeped up since, and requires to model for the very structure of the input keyboard that is used to key in the data. In this algorithm, I explicitly model the QWERTY keyboard as a graphical entity, where I model the 26 alphabets of the English language as the nodes of a graph & the distance between the nodes accounting for the

Scope : This algorithm, will give the similar entities amongst many in a hand-crafted data sets. For example, for two entities, “Weka” and “Wkea” it will identify them as one entity, and with a spell checker in the final disambiguating pass, also identify which of the two entities is the actual entity .

The limitation of this apparatus , though is for non-identified/no-auto suggested words in the English dictionary , such as people names like “Ekta “ and “Keta” – in which case though it will identify and mark the two entities as the same, after having “identified the typo” – it will be limited in its scope to mention which one of them is actually the right representation.

In that sense, this application is the intent prediction in a global scope,  but NOT the identification of which of the singular entities is the actual representation.


The QWERTY representation 

Algorithm & data structure representation  :

  1. The three rows of the English alphabets in the QWERTY keyboard are represented as nodes of the graph, with a distance between the adjacent nodes as 1 . The adjacent nodes in turn are picked up, accounting for the fact that typographical errors are more likely to occur in a neighboring zone.  For example, according to the schema above , Q is adjacent to Q,W,A and S – the adjacency being  in turn represented by the neighborhood of that node. For example, for D as the key, W,E,R,S,F,X,C,V are its immediate neighborhood.
  2. The concept of the neighborhood as above, also narrows down the what keys are acceptable typos’  in the immediate neighborhood, so that the distance metric does not penalize the user
  3. Extension of the adjacent nodes is the concept where the users inadvertently swap the left and right keys . Like instead of typing “weka” the user types “wkea” , since the fingers are not user’s synched up while typing. This can be due to inexperience in typing, or while pure errors while speed-typing (typical of surveys, call center chat data etc.) – in this scope the comparison, instead of being with the neighborhood/adjacent nodes as above, is against the left vs. right portion of the QWERTY keyboard-  with defined positions for what is left and right . This ensures that when two consecutive characters are deemed as “typos emanating from left and right hand asynchronization” – these characters indeed come from left and right part of the keyboard.
  4. As the next step, the threshold of what is acceptable difference between the words is defined. Currently this is set as one-third of the absolute difference in the length of the words being compared. This is because the difference in the lengths of the words being compared has to be within a defined length . for example a user instead of typing “asynchronous”  may type “asynchronus” (missing the “o” after “n”) – then given the threshold restrictions, the distance metric, should uniquely attribute this to asynchronous. (In this case, it will happen in the 1st step of auto-suggest itself, there by narrowing the scope – or reinforcing the uniqueness of the attribute)
  5. If the difference in the length of the words exceeds the threshold  then the algorithm places a higer belief that the words are more likely to be different and uses the Levenshtein distance metric instead.
  6. As a final step to identify which of the two keys is the rightful key (“Lengt” and “length” are both mapped to one entity, but the algorithm till step 4 does not know which is the right “word”) , the word being compared is passed over an “auto-suggest”  – which looks up predefined words in the English dictionary and outputs the candidate(s) with  the lowest distance metric.  

As a proposed extension to this algorithm – I intend try out different dictionary approaches for auto-suggest (together with Apache Solr and elastic search)

I used Python’s networkx to model the QWERTY keyboard, and built the QWERTY distance metric on top of it.


As in other distance metrics, this distance metric computes (n-1)comparisons for each word , thereby outputting the word with the closest distance, O(n2) complexity.  Post this the auto-suggest features narrows down the rightful- scope by candidate key(s). Overall the complexity being O(n2) .

Performance metrics & Benchmark tests

The way to measure the performance of this algorithm is by passing it through the corpus of a hand-collected data set and compare its performance (time complexity and results against other distance metrics – like pure Levenshtein distance application, Manhattan & Euclidean distance, thought by a rule of thumb Levenshtein  will do better on pure distance search among-st the other two (Manhattan & Euclidean distance)

While modeling this problem, I also found a version of Fuzzy string match as in here –  and I will contrast the performance results against this(pending)

Github codes are here

Suggestions, critique is as always welcome.

Stay Green & Growing,



So, how much is your network really worth – Experiments in data-mining, disambiguation & Natural Language processing [Part 1]

This problem is inspired from the baby problem mentioned in Matthew A. Russell ‘s Mining the Social Web on mining Linkedin data for profit& fun. For the Faint-hearted, Short of time, Show me the Meat Quick folks, skip to the results, directly , or explore my codes in Github here

Original Problem Statement & Use-case  for Motivation

The original idea was to fetch the data using Linkedin RESTful API’s and score the connections using a bunch of interfaces across people, social media (new feeds etc) , groups, communications, companies (Proxy for Revenue) and other metrics to score companies.  The idea was to know who are the people who really add value to my network. Then scale this up by the roles I want to grow in, and here’s the fun part of it – since We are all Startup’s in Ourselves (Reid Hofmannn’s The Startup of you) , I could pivot to these interesting insights that this data set might give. Now, you might think – why not hand curate this all, (hand) “analyze” it and see what works , except that this doesn’t not scale, people keep transitioning all the time and so do your plans, plus why not build something beautiful instead ? Besides, I really wanted to work on something that lets me experiment on end-to-end with Data mining techniques, especially on disambiguating entities and Natural language processing that I am learning.


The problem I faced here was multi-fold, some companies did not have official revenue numbers, plus this data-set had to be hand curated) – the problem though was that my architecture was so messed up that I could no longer validate what I see is what I “want” to get, and since I was coding all this data for “scale” in map-reduce on data I fetched via Linkedin’s APi’ (JSON & later the responses from the default XML format).  So, to tackle it hands-on,  I broke the problem(Divide & Conquer) to work only on “companies” for now . On a second thought, it made my life simple, now I don’t have to hand curate revenue number for “all” the companies(As a proxy for company reputation, amongst others) – I can just focus on hand-curate for top 20% say, since after that, things really thin out at the tails, as is confirmed by frequency distribution of the data.

What follows, is the real hands-on from the baby problem I attacked to disambiguate & do frequency plots on these company names.


Note the frequency plot above is pruned at Top 20 entities (weighted by frequency) – as I mentioned, the total entities are 761 , but plotting them in R & Inkspace gets cluttered [If you have smarter ways to visualize , like hovering text when pointed at by the mouse, but not written text, do drop a comment below- I am yet to experiment on the visualization front.

Data set :

761 data points on imported contacts from Linkedin – about companies that people work with, in my immediate network (You can import this manually, use my code in Github here – and see your personal wealth of Network for yourself !)


Methodology & Algorithm :

The prelim step involves looking up the data to do frequency plots on raw data . This is important because it gives you an intimate knowledge of 80-20 effort s that you should focus on, which si especially critical when doing processing with Natural language data sets and gives an idea of the transformations to do on it, including filtering for stop words. This is important since I  do Frequent item set mining – and don’t want to give more weight-age to stop words like “the”  in “The Bank of America” , “The Boston Consulting” , “ The Bank of New York Mellon”  (This would have otherwise basketed all these items with same “key”(or the “Stem” of a unique entity), meaning since it would no longer find similarity with Bank & Boston, it will incorrectly interpret the name as “The” .  This, trick, of course is only possible since I had a good look at the frequency maps of the raw data-set .

[On]Answering the WHY questions 

Before I get to the algorithm, I will answer a few “Why” questions , One, Why did I choose Frequency item set mining over Pure Distance measures/Semantic distance measures – the reason also lies in the attributes of the data that I had. In this case, with natural language constructs, Semantic Distance, or pure distance metric would have done significantly worser and here’s why -:

We would want – {Though Works , Thought Works , India , Thought Works, US } all to be in same basket – and we ALSO NEED a way to know the “stem” of this entity. Though distance metric would fare average in this case,  but so it will fare worser in cases such as this –

{The Bank of America, The bank of Boston, The Bank of New York” } and choosing an “intelligent” threshold would be the bottleneck – with a lot of manual hovering around the “best” distance to classify, had we chosen the Distance metric over frequency item set mining .


1. Transforming the data to handling any known abbreviations(example -[24]7 inc is the same as 247)converting to lower case, removing the stop words – This will create a list of similar companies – With this I create a list of companies with Key as the first character in each company, and value as the similar entities.

[Key is to think about reducing the working space, so that it is “scan’nable by human eyes” for discrepancies]

2. Remove the encoding (Very important, since I had internationalization in my input sample)- and do Frequency item set mining on the transformed dataset – In this, I fetch the output of the step 1 and use key, value pairs such as this -:

{‘Thought’: ‘Thought works ltd’ , ‘Thought works India’ , Thought works US’}

—- which then transforms to  —-

{‘Thought’: ‘Thought works’ , ‘Thought works’ , Thought works’}  [Post entity Disambiguation using Frequent item set mining, note that I carefully choose the “support” for the basket as the “total length of the basket, ie. 3 in this case. Thus, “Thought works” had to occur in all baskets to be qualified as a “(similar) unique diambiguated entity”

This will finally disambiguate and interpret Thought Works as ONE entity.  Finally, I use the concept of flattening a list of lists, since some entities appear multiple times – and will thus be transformed as lists of all similar entities.

3. Do data cosmetics (convert the 1st character of each word to upper case, except in stop words, Recall that I had first converted everything to lowercase ) and Plot the pretty table on the “standardized data-set”  – to get the final frequency counts !

4. Use R to plot these frequencies as a node in the graph to visualize this and edit it out in Inkspace.

Data structures, & Design Paradigm

I use Greedy & Divide and  Conquer  design paradigms  to approach this problem and reduce the sample size of “searches for similar companies” when in step 2 .  Apart from this I exploited the “dictionaries”   [and hence the dictionaries] for quick searches & lookups and getting the “Stem entity” “Frozen sets” for freezing the tuples/sets after having done the frequency item set mining, clever use of “support” for the “bucket” , use of sets for  searching the “frequent items” in baskets that have duplicates [Recall that the main difference between a set and  list is that the former is unordered and can not have duplicates by construct- so this saves us time for lookups and narrows down the work space, in case the “entire basket” is exactly with “identical items”] , that similar & identical items will lie locally, more on this is the python(py file at Github)

My code for Python (Step 1-3) ,and the input dataset I used is here at Github , and so is the code to plot the graph in R and modify in Inkspace (Step 4) .

There are three additional secrets to solving this problem, more like a secret sauce of beauty -:

1. I always challenge myself to simulate as if I am in an “interview” – and I am asked to solve this question. Since the natural temptation would be to think hardER – hence the optimal and accounting for design paradigm (Divide & conquer, Greedy, Dynamic Programming, why resolve that which can be “outsourced” after having solved at the 1st instance) and testing for corner cases approach comes in.

2. I always try to build for scale and as generic as possible, so that I can re-use my code on similar problem space. Why re-solve that which can be borrowed from your own problem-solving space ?

3. Think 1970′s when both the disk space & processing power was limited – and this is how your algorithm will have more resourcefulness than that you think you can do at first.

Together these elements take care of the beauty and elegant’ness part of the core engineering problem . Happy solving !

CRITIQUE is always appreciated – do post your comments, on what does not work & how I can make this better. Better still, looking to collaborate on the use case, as in the opening of the post above – reach out to me at Linkedin / or here . 

Results & Resources

linkedin5_inkspace (PDF file)

Linkedin_processed_file.csv (Frequency maps on the Transformed file)

All of this in Github

Stay Tall,


Hey there, would be nice to have you around !

That’s me

My live Rantings @Twitter

Blog Stats

  • 13,718 hits
July 2018
« Oct