Finding symmetries in an unsymmetrical world ..

Posts Tagged ‘Ekta Grover

So, how much is your network really worth – Experiments in data-mining, disambiguation & Natural Language processing [Part 1]

This problem is inspired from the baby problem mentioned in Matthew A. Russell ‘s Mining the Social Web on mining Linkedin data for profit& fun. For the Faint-hearted, Short of time, Show me the Meat Quick folks, skip to the results, directly , or explore my codes in Github here

Original Problem Statement & Use-case  for Motivation

The original idea was to fetch the data using Linkedin RESTful API’s and score the connections using a bunch of interfaces across people, social media (new feeds etc) , groups, communications, companies (Proxy for Revenue) and other metrics to score companies.  The idea was to know who are the people who really add value to my network. Then scale this up by the roles I want to grow in, and here’s the fun part of it – since We are all Startup’s in Ourselves (Reid Hofmannn’s The Startup of you) , I could pivot to these interesting insights that this data set might give. Now, you might think – why not hand curate this all, (hand) “analyze” it and see what works , except that this doesn’t not scale, people keep transitioning all the time and so do your plans, plus why not build something beautiful instead ? Besides, I really wanted to work on something that lets me experiment on end-to-end with Data mining techniques, especially on disambiguating entities and Natural language processing that I am learning.


The problem I faced here was multi-fold, some companies did not have official revenue numbers, plus this data-set had to be hand curated) – the problem though was that my architecture was so messed up that I could no longer validate what I see is what I “want” to get, and since I was coding all this data for “scale” in map-reduce on data I fetched via Linkedin’s APi’ (JSON & later the responses from the default XML format).  So, to tackle it hands-on,  I broke the problem(Divide & Conquer) to work only on “companies” for now . On a second thought, it made my life simple, now I don’t have to hand curate revenue number for “all” the companies(As a proxy for company reputation, amongst others) – I can just focus on hand-curate for top 20% say, since after that, things really thin out at the tails, as is confirmed by frequency distribution of the data.

What follows, is the real hands-on from the baby problem I attacked to disambiguate & do frequency plots on these company names.


Note the frequency plot above is pruned at Top 20 entities (weighted by frequency) – as I mentioned, the total entities are 761 , but plotting them in R & Inkspace gets cluttered [If you have smarter ways to visualize , like hovering text when pointed at by the mouse, but not written text, do drop a comment below- I am yet to experiment on the visualization front.

Data set :

761 data points on imported contacts from Linkedin – about companies that people work with, in my immediate network (You can import this manually, use my code in Github here – and see your personal wealth of Network for yourself !)


Methodology & Algorithm :

The prelim step involves looking up the data to do frequency plots on raw data . This is important because it gives you an intimate knowledge of 80-20 effort s that you should focus on, which si especially critical when doing processing with Natural language data sets and gives an idea of the transformations to do on it, including filtering for stop words. This is important since I  do Frequent item set mining – and don’t want to give more weight-age to stop words like “the”  in “The Bank of America” , “The Boston Consulting” , “ The Bank of New York Mellon”  (This would have otherwise basketed all these items with same “key”(or the “Stem” of a unique entity), meaning since it would no longer find similarity with Bank & Boston, it will incorrectly interpret the name as “The” .  This, trick, of course is only possible since I had a good look at the frequency maps of the raw data-set .

[On]Answering the WHY questions 

Before I get to the algorithm, I will answer a few “Why” questions , One, Why did I choose Frequency item set mining over Pure Distance measures/Semantic distance measures – the reason also lies in the attributes of the data that I had. In this case, with natural language constructs, Semantic Distance, or pure distance metric would have done significantly worser and here’s why -:

We would want – {Though Works , Thought Works , India , Thought Works, US } all to be in same basket – and we ALSO NEED a way to know the “stem” of this entity. Though distance metric would fare average in this case,  but so it will fare worser in cases such as this –

{The Bank of America, The bank of Boston, The Bank of New York” } and choosing an “intelligent” threshold would be the bottleneck – with a lot of manual hovering around the “best” distance to classify, had we chosen the Distance metric over frequency item set mining .


1. Transforming the data to handling any known abbreviations(example -[24]7 inc is the same as 247)converting to lower case, removing the stop words – This will create a list of similar companies – With this I create a list of companies with Key as the first character in each company, and value as the similar entities.

[Key is to think about reducing the working space, so that it is “scan’nable by human eyes” for discrepancies]

2. Remove the encoding (Very important, since I had internationalization in my input sample)- and do Frequency item set mining on the transformed dataset – In this, I fetch the output of the step 1 and use key, value pairs such as this -:

{‘Thought’: ‘Thought works ltd’ , ‘Thought works India’ , Thought works US’}

—- which then transforms to  —-

{‘Thought’: ‘Thought works’ , ‘Thought works’ , Thought works’}  [Post entity Disambiguation using Frequent item set mining, note that I carefully choose the “support” for the basket as the “total length of the basket, ie. 3 in this case. Thus, “Thought works” had to occur in all baskets to be qualified as a “(similar) unique diambiguated entity”

This will finally disambiguate and interpret Thought Works as ONE entity.  Finally, I use the concept of flattening a list of lists, since some entities appear multiple times – and will thus be transformed as lists of all similar entities.

3. Do data cosmetics (convert the 1st character of each word to upper case, except in stop words, Recall that I had first converted everything to lowercase ) and Plot the pretty table on the “standardized data-set”  – to get the final frequency counts !

4. Use R to plot these frequencies as a node in the graph to visualize this and edit it out in Inkspace.

Data structures, & Design Paradigm

I use Greedy & Divide and  Conquer  design paradigms  to approach this problem and reduce the sample size of “searches for similar companies” when in step 2 .  Apart from this I exploited the “dictionaries”   [and hence the dictionaries] for quick searches & lookups and getting the “Stem entity” “Frozen sets” for freezing the tuples/sets after having done the frequency item set mining, clever use of “support” for the “bucket” , use of sets for  searching the “frequent items” in baskets that have duplicates [Recall that the main difference between a set and  list is that the former is unordered and can not have duplicates by construct- so this saves us time for lookups and narrows down the work space, in case the “entire basket” is exactly with “identical items”] , that similar & identical items will lie locally, more on this is the python(py file at Github)

My code for Python (Step 1-3) ,and the input dataset I used is here at Github , and so is the code to plot the graph in R and modify in Inkspace (Step 4) .

There are three additional secrets to solving this problem, more like a secret sauce of beauty -:

1. I always challenge myself to simulate as if I am in an “interview” – and I am asked to solve this question. Since the natural temptation would be to think hardER – hence the optimal and accounting for design paradigm (Divide & conquer, Greedy, Dynamic Programming, why resolve that which can be “outsourced” after having solved at the 1st instance) and testing for corner cases approach comes in.

2. I always try to build for scale and as generic as possible, so that I can re-use my code on similar problem space. Why re-solve that which can be borrowed from your own problem-solving space ?

3. Think 1970′s when both the disk space & processing power was limited – and this is how your algorithm will have more resourcefulness than that you think you can do at first.

Together these elements take care of the beauty and elegant’ness part of the core engineering problem . Happy solving !

CRITIQUE is always appreciated – do post your comments, on what does not work & how I can make this better. Better still, looking to collaborate on the use case, as in the opening of the post above – reach out to me at Linkedin / or here . 

Results & Resources

linkedin5_inkspace (PDF file)

Linkedin_processed_file.csv (Frequency maps on the Transformed file)

All of this in Github

Stay Tall,



Tailored from my original post At Grace Hopper’s women in Computing, 2012 here .

Network, Network, Network – and do it right ! 

Networking is not “Foreplay” – and it isn’t creating Black holes either ..


And here’s why – :

Women haven’t traditionally had the benefit of “smoke-time” networking that seems to work so well for men (nor any womanized version of it). That said – surely there must be ways that amazing superwomen around the world are making it into the corner offices.

As a society, we have been forced to focus on getting business cards rather than building relationships that can be leveraged practically. This is equivalent of building black holes around us, in an attempt that someday we will find matter that will fill the hole.

The only problem with this approach is two fold: one, you are selling a bad product, and two, your self-promotional pitch was so unmemorable that the speaker decided not to set aside some processing power in her/his brain to remember you. No one remembers someone with an inflated sense of self-importance. There is a classical statistic from speed dating, which found that the men who hardly spoke but genuinely listened ended up getting dates more often than the other men who focused on speaking for a greater fraction of the speed dating.

At another end of the spectrum, I observe, often to my utter dismay, the fawning or servile approach. Rather than feeding their own self, the focus reverses to the person whom they deem influential enough. It may not be a conscious effort, and may simply stem from the lack of awareness of one’s real do-ablity. Specifically in Eastern cultures, including India, people are maneuvered to cushion their interests, to the point of obscurity, which is why sometimes the information and opportunities passes them by, and they don’t even take a notice. So, if you have a strong point, tie it together without taking the detour.

I said earlier that the focus should be on forging mutually befitting relationships, and here’s why: when you focus on creating black holes, as I call them, you are missing out on accidental benefits – amazing support systems, sponsors who would endorse you or vouch for you, mentoring opportunities, or just plain unbiased viewpoints – when you need it. So share something phenomenal that people will remember: spin relevant stories, be inquisitive and keep re-learning, and most of all, stay true to who you really are.

Differentiate the position from the person, be assertive,  don’t downplay yourself, and yet don’t oversell. It is a classical balance problem.

” Network” , and network with an end-goal in mind, which is two fold : one, to create spheres of influence, and two, to create information advantage” .– Saundarya Rajesh, AVTAR

Four quick thoughts that sum this up:

1. Focus on the end goal and then work backward: Accidents -> acquaintances ->associates ->advocates -> allies. The point is simple if you want to be moving from “foreplay” conversations to forging lasting relationships for mutually befitting relationships you will invariably have to focus on quality.

2. Measure it: If you are not moving from acquaintances to allies over a period of time, the networks don’t serve a deeper meaning. You see, input alone doesn’t count, what you made of it does.  The “measuring” bit doesn‘t have to be as flashy as excel but should suffice to give you a reinforcing feedback loop on your overall interpersonal and communication skills.

3. Have a compelling elevator pitch: An elevator pitch is something that gives a compelling introduction to who you are.  Craft one, and try to improvise depending on the listener and her/his background and Intellectual giftedness. Of course this means that you also change dynamically depending on the sophistication that the listener might have about your profession.

4. Demonstrate equality: You can’t have an enriching conversation when you do not consider yourself to be one among equals. Equality begets equality, and it begins with awareness of who you are and what you stand for. Reboot your operating system and get moving – and yes, practice, deliver and then practice again.

The theory of 10,000 years from different schools of thought says that deliberate repetition is a key. So go develop a lens for the world you want to live in, and grow – and then just reverse engineer what you want to achieve of any relationship that could help you towards it and work your way through it. Of course, it assumes you focus on forging a mutually beneficial relationship and not just trying to get around into a parasitic relationship.

Guess what,  we all like to push compelling and competitive candidatures.
Now, Go Rock!


Also published here –

(Dear God, please tell my Mom I am doing very fine, and I still think that studying Quantitative Economics, and flying over 16 countries all the way was worth this Education)

Last week I was on my longest flight so far from the far west to the far east, and on my way back I bumped into a Gentleman  who was coming back from Mongolia (the capital , Ulan Bator, to be precise). Talking with him made this trip one of my most shortest trip ever. He was into Marketing side of Construction and myself a bottom up economist and auto-didactic.  So what could we talk about ?

   I had no clue of Construction, or Marketing in construction business or even Mongolia . All I knew about Mongolia was that they have huge deposits of coal . Some people might hit a stalemate at that, what can you really talk, while I was like – tell me more about it !

     We talked about the Mongolian empire, how the Mongols built the Monasteries all the way to Hungary(We could not agree whether the Mongols came up-to Hungary , but that is for some other time), it’s comparison with the Romans and the influence of Mongols on food. And seamlessly, we moved to the Chinese dynasties the Han empire, the Qing dynasty and how the emperors had tried to bring Buddhism to bind people together , and then we moved to the dispute over Tibet and Taiwan , and what I had learnt in the Harvard Project for Asian & International relations, in Taipei , Taiwan, that I was just coming from.

 Come to think of it – it was a powerful  conversation(and I really learnt a lot).  Now that is the thing, finding CPI , or the common point of interest sets you apart from the rest. And come to think of it, it’s really an art. For once he did not tell me what a superwoman I was ( I like telling myself that I am one) – but I instantly knew that he enjoyed talking to me, as much as I did.

    This being able to talk to people “about them” has been my journey of the last 4-5 years. I have practiced and matured it on CEO**’s and Prof’s – but that is not to brag- the point is about learning. 

So my quick take-aways –

1. People are always willing to “Teach” you, once they see the intellect + energy + enthusiasm in you  to learn .This is how we grow, and is a more sustainable way of learning rather than feeding some decaying facts into your  brain.

2. You never choose your mentor. Your mentor chooses you. Point. (Borrowed from Indra Nooyi)

3. Always push meritocratic people, even if you have to go out of THE way to get them some limelight, and then just keep passing the ball.

4. You don’t ask, you don’t get. Under –rated, over-said, but true.

5.  No matter what you want to learn or know, stop the foreplay, and get to the CPI , quick enough to hold them.

6.  What goes around comes around – Gratitude can not be taught, practice it.

7. There  is no single character that can beat your Genuineness – you will know it when it has arrived to you.

8. Push & stretch yourself, and then stretch yourself a little more.

9. You are beautiful, because of  your words and your self awareness and that is something that sets you apart and a Brand “YOU” in this dynamic very competitive world.

10 . We are more than the organizations , and schools we represent – and in some way the hierarchies of power are shifting bottom up . How YOU speak really defines your School, your community and your organisation and your people.

11. NEVER stop learning. 

12. Everything around you, has always been there, but when you meet people who have been there, done that, it sets a context that you can build on, like things, people, places, empires have suddenly started to exist in your conscience.  Beg, Borrow, steal, but GET that context right in your head, you will amazed at how much you can really learn, after-all . 

13. Stop being mediocre, and stop when you have made your point. And, hold your drink,  if you don’t have that super take-away 🙂 

(Emerging ideas from my book – 2o’s is the new 30’s , planning the dots, you will want to connect backward, Dedicated to Annie Fan, a superwoman who knows how to listen, and listen well)



** John Kearon, CEO, Brainjuicer that even shaped 4 months of my work – My Master’s Thesis – On how we think we think, is not how we really think .  In some ways, that is THE bit about connecting the dots. 

Traditionally, the goal of e-governance projects across the world has been broadly, twofold – ensuring a faster service time to address the pain points and lowering the cost of delivering the public services. However, as the new technologies such as Social media and Big data emerge, Governments should re-think of e-governance not merely as technology centric – but as a tool for participatory public policy.

The possibilities that social media has opened up for governments are many. For starters consider the GovLab, an initiative in USA which works with senior government executives and thought leaders from across the globe – and runs controlled experiments for a better public policy, while also helping reduce the costs of providing services. Or consider @sweden initiative by the government of Sweden, which along with an advertising company runs a twitter account controlled by ordinary citizens for seven days, on a roll basis. This project aims at better governance, engagement of its citizens by amplifying their voice in a transparent manner- while also supporting the tourism industry in Sweden.

    The other end of the spectrum are projects aimed at reinforcing good citizenship based on behavioural economics – across health, reducing fecal pollution by dogs by inducing the dog-owners to clean (Taiwan) , and creating incentives to ask for receipt and beat corruption in tax compliance (mainland China). These are experiments the rest of the world is exploring top change behaviours that stick.

       In the Indian context – now consider delivering mobile health care, where social media listening tools can offer countless opportunities to help deploy resources in an optimized manner – allowing for an efficient delivery to the “last mile”, and developing key infrastructure and utilities in the currently underserved districts. Or, consider supplementing the RBI’s latest e-governance application and online tracking system in its foreign exchange department with social media initiatives such as those developed by Clemson University, USA– that used social media listening tools to predict the direction of stock market and foreign exchange. Having similar projects can help triangulate the consumer and Foreign investor sentiment helping the central banks and governments to handle pressing monetary policy issues in a dynamic manner. The benefits are manifold –  allowing for a better planning and attracting the fleeting FDI by signalling the trends in consumer market thereby re-enforcing that India is indeed a key market from where future growth will emerge.


      Or look at the Aviation, FMCG or telecommunication sector – by bringing symmetry to the information in the markets – we can supplement existing institutions like the Competition Commission of India, CCI to extract useful information on consumer surplus to help decide on how best to allow the competitive landscape emerge, and help the businesses grow.  By supplementing their insight with an additional triangulation of relevant and contemporaneous data, this will allow us to do what is actually in the consumer interest, rather than what “they” think it is.  This is the helm of participatory governance.

    Bringing such innovative disruptive projects will also help entice top talents into public sector and help revive the competitive landscape by restricting brain drain to the much coveted private sector. By using wisdom of crowds and crowd sourcing its problems – this will create a rich vibrant pool of ideas in a fair meritocratic manner, while keeping the overall costs of the project considerably low. We opened up our markets for the world in 1991, and now it is time to open the governance for our citizens and best brains, by making it relevant. Of course like all the sectors, in its bare bones, it has challenges manifold. The first is a cultivating the mindset – but the good news is the developing trend of today’s dynamic youth to lead parallel careers and consultative projects for the social good. Then there is a disheartening mere 10.2% internet penetration  in India, the need for regulating the “social listening” if at such pilot programs are launched . Yet against all of this, we together, are more resourceful than the resources that constraint us.

In conclusion, amplifying the voice of the citizens in a transparent manner can help the governments do a cost benefit analysis of which public goods to develop, allowing for better five year plan – essentially outsourcing it’s public policy and developing a truly meritocratic governance. This in true terms is, the largest (democracy) – for the people, of the people and by the people.


The West is changing its traditional structures, and shifting the democracies bottom up – it’s time we caught up, too. Incredible India, after all.

“I agree with you, Well, the idea is that, I will show you in a moment, that the results hold true…”   the more the presenter stuttered the more the listeners shied away. Stuttering in itself is NOT such a blockade to selling your ideas,  but like a down-hill-trip  it gets you down – and before you know – The Show time is OVER, Thank You ! 

While it is clear that this is the last thing , you would want to EVER do to YOUR brainchild, Your IDEAS, why do people,  then Stutter  ? 

SIMPLE , they CHANGE THEIR MIND , while , before and during speaking .

[So , now, you get the background , hopefully .]

Ok, first let me admit that I am not an excellent communicator myself, but I have worked enough on the things that “can be fixed “

What we are not talking about is public speaking, we are talking about THEM , not YOU. And that is the difference . That is – grabbing your audience by their neck, and earning that attention, or better put “Permission to market”

I think in a pyramid manner – so here are some thoughts I will leave you away with – nothing mentioned here is that which you might not know, but like a self reinforcing loop- if we program ourselves to think about the audience we have just “sold” what we set to .

  1. When you don’t know the answer, and please say so. Over stated, under practiced. Point.
  2. Deck YOUR best CARDS , and play them FIRST .
  3. If people feel nothing, They do nothing .  You HAVE to MAKE them feel . (Idea credits , John Kearon CEO, Brain Juicer , @ TEDx RhineMain )
  4. The goal of persuasion is NOT the flowers in the PITCH but the GOAL is — that which sells.
  5. There is no TRUTH, just FACTS. Truth , is an interpretation of our view points .
  6. Getting emotionally attached to your Idea/pitch is the worst form of narcissism .
  7. LISTEN. CLARIFY . BUY TIME TO THINK , but do not open the mouth and remove all doubt .
  8. Re-learn, re-read, re-arrange , collect your thoughts. OFTEN .
  9.  HUMILITY wins you more ears than arrogant persistence , (almost) Always .
  10. MAKE a STRIKE when YOU have THEM.

It’s about THEM .POINT .


Regressions , Huge data sets & Google

Why Google still trusts its “Manual”  Page ranking algorithm ,ie. why is it not trying to catch the wave of machine learning ? Into the Unknown , unseen and the unexplored . The Black Swan ..

Read more here .

And .. The Big data effect !


This post began in my head quite some time back – when almost every other night , I had this drowning feeling about “Doing something about IT”  . After a while I thought , gosh I could used some help ! So the next day , I asked my Prof’ ” If computer Grads “Show off” their skills by coding and going to all these fancy competitions , there MUST be something the Economics grads must be doing,what is it ? “

You have this lovely beautiful, almost flawless model and you look at it – and you are like – but its so obvious , how could anyone write that ? And as you ask this , you are (almost) marvelling at the sheer beauty of it .

As an Econ Grad ,look around – simple as it may sound – you will find loopholes which you CAN fix – so DO it with “the” MODELS 🙂

Seeing it before anyone else does – that’s the goal . And then finding the data that will fit it 🙂

And (please) keep it simple, stupid 🙂


*Wrote my first “model”  this morning , actually it woke me up – almost , and like my “Copenhagen pool a pool” – the idea which I developed while at SAP labs,India –the idea of which I got after seeing a play about an event that occurred in Copenhagen in 1941 about  a meeting between the physicists Niels Bohr and Werner Heisenberg.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

Hey there, would be nice to have you around !

That’s me

My live Rantings @Twitter

Blog Stats

  • 13,718 hits
July 2018
« Oct