Finding symmetries in an unsymmetrical world ..

Posts Tagged ‘Handling typos on QWERTY keyboard in your datasets

Problem statement:

For hand collected /crafted data sets, as with manually collected datasets  from survey data and other data sets where the experience of the person keying in the data on a digital device, is short of an acceptable standard – the distance conventional metrics will fail. This is because these distance metrics “assume” that the data representation in the data sets proxies for the  intent of the end-user, which is not true, when intentional errors creep us.

This requires a way to model for the unintentional errors that have creeped up since, and requires to model for the very structure of the input keyboard that is used to key in the data. In this algorithm, I explicitly model the QWERTY keyboard as a graphical entity, where I model the 26 alphabets of the English language as the nodes of a graph & the distance between the nodes accounting for the

Scope : This algorithm, will give the similar entities amongst many in a hand-crafted data sets. For example, for two entities, “Weka” and “Wkea” it will identify them as one entity, and with a spell checker in the final disambiguating pass, also identify which of the two entities is the actual entity .

The limitation of this apparatus , though is for non-identified/no-auto suggested words in the English dictionary , such as people names like “Ekta “ and “Keta” – in which case though it will identify and mark the two entities as the same, after having “identified the typo” – it will be limited in its scope to mention which one of them is actually the right representation.

In that sense, this application is the intent prediction in a global scope,  but NOT the identification of which of the singular entities is the actual representation.

Finger_keying_final

The QWERTY representation 

Algorithm & data structure representation  :

  1. The three rows of the English alphabets in the QWERTY keyboard are represented as nodes of the graph, with a distance between the adjacent nodes as 1 . The adjacent nodes in turn are picked up, accounting for the fact that typographical errors are more likely to occur in a neighboring zone.  For example, according to the schema above , Q is adjacent to Q,W,A and S – the adjacency being  in turn represented by the neighborhood of that node. For example, for D as the key, W,E,R,S,F,X,C,V are its immediate neighborhood.
  2. The concept of the neighborhood as above, also narrows down the what keys are acceptable typos’  in the immediate neighborhood, so that the distance metric does not penalize the user
  3. Extension of the adjacent nodes is the concept where the users inadvertently swap the left and right keys . Like instead of typing “weka” the user types “wkea” , since the fingers are not user’s synched up while typing. This can be due to inexperience in typing, or while pure errors while speed-typing (typical of surveys, call center chat data etc.) – in this scope the comparison, instead of being with the neighborhood/adjacent nodes as above, is against the left vs. right portion of the QWERTY keyboard-  with defined positions for what is left and right . This ensures that when two consecutive characters are deemed as “typos emanating from left and right hand asynchronization” – these characters indeed come from left and right part of the keyboard.
  4. As the next step, the threshold of what is acceptable difference between the words is defined. Currently this is set as one-third of the absolute difference in the length of the words being compared. This is because the difference in the lengths of the words being compared has to be within a defined length . for example a user instead of typing “asynchronous”  may type “asynchronus” (missing the “o” after “n”) – then given the threshold restrictions, the distance metric, should uniquely attribute this to asynchronous. (In this case, it will happen in the 1st step of auto-suggest itself, there by narrowing the scope – or reinforcing the uniqueness of the attribute)
  5. If the difference in the length of the words exceeds the threshold  then the algorithm places a higer belief that the words are more likely to be different and uses the Levenshtein distance metric instead.
  6. As a final step to identify which of the two keys is the rightful key (“Lengt” and “length” are both mapped to one entity, but the algorithm till step 4 does not know which is the right “word”) , the word being compared is passed over an “auto-suggest”  – which looks up predefined words in the English dictionary and outputs the candidate(s) with  the lowest distance metric.  

As a proposed extension to this algorithm – I intend try out different dictionary approaches for auto-suggest (together with Apache Solr and elastic search)

I used Python’s networkx to model the QWERTY keyboard, and built the QWERTY distance metric on top of it.

Complexity

As in other distance metrics, this distance metric computes (n-1)comparisons for each word , thereby outputting the word with the closest distance, O(n2) complexity.  Post this the auto-suggest features narrows down the rightful- scope by candidate key(s). Overall the complexity being O(n2) .

Performance metrics & Benchmark tests

The way to measure the performance of this algorithm is by passing it through the corpus of a hand-collected data set and compare its performance (time complexity and results against other distance metrics – like pure Levenshtein distance application, Manhattan & Euclidean distance, thought by a rule of thumb Levenshtein  will do better on pure distance search among-st the other two (Manhattan & Euclidean distance)

While modeling this problem, I also found a version of Fuzzy string match as in here –  http://search.cpan.org/~krburton/String-KeyboardDistance-1.01/KeyboardDistance.pm  and I will contrast the performance results against this(pending)

Github codes are here https://github.com/ekta1007/Custom-Distance-function-for-typos-in-hand-generated-datasets-with-QWERY-Keyboard

Suggestions, critique is as always welcome.

Stay Green & Growing,

Ekta 

Advertisements

Hey there, would be nice to have you around !

That’s me

My live Rantings @Twitter

Top Posts

Blog Stats

  • 13,260 hits
October 2017
M T W T F S S
« Oct    
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Top Clicks

  • None