Lexical Normalization


This project aims to predict the best match for each input token that is potentially misspelled, with respect to a reference dictionary. A pipeline consisting of exact string matching, approximate string matching and word frequency weighting is implemented. In the approximate matching part, methods like global edit distance and editex are contrasted.


The input dataset misspell.txt contains 10322 words while only 3755 are unique. Some input tokens correspond to multiple "correct" answers. For example, "it's" can be mapped to the contraction itself or the pronoun "its" depending on the specific context. The reference dictionary contains 370099 words while 1191 words in the benchmark correct.txt are still missing from the dictionary.

The original dataset is from \cite{baldwin2015shared}.

Related Work

Misspelled tokens are transformed into a consistent form during lexical normalization. \cite{han2011lexical} exploits lexical edit distance and phonemic edit distance to capture morphophonemic similarity for tokens in Twitter short messages. \cite{ahmed2015lexical} proposed a 5-step process which involves edit distance and refined Soundex.

\cite{zobel1996phonetic} demonstrates that Editex and Ipadist are superior to edit distance and N-gram and Soundex has the lowest performance in terms of precision and recall percentage. As a result, Editex is chosen for phonemic similarity and edit distance is picked from lexical distance based methods.


A prediction pipeline consisting of three steps is constructed.

  1. Exact String Matching
  2. Approximate String Matching (optional)
  3. Word Frequency (optional)

Exact String Match

If an input token already exists in the reference dictionary, the input token itself is simply determined as the best match; step 2 and step 3 will also be skipped.

These special cases can be found by intersecting the input dataset and the reference dictionary. $$\{\texttt{exact matches}\} = \{\texttt{input}\} \cap \{\texttt{dict}\}$$

2352 distinct input tokens can be directly matched in step 1 while 1403 tokens need further steps. This step demands least computing resources and finishes in seconds (Python implementation on a general laptop).

Approximate String Match

The approximate matching part has two candidates: Global Edit Distance from lexical distance perspective and Editex from phonemic similarity perspective.

Global Edit Distance

For each input token, computing the global edit distance against a dictionary of 370k words takes time. Instead, all neighbors with a single-character deletions, insertions and replacements are enumerated. The intersection between the neighbors and the reference dictionary becomes the candidate(s). If an empty intersection is found, the distance keeps incrementing over iterations until at least one candidate is found or the distance exceeds a pre-defined limit.


Python library py-stringmatching implements Editex distance described in \cite{zobel1996phonetic}. Specifically, 0 penalty is given to exact character match; 1 is given if two characters are in the same group (e.g. m-n / c-k) while 2 is given to dissimilar characters.

The normalized editex similarity between two non-empty strings is derived from the editex distance between them. $$score = 1 - \frac{distance}{2 \cdot \max(s1.length,\ s2.length)}$$

Each input token has a similarity to each word in the dictionary. The word(s) in the dictionary with the highest score become the candidates.

For each input token, it takes around 1-2 minutes to compute the similarity scores to all the words in the dictionary (Python implementation on an 1 vCPU - 4GB RAM virtual machine). The job was built in parallel on two identical instances on NeCTAR cloud. AWS DynamoDB was used to store the intermediate results.

Word Frequency

Step 2 produces a list of candidates with the same distance / similarity. For example, "booking", "cooking" and "looking" has the same distance to "xooking".

Input tokens are isolated terms and the context is stripped. It is difficult to determine the best match. As a result, word frequency is introduced to break the tie. For example, both "similar" and "similor" are the prediction candidates of "similiar". Word frequency makes "similar" prominent and enables successful prediction.

Python library wordfreq provides the frequency of words.

Evaluation & Analysis

As discussed in Dataset subsection, some input tokens correspond to multiple "correct" answers. As a result, a prediction is considered successful as long as it hits one of the accepted answers.

$$accuracy = \frac{\#\ of\ successful\ predictions}{total\ \#\ of\ predictions}$$

Method Accuracy Comment
Edit Distance 1 62.557% distance ≤ 1
Edit Distance 2 62.716% distance ≤ 2
Edit Distance 3 62.716% distance ≤ 3
Editex 64.288%


Performance Trade-off

Editex performs slightly more accurately than edit distance. This result is consistent with \cite{zobel1996phonetic}. In terms of execution performance, edit distance with immediate neighborhood takes seconds to finishwhile Editex similarity takes more than 30 hours. As a result, the utilization of Editex similarity is restricted in real-time system if a large reference dictionary is given. The execution performance of edit distance is independent on the dictionary size because neighbors are enumerated. Evaluating the intersection of two Hash Sets has linear time complexity: iterating over the smaller set is $\mathcal{O}(n)$ and checking the existence of each element in the bigger set is $\mathcal{O}(1)$.

When the distance limit increases from 1 to 2, the accuracy increases marginally and the execution time increases from 2 seconds to 75 seconds. When the distance limit is increased to 3, the build time surges to several hours but the accuracy is not improved at all. These results verify the statement in the lecture material "exhaustive search beyond the immediate neighborhood is at the upper limit of feasibility and usability".

Keep the Original Token

The predictor can also leave the input token unchanged if there is no (good) match. For example, a threshold can be put in-place for Editex similarity to find relatively "good" matches. If the similarity of the top candidates does not reach a certain level of confidence, the original input token is reserved. The following table shows the accuracy has been improved after input tokens with unconfident predictions are left unchanged. It also shows more "corrections" result in more incorrect predictions.

Method Accuracy Comment
Exact match 90.940% (distance = 0)
Edit Distance 1 74.274% distance ≤ 1
Edit Distance 2 66.338% distance ≤ 2
Edit Distance 3 64.447% distance ≤ 3


A prediction pipeline with 3 steps is proposed, implemented and verified. In terms of approximate matching, Editex performs more accurately than edit distance while the computation cost makes Editex at the edge of feasibility and usability. Leaving the input tokens unchanged is a good strategy if there is no satisfying match.


  title={Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition},
  author={Baldwin, Timothy and de Marneffe, Marie-Catherine and Han, Bo and Kim, Young-Bum and Ritter, Alan and Xu, Wei},
  booktitle={Proceedings of the Workshop on Noisy User-generated Text},

  title={Lexical normalisation of short text messages: Makn sens a\# twitter},
  author={Han, Bo and Baldwin, Timothy},
  booktitle={Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1},
  organization={Association for Computational Linguistics}

  title={Lexical normalisation of twitter data},
  author={Ahmed, Bilal},
  booktitle={2015 Science and Information Conference (SAI)},

  title={Phonetic string matching: Lessons from information retrieval},
  author={Zobel, Justin and Dart, Philip},
  booktitle={Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval},