Twitter engineering today explained its real-time "related queries" and "spelling corrections" process in Twitter search. "As you may have noticed, searches on twitter.com, Twitter for iOS, and Twitter for Android now have spelling corrections and related queries next to the search results," Twitter said.
The company notes, that "within the first two weeks of launching related queries and spelling corrections in late April, Twitter has corrected 5 million queries and provided suggestions to 100 million more."
The company concludes that it is working on more ways to help find and discover the most relevant and engaging content in real time. Also, there are other "big improvements" Twitter will be rolling out to Twitter search over the coming weeks and months.
"At the core of our related queries and spelling correction service is a simple mechanism: if we see query A in some context, and then see query B in the same context, we think they're related. If A and B are similar, B may be a spell-corrected version of A; if they're not, it may be interesting to searchers who find A interesting. We use both query sessions and tweets for context; if we observe a user typing [justin beiber] and then, within the same session, typing [justin bieber], we'll consider the second query as a possible spelling correction to the first -- and if the same session will also contain [selena gomez], we may consider this as a related query to the previous queries. The data we process is anonymized -- we don't track which queries are issued by a given user, only that the same (unknown) user has issued several queries in a row, or continuously tweeted," Twitter said.
"To measure the similarity between queries, we use a variant of Edit Distance tailored to Twitter queries; for example, in our variant we treat the beginning and end characters of a query differently from the inner characters, as spelling mistakes tend to be concentrated in those. Our variant also treats special Twitter characters (such as @ and #) differently from other characters, and has other differences from the vanilla Edit Distance. To measure the quality of the suggestions, we use a variety of signals including query frequencies (of the original query and the suggestion), statistical correlation measures such as log-likelihood, the quality of the search results for the suggestion, and others."
"Twitter's spelling correction has a number of unique challenges: searchers frequently type in usernames or hashtags that are not well-formed English words; there is a real-time constancy of new lingo and terms supplied by our own users; and we want to help people find those in order to join in the conversation," the comapny said.
"To address all of these issues, on top of our context-based mechanism, we also index dictionaries of trending queries and popular users that're likely to be misspelled, and use Lucene's built-in spelling correction library (tweaked to better serve our needs) to identify misspelling and retrieve corrections for queries," explained Twitter.
Twitter say, they started computing-related queries and spelling correction in a batch service, but noticed the the lag -- as the process would take several hours for the models to adapt to new search trends. So, the company rewrote the entire service, as an online, real-time one.
"Queries and tweets are tracked as they come, and our models are continuously updated, just like the search results themselves. To account for the longer tail of queries that has less context from recent hours, we combine the real-time, up-to-date model with a background model computed in the same manner, but over several months of data (and updated daily)," explained Twitter.