2009-12-17 16:26A one-liner for finding spelling mistakes in codeI do a lot of programming, and I like writing one-liners to help me with things, so it’s perhaps not surprising that I’ve ended up writing a one-liner to help me with my programming. I should point out that the initial motivation to write this one-liner was not that I made any mistakes in the code I wrote, rather it was someone else’s code I was looking at which needed correction, but it would be hubris to assume I’m never going to make any mistakes myself, so I’m sure this script will be useful for my own code. Of course, nowadays editors will at least spell check the comments in your code for you, but it is also good to make sure your variable names don’t contain misspelled words, as that makes it harder for people (who know the correct spelling) to collaborate with you. This one-liner is rather crude and does produce a lot of noise in the output, but it is also interesting from a technical point of view, so I will discuss below how I came up with it and how it works. Initial attemptsMy first thoughts were to try a direct grep method. The command: grep -x foo /etc/dictionaries-common/words
would output foo if that word was found, and nothing if it wasn’t. The requirement is to find spelling mistakes, though, not correctly spelled words, so we would want to reverse this logic. An obvious alternative to try is: grep -xv foo /etc/dictionaries-common/words
since -v does reverse the logic, in some sense, but what will actually happen is that all words which don’t match foo are output, which is even less useful. Another way of using grep is: grep -xc foo /etc/dictionaries-common/words
which outputs 1 if the word foo is found, and 0 if it isn’t. Having what looks like a binary output stream, however isn’t very useful on its own. I had hoped that using the options -xcn would have caused grep to output 0:foo if the word were not found, but sadly it doesn’t work like this. I think I then went on to try some other clever methods, probably using xargs a lot, until I thought of an even cleverer method involving uniq. I realised I could cat the words from one file along with the dictionary file, and then sort the combined list and run uniq -c on it. This would produce a list of words each prepended by a number, and that number would be 1 if the word only appeared once in one or other of the files (and didn’t appear at all in the other file). Unfortunately this wouldn’t tell you whether the unique occurrence was an uncommon word from the dictionary or a spelling mistake from the source file. If I had been a bit cleverer still, I might have tried cating two copies of the dictionary and then dismissing words which appeared twice or more times in the output, but this would not catch words which are spelled incorrectly several times in one file. This could perhaps have been worked around by doing: cat WordsToTest-OnePerLine.txt | sort | uniq | cat - /etc/dictionaries-common/words /etc/dictionaries-common/words | sort | uniq -c | grep 1
but I didn’t think of that at the time. Anyway, at this point I gave up and called my friend over. As he is an expert at using awk, and awk is Turing powerful and thus “cheating” under my one-liners rule, I was hoping he would be unable to solve this challenge. He did manage to get very close, but “fortunately” he was unsuccessful, limited by the difficulty of looping over two input files simultaneously. Seeing how close he got to a solution, however, did inspire me to tackle the problem again, from a different angle, which is why I looked into using diff. Using diffIt turns out that diff can accept stdin as one of its files for comparison, so you can pipe in a list of words and compare them with the dictionary, like so: echo -e "one\noneandahalf\ntwo\nthree" | sort | diff - /etc/dictionaries-common/words
This will produce a lot of output, though, as the diff also lists the words which are in the dictionary and not in the echoed text. To filter these unnecessary lines out, the command should be: echo -e "one\noneandahalf\ntwo\nthree" | sort | diff - /etc/dictionaries-common/words | grep "<" | cut -c 3-
where the cut is to remove the “< ” at the beginning of each line, which was added by the diff. The only remaining task was to generate a list of tokens able to be checked with this diff trick. If the files you’re interested in checking are JavaScript source files, for example, you might use a command like this to generate those tokens: find . -name "*.js" | xargs cat | sed ’s/[^a-zA-Z]/ /g’ | tr ’ ‘ ‘\n’
The find produces a list of eligible filenames, the xargs cat outputs their contents, the sed turns all non-alphanumeric characters into spaces, and the tr turns all spaces into new lines. This does produce a lot of blank lines, but those can be sorted and uniqed out later. The more significant issue is how to deal with camelCase or CamelCase words. Each word in the concatenated string should be checked on its own, so strings like this should be split into their constituent parts. What is needed is a search in each word for a lower case letter followed by an upper case letter, and a space should be added between them. This is possible with sed, so the command above should be replaced with: find . -name "*.js" | xargs cat | sed ’s/[^a-zA-Z]/ /g’ | sed ’s/\([a-z]\)\([A-Z]\)/\1 \2/g’ | tr ’ ‘ ‘\n’ | sort | uniq
Case sensitivityThe combined result, then, is a command like this: find . -name "*.js" | xargs cat | sed ’s/[^a-zA-Z]/ /g’ | sed ’s/\([a-z]\)\([A-Z]\)/\1 \2/g’ | tr ’ ‘ ‘\n’ | sort | uniq | diff - /etc/dictionaries-common/words | grep "<" | cut -c 3-
which will produce output like this: BUTTON bx By C Cache cadetblue Calculate Call callback Callback callbacks This isn’t right, though, as “Call” is not an incorrect spelling. What is needed is to make the checks less case sensitive, and the obvious choice would be to use the -i option to diff, but if the source files contain both Call and call then it will look different to the words list, which only has one copy of that word. The trick, then, is to make sure that the list of words piped into the diff is further pre-filtered to remove these duplicate (modulo case) words. Annoyingly this means adding the -f flag to sort and the -i flag to uniq, along with the -i for diff. This produces output like: bx cadetblue callback callbacks callee cancelable cb cc cd cdaa ced ConclusionSo the final command should be: find . -name "*.js" | xargs cat | sed ’s/[^a-zA-Z]/ /g’ | sed ’s/\([a-z]\)\([A-Z]\)/\1 \2/g’ | tr ’ ‘ ‘\n’ | sort -f | uniq -i | diff -i - /etc/dictionaries-common/words | grep "<" | cut -c 3-
One thing this command still gets wrong is listing things like “Internet” and “Java” (which are capitalised in the dictionary file) as spelling mistakes. This is because with case insensitive ordering of the source tokens, “Internet” is sorted after “igloo”, for example, whereas with the dictionary’s case sensitive ordering, “Internet” comes before “igloo”. Obviously if a word appears at the wrong place in the list, diff will assume it isn’t in the dictionary, despite the -i option being used. So the conclusion could be that collation is hard, or perhaps that I should have used the cat | cat trick I mentioned above. Anyway, how much do you think your coworkers will enjoy hearing that their code has spelling mistakes in? Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
QuicksearchCategoriesSyndicate This BlogBlog Administration |