On typos and misspellings

This week I’m in a release mood, so I’m releasing several projects I’m involved with. If you lost the first two, checkout dietsplash 0.3 and genslide 0.3 (though the announcement was in Portuguese).

After developing for several projects I’ve noticed most of them contain typos and misspellings. Even if this does not directly affect the source quality (unless the misspellings are in documentation), if we left the comment there, we’ve left it for a reason: because we want the reader of that code to stop and read it. It’s particularly good to have the correct spelling of each word when there are people from several parts of the word that maybe do not have English as their mother tongue (as I don’t). This way we can be more sure the correct message is being given through code, comments  and documentation.

Thinking a bit on this I made some bash and awk scripts to fix misspellings based on the list of common misspellings available on wikipedia. I’ve successfully sent patches for projects like the Linux kernel, ConnMan, oFono and EFL. After some of them were accepted and after I decided to run the scripts again, I noticed how slow they were (if you are curious what they did, you can google on the oFono mailing list, in which I explain the scripts). So, I started a new, very short project: codespell. Measuring against the Linux kernel tree, it runs circa 20x faster than the previous scripts. Its current version is 1.0-rc1 and I’d like to have some more testers before I release the final 1.0.

Codespell is designed to fix misspellings in source code, but it can be applied to any type of text files. When possible, codespell will automatically fix the misspelling. Otherwise it will give some suggestions about possible changes. For example, running against the Linux kernel tree, it gives me several lines like below:

drivers/target/target_core_transport.c:2528: competion  ==> competition, completion

drivers/edac/cpc925_edac.c:186: MEAR  ==> wear, mere, mare

WARNING: Decoding file drivers/hid/hid-pl.c

WARNING: using encoding=utf-8 failed.

WARNING: Trying next encoding: iso-8859-1WARNING: Decoding file drivers/hid/hid-pl.cWARNING: using encoding=utf-8 failed. WARNING: Trying next encoding: iso-8859-1

drivers/net/niu.c:3276: clas  ==> class  | disabled because of name clash in c++

FIXED: ../kernel/drivers/scsi/aacraid/aacraid.h

FIXED: ../kernel/drivers/scsi/lpfc/lpfc_sli.c

FIXED: ../kernel/drivers/scsi/aacraid/aacraid.hFIXED: ../kernel/drivers/scsi/lpfc/lpfc_sli.c

(This is all in beautiful colored lines! Test it to see the true output)

The first two illustrate some changes that cannot be automatically done because that misspelling is a common one for more than one word. So, codespell gives you the file and line where they occur.

The WARNINGs are related to the encoding of the file. Codespell will default to parse files in UTF-8 encoding, which will handle ‘ascii’ as well. If it fails to decode any line, it will try the next available encoding, i.e. ISO-8859-1. Using these two encodings I have successfully ran codespell with all the projects I care about.

Codespell allows some changes to be disabled. This is shown by the “clas => class” fix, that are not always safe to do because of name clash with C++ code.

The lines prefixed with “FIXED” show the files that were automatically fixed. In current Linus’ master branch, this resulted in:

2545 files changed, 5007 insertions(+), 5007 deletions(-)

These were the automatic fixes, that may contain some false positives. The funniest one is the on found in Documentation/DocBook/kernel-hacking.tmpl:

/*
* Sun people can’t spell worth damn. “compatability” indeed.
* At least we *know* we can’t spell, and use a spell-checker.
*/
As can be seen by the number above, this is not really true ;-) .
So, there it’s: codespell 1.0-rc1. Get it. Test it. Report problems. Tell me about projects that were successfully patched.

6 thoughts on “On typos and misspellings

    1. Lucas De Marchi Post author

      Thanks, Mariana.

      No, it will find any misspelling, even if it’s part of the code like a string, a variable name, a function name, etc. At first I thought I’d have to parse the code to get only strings and comments, but after testing it in some projects I noticed that it’s much easier to just disable some automatic changes such as “rela => real” and the like.

      I forgot to say that the program comes with its own dictionary, based on the one available at wikipedia, in which I disabled some rules to have a better result when parsing C source code.

  1. Capi Etheriel

    When i first tried it out, without the –write-changes parameter, it showed all of the typos and its respective fixes. But wherever there was a Capitalized typo, it presented a lowercase fix. I thought it would break and didn’t apply.

    Later, I applied it and checked the file to manually capitalize the words, but it was fixed fine. I guess it already preserves the capitals, but the preview doesn’t preserve.

    It made postpone using codespell (but not that much), maybe the preview could be fixed.

    1. Lucas De Marchi Post author

      Yes, you are right.

      I’m correctly fixing the words, even if they are Capitalized or UPPERCASE. However the ‘preview’ is not showing them right.

      If you know python, you may send a patch to codespell. See my repository. Otherwise I’ll fix it later.

      Thanks for reporting this issue.

    2. Lucas De Marchi Post author

      It’s fixed now. 1.0-RC2 is out:

      commit 09b4baa68be088ed765c45303b3267e1a47c2062
      Author: Lucas De Marchi
      Date: Mon Feb 21 23:59:00 2011 -0300

      codespell 1.0-rc2

  2. Pingback: ANNOUNCE: codespell 1.0 @ Politreco

Comments are closed.