2009-07-31 20:35The right way to split stringsThis month I found myself considering the problem of how to split a string representing some English text into substrings that contain only whole words, but where the substrings are as near as possible to a certain number of characters in length. Formalising this a bit, I wanted an algorithm which would take a string s of characters (including spaces but not as consecutive characters) and a number n, and return a substring of s starting at the first character (using one-based numbering) and going up to the mth character, where m ≤ n, character m+1 is a space, and any characters in s which are after the m+1th (exclusive) and before the n+1th (inclusive) are not spaces. That has probably made things sound more confusing, so imagine that n was 7, and s was “1234 6789 ABCD”, then m would be 4, and m+1 would be 5, meaning there are no spaces that are strictly after m+1 but before or including n+1. The main focus of this post, though, is showing the different ways it can be done in Groovy, and how beautiful those ways are, especially if you make it a one-liner. The Wrong WayOne way to attack the problem is to try generating a list of word lengths, then convert that into a running total list, so that the problem is simply a matter of finding the index of the first value in a list greater than or equal to n. Consider the following code: list = [1,2,3] list.plus(0).eachWithIndex{it, i -> list[i]=list[i-1] + it} println list The The Long WayA slight improvement is to use a threshold = 14 sum = 0 wordCount = "the quick brown fox".split(" ").collect{it -> it.size()+1}.collect{it -> sum+=it}.findAll{it<=threshold+1}.size - 1 "the quick brown fox".split(" ")[0..wordCount].join(" ") This method is “the long way” because apart from wasting a line on a temporary variable sum, it requires one chain of instructions to find how many words are allowed, and another chain to generate the desired string, in this case the quick. The Right WayAfter getting this far, it struck me that the problem was actually equivalent to taking a substring of the desired length, then removing any incomplete words from the end. The only problem with this method was determining whether the word at the end of the string was really incomplete, that is, whether taking a substring had cut at the end of a word or in the middle of one. These two cases could easily be dealt with by an threshold=15 // Returning threshold+1 characters means taking characters with indexes from 0 to threshold if ("the quick brown fox"[threshold] == " ") { println "the quick brown fox"[0..threshold] } else { println "the quick brown fox"[0..threshold].split(" ")[0..-2].join(" ") } It seems like this is “the right way”, then, especially as you could rewrite the above as a rather cumbersome one-liner using the ternary operator. The Cunning WayFrom a maintainability perspective, using an threshold=15 println "the quick brown fox"[0..threshold].plus("#").split(" ")[0..-2].join(" ") When threshold is 15, the ConclusionAgain, Groovy shows what a pretty language it is, and I really find that it reflects the way I naturally think about programming. Perhaps it would be more accurate to say that Groovy has taught me to think about programming in a way that is nicely expressible using Groovy’s syntax. In any case, it feels like when I am trying to solve these sorts of algorithmic problems using Groovy, the language is not as much of a hindrance as using other languages would be. Using the language may even have made me a better programmer, but I don’t think that coming up with a clever string splitting algorithm is a huge advance in the field of software. Actually, it sounds like the sort of thing Knuth was doing 30 years ago. Has anyone ever published a one-liner like this before? Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
QuicksearchCategoriesSyndicate This BlogBlog Administration |