2010-02-28 22:47Splitting arbitrary length stringsA friend of mine was apparently inspired by my solution to the problem of finding the first 10 digit prime number in the digits of e, and he told me about a seemingly similar problem he faced while trying to manipulate a text file he had created in a custom format. While some techniques from solving the e problem may be applicable, his data format allowed for arbitrary length strings, rather than having this 10 digit limitation, which made the problem suddenly much harder. I will detail the exact problem below, as well as listing some of the methods we used to tackle it. The problemMy friend’s notes on the problem describe it as follows: Suppose our adversary gives us an input of the form OoblEck… And demands it be transformed into OoObOlEcEk by prefacing each lower-case letter with the most recent capital. For the purposes of this blog post, however, I will use the notation AbcdZbcdef for the input string and thus AbAcAdZbZcZdZeZf is the desired output string (although the examples will actually only focus on the section before the Z). This has the advantage of being easier to use in spoken conversations, at the expense of making the problem look slightly less general. Multiple copies of a stringBuilding on the solution to the e problem, I aimed to turn a string like Abcde into something like: Abcde Acde Ade Ae and then take the first two letters from each line, for instance. With a for-loop being against my rules for one-liners, the first step to generate a set of rows like this was to generate n copies of the string, for a string with n letters. When discussing the puzzle with me, my friend had mentioned an undocumented feature of printf which seemed to provide just the functionality I was looking for, and I discovered that even a command like this was possible: $ printf ‘HelloWorld\n%.0s’ {1..5} HelloWorld HelloWorld HelloWorld HelloWorld HelloWorld What I was really aiming for, though, was to have the line number appended to each line, so that I would know which characters to remove from which respective line, but that turned out to be too difficult. I did manage to get as far as: $ echo “foo” | sed “s/.*/’&\\n%.0s’/” | xargs -I@ printf @ {1..5} foo foo foo foo foo which, despite the multiple-backslashes, did seem to indicate I was making progress. However, this method still did not produce the line numbers I wanted (and nl wouldn’t work, as I needed the numbering to restart with each new capital), but I had a sudden brain-wave: instead of trying to add numbers to a command for outputting multiple strings, I could try adding strings to a command that outputs multiple numbers. The command that came to mind was seq and I discovered that I could run the command: $ echo foo%1.f 3 | xargs -n 2 seq -f foo1 foo2 foo3 which uses the -f formatting feature of seq Replacing place holdersUnfortunately in this example the number 3 is hardcoded, whereas in general the length of the string between capital letters is not fixed or known in advance, so it would have to be calculated. This made me look for console commands which find the length of a string but which (unlike wc) keep the string intact. It turns out there is such a command, or so I thought; it is called expr, and I found it thanks to an idea that my friend (the same friend who came up with this puzzle in the first place) had. His plan was to use the replace function of expr to swap out a place holder character for the necessary capital letter, something like this: Ab-c-d-e → AbAcAdAe This idea was really clever and I had never even used expr for a one-liner before, but unfortunately it was a little too clever, as the replace function doesn’t actually exist. Reading the man page for it, however, was helpful, and it inspired me to try: $ expr length foo 3 Encouraged by this, I tried something a little more complicated: $ expr (length foo) (foo) bash: syntax error near unexpected token `length’ but eventually gave up, deciding that expr was the wrong tool for the job. My personal favourite tool for these sorts of puzzles is sed so it is perhaps no wonder that I ended up resorting to it here. One attempt was: $ echo “-b-c-d A” | sed ‘s/-/A/g’ AbAcAd A which seemed hopeful, but had the problem of hardcoding the replacement letter in the sed rather than extracting it from the last character of the input string. Now, sed does allow capturing sub-expressions and assigning them to the special variables \1 to \9, but it proved too difficult to capture the final character of the input while also specifying that the hyphen is the character which should be replaced. If sed isn’t the answer to the problem, though, that just means you need to turn the problem into one for which sed is the answer – using sed! The “meta-sed” approach I came up with, then, went something like this: $ echo “-a-b-c A” | sed ‘s/\(.*\) \(.*\)/echo \1 | sed “s\/-\/\2\/g”/’ | xargs echo -a-b-c | sed s/-/A/g Actually that just spits out a command which, if run, would produce the desired string, and it was here I decided to re-evaluate my approach again. awkIt turns out I have another friend, a coworker, who had also read my blog post about e and is something of an expert with awk. On previous occasions when I had discussed such puzzles with him, I tended to find that his solutions either broke my rules for one-liners (as awk allows embedded loops) or the solutions were so trivial that I could replace them with a sed one-liner. Still, I was interested to see how an awkist would solve this problem. It took him a few minutes, but he came back with this set of commands: $ cat input | awk ‘BEGIN{FS=”“; string=”“} {for(i=1; i <NF; i++) if ($i==”-“) string=string”“$1; else string=string”“$i} END{print string}’ which does work and produces output like this: $ echo “Ab-c-d-” | awk ‘BEGIN{FS=”“; string=”“} {for(i=1; i <NF; i++) if ($i==”-“) string=string”“$1; else string=string”“$i} END{print string}’ AbAcAd This was admittedly quite impressive, but I think at this point I insisted that awk must surely have some feature to replace a given character with another (captured) character, and he replied something like “Of course! Wait a minute!” before returning moments later with something that looked a bit more like this: $ echo “Ysdf-sdf-sdf-sdf” | awk ‘BEGIN{FS=”“} {gsub(/-/,$1)} {print}’ YsdfYsdfYsdfYsdf He even said something about picking the syntax to deliberately avoid semi-colons, in order to not fall foul of my one-liner rules. At this I had to accept that awk was the right tool for the job and had won the day. ConclusionAll I had left to do was adjust the source string and incorporate this awk code into the final one-liner, and I got this: $ echo “OoblEck” | sed ‘s/[A-Z]/\n&/g’ | sed ‘s/[a-z]/-&/g’ | sed ‘s/\([A-Z]\)-/\1/’ | awk ‘BEGIN{FS=”“} {gsub(/-/,$1)} {print}’ | tr -d “\n” | xargs OoObOlEcEk Just to quickly explain how this works, the first sed splits the input into lines starting at each capital letter, the second puts a hyphen before each lower case letter, the third removes the hyphen that is immediately after the initial capital (otherwise the resultant string would start with two capitals), and then the awk copies the initial capital over each of these hyphens, as explained above. The tr after this is just to assemble the lines into a single string again, and the xargs effectively adds a newline character at the end of the string. Perhaps the moral of this story is that we cannot defeat our common adversary on our own, but together we are part of something greater. It might also mean that I should learn awk properly so that I can solve these sorts of problems without having to disturb my friends. Or is the moral that one should not create custom data formats which use individual letters to represent values? Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
QuicksearchCategoriesSyndicate This BlogBlog Administration |