2006-08-31 21:15A nice bit of word count PHPOne of the things on my TODO list for some time has been writing some code to access the raw database of my blog posts. The most obvious reason is for generating stats, but I like to know I can ask questions about this body of data and get back meaningful answers. A particular question which came up almost 6 months ago was about the word count of my blog: I have been thinking about a word count plugin, or just a PHP SQL (BASH?) hack, to give me a more meaningful statistic, but I’ll save that for when I’m running out of things to say. I hope this doesn’t mean that I’m running out of things to say, and instead that I’m getting back to my core topics like programming. Anyway, I have produced a PHP page to generate a quite detailed analysis of word frequencies and counts, which I will discuss and present in this blog post. Firstly, to remind myself and to inform anyone less experienced than me how I made it, here are the basics (with dollar signs in front of words which must be replaced by values relevant to the individual setup) . I SSH’ed into my web host, and logged into the [MySQL] database server using Turning this query into an interesting set of results was the job of some serious PHP, involving a lot of regular expression replacement. In general, this meant stripping out all punctuation except hyphens and apostrophes (and I realised quite late that it was simpler to do this by whitelisting the characters I do want than blacklisting those I don’t). To be precise though, I then had to deal with words that began, ended or consisted solely of hyphens, apostrophes and spaces, and words that ended “apostrophe ess” so that “Linux’s” wasn’t counted as a separate word from “Linux” (although that does undervalue the ess slightly). Then came the use of arrays. There are some powerful array functions in PHP, and they are a natural choice for dealing with list- and table-based data. Using the Unfortunately, even after a lot of refactoring and commenting of the code, the actual table creation part was still only “not so ugly it needed a rewrite” instead of “so beautiful its optimality is unquestionable”. There is presumably some way of iterating over the whole-blog array and creating an array which groups words of the same frequency. One could possibly extract a unique list of frequencies and then pull out an array of words for each of those frequencies. Instead, I generated an array of words for each frequency (which could then be automatically sorted and converted into a comma-separated string) on-the-fly while iterating through the whole-blog array (using the fact that I can sort the whole-blog array in order of frequency). This method has problems when dealing with the first word (because the processing loop has nothing to compare the first value with) and with the last word (because the loop doesn’t know it has reached the last word until it is too late). Still, if you excuse two rather ill-fitting Here then is the page in action, and here is the code: <?php /** * blog-stats.php * * Copyright 2006 Hagfish * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License, * version 2, as published by the Free Software Foundation. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ // Create the SQL query to get all the published content from the ‘posts’ table (for posts belonging to author 1) $sql = "SELECT post_content FROM wp_posts WHERE post_author=1 AND post_status=’publish’"; // Connect safely to the database and run the query mysql_connect($databaseHost, $username, $password); mysql_select_db($databaseName); $result = mysql_query($sql); // An array which stores the frequencies accumulated over all the posts $combinedArray = array(); // Keep getting the next blog post (database row) until there are no more to get while ($row = mysql_fetch_array($result)) { // Form a string, $text, from the result of the query, stripping any nasty HTML tags $text = strip_tags($row[‘post_content’]); // "This" and "this" shouldn’t be counted separately, so all words converted to lower case (including abbreviations) $lowerText = strtolower($text); // This still leaves entities like & which we strip with this line $noEntities = preg_replace(’/&[^;]*;/’, ”, $lowerText); // Have to strip out punctuation, either by blacklisting, or more simply by whitelisting //$noPunc = preg_replace(‘/[.,?()"\/’\[\]!{}$:;]/’,’ ‘,$noEntities); $noPunc = preg_replace(’/[^a-zA-Z-']/’, ‘ ’, $noEntities); // Hyphens and apostrophes have been allowed, but there can still be problems: // * apostrophes / hyphens on their own // * words starting / ending with an apostrophe, or words in the possessive "’s" form $cleanStart = preg_replace(‘/ [-']*/’, ’ ‘, $noPunc); $cleanEnd = preg_replace(‘/[-']s? /’, ’ ‘, $cleanStart); // Multiple spaces confuse the following "explode", so they must be removed $singleSpaced = preg_replace(‘/\s+/’, ’ ‘, $cleanEnd); // An initial or final space on the string confuses things too $delimited = preg_replace(‘/ $/’, ”, $singleSpaced); $delimited = preg_replace(‘/^ /’, ”, $delimited); // Now split the post into an array of individual words $wordArray = explode(’ ‘, $delimited); // Make an array whose keys are the words and whose values are the frequency of those words $uniqueWordArray = array_count_values($wordArray); // Add the values from the $uniqueWordArray to the values in $combinedArray, using the same keys foreach ($uniqueWordArray as $k => $v) { $combinedArray[$k] += $v; } } arsort($combinedArray); //print_r($combinedArray); ?><?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> <style type="text/css" media="all"> span.total { font-weight: bold; } td.total { text-align: right; } table, tr, td, th { border: 4px solid gray; border-style: ridge; border-collapse: collapse; } td.number { text-align: center; } </style> <title>Blog Stats</title> </head> <body> <table> <tr> <th>Frequency</th> <th>List of words with that frequency</th> <th>Size of that list of words</th> <th>Number of words</th> </tr> <?php // Pre-seed the following loop with the initial frequency from the array $numericalArray = array_values($combinedArray); $oldFrequency = $numericalArray[0]; // $cellArray will be used later to hold all of the words of a given frequency $cellArray = array(); // Need to work out when the loop is about to end so the final $cellArray gets a chance to be displayed $i = 0; $sizeOfArray = sizeof($combinedArray); // Store the grand total $product = 0; $grandTotalArray = array(); // Here is the main loop which either amasses words in $cellArray, or spits out a table row foreach ($combinedArray as $word => $frequency) { $i++; if ($frequency == $oldFrequency) { // Still in the same row, so just write the next word $cellArray[] = $word; } // If the frequency of this word is something new, or this is the last word, create the row from the data so far if ($frequency != $oldFrequency || $i == $sizeOfArray) { ?> <tr> <td class="number"><?php echo $oldFrequency; ?></td> <td><?php asort($cellArray); echo implode(’, ‘,$cellArray); ?></td> <td class="number"><?php echo sizeof($cellArray); ?></td> <td class="number"><?php $product = $oldFrequency * sizeof($cellArray); $grandTotalArray[] = $product; echo $product; ?></td> </tr> <?php // Set up the variables for the next row $oldFrequency = $frequency; $cellArray = array($word); } } ?> <tr> <td colspan="4" class="total"> <span class="total">Grand total:</span> <?php echo array_sum($grandTotalArray); ?> </td> </tr> </table> </body> </html> I should point out, however, that I have deliberately downgraded the security of this script (although the security implications are not serious), and if you wish to use this yourself, either don’t make it available publicly (you could use it to generate static HTML and host that) or review it to make sure you remove any unacceptable risks (mostly information disclosure). Of course, you are Free to add extra security and redistribute your improved version, following the GNU GPL. I include that link in the hope that it counts as “[receiving] a copy of the GNU General Public License along with this program”. How and why I chose to specify version 2 of the GPL will have to wait for another blog post. How many times have I written hapax legomenon? Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
QuicksearchCategoriesSyndicate This BlogBlog Administration |