A compelling alternative to Hpricot

Posted by Jeremy Voorhis Tue, 10 Apr 2007 15:52:00 GMT

After re-reading Nat Pryce’s Scrapheap Challege writeup, I tried to see how easily I could answer a simple question with only lynx, grep and friends. It turns out to be even simpler than I suspected. For example, the following tells me that my average blog post receives 2.2 comments.


# iterate through 17 pages
page=1; until [ $page -eq 17 ] ; do
  lynx -dump "http://www.jvoorhis.com/articles/page/$page" | \
    egrep "([[:digit:]]+|no) comments?" | \
    sed -e "s/\[.*\]//g" -e "s/no/0/g" | \
    awk '{print $1}' >> comments.txt
  page=$(( page + 1 ))
done
avg comments.txt 



Comments

  1. cypherpunk said about 3 hours later:

    “How could anything be better than Hpricot?!!”

    When I read the blog title, I was a little scared, because I’ve grown to like Hpricot a lot . The thought of Hpricot being marginalized made me feel almost …threatened. Irrational? Perhaps. But I am only human.

    With that said, I must say that that is some fine shell scripting you have there. It also looks very ruby-esque.

  2. JV said about 5 hours later:

    Thanks! But it’s not as Ruby-esque as I would like, since until, while and for were the only iterators I had to play with.

    Although this is a trivial example, I think the following ideas let me bang out a trivial implementation:

    1. Translating the problem to a simpler domain – HTML parsing is substituted by simple regular expressions.
    2. Composing small, focused components to build a solution.
    3. Exploiting the capabilities of a tool beyond its intended use – lynx is a browser, but I wanted a tool that would download document, parse it into semi-structured text, and get out of the way.

    These are all good ideas that can be put to work while composing throw-away solutions for prototyping or problem solving.

  3. amr said about 9 hours later:

    Jeremy, what shell did you use?

    I second the “Ruby-esqe” comment :) I did a double take lol!

  4. JV said about 11 hours later:

    I use zsh, but the above should run just fine in anything sh-compatible.

  5. cypherpunk said about 21 hours later:

    xargs is my secret iterating weapon.

  6. amr said 1 day later:

    I see, “avg” isn’t a builtin in zsh is it? mine on osx doesn’t seem to have it.

  7. JV said 1 day later:

    @amr

    I wondered if anyone would catch that ;) avg is this thing that I wrote a while ago that reads lists of numbers and, well, averages them. This is all:

    
    #!/usr/bin/env ruby
    
    module Enumerable
      def sum
        self.inject(0.0) { |s,e| s + e.to_f }
      end
      def avg
        sum / size.to_f
      end
    end
    
    data = ARGF.read.split($/)
    puts data.avg
    

    That is not the only utility of its kind I have. I am surprised there aren’t more command line tools for statistical processing.

  8. cypherpunk said 1 day later:

    I couldn’t find avg, either, so I decided to implement it as a shell function:

    function avg () {
      n=0 
      sum=0 
      while read i
      do
        n=$(( n + 1 )) 
        sum=$(( sum + i )) 
      done
      echo $sum / $n | bc -l
    }
  9. amr said 2 days later:

    Heh :) actually just that very day I was doing a one liner in ksh & friends on HP/UX to sum up the disk sizes on one of our boxes and I had to use AWK variables to sum the sizes up in the pipeline. I checked out bc quickly but couldn’t find anything that would sum/avg a stream of numbers in a pipeline.

    I’d really love to see bc -avg -i capability (or something like that), please point me to one if someone knows it. I looked up and down that man page but was too hurried to futz around too much.

  10. amr said 2 days later:

    Spoke too soon! I think this would do the averge thingy:

    expr `cat comments.txt | (tr ’\n’ +; echo 0) | bc` / `cat comments.txt | wc -l `

  11. amr said 2 days later:
    expr `cat input.txt | (tr '\n' +; echo 0) | bc` / `cat input.txt | wc -l `

    I think my prev comment was eaten by textile.

  12. cypherpunk said 4 days later:

    Beware of Randal Schwartz. He’s prone to handing out Useless Use of Cat Awards

(leave url/email »)