I program to program but I am finding an interest in statistics lately. There has been a rise in the popularity of social bookmarking sites, as well as a rush to beat Google at their own game with improved vertical search capabilities. Paul Graham writes about using a Bayesian filter to classify spam/non-spam but the same idea could be used to classify other kinds of text as well, as I’m sure is already being done by any major search engine contender. Bayesian analysis could be used to separate, for example, the word Java in the context of coffee from Java in the context of computing. Instead of only spam/non-spam probabilities, it may be practical to determin probabilities for a whole set of classifications like humor, romance, or tutorials, as they apply to web content.

Searching the web is a complex endeavor and I am more curious than anything so I wrote a simple script to scrape the popular pages of Digg and Reddit. The program output is just a list of words in order of popularity for each website. I present here some results of a scan on new-years eve, 2006. I compiled the following list in no particular order. Words were hand selected from the output based on their ability to indicate whether the text was more politically or technology inclined. PPM is number of times a word would appear in a corpus of 1 million words, % is the percent chance a word appeared on a given page.

Word Reddit PPM Reddit % Digg PPM Digg %
Bush 0001185 075 0000359 047
Saddam 0000359 047 0000453 060
oil 0000079 005 0000151 020
war 0000434 027 0000113 015
Blair 0000039 002 0000000 000
Wii 0000237 015 0000794 105
Linux 0000276 017 0001928 255
Windows 0000118 007 0002836 375
computer 0000158 010 0000151 020
AJAX 0000000 000 0000132 017
Sony 0000079 005 0000245 032

It can be seen from these results that a filter trained to separate political from technological discussion would tend to label digg the more technological, in the sense of popular technology anyway. As for political words, Bush, Blair, and war were popular on Reddit but oil and Saddam lean slightly toward Digg. An average of the PPM score for political words would place Reddit on top. A real statistician might kill me for these “conclusions” so in the spirit of learning something, or helping someone else learn something, here’s the code I used to collect the data. Make your own conclusions if you wish.


require 'uri'
require 'net/http'
require 'thread'

class WordList

  attr_reader :group_count
  attr_reader :total_count

  def WordList.parse(text)
    text.
    gsub(/<!--(.|\s)*?-->/, " "). # remove comments
    gsub(/<[^>]*>/, " ").         # remove tags 
    gsub(/&(\w)+;/, " ").         # entities
    gsub(/[^\w\d\s+]/, " ").      # replace all non word/digit/whitespace with whitespace
    gsub(/[^w]\d+[^w]/, " ").     # remove pure numbers
    chomp.split(/\s+/)
  end

  def initialize
    @words = Hash.new(0)
    @total_count = 0
    @group_count = 0
  end

  def mutex
    @mutex ||= Mutex.new
  end

  def synchronize
    mutex.synchronize { yield self }
  end

  def add(word)
    return unless word and word.length > 0
    @words[word.to_sym] += 1
    @total_count += 1
  end

  def add_each(words)
    words.each { |w| add w }
    @group_count += 1
  end

  def add_string(s)
    add_each self.class.parse(s)
  end

  def sort
    @words.to_a.sort_by { |x| -x[1] }
  end

  def output
    string = "Word, PPM of words, % of pages\n" 
    sort.each do |x|
      string << sprintf("%20s %07d %03d\n", x[0], (x[1] * 1000000 / @total_count), (x[1] * 100 / @group_count))
    end
    string
  end

end

class SiteScanner

  def initialize()
    @word_list = WordList.new
  end

  def base_url
    raise 'overwrite me'
  end

  def page_to_url(n)
    raise 'overwrite me'
  end

  def get_page(n)
    url = base_url + page_to_url(n)
    req = Net::HTTP.get_response(URI.parse(url))
    req.body
  end

  def scan
    threads = []
    began_at = Time.now
    for n in 1..num_pages
      threads << Thread.new do
        s = get_page(n)
        print "."; $stdout.flush
        @word_list.synchronize { |w| w.add_string s }
      end
    end
    threads.each { |t| t.join }
    ended_at = Time.now
    File.open(filename, File::WRONLY|File::APPEND|File::CREAT, 0666) do |file|
      file << "Scan began:    #{began_at}\n" 
      file << "Scan ended:    #{ended_at}\n" 
      file << "Words scanned: #{@word_list.total_count}\n" 
      file << "Pages scanned: #{@word_list.group_count}\n" 
      file << @word_list.output
    end
  end

end

class DiggScanner < SiteScanner

  def base_url
    "http://digg.com/news/page" 
  end

  def page_to_url(n)
    n.to_s
  end

  def filename
    "digg.txt" 
  end

  def num_pages
    100
  end

end

class RedditScanner < SiteScanner

  def base_url
    "http://reddit.com/?offset=" 
  end

  def page_to_url(n)
    ((n - 1)  * 25).to_s
  end

  def filename
    "reddit.txt" 
  end

  def num_pages
    40
  end

end

RedditScanner.new.scan
DiggScanner.new.scan

This project also gave me a chance to try my hand at ruby threads. I’m not sure if the mutex on the word list is required. I have no idea what would happen if two threads called WordList#add_string at the same time. Perhaps someone with more knowledge can comment on thread safety and what happens if two threads call the same method.

It’s a long, long way from a useful application but the program output was interesting and encouraging. I would enjoy your comments.

1 Response Follows

  1. Justin says
    Sam, great article. You truly are an abstract thinker. Just one suggestion though, I think it would be good if you could include a time period into your scanner. e.g. 30PPM over 35 hours 12 minutes and 33 seconds. Just a suggestion though. You could even start your own site called socialwords.com. Hehe, seriously great idea. With this scanner you could also determine the 'quality' level and the focus of the diff sites.

Your Response