Deduping join data in ruby based on multiple attributes

In the past I created an app using Rails built in HABTM and no unique index on the join tables. I’m more into the has_many :through choice today.

In order to migrate my data with a unqiue index, I had a bunch of de-duping to do. It was a bit trickier than usual because I had to de-dupe based on multiple attributes. I wrote a quick class in ruby to take of this and decided I should share it. Let me know if you have a better way. I sorta went with the fastest because I’m up against a deadline right now.

class JoinDuplicateRemover
 
  def self.de_dup
    models_dedup_hash = {
      AccountEventType => [:account_id, :event_type_id],
      ClientEventType => [:client_id, :event_type_id],
      AccountProductPackage => [:account_id, :product_package_id],
      ClientProductPackage => [:client_id, :product_package_id],
      ClientEventCountry => [:client_id, :country_id],
      ClientResearchCountry => [:client_id, :country_id],
      AccountEventCountry => [:account_id, :country_id],
      AccountResearchCountry => [:account_id, :country_id]
    }
 
    models_dedup_hash.each do |klass,scope_attrs|
 
      puts "\nStart de-duping #{klass.to_s} - #{Time.now}"
      rows_to_del = []
      index = 0
 
      klass.all.each do |instance|
 
        index += 1
        if index % 10 == 0
          print "."
          STDOUT.flush
        end
 
        conditions_hash = Hash.new
        scope_attrs.each do |sa|
          conditions_hash[sa] = instance.send(sa)
        end
        klass.where(conditions_hash).each_with_index do |dupl,i|
          rows_to_del << dupl.id unless i == 0 or rows_to_del.include?(dupl.id) 
        end
      end
 
      rows_to_del.each do |dupl|
        instance = klass.find(dupl)
        puts "Destroying: #{instance.inspect}"
        instance.destroy
      end
 
      puts "\nFinished de-duping #{klass.to_s} - #{Time.now}"
    end
  end
 
end
No TweetBacks yet. (Be the first to Tweet this post)
Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • MySpace
  • Slashdot
  • StumbleUpon
  • Technorati
  • TwitThis

If you enjoyed this post, make sure you subscribe to my RSS feed!

This entry was posted in Software and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.