Manipulate Excel(.xlsx) by Ruby

04 Feb 2016 - by @黄野

1. Introduction

Recently, I got a small task from my boss that required me to find and eliminate repeat items based on specific columns from 5 Excel files by Ruby. To be more clear, details are listed as following:

One seed Excel file which contains all the items
Other Excel files will be used to retrieve and compare with the seed Excel file based on specific columns
If there is any repeat item, remove it from the seed file
Finally, export a new Excel file

Because it didn’t refer to a very complicated manipulation, I decided to use rubyXL to implement the task. Another good gem for manipulating Excel is axlsx.

2. Ruby scripts

2.1. Compare a single Excel file with the seed Excel file and remove repeat items from the seed file

#!/usr/bin/ruby

require 'rubyXL'

# Check arguments
if ARGV.length != 8 or not File.exist?(ARGV[3]) or not File.exists?(ARGV[7])
  puts "Usage: ./filter.rb seed_file_column_name seed_file_column_index seed_file_end_indicator path_to_seed_file exclude_file_column_name exclude_file_column_index exclude_file_end_indicator path_to_exclude_file"
  exit
end

# Parse arguments
seed_file_column_name = ARGV[0]
seed_file_column_index = ARGV[1]
seed_file_end_indicator = ARGV[2]
path_to_seed_file = ARGV[3]
exclude_file_column_name = ARGV[4]
exclude_file_column_index = ARGV[5] 
exclude_file_end_indicator = ARGV[6]
path_to_exclude_file = ARGV[7]

# Init variables, data structure for filtered entities and excel parsing
filtered_companies_index = []
filtered_companies = {}
exclude_companies_index = []
exclude_companies = {}
seed_wb = RubyXL::Parser.parse(path_to_seed_file)
exclude_wb = RubyXL::Parser.parse(path_to_exclude_file)
filtered_wb = RubyXL::Workbook.new

# Define a new function for structuring the workbook
def structure_workbook(wb, column_index, column_name, end_indicator)
  columns = []
  companies_index = []
  companies = {}
  worksheet = wb[0]
  lambda do
    worksheet.each do |row|
      if columns.empty?
        row.cells.each do |cell|
          columns << cell.value
        end
      else
        company = []
        row.cells.each do |cell|
          company << cell.value
          if columns[company.size - 1] == column_name
            return if cell.value == end_indicator
            companies[cell.value] = company
            companies_index << company[column_index.to_i]
          end
        end
      end
      if not columns.include? column_name
        columns.clear
      end
    end
  end.call
  return [companies, companies_index]
end

# Parse seed and exclude workbooks and populate data structures
filtered_companies, filtered_companies_index = structure_workbook(seed_wb, seed_file_column_index, seed_file_column_name, seed_file_end_indicator)
exclude_companies, exclude_companies_index = structure_workbook(exclude_wb, exclude_file_column_index, exclude_file_column_name, exclude_file_end_indicator)

# Find the differences
intersections_companies_index = exclude_companies_index & filtered_companies_index
diff_companies_index = exclude_companies_index - filtered_companies_index

# Export results to the new Excel
diff_companies_index.each_with_index { |val,index|
  exclude_companies[val].each_with_index { |val1, index1|
    filtered_wb.worksheets[0].insert_cell(index, index1, val1)
  }
}

# Output
filtered_wb.write('filtered.xlsx')
puts intersections_companies_index
puts '---A new Excel file called filtered.xlsx has been generated and repeat items are shown above!---'

2.2. Compare multiple Excel files with the seed Excel file and remove repeat items from the seed file

#!/usr/bin/ruby
​
require 'rubyXL'
​
# Check arguments
if ARGV.length < 6 or not File.exist?(ARGV[2])
  puts "Usage: ./filter.rb seed_file_column_name seed_file_end_indicator path_to_seed_file exclude_file_column_name exclude_file_end_indicator path_to_exclude_file path_to_exclude_file .."
  exit
end
exclude_file_paths = []
ARGV[5..-1].each do |exclude_file_path|
  if not File.exists?(exclude_file_path)
    puts "Usage: ./filter.rb seed_file_column_name seed_file_end_indicator path_to_seed_file exclude_file_column_name exclude_file_end_indicator path_to_exclude_file path_to_exclude_file .."
    exit
  end
  exclude_file_paths << exclude_file_path
end
​
# Parse arguments
seed_file_column_name = ARGV[0]
seed_file_end_indicator = ARGV[1]
path_to_seed_file = ARGV[2]
exclude_file_column_name = ARGV[3]
exclude_file_end_indicator = ARGV[4]
​
# Init variables, data structure for filtered entities and excel parsing
filtered_companies_index = []
filtered_companies = {}
exclude_companies_index = []
exclude_companies = {}
seed_wb = RubyXL::Parser.parse(path_to_seed_file)
filtered_wb = RubyXL::Workbook.new
​
def structure_workbook(wb, column_name, end_indicator)
  columns = []
  companies = {}
  worksheet = wb[0]
  lambda do
    worksheet.each do |row|
      if columns.empty?
        row.cells.each do |cell|
          columns << cell.value
        end
      else
        company = []
        row.cells.each do |cell|
          company << cell.value
          if columns[company.size - 1] == column_name
            return if end_indicator != "" and cell.value == end_indicator
            companies[cell.value] = company
          end
        end
      end
      if not columns.include? column_name
        columns.clear
      end
    end
  end.call
  return columns, companies
end
​
# Parse seed and exclude workbooks and populate data structures
companies_to_exclude = []
exclude_file_paths.each_with_index do |exclude_file_path, index|
  columns, companies_to_exclude[index] = structure_workbook(RubyXL::Parser.parse(exclude_file_path), exclude_file_column_name, exclude_file_end_indicator)
end
columns, seed_companies = structure_workbook(seed_wb, seed_file_column_name, seed_file_end_indicator)
​
# Find the differences
filtered_companies = seed_companies.dup
companies_to_exclude.each do |companies_to_exclude_tmp|
  filtered_companies = filtered_companies.dup.delete_if { |k,_| companies_to_exclude_tmp.key?(k) }
end
​
# Import results to the new Excel
columns.each_with_index do |column, index|
  filtered_wb.worksheets[0].insert_cell(0, index, column)
end
filtered_companies.each_with_index do |(key, value), row_index|
  value.each_with_index do |cell_value, cell_index|
    filtered_wb.worksheets[0].insert_cell(row_index + 1, cell_index, cell_value)
  end
end
​
filtered_wb.write('filtered.xlsx')
puts "Seed company excel: #{seed_companies.length}"
ARGV[5..-1].each_with_index do |exclude_file_path, index|
  puts "Exclude company excel #{exclude_file_path}: #{companies_to_exclude[index].length}"
end
puts "Filtered items: #{filtered.length}"
puts "--- The excel file filtered.xlsx has been generated ---"

3. Run and test

Open your terminal
Run gem install rubyXL to install the dependency rubyXL
Navigate to the directory where you put our Ruby scripts
Input ruby xxx.rb (xxx is the name of Ruby script)
There will be a usage message shown and you should follow the instruction
Example: ruby filter.rb seed_file_column_name seed_file_column_index seed_file_end_indicator path_to_seed_file exclude_file_column_name exclude_file_column_index exclude_file_end_indicator path_to_exclude_file
If the script is running correctly, a new Excel file called filtered.xlsx will be generated in the same directory and repeat items will be shown in your Terminal