Full text search for document attachments with Rails & ElasticSearch

by Giang, last updated 19 Jan 2018

I've started working on a project that requires full text search on uploaded documents using ElasticSearch. Lucky enough, ElasticSearch has this Mapper Attachments Type. It is a plugin and can be easily installed. There are few important things to note here:

  • ES accept attachment as an encoded string in base64
  • By default only 100,000 chars are extracted from attachments. You need to config if you need more
  • It handles a lots of file types, not just document. See here

So far, there are several gems that make it easy to work with ElasticSearch such as Tire, Chewy, ElasticSearch Rails and Searchkick. Except Tire which has been retired for a long time, I believe that any of the other three gems will work well. I chose Chewy because it has a dedicate wiki that gives an example of configuration for attachment full text search.

CarrierWave is used to handle upload process.

Following is a sample code for a Product model with two fields: name and attachment

class ProductsIndex < Chewy::Index
  define_type Product do
    field :name
    field :attachment, type: "attachment", value: ->product {
      if product.attachment.present?
        Base64.encode64 open(product.attachment.path).read
      else
        ""
      end
    }
  end
end

A shortcut for quick search:

class << self
  def search keyword
    fields = %w[name attachment]
    ProductsIndex.query multi_match: {query: keyword, fields: fields}
  end
end

Link for demo https://github.com/nguyenducgiang/chewy-demo

But it's not the only solution

What if we just extract text content from document ourself before passing it to ES as a normal string? It is possible using gem like Yomu