To edit pages or tickets please login with username/password: aaf/aaf

Ticket #202 (assigned enhancement)

Opened 8 months ago

Last modified 8 months ago

[PATCH] Large speed improvement on indexing (15x)

Reported by: aaf Assigned to: jk (accepted)
Priority: major Milestone: 0.5
Component: 0plugin Version:
Keywords: PATCH Large speed improvement indexing Cc: francois.lagunas@gmail.com, j@jjb.cc

Description

When adding acts_as_ferret to an existing application, with a few thousands of entries in th DB, the initial indexing time is prohibitive : on a MacBook?, 15000 entries are indexed in about 215s. This may lead to the improper conclusion that ferret / acts_as_ferret is slow (but other experiment show that it is not !) In fact, the bulk indexer does not really perform bulk indexing, as the documents are basically indexed one by one. The solution I have found is to change slightly the index_records function in bulk_indexer.rb file revision 316 to :



def index_records(records, offset) 
      docs = {}
      batch_time = measure_time {        
        records.each { |rec| docs[rec.id] = rec.to_doc if rec.ferret_enabled?(true) }
        @index.update_batch(docs)
      }.to_f

      ...
 end

This use a new function, update_batch, that I added in ferret 0.11.6, in index.rb :

  def update_batch(docs)
      @dir.synchrolock do
        ensure_writer_open()
        commit = false
        docs.each do |id, value| 
          delete(id)
          commit = true if id.is_a?(String) or id.is_a?(Symbol)
        end
        if commit
          @writer.commit
        end
        ensure_writer_open()
        docs.each do |id, new_doc| 
          @writer << new_doc
        end
          flush() if @auto_flush
      end
    end

This function performs the same operation as update, but on a set of documents instead of one document at a time. The result is that initial indexing took only 17s instead of 215s, a nice improvement. This of course would need some consistency validation on the ferret side. There may too some subtler improvements to be done, as the delete part of this patch is not totally "batched". And this could be used to speed up other operations in acts_as_ferret I suppose.

Francois Lagunas francois.lagunas@gmail.com http://www.tourteaser.com

Change History

(in reply to: ↑ description ) 02/06/08 23:14:29 changed by aaf

  • keywords changed from Very Slow Batch indexing to PATCH Large speed improvement indexing.
  • type changed from defect to enhancement.
  • summary changed from Very slow initial indexing (patch provided) to [PATCH] Large speed improvement on indexing (15x).

This has a related ticket on ferret :

http://ferret.davebalmain.com/trac/ticket/340

In fact, when the patch on ferret is used, you only have to use the new batch_update function :

def index_records(records, offset) 
      docs = {}
      batch_time = measure_time {        
        records.each { |rec| docs[rec.id] = rec.to_doc if rec.ferret_enabled?(true) }
        @index.batch_update(docs)
      }.to_f

      ...
 end

Francois Lagunas
Scientific Director, Dailymotion
francois.lagunas@gmail.com
http://www.tourteaser.com

02/07/08 09:25:13 changed by jk

  • status changed from new to assigned.
  • milestone set to 0.5.

cool stuff :-)

02/11/08 15:11:25 changed by aaf

  • cc changed from francois.lagunas@gmail.com to francois.lagunas@gmail.com, j@jjb.cc.

Looks like Francois' improvements were accepted into Ferret:

http://ferret.davebalmain.com/trac/ticket/340 http://ferret.davebalmain.com/trac/changeset/810

Nice!

To edit pages or tickets please login with username/password: aaf/aaf