Close Close Comment Creative Commons Donate Email Add Email Facebook Instagram Mastodon Facebook Messenger Mobile Nav Menu Podcast Print RSS Search Secure Twitter WhatsApp YouTube
PROPUBLICA Expose Corruption. Defend Truth. Support Investigative Journalism.
DONATE

New Open Source Project: Daybreak, a Simple Key/Value Database for Ruby

A couple of weeks ago, in an article about the science behind the Message Machine project, we mentioned the custom key-value store we built to store non-relational data. Today, we're open sourcing the library which we're calling Daybreak.

A couple of weeks ago, in an article about the science behind the Message Machine project, we mentioned the custom key-value store we built to store non-relational data. Today, we're open sourcing the library which we're calling Daybreak.

Daybreak is a simple key-value store for Ruby that operates just like a Ruby hash. Commits to the database are stored in an append-only file and flushed asynchronously, though there are options to atomically commit a write. It is faster than pstore and is simpler to use than dbm. Because your data is stored in an in-memory hash table you also get Ruby conveniences like each, filter, map and reduce. You can install it by running gem install daybreak, and the code is over on github.

Daybreak's API mirrors Ruby's hash interface and it is convenient to use. The docs have a simple walkthrough of the api, but let's create a simple search engine to showcase Daybreak's abilities. Here's the class we'll fill out in this post:

classSearch
    definitialize(docs)
      # tk
    end

    defquery
      # tk
    end

    defadd(docs)
      # tk
    end
  end

First, let's store the documents in a database:

classSearch
    definitialize
      # create the storage database
      @docs_db=Daybreak::DB.new('./docs.db')
      end

    # ...

    # add some documents
    defadd(docs)
      # Daybreak keys are strings so we'll want to convert them back to
      # integers to find the next key
      max=(@docs_db.keys.map(&:to_i).max||0)
      docs.eachdo|doc|
        max+=1
        @docs_db[max]=doc
      end
      # we'll make sure our changes are flushed to disk
      @docs_db.flush!
      @index_db.flush!
      end
  end

In order to have an effective search engine, we'll also need to store each document's index. So let's add another database to handle those indexes:

classSearch
    definitialize
      # create the storage database
      @docs_db=Daybreak::DB.new('./docs.db')
      # Ruby objects work too, this db will have a default value of an empty
      # set
      @index_db=Daybreak::DB.new('./index.db'){|k|Set.new}
      end

    # add some documents
    defadd(docs)
      # Daybreak keys are strings so we'll want to convert them back to
      # integers to find the next key
      max=(@docs_db.keys.map(&:to_i).max||0)
      docs.eachdo|doc|
        max+=1

        @docs_db[max]=doc
        tokens=doc.split(/ +/)
        # create a simple index of ids by word frequencies
        tokens.each{|t|@index_db[t.downcase]=@index_db[t.downcase]<<max}
        end

      # we'll make sure our changes are flushed to disk
      @docs_db.flush!
      @index_db.flush!
    end
  end

And finally let's write the function that takes a query and returns documents that have the words that match the query:

classSearch
    defquery(query)
      num_docs=@docs_db.length

      tokens=query.split(/ +/)

      # Find documents with the query terms
      ids=tokens.reduce([]){|m,t|m+@index_db[t.downcase].to_a}.uniq

      # Finally grab the text and return it.
      ids.map{|id|@docs_db[id]}
      end
  end

Here's an example of how to use the above class:

searcher=Search.new

  searcher.add([
    "To define the reality of the human condition and to make our definitions public.",
    "To confront the new facts of history-making in our time, and their meaning for the problem of political responsibility.",
    "Continually to investigate the causes of war, and among them to locate the decisions and defaults of elite circles.",
    "To release the human imagination, to explore tall the alternatives now open to the human community by transcending both the mere exhortation of grand principle and the mere opportunist reaction.",
    "To demand full information of relevance to human destiny and the end of decisions made in irresponsible secrecy.",
    "To cease being the intellectual dupes of political patrioteers."
  ])

  searcher.query("human condition")
  >>["To define the reality of the human condition and to make our definitions public.",
      "To release the human imagination, to explore tall the alternatives now open to the human community by transcending both the mere exhortation of grand principle and the mere opportunist reaction.",
      "To demand full information of relevance to human destiny and the end of decisions made in irresponsible secrecy."]

In another program we can reopen the database and perform another query:

search2=Search.new

  searcher.query("explore")
  >>["To release the human imagination, to explore tall the alternatives now open to the human community by transcending both the mere exhortation of grand principle and the mere opportunist reaction."]

That's the basics. If it sounds useful to you, head on over to github and kick the tires. If you run into bugs, open up an issue on github, and of course we're always happy to receive pull requests!

Let us know if you end up using it in a project by emailing us at [email protected].

Latest Stories from ProPublica

Current site Current page