Disassembling/Reassembling the MongoDB ObjectId

Intro

MongoDB automatically adds an _id field for every inserted document (if one is not provided by the user), the value of which must be unique and can be of any type, but most commonly is of ObjectId type.

ObjectId specification

MongoDB docs are very detailed and the ObjectId specification helps us break down the ObjectId automatically generated value into distinctive parts. These parts are (consecutively):

  1. timestamp → Generation timestamp (4 bytes)
  2. machine → First 3 bytes of the MD5 hash of the machine host name, or of the mac/network address, or the virtual machine id.
  3. pid → First 2 bytes of the process (or thread) ID generating the ObjectId.
  4. inc → ever incrementing integer value.

Most drivers (including pymongo) include direct methods to return the creation timestamp.

For everything purpose there must be …

…as Yoda would probably add. Decomposing an ObjectId enables us to store the individual properties as integer values and fetch the specific document by recomposing the _id field value and querying using this as condition. I plan to write another post describing the steps of implementing xmlpipe2 in order to use the Sphinx search engine along with MongoDB collections.

Enough with the talk already!

The following example uses Python and the native MongoDB driver, pymongo, on a server running Ubuntu 10.04. pymongo is installed using easy_install:

apt-get install python-setuptools
easy_install pymongo

We start by importing the objectid class from the pymongo module and the date class from the datetime module, which we’ll use in order to verify the extracted timestamp.

#!/usr/bin/python
from pymongo import objectid
from datetime import date

Generate a new ObjectId and get it’s 24-byte hexadecimal representation as string:

o = objectid.ObjectId()
o_str = str(o)
print 'ObjectId : ', o_str

Selecting the string ranges that correspond to each property and converting to integer values from hexadecimal:

id_time = int(o_str[0:8], 16)
machine = int(o_str[8:14], 16)
pid = int(o_str[14:18], 16)
inc = int(o_str[18:], 16)

print 'Decomposed:'
print 'creation timestamp:', id_time, o_str[0:8]
#verifying the timestamp is valid
print date.fromtimestamp(id_time).strftime('%Y-%m-%d %H:%M:%S')
print 'machine hash:', machine, o_str[8:14]
print 'PID:', pid, o_str[14:18]
print 'inc:', inc, o_str[18:]

For the reassembling of the ObjectId, we use the hex function to get the hexadecimal representation as string, but we must erase the ‘0x’ prefix. Also we use the zfill string function, to left pad the resulting strings with zeros:

#recomposing to string
objectid_r = hex(id_time).replace('0x', '')
objectid_r += hex(machine).replace('0x', '')
objectid_r += hex(pid).replace('0x', '').zfill(4)
objectid_r += hex(inc).replace('0x', '').zfill(6)
print 'Recomposing ObjectId: ', objectid_r
print 'Comparison for string representations: ', objectid_r == o_str

At this point, we observe that these 2 sets of code could be incorporated into methods of the ObjectId class, so we can create our own class by extending the base class and adding the compose and decompose methods:

class myobjectid(objectid.ObjectId):
  def decompose(self, s = False):
    if s == False:
      o_str = str(self)
    else:
      o_str = s
    id_time = int(o_str[0:8], 16)
    machine = int(o_str[8:14], 16)
    pid = int(o_str[14:18], 16)
    inc = int(o_str[18:], 16)
    return { "timestamp" : id_time, "machine" : machine, "pid" : pid, "inc" : inc }

  def compose(self, elements):
    objectid_r = hex(elements["timestamp"]).replace('0x', '')
    objectid_r += hex(elements["machine"]).replace('0x', '')
    objectid_r += hex(elements["pid"]).replace('0x', '').zfill(4)
    objectid_r += hex(elements["inc"]).replace('0x', '').zfill(6)
    return objectid_r

The decompose method returns a dictionary with the elements of the ObjectId instance. An explicit string can be given for decomposition by providing the value for the s argument. The compose method takes the dictionary returned from decompose and returns the string representation of the ObjectId.

#we observe that ObjectId generated values for the same session are sequential,
#as the driver emulates the database behavior
print 'Using the extended class:'
o = myobjectid()
print 'Hex representation:', str(o)
print 'Fetching decomposition dict:',o.decompose()
print 'Verifying recomposition method:',o.compose(o.decompose())

Sample output:

ObjectId :  4f9407d7ae243d04f8000000
Decomposed:
creation timestamp: 1335101399 4f9407d7
2012-04-22 00:00:00
machine hash: 11412541 ae243d
PID: 1272 04f8
inc: 0 000000
Recomposing ObjectId:  4f9407d7ae243d04f8000000
Comparison for string representations:  True

Using the extended class:
Hex representation: 4f9407d7ae243d04f8000001
Fetching decomposition dict: {'machine': 11412541, 'timestamp': 1335101399, 'pid': 1272, 'inc': 1}
Verifying recomposition method: 4f9407d7ae243d04f8000001

Well, that was all, I hope I got it right… please feel free to comment on any mistakes!
The complete code is here (click to expand):

#!/usr/bin/python
from pymongo import objectid
from datetime import date

class myobjectid(objectid.ObjectId):
  def decompose(self):
    o_str = str(self)
    id_time = int(o_str[0:8], 16)
    machine = int(o_str[8:14], 16)
    pid = int(o_str[14:18], 16)
    inc = int(o_str[18:], 16)
    return { "timestamp" : id_time, "machine" : machine, "pid" : pid, "inc" : inc }

  def compose(self, elements):
    objectid_r = hex(elements["timestamp"]).replace('0x', '')
    objectid_r += hex(elements["machine"]).replace('0x', '')
    objectid_r += hex(elements["pid"]).replace('0x', '').zfill(4)
    objectid_r += hex(elements["inc"]).replace('0x', '').zfill(6)
    return objectid_r


o = objectid.ObjectId()
o_str = str(o)
print 'ObjectId : ', o_str
#decomposing to integers
id_time = int(o_str[0:8], 16)
machine = int(o_str[8:14], 16)
pid = int(o_str[14:18], 16)
inc = int(o_str[18:], 16)

print 'Decomposed:'
print 'creation timestamp:', id_time, o_str[0:8]
print date.fromtimestamp(id_time).strftime('%Y-%m-%d %H:%M:%S')
print 'machine hash:', machine, o_str[8:14]
print 'PID:', pid, o_str[14:18]
print 'inc:', inc, o_str[18:]

#recomposing to string
objectid_r = hex(id_time).replace('0x', '')
objectid_r += hex(machine).replace('0x', '')
objectid_r += hex(pid).replace('0x', '').zfill(4)
objectid_r += hex(inc).replace('0x', '').zfill(6)
print 'Recomposing ObjectId: ', objectid_r
print 'Comparison for string representations: ', objectid_r == o_str

#we observe that ObjectId generated values for the same session are sequential,
#as the driver emulates the database behavior
print ''
print 'Using the extended class:'
o = myobjectid()
print 'Hex representation:', str(o)
print 'Fetching decomposition dict:',o.decompose()
print 'Verifying recomposition method:',o.compose(o.decompose())
Advertisements
Tagged , ,

3 thoughts on “Disassembling/Reassembling the MongoDB ObjectId

  1. There is a bug in this script… The date.fromtimestamp() call is throwing away the time information. Instead you should use something like time.localtime() to extract the date and the time from the timestamp.

    • Thanks for the remark. You are correct, but this line is for demonstrating purposes only (just checking that our timestamp corresponds to an actual date), it is not actually used in the composition/decomposition methods.
      I will replace it with datetime as soon as possible.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: