Disassembling/Reassembling the MongoDB ObjectId

Introduction

MongoDB automatically adds an _id field for every inserted document (if one is not provided by the user), the value of which must be unique and can be of any type, but most commonly is of ObjectId type. Here we examine its individual components and how we can retrieve them, as well as the reconstruction process to the initial ​ObjectId instance.

ObjectId Specification

MongoDB docs are very detailed and the ObjectId specification helps us break down the ObjectId automatically generated value into distinctive parts. These parts are (consecutively):

  1. timestamp → Generation timestamp (4 bytes)
  2. machine → First 3 bytes of the MD5 hash of the machine host name, or of the mac/network address, or the virtual machine id.
  3. pid → First 2 bytes of the process (or thread) ID generating the ObjectId.
  4. inc → ever incrementing integer value.

Most drivers (including pymongo) include direct methods to return the creation timestamp.

Use Cases

Decomposing an ObjectId enables us to store the individual properties as separate values in a different store and fetch the specific document by reconstructing the autogenerated _idfield value and querying using these as conditions. Additionally it gives us insight to which node and when the value was generated.

Enough with the talk already!

The following example uses Python and the native MongoDB driver, pymongo, which can be installed using pip:

pip install pymongo==3.7.1

We start by importing the ObjectId class from the ​bson package and the date class from the datetime built-in package, which we’ll use in order to verify the extracted timestamp.

from bson.objectid import ObjectId
from datetime import datetime

Generate a new ObjectId and get the 24-byte hexadecimal representation as string:

oid = ObjectId()
print 'ObjectId: {}'.format(oid)

Selecting the string ranges that correspond to each property and converting to integer values from hexadecimal:

oid_as_string = str(oid)
generation_time = int(oid_as_string[0:8], 16)
host = int(oid_as_string[8:14], 16)
process_id = int(oid_as_string[14:18], 16)
increment = int(oid_as_string[18:], 16)
print '''Decomposed Form: timestamp={}->{},host={}->{},\
process_id={}->{},increment={}->{}'''.format(
    oid_as_string[0:8],
    generation_time,
    oid_as_string[8:14],
    host,
    oid_as_string[14:18],
    process_id,
    oid_as_string[18:],
    increment
)

# Validate the timestamp
print 'Generation Timestamp: {}'.format(
    datetime.fromtimestamp(
        generation_timestamp
    ).strftime('%Y-%m-%d %H:%M:%S')
)

For the reconstruction of the ObjectId, we use the hex function to get the hexadecimal representation as string, but we must erase the '0x' prefix. Also we use the zfill string function, to left pad the resulting strings with zeros:

# Reconstructing as string
def convert_to_hex(component):
    return hex(component).replace('0x', '')
oid_new = convert_to_hex(generation_time)
oid_new += convert_to_hex(host)
oid_new += convert_to_hex(process_id).zfill(4)
oid_new += convert_to_hex(increment).zfill(6)

print 'Reconstructed ObjectId: ', oid_new
print 'Comparison: ', ObjectId(oid_new) == oid

At this point, we observe that the decomposition and reconstruction could be incorporated into methods of the ObjectId class, so we can create our own class by extending the base class and adding the decompose and reconstruct methods:

class ExtendedObjectId(ObjectId):
    def decompose(self):
        oid_as_string = str(self)
        generation_time = int(oid_as_string[0:8], 16)
        host = int(oid_as_string[8:14], 16)
        process_id = int(oid_as_string[14:18], 16)
        increment = int(oid_as_string[18:], 16)
        return {
            "timestamp" : generation_time,
            "host" : host,
            "process_id": process_id,
            "increment" : increment
        }

    @classmethod
    def from_decomposed_form(cls, properties):
        oid = cls._convert_to_hex(properties['timestamp'])
        oid += cls._convert_to_hex(properties['host'])
        oid += cls._convert_to_hex(properties['process_id']).zfill(4)
        oid += cls._convert_to_hex(properties['increment']).zfill(6)
        return cls(oid)

    @staticmethod
    def _convert_to_hex(component):
        return hex(component).replace('0x', '')

The decompose method returns a dictionary with the elements of the ObjectId instance. The from_decomposed_form class method takes the dictionary returned from decompose and returns the a new ObjectId instance. The latter serves as an alternative constructor.

print 'Using the extended class:'
extended_oid = ExtendedObjectId()
print 'Hex Representation:', str(extended_oid)
print 'Decomposed Form:', extended_oid.decompose()
extended_oid_reconstructed = ExtendedObjectId.from_decomposed_form(
    extended_oid.decompose()
)
print 'Verifying Reconstruction:', extended_oid_reconstructed == extended_oid

 

Conclusion

After analyzing the structure of the MongoDB ​​ObjectIdvalues, we managed to decompose and reconstruct them. The process can help us understand the logic behind the autogenerated document identifiers among MongoDB instances and may prove helpful in situations where we need to store individual identifier properties to other data stores.

Advertisements

4 thoughts on “Disassembling/Reassembling the MongoDB ObjectId

  1. There is a bug in this script… The date.fromtimestamp() call is throwing away the time information. Instead you should use something like time.localtime() to extract the date and the time from the timestamp.

    1. Thanks for the remark. You are correct, but this line is for demonstrating purposes only (just checking that our timestamp corresponds to an actual date), it is not actually used in the composition/decomposition methods.
      I will replace it with datetime as soon as possible.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.