Introduction
MongoDB automatically adds an _id
field for every inserted document (if one is not provided by the user), the value of which must be unique and can be of any type, but most commonly is of ObjectId
type. Here we examine its individual components and how we can retrieve them, as well as the reconstruction process to the initial ObjectId
instance.
ObjectId Specification
MongoDB docs are very detailed and the ObjectId specification helps us break down the ObjectId automatically generated value into distinctive parts. These parts are (consecutively):
- timestamp → Generation timestamp (4 bytes)
- machine → First 3 bytes of the MD5 hash of the machine host name, or of the mac/network address, or the virtual machine id.
- pid → First 2 bytes of the process (or thread) ID generating the ObjectId.
- inc → ever incrementing integer value.
Most drivers (including pymongo) include direct methods to return the creation timestamp.
Use Cases
Decomposing an ObjectId
enables us to store the individual properties as separate values in a different store and fetch the specific document by reconstructing the autogenerated _id
field value and querying using these as conditions. Additionally it gives us insight to which node and when the value was generated.
Enough with the talk already!
The following example uses Python and the native MongoDB driver, pymongo
, which can be installed using pip:
pip install pymongo==3.7.1
We start by importing the ObjectId
class from the bson
package and the date
class from the datetime
built-in package, which we’ll use in order to verify the extracted timestamp.
from bson.objectid import ObjectId from datetime import datetime
Generate a new ObjectId
and get the 24-byte hexadecimal representation as string:
oid = ObjectId() print 'ObjectId: {}'.format(oid)
Selecting the string ranges that correspond to each property and converting to integer values from hexadecimal:
oid_as_string = str(oid) generation_time = int(oid_as_string[0:8], 16) host = int(oid_as_string[8:14], 16) process_id = int(oid_as_string[14:18], 16) increment = int(oid_as_string[18:], 16) print '''Decomposed Form: timestamp={}->{},host={}->{},\ process_id={}->{},increment={}->{}'''.format( oid_as_string[0:8], generation_time, oid_as_string[8:14], host, oid_as_string[14:18], process_id, oid_as_string[18:], increment ) # Validate the timestamp print 'Generation Timestamp: {}'.format( datetime.fromtimestamp( generation_timestamp ).strftime('%Y-%m-%d %H:%M:%S') )
For the reconstruction of the ObjectId
, we use the hex
function to get the hexadecimal representation as string, but we must erase the '0x'
prefix. Also we use the zfill
string function, to left pad the resulting strings with zeros:
# Reconstructing as string def convert_to_hex(component): return hex(component).replace('0x', '') oid_new = convert_to_hex(generation_time) oid_new += convert_to_hex(host) oid_new += convert_to_hex(process_id).zfill(4) oid_new += convert_to_hex(increment).zfill(6) print 'Reconstructed ObjectId: ', oid_new print 'Comparison: ', ObjectId(oid_new) == oid
At this point, we observe that the decomposition and reconstruction could be incorporated into methods of the ObjectId
class, so we can create our own class by extending the base class and adding the decompose
and reconstruct
methods:
class ExtendedObjectId(ObjectId): def decompose(self): oid_as_string = str(self) generation_time = int(oid_as_string[0:8], 16) host = int(oid_as_string[8:14], 16) process_id = int(oid_as_string[14:18], 16) increment = int(oid_as_string[18:], 16) return { "timestamp" : generation_time, "host" : host, "process_id": process_id, "increment" : increment } @classmethod def from_decomposed_form(cls, properties): oid = cls._convert_to_hex(properties['timestamp']) oid += cls._convert_to_hex(properties['host']) oid += cls._convert_to_hex(properties['process_id']).zfill(4) oid += cls._convert_to_hex(properties['increment']).zfill(6) return cls(oid) @staticmethod def _convert_to_hex(component): return hex(component).replace('0x', '')
The decompose
method returns a dictionary with the elements of the ObjectId
instance. The from_decomposed_form
class method takes the dictionary returned from decompose
and returns the a new ObjectId
instance. The latter serves as an alternative constructor.
print 'Using the extended class:' extended_oid = ExtendedObjectId() print 'Hex Representation:', str(extended_oid) print 'Decomposed Form:', extended_oid.decompose() extended_oid_reconstructed = ExtendedObjectId.from_decomposed_form( extended_oid.decompose() ) print 'Verifying Reconstruction:', extended_oid_reconstructed == extended_oid
Conclusion
After analyzing the structure of the MongoDB ObjectId
values, we managed to decompose and reconstruct them. The process can help us understand the logic behind the autogenerated document identifiers among MongoDB instances and may prove helpful in situations where we need to store individual identifier properties to other data stores.
[…] machine identified is the first three bytes of the MD5 hash of the machine host name, the mac/network address, or the virtual machine […]
There is a bug in this script… The date.fromtimestamp() call is throwing away the time information. Instead you should use something like time.localtime() to extract the date and the time from the timestamp.
Thanks for the remark. You are correct, but this line is for demonstrating purposes only (just checking that our timestamp corresponds to an actual date), it is not actually used in the composition/decomposition methods.
I will replace it with datetime as soon as possible.
[…] https://devopslog.wordpress.com/2012/04/22/disassemblingreassembling-mongodb-objectids/ […]