MongoDB TTL to shrink collection size and save costs

MongoDB TTL to shrink collection size and save costs
T. Schneider / Shutterstock.com

For my own project UNIVERSECLOUD (you will hear more on that in a few weeks in another blog post) i use a MongoDB collection for logs. (about 1mil entries, ~40MB Storage)

40MB Storage is not that much, but right now the table only contains logs of one power user* right now. So in the future when there are more users on UNIVERSECLOUD (hopefully :) ), the table will rise in storage significantly.

*power user = multiple projects with many serverless functions producing many logs

Costs

1 User = 40MB
1000 User = 40000MB

I use a managed MongoDB database on Digitalocean with 34GB storage for about 90€/month (primary + two standbys). In this case it is enough space for 850 user. (only logs)

costs per user = ~0.1059€

This is inacceptable, so i decided that i need to shrink the collection size. But which entries can be safely deleted?

If you think about logs, they give you information about activities. But you dont need a unlimited retention time for that, so i decided to adjust the retention time to 30 Days.

TTL

To achieve a retention time on a collection without needing a cronjob querying the collection and find old entries to delete them, MongoDB offers a nice index type called TTL. (time to live)

An example of a log entry looks like this:

{
  "_id": {
    "$oid": "63f102f07953e50a69d8ab82"
  },
  "userId": "foobar",
  "resource": "function:63e157427bcde81be189b7f9",
  "value": "log entry from function 63e157427bcde81be189b7f9",
  "createdAt": "2023-05-30T16:10:33Z"
}

What is a TTL index?

TTL indexes are special single-field indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time or at a specific clock time. Data expiration is useful for certain types of information like machine generated event data, logs, and session information that only need to persist in a database for a finite amount of time.
https://www.mongodb.com/docs/manual/core/index-ttl/

The caveat in my special case is that the createdAt field holds the date value in ISO string format, but a TTL Index needs a field holding a bson Date object.

What to do?

Transforming all entries

We need to transform all our entries inside the collection from the old ISO string field to a bson Date field.

There is a nice script for mongo shell we can use for that:

var dbName = 'universecloud';
// logs is our collection where we want to transform all entries
var collectionName = 'logs';
var connectionString = `mongodb+srv://foobar:supersecret@dbHost/${dbName}`;
var db = connect(connectionString);

// Find all documents with a createdAt field containing an ISO date string
var documents = db[collectionName].find({ createdAt: { $type: 'string' } });

// Iterate over the documents and update the createdAt field
documents.forEach(function (doc) {
  var isoDateString = doc.createdAt;
  var bsonDate = new Date(isoDateString);

  db[collectionName].updateOne(
    { _id: doc._id },
    { $set: { createdAt: bsonDate } }
  );
});

print('Transformation done.');

This script may take some time, depending on how much data you update.

Create TTL index

Now that the data is transformed, we can create the TTL index on the collection.

Since we have a retention time of 30 days, we use 2592000 seconds as expireAfterSeconds argument.

You can either create it via mongosh

use universecloud;
db.logs.createIndex( { "createdAt": 1 }, { expireAfterSeconds: 2592000 } )

or use a MongoDB GUI (for example MongoDB Compass on mac):

Bildschirm-foto-2023-07-19-um-17.59.09

After you created the index, MongoDB needs some time to update the index data and to remove entries older than 30 days.

Conclusion

After data transformation and index creation, the result looks like this:

Bildschirm-foto-2023-07-19-um-21.23.05

As you can see logs now use 2.64MB (58k items with an average size of 207b)
Now we have enough log space for ~12879 user.

cost per user: 0.0070€

This looks insane and worth it for a little tweak on our collection.
We just saved about 93.39% cost per user.

If you have any questions or tips, dont hesitate to contact me via e-mail (mail@yannfaulhaber.dev) or https://github.com/yhc44

cheers.

btw: you can find any code that i use in blog posts in my public github repository that i created for this blog.

https://github.com/yhc44/blog