MongoDB User Group First Meetup: Learning MongoDB with Seniors

The first meetup organized by local MongoDB user group was held at Hackerspace Singapore on last Tuesday. During the meetup, Matias Cascallares, consulting engineer from MongoDB, shared with us the importance of schema design in a “schemaless” MongoDB, the use cases of different schema design approach, and ended his talk with the new features that will be released in MongoDB 2.6.

Documents and Collections

MongoDB is an open-source document oriented database, an example of the NoSQL database, which is non-relational and horizontally scalable. Many of the databases currently in use are based on relational database model where we have tables and each table has columns and keys defined in database schema.

In my job, I deal with relational database. Sometimes there is a need to store two types of records in the same table. Even though they share some fields with one another, each of them also has its own unique elements. So, I simply combine all the fields and defined them as the columns in the table, leaving unused fields empty or null.

Not all records have discount info
Not all records have discount info in the table

MongoDB stores the data in documents. Documents are stored on disk in BSON, a binary representation of JSON objects. A grouping of documents is called a collection, which is equivalent of a table in relational database model. So documents in MongoDB are equivalent of rows in a table. Because of “schemaless”, collections do not enforce document structure. Documents within a collection can have different fields. Thus, it’s allowed to write the two records shown in the image above into two documents as follows.

    ID: 1,
    ItemCode: "T934",
    CreateUser: "User01",
    CreateDate: "2014-01-01"
    ID: 2,
    ItemCode: "T987",
    DiscountID: 4,
    DiscountAmount: 0.32,
    CreateUser: "User02",
    CreateDate: "2014-02-01"
Schema in "schemaless" MongoDB?
Schema design in “schemaless” MongoDB?

Different Approach of Schema Design in MongoDB

I use relational database in my work. I realize that when the tables are growing bigger and bigger, it’s getting harder and harder to adjust the schema. I always hope to have the columns to be dynamic. Hence, the concept of schemaless database in MongoDB interests me a lot.

In relational database, we can use stuff like joins to provide results to a query. However, in MongoDB, joins are not supported. The data is either denormalized or stored together with related data in the same document. In addition, we can also use the _id field of one document in another document as a reference. The application just needs to run a second query to return the referenced data. Hence, it’s like doing a reverse engineering when we are designing schema in MongoDB because now we have to ask “What question will I have”.

Also, due to fact that documents can have different fields, we cannot describe the collection. Instead, we have to look into the codes to find out the schema information of the collections. Even though when we query with a key not existing in any of the document in the collection, it will still not raise an error. It just won’t return anything. So, it is impossible to describe the collection as how we do in relational database.

Shards and Shard Keys

Normally, we use the strongest servers we have to host our databases. We do vertical scaling by adding more CPU, RAM and storage resources to increase capacity. This gets expensive quickly. There is where horizontal scaling can come into play. Horizontal scaling, aka sharding, does a horizontal partition in a database. Instead of just increase the capacity of the one server, we add more servers. The data is then distributed to multiple database servers, aka shards.

MongoDB is built with horizontal scaling in mind. Sharding is implemented in MongoDB with the help of sharded cluster which consits of three components: shards, config server, and router. Router is in charge of routing the reads and writes from applications to the shards by refering to the metadata stored in the config server.

To shard a collection, shard key is needed to divide a collection’s data across the cluster’s shards. During the talk, Matias analyzed three different approaches of schema design for a social network application which has to support a great amount of read and write operations. By just changing the shard key used in the shard and the schema, we manage to get 3 different solutions. One of them is good at writing performance. One is good at reading performance with lots of random IO. Final one is good at reading performance with no random IO but with more work in write. For more details about these three schema design approaches, please read the blog post “Schema Design for Social Inboxes in MongoDB”.

Data Expiration

Data expiration is mentioned in the second half of the talk about schema design.

Data expiration is useful and important because some data stored in database will not be used anymore. However, data won’t expire itself.

In MS SQL Server, we create scheduled jobs with stored procedures to erase the data which has been there for more than 2 months, for example.

In Matias’ talk, I came to know about the feature that was introduced back in MongoDB 2.2, the Time To Live (TTL) Collection. By just specifying a value in the expireAfterSeconds index option, the documents in the TTL Collection will automatically removed after specific period of time.

That’s All

Important slide, the summary.
Important slide, the summary. Photo Credit: Singapore MongoDB User Group

The talk is very interesting. I agree that schema design in MongoDB is not trivial. As mentioned during the talk, it does require a lot of practices to get really, really good at schema design in MongoDB and know how to balance between the read performance and write performance. Thus, the talk is actually just a starting point for a beginner like me. It’s now up to me to find out more about MongoDB and other database technologies myself. Hopefully one day I’ll be so imba that I have the opportunity to give a talk about MongoDB as well.

For those who were there will now realize that I do not cover everything of the talk in this post. Firstly, I don’t want my post to be too lengthy. Most of the readers will just tl;dr. Secondly, I just started to learn MongoDB last month, so I try not to “act smart” here. Thirdly, it’s to encourage people to join our Singapore MongoDB User Group to find out more (such as the upcoming talks and other activities) instead of just reading my post. =P

I’d also like to take this opportunity to thank my friend, Laurence, for inviting me to attend this amazing talk. I’m now already looking forward to the next meetup in April.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s