[KOSD Series] Discussion about Cosmos DB Performance

KOSD, or Kopi-O Siew Dai, is a type of Singapore coffee that I enjoy. It is basically a cup of coffee with a little bit of sugar. This series is meant to blog about technical knowledge that I gained while having a small cup of Kopi-O Siew Dai.

kosd-cosmos-db.png

During a late dinner with my friend on 12 January last month, he commented that he encountered a very serious performance problem in retrieving data from Cosmos DB (pka DocumentDB). It’s quite strange because, in our IoT project which also stores millions of data in Cosmos DB, we never had this problem.

Two weeks later, on 27 January, he happily showed me his improved version of the code which could query the data in about one to two seconds.

Yesterday, after having a discussion, we further improved the code. Hence, I’d like to write down this learning experience here.

Preparation

Due to the fact that we couldn’t demonstrate using the real project code, I thus created a sample project getting data from database and collection on my personal Azure Cosmos DB account. The database contains one collection which has 23,967 records of Student data.

The Student class and the BaseEntity class that it inherits from are as follows.

public class Student : BaseEntity
{
    public string Name { get; set; }

    public int Age { get; set; }

    public string Description { get; set; }
}
public abstract class BaseEntity
{
    [JsonProperty(PropertyName = "id")]
    public string Id { get; set; }

    public string Type { get; set; }

    public DateTime CreatedAt { get; set; } = DateTime.Now;
}

You may wonder why I have Type defined.

Type and Cost Saving

The reason of having Type is that, before DocumentDB was rebranded as Cosmos DB in May 2017, the DocumentDB pricing is based on collections. Hence, the more collection we have in the database, the more we need to pay.

confused-about-documentdb-pricing.png

DocumentDB was billed per collection in the past. (Source: Stack Overflow)

To overcome that, we squeeze the different types of entities in the same collection. So, in the example above, let’s say we have three classes — Students, Classroom, Teacher that inherit from BaseEntity, then we will put the data of the three classes in the same collection.

Then here comes a problem: How do we know which document in the collection is Student, Classroom or Teacher? There is where the property Type will help us. So in our example above, the possible value for Type will be Student, Classroom, and Teacher.

Hence, when we add a new document through repository design pattern, we have the following method.

public async Task<T> AddAsync(T entity)
{
    ...

    entity.Type = typeof(T).Name;

    var resourceResponse = await _documentDbClient.CreateDocumentAsync(UriFactory.CreateDocumentCollectionUri(_databaseId, _collectionId), entity);

    return resourceResponse.StatusCode == HttpStatusCode.Created ? (dynamic)resourceResponse.Resource : null;
}

Original Version of Query

We used the following code to retrieve data of a class from the collection.

public async Task<IEnumerable<T>> GetAllAsync(Expression<Func<T, bool>> predicate = null)
{
    var query = _documentDbClient.CreateDocumentQuery<T>(UriFactory.CreateDocumentCollectionUri(_databaseId, _collectionId));

    var documentQuery = (predicate != null) ?
        (query.Where(predicate)).AsDocumentQuery():
        query.AsDocumentQuery();

    var results = new List<T>();
    while (documentQuery.HasMoreResults)
    {
        results.AddRange(await documentQuery.ExecuteNextAsync<T>());
    }

    return results.Where(x => x.Type == typeof(T).Name).ToList();
}

This query will run very slow because the line where it filters the class is after querying data from the collection. Hence, in the documentQuery, it may already contain data of three classes (Student, Classroom, and Teacher).

Improved Version of Query

So one obvious way is to move the line of filtering by Type above. The improved version of code now looks as such.

public async Task<IEnumerable<T>> GetAllAsync(Expression<Func<T, bool>> predicate = null)
{
    var query = _documentDbClient
        .CreateDocumentQuery<T>(UriFactory.CreateDocumentCollectionUri(_databaseId, _collectionId))
        .Where(x => x.Type == typeof(T).Name);

    var documentQuery = (predicate != null) ?
        (query.Where(predicate)).AsDocumentQuery():
        query.AsDocumentQuery();

    var results = new List<T>();
    while (documentQuery.HasMoreResults)
    {
        results.AddRange(await documentQuery.ExecuteNextAsync<T>());
    }

    return results;
}

By doing so, we managed to reduce the query time significantly because all the actual filtering will be done at Cosmos DB side. For example, there was one query I managed to reduce the query time of it from 1.38 minutes to 3.42 seconds using the 23,967 records of Student data.

Multiple Predicates

The code above however has a disadvantage. It cannot accept multiple predicates.

I thus changed it to be as follows so that it returns IQueryable.

public IQueryable<T> GetAll()
{
    return _documentDbClient
        .CreateDocumentQuery<T>(UriFactory.CreateDocumentCollectionUri(_databaseId, _collectionId))
        .Where(x => x.Type == typeof(T).Name);
}

This has another inconvenience is there whenever I call GetAll, I need to remember to load the data with HasMoreResults as shown in the code below.

var studentDocuments = _repoDocumentDb.GetAll()
    .Where(s => s.Age == 8)
    .Where(s => s.Name.Contains("Ahmad"))
    .AsDocumentQuery();

var results = new List<T>();
while (studentDocuments.HasMoreResults)
{
    results.AddRange(await studentDocuments.ExecuteNextAsync<T>());
}

Conclusion

This is just an after-dinner discussion about Cosmos DB between my friend and me. If you have any better idea of designing repository for Cosmos DB (pka DocumentDB), please let us know. =)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s