Today is a frustrating day for many Malaysians due to spending 2 hours with endless mouse clicking, and yet cannot secure a spot in the phase 2 AstraZeneccca Vaccination Program. I had personally sit in front of the laptop with multiple browsers, devices and Internet connection from 12pm till 2pm, kiasu-ness to the max yet cannot secure a spot. I have even prepared myself with the Chrome Dev Tool script prepared by our Googler Jecelyn Yeen. Sorry to say Jecelyn also cannot secure a spot, but I think she was okay, looking at she was happily eating a Hokkaido Crab stick. Credit to her FB post
After my dinner, I still feel dulan for wasting 3 hours of my productivity time (2 hours in front of the computer, and 1 hour reading all the funny comments/videos/articles by other frustrated people like me). So I have decided to create this post to discuss what we can do better being a tech developer when designing a high traffic system like https://www.vaksincovid.gov.my/ Imagine today the system has wasted 3 hours of 1 million Malaysian, it was 3 million productivity hours and could cost lost of millions in terms of GDP. In Agmo, we have implemented many high traffic systems, such as Doc2Us (The only health advisory partner integrated into MySejahtera), Malaysia 4D mobile app (lottery result viewer that has millions of downloads) and Pos Malaysia PosRider platform using Microsoft Azure Cloud Platform.
In general, there are 3 great ideas being shared from the Internet:
1. Split the reservation by state, like KL – 26 May, Penang – 27 May, this will naturally reduce the concurrent traffic, simple yet workable idea
2. Using a waiting room concept, offered for free by Cloudfare. The website is already using Cloudfare, this seems to be a very intuitive solution, more details can refer to Lowyat.net
3. Improve the database design to accept a high concurrent transactional traffic, this is a great article from Timothy about this approach.
Timothy is absolutely right that many has the perception that why don’t the developers just scale out or scale up the servers (application layer), in fact the bottleneck is at the database (data layer). Below image was extracted from Timothy’s post.
In general, the major complains from the Internet are cannot select state to view the available date. The underlying API is “https://api.vaksincovid.gov.my/az/?action=listppv”
Image credit: Internet
The second problematic API is a POST API “https://api.vaksincovid.gov.my/az/”, when you selected the date slot and click “Submit” button.
The error handling is really bad, it seems to show the same error message “Tempahan untuk tarikh dan tempat yang dipilih sudah penuh.” regardless of the actual error (unless you are developers and check from the console then you will know the real error). For instance, the API server is shut down now, but yet I can see the same error message as in the afternoon (CORS error due to rate limit or 503 due to server is unable to handle the request).
The CORS error is a frontend checking, and hence can be bypassed in the browser level as pointed by our partner, Mr. Andhie. But the 503 error is indeed a server limitation, that’s nothing much you can do apart from trying your luck by clicking again and again.
The submit API is a transactional API and thus a lock is required to ensure atomic operation. That’s where a database partitioning and horizontal scaling is important.
To explain partitioning in a layman term, as I shared in my Facebook earlier, I would have tackle this problem with creating multiple Google Forms, 1 Google Form for each time slot of each PPV (Vaccination Centers)
To put it in a technical context, you can imagine each “City” in database diagram below is substituted by a unique date slot in a PPV, say:
London = KL PWTC at 10 June
NYC = KL PWTC at 11 June
Paris = KL UM at 10 June
Rome = KL UM at 11 June Image Credit: Microsoft.com (link at below)
After the partitioning is done and implemented correctly, we can enjoy the benefit offered by automatic and instant scalability with guarantee speed from Cosmos DB, which is not possible by a monolithic database. Nevertheless, it is still required to do a stress test on all the critical APIs, provision with the right infrastructure.
Disclaimer: This post is solely personal opinion of Tan Aik Keong (AK), it doesn’t represent Agmo Studio at any mean. You can contact me at LinkedIn for any question