The only goal for a startup is to grow as fast as possible. In that context, the most difficult part is, as always, generating the customer attraction and retention. When a startup is growing, it becomes absolutely crucial to provide a stable and reliable service. And not every system is able to grow easily. Today I want to share with you the best practices that can 1000X and even more the current capacity of your platform.
The single point of failure
“A chain is only as strong as its weakest link”. Like for a chain, when we talk about a website they are several things that can fail once the traffic increases. It could be:
– IN/Out Network traffic
– CPU load
– RAM load
– Database access (local or remote)
– HardDrive IO (usually for database)
– External services
Let’s analyze each one and see what improvement can be made in each area.
The network load can be reach easily if your service provides high bandwidth content like software, video or audio. There are 2 technical ways to improve this kind of performance issue.
- The second way, is by using a static storage service which allows you to drop your big content on huge clusters and deliver it to your customer from there. It became pretty simple to integrate this kind of service into your own application as API SDK are always provided. The principle is pretty simple, instead of sending the content (video, audio or whatever…) to your server, your server will create an empty file on the static content system and provide a POST url to allow the final customer (browser, mobile app…) to publish the content directly without passing through your own server.
CPU load optimization
The CPU is the best performance issue we can have as we can increase easily the server capacity. Nevertheless, when we talk about a website or an API server, we should not have big CPU load. Web server or API server are build to set and get value to or from the database, change few values and send it back to the customers. Big PDF generation, video transcription, file conversion shouldn’t be performed by a web server but by a dedicated app server. This is a quite classical issue that I can see in growing start-ups.
In the beginning, it’s quite normal to build an app as fast as possible, so we just put the code into the web server. But at some point, when the traffic increases and, for instance, if a file conversion that takes 10 seconds of 100% CPU load, whenever we get 10 conversions in parallel, which is not so much by the way, the server will take 100 seconds to deliver the content, and this is usually more than the classical 60 seconds timeout parameter. The problems start from there. There are 2 ways to avoid these problems.
- Delegating CPU consumer process to an extra server by creating a spool. Each time you get a new file to convert, it should go into a pool that should be handled by an extra server. Then when the pool gets new entries, a dedicated app processes the content and sends a notification to the customers as soon as the file is ready. This is exactly like when you convert a video on youtube or other video services.
- Sharing CPU load across several servers. This is where it becomes, technically speaking, interesting. Most of infrastructure or cloud providers have access to a load balancer feature, which allows sharing of connections onto several servers. Actually they are several way to do it, whether by DNS which is also called (Round Robin DNS) or by network – Loadbalancer.
Note: Before starting to share connections between several servers, you need to make sure to have a shared session system, otherwise customers will be constantly disconnected as soon as there are connection switching from one server to another. The best way is to share this session into a dedicated database. Any database will work, but it could be easier to use NoSQL database or temporary database like Redis.
RAM load optimisation
The RAM consumption is quite like the CPU one. In fact a process will mainly consume some CPU and RAM. If your server doesn’t have enough RAM, exactly like for the CPU issue, you should go for sharing the workload between several servers and maybe think over delegating the big RAM consumer processes to a dedicated App server.
Database Access and Input/Output Data Access
This part, I think, is the trickiest one. The database is usually the heart of any information system, the only place where we store data. By that, I mean that a database system is a lot more complicated to be shared across several servers, since the data equals to 0 or to 1 and it’s difficult to update in real-time onto several servers. So the first thing when your app needs to scale is to identify which requests are the most used and for which feature/function. As we discussed previously, the session shouldn’t be on the same database system as the main application. Session generates requests on almost every user’s request and it’s a huge database consumer. In addition, if you application is about managing a huge quantity of simple information like http://goo.gl links, then you shouldn’t them on the main database, but instead, use a dedicated database where the would use NoSQL. Another way is to use cache database like Redis for session, if your application needs to store a lot of temporary information, then a Redis database would be great to decrease your main database’s workload.
Now if you’ve done everything possible to reduce you database workload, then it’s time to think about Database cluster. They are many ways to share a database across several servers by splitting data or even tables across several servers. Whatever you use MySQL, PostgreSQL, SQL server or even better Oracle you will find many options that will require a database expert to set it up.
The other and the last way to increase your database capacity, is obviously to increase the server capacity by using bigger CPUs, RAMs, and with huge Input/Output Access to the hard drive.
Today, in order to build a new website or application, I will recommend you to use as many external services as possible. Weather it is for audio/video conversion, sending emails, etc, many sub-features might be an entire expertise, and using someone else’s expertise is easier than rebuilding a system from scratch, especially when the only things that matter is to serve your customer, and not wasting time and resources on building what someone else have already build. “Don’t reinvent the wheels”.
Nevertheless, pay attention to the subcontractor’s capacity. If a sub-service is unavailable it’s usually your entire service that is impacted, so be sure to use reliable and well known external services. As example, this is one of the reasons why it’s a good idea to use services like Youtube or Vimeo Pro.
This last part is an entire topic, where some experts wrote entire books on it. You just need to know that it’s unfortunately quite easy to overload a web server until it crashes. Just be aware of it. In addition to increasing your service capacity, deploy web application firewall and QOS firewall rules to protect your service.
What do you think about it?
First of all, thank you very much for taking the time to read this article. It would be amazing if you could let me know if this has been helpful to you and your startup.
What does it make you think of?
Are currently facing some of these common issues?
Whatever the case might be, I’d be more than happy to answer any of your questions or comments, so don’t hesitate, reach out and I’ll answer you as soon as possible.