Why do we optimize?
I'm used to optimizing the speed and memory consumption to improve the user experience, to allow new features that would otherwise be prohibitively expensive, to be able to do more in less, to avoid forcing customers to upgrade their hardware unnecessarily and so on. But this rarely if ever has direct effect to the cost of the product to you as developer.
Enter cloud-based products and this all changes and suddenly when you optimize your code you often do it to both save you money and improve user's experience. Case in point: replacing local CouchDb with API-compatible Cloudant a DBaaS. I do this all the time, keeping compatible behavior between the local CouchDb and Cloudant so that we scale easily.
(as a side note, doing cloud-based services without having near servers represents a large drag on your development speed so if you can avoid it with keeping a local compatibility, that's great)
Cloud-based vs. hosted server
Doing a cloud-based instead of a hosted server (virtual or not) is, as most of things in engineering, a trade-off. On the cloud-based end you can scale easily and "infinitely" without worrying (too much) and just paying in money for the priviledge. On the other hand, a hosted server may be cheaper and it will definitelly be more forgiving re performance (no additional hops between your service and the database) but scaling will involve additional work.
For Software Marble's latest product, Memoirs of a Future Simulation, we have decided to go with cloud-based approach. Building it however has shown us that we needed to optimize our storage code in order to decrease the costs, not only regarding Cloudant but also regarding Heroku which we use to host the actual service.
It turns out that the same optimization solves both things: limit the number of "Heavy HTTP" requests (
COPY) to Cloudant and you will need less instances (within the limits of available memory) as there is less processing overhead in sending one HTTP request than thousands (asynchronouse nature of Node.js notwithstanding). Since Cloudant charges per HTTP requests and not per the size of the data you upload or download, the solution is to convert as many as possible of heavy HTTP requests to a single bulk request. Cloudant doesn't limit the size of the upload so with sufficient memory (and relaxed requirements) you could really optimize this to the ground.
Since our service is based on gathering historical data (which is never edited within our service) we have sufficiently relaxed requirements and can afford to wait (and risk losing the data which would then simply be recollected) betweeen the moment of gathering and the moment of putting the data into the database. This allows unifying many gathering requests to one (or two as we will see) database-uploading request and this HTTP frugaility has direct effect on our bottom line.
This frugality is, yet again, a tradeoff between:
- Relaxed requirements we may have but we do need to get the data into the database within some reasonable time frame.
- The amount of memory that we have available for caching of documents while waiting for upload.
- The desired decrease factor in the number of HTTP requests.
The exact sweetspot will depend on your constraints. In our implementation we know that our documents are rarely much larger than several kilobytes and thus we have reduced the memory tracking to tracking the number of docs and limited those to 10,000. Regarding time we have settled, at least for now, on 1 minute between forced uploads.
Assuming that you have sufficiently relaxed requirements, that you use Node, that you use CouchDb/Cloudant, that to access CouchDb you use nano, we have just published module frugal-couchdb which incorporates some of the algorithms we have been creating to decrease the costs and improve performance of our service. Its repository is hosted here.