How many Users can Our System Support?

This is a question we are frequently asked.

How many users can I support.

The answer is always complicated and conversations typically go something like:

Customer: We are getting ready to scale up; are we ready to handle 100 mega-users?

Edoceo: how many users are on the system now, what resources do you have?

C: We've got a million sign-ups and we're hosted at Rackspace.

E: No, how many servers? What is the environment? How much existing load, &c.

C: We have two servers, it's built with RoR and uses a combination of MySQL and MongoDB (for web-scale). I think we have lots of room - we have 4Gig of RAM - so how many users can we handle?

E: We need lots of data to make that prediction, step one is to measure the current usage and resource consumption, increase the usage and then review the measurements again - that will give us two points on a grid from which we can make linear predictions - of course more samples are better. What do you use for monitoring now?

C: Rackspace tells us when the server is down.

E: :(

This goes on for some time, attempting to translate user-accounts to actual users. There is a world of difference between 400m user-accounts and 400m active users in a 12h period. The critical performance metric is active users.

Once you have a measurement of active users, then we determine what resources active users require. Naturally this necessitates measurement of these resources such as CPU, RAM, HDD, Network and various services such as web-servers, database-servers, caching servers and many other aspects (we could go on and on). Without these metrics capacity prediction will be impossible.

  1. Decide Acceptable Performance, eg 1 second TTLB on homepage.
  2. Determine Active Users, eg: recent session count
  3. Measure resource consumption at current level
  4. Use gauges and equations to predict growth/consumption rates

Finding Active Users

Some determination needs to be made with regard to measuring active users. How long is their session? How frequent are their page requests? Some web-apps/sites have less, some more. Make a determination for your environment. For example we check active sessions that are updated in the last four minutes. That becomes our basis for creating a base-line measurement.

# example script which queries a session database server and publishes metric
val=$(psql \
    --quiet \
    --no-align \
    --tuples-only  \
    --command "SELECT count(id) FROM session WHERE atime >= current_timestamp - '4 minutes'::interval; &quot)

curl -d "stat=good" -d "metric=users=$val" 'http://custos/api/service/stat?s=...' >/dev/null

Measure Resource Consumption

This is perhaps the most time consuming factor in performance monitoring. Configuring first the collector system and then configuring the environment. There are many monitoring systems to choose from and thousands of scripts for collecting metrics.

The collection system will likely be one of: Nagios, OpenNMS, MRTG+Cacti but could also be tools from WatchMouse, CloudKick or even our Custos system (currently beta). These system will use the scripts to collect the system metrics and hopefully build nice charts and graphs that are used in the next phase.

What Metrics to Capture

Determining what to capture on the hardware/software side is a bit easier, we almost already know what to measure. Generally we start with the "standard" things such as CPU, RAM, HDD and NIC - that is processor, memory, disk and network.

Making the Predictions

Once these metrics are collected, for some usable time-period then we can start the game of predicting the future. For example: knowing how much, RAM, CPU and other resources are used for 400 users, we can then extrapolate how much resources are needed for 4000 users. The graphs also can show growth rates for data, which can let one know when to upgrade/improve the database back-ends.

Usage predictions are difficult at best, collecting the metrics is the critical first-step to planning for web-site growth.