How does BlueStacks App player determine what’s good for the product and what’s not? [Part 1]
Like every business out there, the BlueStacks app player strives to become better with every new version. However, that is not always as simple as making a few code changes or adding new features. In some cases, there could be a chance that the latest version doesn’t work as well as the previously available version.
So how do we ensure that every new version is better than the previous one?
The answer is simple. We do it using A/B testing or split testing.
What is A/B testing?
A/B testing is simply comparing two versions of an application or a webpage to determine which of the two versions is performing better. It is an experiment wherein two versions of an application or webpage are served to a set of randomly selected users.
In A/B testing, the version that is being compared is referred to as the “Variation”, while the version with which the performance of the Variation is being compared is called the “Control”. The results of an A/B experiment help determine which version is best for the business.
How exactly does the BlueStacks perform the A/B testing on the world’s #1 App Player?
When we talk about A/B testing w.r.t to the BlueStacks app player, we compare certain defined stages common between the two versions and determine any improvement or decrement in performance at any of the stages for the new version. If the improvement/decrement is due to the difference between the two versions, we then proceed to find how much of the improvement/decrement is due to the difference and what are those?
Let’s suppose a version of the BlueStacks app player, say a.b.c, is live for the users from the download buttons on the BlueStacks app player website; It will be our control version. A new version, d.e.f, is ready to be released to the users; this will be the variation version.
To release the variation to our users and determine if this version can replace the control version, we alternatively serve the visitors of our website both these versions, i.e. a.b.c and d.e.f. In other words, we make both versions live from the website download button with 50–50 distribution logic.
So if 100 unique users visit the bluestacks.com website and click on the download button, 50 of them should get a.b.c, the control and 50 should get d.e.f, the variation. This distribution is done randomly, and we do not control which user gets what version so that the A/B testing is performed with clean data.
After both the versions are live to the users comes the part to compare the performance of both versions. The comparison is made based on certain metrics, which are stages of the two versions that we compare right from its download until the user installs an app in the app player.
Let’s name these stages Stage 0, Stage 1, Stage 2, Stage 3, Stage 4 and Stage 5. All these five stages will be true for both versions of the BlueStacks app player regardless of the differences between the two versions.
Taking Stage 0 as the point of comparison, the number of users that reach each of the further stages for the two versions determines the version’s performance.
Below is a sample of randomly generated data used for further explanation.
Let’s suppose that 4000 users visited the BlueStacks app player website and clicked on download, out of which 2000 got the version a.b.c while the other 2000 users got the new version d.e.f which is the variation version.
We will consider the number of users receiving each version as Stage 0. Out of these 2000 users, 1720 people reached stage 1 for the control and 1791 for the variation. Similarly, out of the 2000 users, the number of users that reached stage 3 in control was 1665 and in variation was 1733 and so on.
Hence, the percentage difference b/w these stages, which is called the conversion, will be as shown below,
Based on this data, we then calculate the percentage difference of the conversions of each of the states, which we call the “Improvement”.
Improvement = (The conversion rate of variation / The conversion rate of control) — 1
So for stage 1, Improvement = (89.55 / 86.00) — 1 = 0.04127 = 4.13%
Similarly, we can calculate the improvement for all the stages,
Looking at the above data, we can say that the performance of Stage 1 and Stage 2 for the variation is better than the performance of Stage 1 and Stage 2 for the control. Since the improvement is positive, it can be said that the performance of the control is better than a variation.
However, we need to ascertain whether the improvement we have achieved with the data above can be believed?
To determine this, we calculate the statistical signification or the confidence of the improvement for each stage.
What is Statistical Significance?
Statistical significance or confidence is the likelihood that the Improvement of the different stages is not a random result. So, the greater the value of statistical significance, the lesser chances of the improvement observed due to random changes and the higher the possibility that the improvement is due to the changes done in the code between the control and variation.
Let’s take a small example to understand how the statistical significance is calculated.
Suppose we have two versions, A & B, of any app/website which has been shown to 8 users.
So, there can be 9 different combinations of possible scenarios for distribution of the two versions of the website among the 8 users, e.g. there can be 0 users that get the A version, and all the 8 users get the B version, or there can be 1 user that got the version A while 7 users got the version B and so on.
The possible combinations of each of the scenarios can be calculated using the formula,
nCr = n! / r! (n — r)!Where n! is n * (n — 1) * (n — 2) * …. * 2 * 1
Let us take the first scenario into consideration. Using the formula nCr, the possible combinations for 0 users getting A and 8 users getting B can be calculated as:
8C0 =8! / 0! (8–0)!8C0 = 8*7*6*5*4*3*2*1 / 0! 8!8C0 = 8*7*6*5*4*3*2*1 / 1 * 8*7*6*5*4*3*2*1= 1
Hence, the possible combinations for 0A and 8B is 1. Similarly, the possible combinations for the scenario where 1 user gets A and 7 users get B can be calculated as:
8C1 = 8! / 1! (8–1)!8C1 =8*7*6*5*4*3*2*1 / 1! 7!8C1 =8*7*6*5*4*3*2*1 / 1 * 7*6*5*4*3*2*1= 8Note: The factorial of 0 is 1.
Using the above formula, we can calculate the possible combinations of all the possible scenarios.
Once we have the possible combinations, we can go on to calculate the probability of occurrence of each of the possible scenarios using the formula,
P(A) = The number of possible outcomes or combinations / Total number of possible outcomes or combinations
Here, the total number of possible outcomes or combinations will be 28 as two versions are being shown to 8 users. So, 28 = 2*2*2*2*2*2*2*2 which is 256.
Now, if we take the first possible scenario, the probability of the occurrence of that scenario will be,
P(A) =1 / 256 = 0.00390625
Similarly, for the second scenario,
P(A) =8 / 256 = 0.03125
And so on. Using the above formulas, we get the following data,
From the above table, we can see that there can be two extreme cases where all conversions happened with A and where all conversions happened with B, the probability of occurrence for which is very low. There is also a chance of something less extreme, i.e. 2 conversions because of A and 2 because of B.
Let’s focus on the case where the number of users who got A as 3 and the number of users who got B as 5. If we see the probability of occurrence for this scenario, it is around 0.21875, which is 21.88%.
Now, the conversion rate of A and B can be calculated as follows,
The conversion rate of A = 3*8 / 100 = 37.5% = ~38%The conversion rate of B = 5*8 / 100 = 67.5%.= ~68%
Further, the improvement for version B will be calculated as
Improvement = (The conversion rate of A / The conversion rate of B) — 1Improvement = (38/68)-1 = 0.7894736842 = 78.95%
So, the improvement for version B is +78.9%. But since the sample size we’ve considered is very small, it can be said that the 78.9% uplift happened by chance.
What does happen by chance mean here, though?
To understand this, let’s say we do a random assignment of version A and B between 8 users 5000 times, using a script; we get the below data,
- Here A converts mean the number of users that got version A, and B converts means the number of users that got version B.
- Occurrence means the number of times the said A and B convert happened when versions A and B were randomly assigned to 8 users 5000 times.
So if we see the first row, it can be read as 0 people got version A and all 8 people got version B 6 times out of 5000 times.
Note: This is a random assignment using a script.
If we represent this data graphically, we will get the following result,
From this graph, we can see that 5 people getting version B and 3 people getting version A is highly likely. So, we can conclude that the 78.9% improvement or uplift that we calculated previously is very likely random and a result of the natural variance since this uplift happened even with a random assignment using a script.
Hence, we can say that this uplift is not statistically significant. In the scenario where 5 people getting version B and 3 people getting version A would have been rare, we would have concluded that the probability of this happening randomly or by chance is highly unlikely, and hence it is statistically significant.
Considering all the cases where the improvement of version B is at least 78.9% better than A from the above data, we will have the following data (highlighted in pink),
From this data, B is 78.9% better than A when:
- 0 people got A
- 1 person got A
- 2 people got A, and
- 3 people got A out of 8.
And the total occurrence of these scenarios out of 5000 was (6 + 49 + 338 + 1119), which is 1592.
Here, we can establish that the B will at least 78.9% be better than A for 1592 times out 5000 times.
So, the probability of B being at least 78.9% better than A due to random/natural causes will be calculated as:
P(A) = The number of possible outcomes or combinations / Total number of possible outcomes or combinationsP(A) = 1592 / 5000 = 0.3184 = 31.84%
The value of 0.3184 is known as the p-value. Once the p-value is available, it can be used for the calculation of the statistical significance or confidence for this set of data which is 1-P, with the formula below:
Statistical significance = 1-p = 1–0.3184 = 0.6816 = 68.16%
Now that we have the statistical significance comes the question of how much statistical significance is good enough? i.e. what value of statistical significance is needed to determine the likelihood that the improvement is not due to natural variance or random.
We previously stated that the higher the statistical significance, the better. In the above example, we got a statistical significance of 68.16%, which means that there is a 31.84% chance of a false positive
Probability of a false positive with 68.16% statistical significance = 100% — 68.16% = 31.84%
Similarly, in any case, 80% statistical significance will mean a 20% chance of a false positive, i.e. 20% chance of the result being random. This means 1 out of 5 chance of a false positive or 1 out of 5 times the result can be random.
90% statistical significance, in any case, will mean a 10% chance of a false positive, i.e. a 10% chance of the result being random. This means 1 out of 10 chance of a false positive or 1 out of 10 times the result can be random.
95% statistical significance, in any case, will mean a 5% chance of a false positive, i.e. a 5% chance of the result being random. This means 1 out of 20 chance of a false positive or 1 out of 20 times the result can be random.
And 99% statistical significance will mean a 1% chance of a false positive, i.e. only a 1% chance of a random result. This means 1 out of 100 chance of a false positive or 1 out of 100 times the result can be random.
BlueStacks app player aims for 99% statistical significance, which means that the p-value should be 0.01, i.e. very low.
A lower p-value or higher statistical significance will signify that the improvement present is more likely due to the changes between the two versions, e.g. changes in code or the website etc. In contrast, a higher p-value or lower statistical significance will signify that the improvement present is more likely due to the random causes.
Now that we understand what statistical significance is, it is essential to know how much data is required to run an A/B experiment so that results can be concluded.
How much data do we need in an A/B experiment?
Let’s take the example of a coin that is flipped 4 times. There are 24 i.e. 16 possible outcomes that can happen in this case, as shown below.
Converting this data into a graph w.r.t the number of heads vs the possible combinations, we will get the following,
If the coin is flipped 8 times, the graph will be as shown below,
If the coin is flipped 16 times, the graph will be as shown below,
If the coin is flipped 32 times, the graph will be as shown below,
If the coin is flipped 64 times, the graph will be as shown below,
If the coin is flipped 128 times, the graph will be as shown below,
Looking at all the graphs shown above for different numbers of flips, we notice that the peak probability, in every case, increases steeply as the number of flips increases and is at half the number of flips and declines on both sides.
This is simply a consequence of many more possible ways of achieving half heads and half tails for a given number of flips compared to the ways where either heads or tails are a majority. As we make more and more flips, the graph of the probability of a given number of heads becomes smoother and approaches the “bell curve” or a “normal distribution curve” as shown below.
The bell curve or the normal distribution curve is a continuous probability distribution that is symmetrical on both sides of the mean in a way that the right side of the centre is a mirror image of the left side.
In the graph below, the orange marked area represents the situations where chances of random occurrence or occurrence by chance are less, i.e. in these cases; the p-value will be low, which further means that the statistical significance in these cases will be high.
The area marked in yellow depicts the situations where chances of random occurrence or occurrence by chance are high, i.e. in these cases, the p-value will be high, which further means that the statistical significance in these cases will be lower.
So, in summation, it can be said that as we go farther from the median, the chances of occurrence by chance decrease and the chances of occurrence by our change increase, meaning increased statistical significance.
Hence, the greater the data, the more confidence we’ll have in the results.
What’s Next?
In the next part of this blog, we will see how statistical significance is calculated, what is expected, and the actual deviation.
Credits
Images-
Anubha Bansal