At trivago, we are generating price-related statistics using billions of logged entries every day containing information provided by third parties. This can get quite messy, as you can easily find hotel prices at one million Euro and above in that huge dataset. In order to avoid having these extreme outliers - bugs in the implementation, invalid requests or however they may occur - we decided that in the future, want to show the median price of a hotel instead of the mean as we used to do (when we still had a much smaller dataset that was much easier to clean and control). One valid question during the transition was:
“How much data do I need to have a reliable median information?”
Approach
For the mean, SQL and also Hive already provide tools that help answering that question - e.g stddev_samp
and variance
as part of the Hive UDAFs). Providing a confidence interval for your mean value is quite straightforward and also easy to understand for the unitiated. A 95% confidence interval is the interval, in which the “real” mean value lies with a confidence of 95%. I did some quick research (meaning, I googled) and found the following article discussing how to calculate the confidence interval of the median: https://epilab.ich.ucl.ac.uk/coursematerial/statistics/non_parametric/confidence_interval.html
Formula
The formula itself is quite simple. The upper and lower bounds of our confidence interval is the \(u\)th and \(l\)th element of our (sorted) list of size \(n\), with
$$
u = 1 + \frac{n}{2} + \left(N_{1-\alpha/2} * \frac{\sqrt{n}}{2}\right)
$$
and
$$
l = \frac{n}{2} - \left(N_{1-\alpha/2} * \frac{\sqrt{n}}{2}\right)
$$
both rounded to the nearest integer. Substitute \(N_{1-\alpha/2}\) with the value from the standard normal distribution of the \(p\)th percentile for a confidence of \(p\), you can use a table of the normal distribution for this (e.g. at Wolfram Alpha). I want a 95% confidence, so I am going to use 1.96 in the example below.
Implementation - old version
update
I actually found a much simpler and straightforward way to get this using the percentile method than this approach, which can be found below this section.
Luckily Hive - starting with version 0.14 which includes HIVE-7325- already offers all the functions we need to implement this.
We need to collect all the elements, sort them, and then pick element \(l\) and \(u\). This can be done this way:
1 | WITH cte AS ( |
In the cte
part we do the counting, collecting and sorting. Afterwards we access our price_array by the position of the element we want.
Hive is actually able to execute this in one mapreduce stage, but of course you need to store and sort through all your elements in one list, so it will need a lot of memory for this operation if your lists grow large. You might want to do sampling to reduce the size and get an approximation instead of an exact solution.
Implementation - update
Instead of going for the \(u\)th element, we are going for the percentile at which we should find element \(u\).
1 | WITH cte AS ( |
This way we don’t need to keep the entire array sorted in memory and instead can use the - probably much more optimised - percentile function from Hive. Hive needs two mapreduce stages for this query: counting the elements per dimension, and then selecting the percentile depending on the previous count.
We need some casting so everything is the correct type, and an additional least/greatest so the returned value stays within the bounds of the array (for very low \(n\), \(l\) and \(u\) can be smaller than 1 or greater than \(n\)). Since the array position in Hive starts at 0, we also need to substract 1.
Impala
Unfortunately Hive is not really fast enough to do this in ad-hoc queries, and Impala does not support the percentile function. Since percentile is a Hive UDAFs, we also cannot use it with Impala as with UDFs. There is one option that we can use: We can rank the individual entries through an analytic window and then pick the rows with the rank that we want:
1 | WITH cte AS ( |
Instead of our way of calculating the median via the avg, we could also use the appx_median function from Impala that calculates an approximation of the median, but is a lot faster. Again, this query will get tricky if you have a lot of values to sort through.
Result
The confidence interval now gives us valuable information on how reliable our median is: Even at just one hundred prices the confidence interval often still is very small, since hotel prices often are within a very small range once all the outliers are excluded (as the median does by nature). As expected, the upper boundary usually is a bit further away from the median than the lower boundary.