Calculating confidence intervals for median on Hive and Impala

2015-12-11

At trivago, we are generating price-related statistics using billions of logged entries every day containing information provided by third parties. This can get quite messy, as you can easily find hotel prices at one million Euro and above in that huge dataset. In order to avoid having these extreme outliers - bugs in the implementation, invalid requests or however they may occur - we decided that in the future, want to show the median price of a hotel instead of the mean as we used to do (when we still had a much smaller dataset that was much easier to clean and control). One valid question during the transition was:
“How much data do I need to have a reliable median information?”

Approach

For the mean, SQL and also Hive already provide tools that help answering that question - e.g stddev_samp and variance as part of the Hive UDAFs). Providing a confidence interval for your mean value is quite straightforward and also easy to understand for the unitiated. A 95% confidence interval is the interval, in which the “real” mean value lies with a confidence of 95%. I did some quick research (meaning, I googled) and found the following article discussing how to calculate the confidence interval of the median: https://epilab.ich.ucl.ac.uk/coursematerial/statistics/non_parametric/confidence_interval.html

Formula

The formula itself is quite simple. The upper and lower bounds of our confidence interval is the $u$th and $l$th element of our (sorted) list of size $n$, with

$$
u = 1 + \frac{n}{2} + \left(N_{1-\alpha/2} * \frac{\sqrt{n}}{2}\right)
$$

and

$$
l = \frac{n}{2} - \left(N_{1-\alpha/2} * \frac{\sqrt{n}}{2}\right)
$$

both rounded to the nearest integer. Substitute $N_{1-\alpha/2}$ with the value from the standard normal distribution of the $p$th percentile for a confidence of $p$, you can use a table of the normal distribution for this (e.g. at Wolfram Alpha). I want a 95% confidence, so I am going to use 1.96 in the example below.

Implementation - old version

update
I actually found a much simpler and straightforward way to get this using the percentile method than this approach, which can be found below this section.

Luckily Hive - starting with version 0.14 which includes HIVE-7325- already offers all the functions we need to implement this.
We need to collect all the elements, sort them, and then pick element $l$ and $u$. This can be done this way:

WITH cte AS (
	SELECT 
		item_id, 
		PERCENTILE(price,0.5) AS median,
		SORT_ARRAY(COLLECT_LIST(price)) AS price_array,
		COUNT(1) AS num_prices
	FROM 
		item_price_found_daily
	GROUP BY 
		item_id)

SELECT 
	item_id,
	median AS median,
	num_prices,
	num_prices/2 - 1.96/2*SQRT(num_prices) AS lower_l,
	price_array[GREATEST(CAST(num_prices/2 - 1.96/2*SQRT(num_prices) AS int) - 1,0)] AS lower_bound,
	1 + num_prices/2 + 1.96/2*SQRT(num_prices) AS upper_u,
	price_array[LEAST(CAST(1 + num_prices/2 + 1.96/2*SQRT(num_prices) AS int) - 1,CAST(num_prices - 1 AS int))] AS upper_bound,
	price_array
FROM
	cte

In the cte part we do the counting, collecting and sorting. Afterwards we access our price_array by the position of the element we want.
Hive is actually able to execute this in one mapreduce stage, but of course you need to store and sort through all your elements in one list, so it will need a lot of memory for this operation if your lists grow large. You might want to do sampling to reduce the size and get an approximation instead of an exact solution.

Implementation - update

Instead of going for the $u$th element, we are going for the percentile at which we should find element $u$.

WITH cte AS (
	SELECT 
		item_id,
		price,
		count(1) OVER (PARTITION BY item_id) AS num_prices
	FROM 
		item_price_found_daily
	WHERE 
		ymd > 20151209)

SELECT
	item_id,
	PERCENTILE(price,0.5) AS median,
	PERCENTILE(price,greatest(cast(num_prices/2-1.96/2*sqrt(num_prices) AS int)-1,0)/num_prices) AS lower_bound,		
	PERCENTILE(price,least(cast(1+num_prices/2+1.96/2*sqrt(num_prices) AS int)-1,cast(num_prices-1 AS int))/num_prices) AS upper_bound
FROM 
	cte
GROUP BY 
	item_id

This way we don’t need to keep the entire array sorted in memory and instead can use the - probably much more optimised - percentile function from Hive. Hive needs two mapreduce stages for this query: counting the elements per dimension, and then selecting the percentile depending on the previous count.
We need some casting so everything is the correct type, and an additional least/greatest so the returned value stays within the bounds of the array (for very low $n$, $l$ and $u$ can be smaller than 1 or greater than $n$). Since the array position in Hive starts at 0, we also need to substract 1.

Impala

Unfortunately Hive is not really fast enough to do this in ad-hoc queries, and Impala does not support the percentile function. Since percentile is a Hive UDAFs, we also cannot use it with Impala as with UDFs. There is one option that we can use: We can rank the individual entries through an analytic window and then pick the rows with the rank that we want:

WITH cte AS (
	SELECT 
		item_id, 
		COUNT(1) OVER (PARTITION BY item_id) AS num_prices,
		ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY price ASC) AS rank,
		price
	FROM 
		item_price_found_daily)
		
SELECT
	item_id,
	COUNT(1) AS prices,
	AVG(IF(num_prices/2 - rank BETWEEN -0.5 AND 0.5, price,NULL)) AS median_price,
	SUM(IF(rank = greatest(CAST(num_prices/2 - 1.96/2*SQRT(num_prices) AS int) - 1,0), price,0)) AS lower_bound,
	SUM(IF(rank = least(CAST(1 + num_prices/2 + 1.96/2*SQRT(num_prices) AS int) - 1,CAST(num_prices - 1 AS int)), price,0)) AS upper_bound
FROM
	cte 
GROUP BY 
	item_id

Instead of our way of calculating the median via the avg, we could also use the appx_median function from Impala that calculates an approximation of the median, but is a lot faster. Again, this query will get tricky if you have a lot of values to sort through.

Result

The confidence interval now gives us valuable information on how reliable our median is: Even at just one hundred prices the confidence interval often still is very small, since hotel prices often are within a very small range once all the outliers are excluded (as the median does by nature). As expected, the upper boundary usually is a bit further away from the median than the lower boundary.

short messages from the world of hadoop

uncategorized