利用Hive中percentile_approx计算等频划分分位点

等频划分

等频划分:按照数据的分布情况,每个区间的数据数量一样,平均划分成k个区间

等比划分:按照数据的全部取值情况,平均划分成k个区间

Hive 中计算分位数的函数:percentile_approx

hive 中的percentile_approx函数可以确定等频划分的分位点

percentile_approx(col,array(0.2,0.4,0.6,0.8))
[0.0,4001.0,4061.0]

其中col为要划分的列,array中的数字代表划分的位置,比如(0.2,0.4,0.6,0.8)就是钱20%数量的样本被分到一个区间,然后20%-40%的样本被分到一个区间….

返回值是一个array

如果希望变成一列,可以用explode函数:

explode(percentile_approx(cast(col as double),array(0.05,0.5,0.95),9999))as percentile
percentile
0.0
4001.1
4061.0

代码

create table zhangsiyao.percentile_of_dt_fea_0818_0820_2 as select explode(percentile_approx(adloc_age_sex_ctr,array(0.25,0.5,0.75))) as adloc_age_sex_ctr_percentile,explode(percentile_approx(adloc_income_ctr,array(0.25,0.5,0.75))) as adloc_income_ctr_percentile,explode(percentile_approx(adloc_marry_ctr,array(0.25,0.5,0.75))) as adloc_marry_ctr_percentile,explode(percentile_approx(adloc_province_ctr,array(0.25,0.5,0.75))) as adloc_province_ctr_percentile,explode(percentile_approx(adloc_clinttype_ctr,array(0.25,0.5,0.75))) as adloc_clinttype_ctr_percentile,explode(percentile_approx(adloc_sponsorid_ctr,array(0.25,0.5,0.75))) as adloc_sponsorid_ctr_percentile,explode(percentile_approx(adloc_categoryid_ctr,array(0.25,0.5,0.75))) as adloc_categoryid_ctr_percentile,explode(percentile_approx(adloc_crtvid_ctr,array(0.25,0.5,0.75))) as adloc_crtvid_ctr_percentile,explode(percentile_approx(adloc_unet_ctr,array(0.25,0.5,0.75))) as adloc_unet_ctr_percentile,explode(percentile_approx(adloc_isp_ctr,array(0.25,0.5,0.75))) as adloc_isp_ctr_percentile,explode(percentile_approx(adloc_appver_ctr,array(0.25,0.5,0.75))) as adloc_appver_ctr_percentile,explode(percentile_approx(adloc_devicetype_ctr ,adloc_deviceos_ctr,array(0.25,0.5,0.75))) as adloc_devicetype_ctr ,adloc_deviceos_ctr_percentile,explode(percentile_approx(age_province_ctr,array(0.25,0.5,0.75))) as age_province_ctr_percentile,explode(percentile_approx(age_marry_ctr,array(0.25,0.5,0.75))) as age_marry_ctr_percentile,explode(percentile_approx(age_categoryid_ctr,array(0.25,0.5,0.75))) as age_categoryid_ctr_percentile,explode(percentile_approx(age_crtvid_ctr,array(0.25,0.5,0.75))) as age_crtvid_ctr_percentile,explode(percentile_approx(sex_marry_ctr,array(0.25,0.5,0.75))) as sex_marry_ctr_percentile,explode(percentile_approx(sex_categoryid_ctr,array(0.25,0.5,0.75))) as sex_categoryid_ctr_percentile,explode(percentile_approx(income_categoryid_ctr,array(0.25,0.5,0.75))) as income_categoryid_ctr_percentile,explode(percentile_approx(income_crtvid_ctr,array(0.25,0.5,0.75))) as income_crtvid_ctr_percentile,explode(percentile_approx(marry_categoryid_ctr,array(0.25,0.5,0.75))) as marry_categoryid_ctr_percentile,explode(percentile_approx(marry_crtvid_ctr,sponsorid_unet_ctr,array(0.25,0.5,0.75))) as marry_crtvid_ctr,sponsorid_unet_ctr_percentile from zhangsiyao.dt_fea_0818_0820;

由此就找到了将特征等频划分的分位点,可以按照得到的分位点队连续型特征进行分桶操作