Avito Product Demand Prediction

1. dataset description

* item_id - Ad id.
* user_id - User id.
* region - Ad region.
* city - Ad city.
* parent_category_name - Top level ad category as classified by Avito's ad model.
* category_name - Fine grain ad category as classified by Avito's ad model.
* param_1 - Optional parameter from Avito's ad model.
* param_2 - Optional parameter from Avito's ad model.
* param_3 - Optional parameter from Avito's ad model.
* title - Ad title.
* description - Ad description.
* price - Ad price.
* item_seq_number - Ad sequential number for user.
* activation_date- Date ad was placed.
* user_type - User type.
* image - Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
* image_top_1 - Avito's classification code for the image.
* deal_probability - The target variable.

2. Lookup list for methods to preprocess following types

1. text  
2. images  
3. time series    

2.1 text data

stemming
tf
tf-idf
text statistics (#words, #char, caps, alphanum)

2.2 image data

reference

image features
- top predict categories(percentage encoding)
- key point detection(similar idea as ROI)
image quality
- dullness
- whiteness
  percentage of dark(<dark_threshold)/bright(>light_threshold) pixel
- uniformity(edge present percentage)
- dominant color/average color
  pixel frequency
- size
- blurrness(Variation Laplacian)
- classification confidence (pipeline)

2.3 Time Series

day (Mon-Sun)
duration

Note:
A interesting way to encode numeric variables using percentage, like price, item_seq_number:
log1p(price) and log1p(item_seq_number)

3. challenges

Random Forest takes forever to tune parameters
Solution: change to lightGBM
Images size exceeded memory
explore image data in training and testing (zipfile,hashing)
keras model on image (zipfile)

San Wang

Avito Product Demand Prediction

1. dataset description

2. Lookup list for methods to preprocess following types

2.1 text data

2.2 image data

2.3 Time Series

3. challenges

You might also enjoy (View all posts)