1. dataset description
* item_id - Ad id.
* user_id - User id.
* region - Ad region.
* city - Ad city.
* parent_category_name - Top level ad category as classified by Avito's ad model.
* category_name - Fine grain ad category as classified by Avito's ad model.
* param_1 - Optional parameter from Avito's ad model.
* param_2 - Optional parameter from Avito's ad model.
* param_3 - Optional parameter from Avito's ad model.
* title - Ad title.
* description - Ad description.
* price - Ad price.
* item_seq_number - Ad sequential number for user.
* activation_date- Date ad was placed.
* user_type - User type.
* image - Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
* image_top_1 - Avito's classification code for the image.
* deal_probability - The target variable.
2. Lookup list for methods to preprocess following types
2.1. text
2.2. images
2.3. time series
2.1 text data
- stemming
- tf
- tf-idf
- text statistics (#words, #char, caps, alphanum)
2.2 image data
- image features
- top predict categories(percentage encoding)
- key point detection(similar idea as ROI)
- image quality
- dullness
- whiteness
percentage of dark(<dark_threshold)/bright(>light_threshold) pixel - uniformity(edge present percentage)
- dominant color/average color
pixel frequency - size
- blurrness(Variation Laplacian)
- classification confidence (pipeline)
2.3 Time Series
- day (Mon-Sun)
- duration
Note:
A interesting way to encode numeric variables using percentage, like price, item_seq_number:
log1p(price) and log1p(item_seq_number)
3. challenges
- Random Forest takes forever to tune parameters
Solution: change to lightGBM - Images size exceeded memory
explore image data in training and testing (zipfile,hashing)
keras model on image (zipfile)