San Wang bio photo

San Wang

Direction is more important than speed

Email LinkedIn Github

Data Source

1. dataset description

* item_id - Ad id.
* user_id - User id.
* region - Ad region.
* city - Ad city.
* parent_category_name - Top level ad category as classified by Avito's ad model.
* category_name - Fine grain ad category as classified by Avito's ad model.
* param_1 - Optional parameter from Avito's ad model.
* param_2 - Optional parameter from Avito's ad model.
* param_3 - Optional parameter from Avito's ad model.
* title - Ad title.
* description - Ad description.
* price - Ad price.
* item_seq_number - Ad sequential number for user.
* activation_date- Date ad was placed.
* user_type - User type.
* image - Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
* image_top_1 - Avito's classification code for the image.
* deal_probability - The target variable.

2. Lookup list for methods to preprocess following types

2.1. text  
2.2. images  
2.3. time series    

2.1 text data

  • stemming
  • tf
  • tf-idf
  • text statistics (#words, #char, caps, alphanum)

2.2 image data

reference

  • image features
  • image quality
    • dullness
    • whiteness
      percentage of dark(<dark_threshold)/bright(>light_threshold) pixel
    • uniformity(edge present percentage)
    • dominant color/average color
      pixel frequency
    • size
    • blurrness(Variation Laplacian)
    • classification confidence (pipeline)

2.3 Time Series

  • day (Mon-Sun)
  • duration

Note:
A interesting way to encode numeric variables using percentage, like price, item_seq_number:
log1p(price) and log1p(item_seq_number)

3. challenges