It is the beginning of the year; many of you are thinking about how to leverage Machine Learning to improve your products or services. Here at PredictionIO, we work with many companies in deploying their first ML system and big data infrastructure. We put together some good practices on data collection we would like to share with you. If you are thinking about adopting ML down the road, collecting the right data in the right format will reduce your data cleansing effort and wasted data.

Data Collection

Collect Everything

It is important to collect all data. Until you actually train a predictive model it is very hard to know which attributes and information will have predictive value and provide the best results. If a piece of information is not collected, there is no way of retrieving it and it’s lost for eternity. The low cost of storage also enables you to collect everything related to your app, product, or service. Here are two examples:

  • In product recommendation, it is important to collect user identifiers, item (i.e., product) identifiers, and behavioral data including ratings. Other related attributes such as category, descriptions, price, etc can also be useful features for improving your recommendation model. Implicit behaviors, such as views, may prove more useful than explicit ratings.

  • When predicting survival of passengers on the Titanic, intuitively we know attributes such as passenger age and gender are relevant.  Other attributes such as number of children aboard, fare, and cabin may or may not be useful information. It is hard to know which features will prove most predictive value until you start building a predictive model.

Storing logs is often a common solution; they can later be extracted, transformed, and loaded for training your machine learning models.

Timestamp Every Event

It is important to timestamp every event, especially for user actions or behavioral data. Timestamps allow us to prevent look-ahead bias while constructing the machine learning model.

PredictionIO offers an Event Server that supports the best practice or collecting data in an “Event-based style”. This means everything is collected as an event with a timestamp whether it is a user (e.g., “Sarah Connor”), an item (e.g., “The Terminator”), or an user-to-item action (“Sarah Connor views The Terminator”).

For example, creating user Sarah Connor:

{
  "event" : "new_user",
  "entityType" : "user"
  "entityId" : "de305d54-75b4-431b-adb2-eb6b9e546013",
  "properties" : {
    "name" : "Sarah Connor",
    "age" : 19,
    "email" : "sarah.connor@sky.net",
    "gender" : "Female"
   }
  "eventTime" : "1984-10-26T21:39:45.618-07:00"
}

Notice for entityId we have used a Universally unique identifier (UUID) and eventTime we use ISO 8601 format.

Maintain Attribute Consistency

Use consistent attribute values. If “Female” is used as gender, it is better to keep the same notation over time and not to replace with “F” or “female” or “girl”.

When you remove a feature, you should exclude it from your training set. You can clean the data related to the feature and re-import. When you add a new feature, it’s important to backfill the field with default value.

Avoid Serialized & Binary Fields

In Event Server, the field “properties” allows any free-form JSON objects. Due to convenience, one may store an escaped JSON string as one of the field. However, serialization may obfuscate the data to a point which it becomes unusable. An example below:

Wrong:

{
  "event" : "new_user",
  "entityType" : "user"
  "entityId" : "de305d54-75b4-431b-adb2-eb6b9e546013",
  "properties" : {
    "name" : "Sarah Connor",
    "age" : 19,
    "email" : "sarah.connor@sky.net",
    "gender" : "Female",
    "car": "{\r\n \"make\": \"Honda\",\r\n \"model\": \"Fit\",\r\n \"trim\": \"Sport\",\r\n \"year\": 2015\r\n}"
  }
  "eventTime" : "1984-10-26T21:39:45.618-07:00"
}

Correct:

{
  "event" : "new_user",
  "entityType" : "user"
  "entityId" : "de305d54-75b4-431b-adb2-eb6b9e546013",
  "properties" : {
    "name" : "Sarah Connor",
    "age" : 19,
    "email" : "sarah.connor@sky.net",
    "gender" : "Female",
    "car": {
      "make": "Honda",
      "model": "Fit",
      "trim": "Sport",
      "year": 2015
    }
  }
  "eventTime" : "1984-10-26T21:39:45.618-07:00"
}

Possible exception is when serializing dramatically reduce the amount of storage space. For example, you may want to store your data using Protocol Buffer and serialize it as a binary string. Doing so may save 5x storage space, but it will make your data unparseable. Worse is if you lose your message definition file, the data is lost forever. Unless your data size is of Google or Amazon scale, it may not be worth it.

Lookup Time

Lookup can be time consuming for large datasets. PredictionIO Event Server indexes data by (entityId, entityType). If you want efficient lookup, choose the “entityId” and “entityType” according to your access need.

Use Queuing Service

It is recommended to use a message queuing service to pass your event data to the Event Store. If the event store is temporarily unavailable, messages will reside in the queue until it is processed. No data loss.

We hope this is helpful. If you have other tips or other questions, please share them here with the community!

By Simon Chan