Resonate’s Data Science team has provided two options below to increase your match rate by extrapolating your matched records to the rest of your unmatched file. After you complete the Resonate Data Append process, your Data Science team can follow these instructions to append Resonate attributes to any unmatched records.
1. The Univariate Approach
For this method, you will take your private data and train a model to predict each of our appended attributes.Given the size of these datasets, we recommend using the pipeline API from spark with a random forest model. This Medium post has a good overview and useful code:
https://medium.com/rahasak/random-forest-classifier-with-apache-spark-c63b4a23a7cc
Alternatively, you could also use XGboost, or even a logistic regression.
The general framework here is:
- Vectorize and standardize your features for the whole dataset
- Encode the target Resonate attribute as a 1.0 or 0.0
- In the case of extremely imbalanced classes, you will want to use class weighting, up/down sampling to balance class samples, or SMOTE if your data allows it.
- Train a regularized classification algorithm with some flavor of cross validation on the hyperparameters (k-fold, hyperopt, or train-test splits)
- Apply results
2. Deep Learning Route
The more challenging path is the deep learning route, since this technology can learn to impute the entire stream of Resonate attributes at once. The Tensorflow user guide has a useful tutorial:
https://www.tensorflow.org/tutorials/generative/autoencoder#first_example_basic_autoencoder
For this approach:
- Vectorize and standardize your features for the whole dataset
- Vectorize Resonate attributes as binary features (0.0 or 1.0)
- Train a neural network with sigmoid outputs. Follow the basic autoencoder framework but do not worry about bottleneck layers.
- Apply results
Comments
0 comments
Please sign in to leave a comment.