We all want to ask as few questions as possible on our web forms. However each question adds incremental value. How do we think through the trade-off of additional questions (leading to accuracy) vs simplicity?
First, let’s illustrate the tradeoff here.
We’re trying to find the optimal number of questions, where conversion is maximized, subject to some minimum level of information content.
The first, perhaps obvious, observation here, is that third party data is always a good idea. You get extra information content, for example to customize offers and look and feel, without impacting on the consumer experience.
Next, we need some way to test the information content of various subsets of the questions. Demyst.Data offers a way to do this – but the concept is pretty simple.
1. Upload your exhaustive questions, and a target variable
2. Fit some scorecard or segmentation that you’re happy with
Here’s ours. This can be thought of as the ‘taj mahal’ workflow (i.e. all questions are included).
3. Delete columns, rinse and repeat
The next step is to delete each column, and refit the entire scorecard, and plot side by side. Again, here’s one we prepared earlier.
The orange line, the baseline, is flat (clearly if you don’t ask any questions then predictive lift isn’t possible). The red line is what it looks like if no “Demyst” data is appended. All this means is we’ve temporary turned off the third party data and refit. The “without demyst” line is almost as steep as the full ‘taj mahal’ line. In a real dataset, this might mean you wouldn’t bother buying third party data (not something we’d advocate – actually what’s happening here is the emails are always joe, or john, so it’s not surprising that it’s not adding much value).
4. Keep going
There’s a near limitless number of permutations of this exercise.
No we can see that credit and email as standalone don’t add much value. Age is really the winner here, suggesting a radically simpler quoting process.
We don’t have the full picture yet, since we don’t know if that reduction in lift is compensated by a corresponding lift in conversion thanks to a simpler workflow. That’s a topic for another post.