We can predict anything!

July 12, 2011

conversion, marketing, underwriting model, powerful analytics API, grow your thin file customersWhat?

We’re opening some private beta trials to the DeMyst API, a tool that leverages rich public data sources; think digital footprint, telecommunications usage, and much more, to predict risk or conversion when there is minimal consumer information available.  Initially targeted at lenders, a few client meetings quickly revealed that our tool had broad applications to anyone working with ‘thin file’ customers.  So, like any good group of bootstrapping engineers, we iterated a bit, and hence, the latest version of the DeMyst API, a tool capable of predicting anything, was born.

How does it work? 

The tool excels at predicting a ‘target’ with minimal inputs/identifiers. Of course, the more identifiers the better, but the reality is, there is a bunch of rich data out there in the public domain that with a bit of aggregation, some pretty geeky analytics, and kickass technology, allows us to produce either a standalone prediction or complementary attributes for your own scorecards.    The UX is painless; just upload a decent sized sample containing whatever identifiers are available (e.g. it even works with just email!), we’ll append nifty third party data, exert our ‘muscle’, and within minutes, produce an API with a custom prediction. Yes, minutes.

Sounds cool, prove it: 

From a proud founder’s perspective, we’ve been pretty blown away by how much lift is being created by the toolkit.   We can boast a  >90% hit rate and significant growth improvements.  To prove it, we’re offering a few private beta spots so you can test for yourself.

What’s the catch? 

There are two:

  • You must be brutally honest and willing to provide us with feedback to help us refine the product.
  • You must have a genuine interest in commercially using the product if you like it (at a preferred rate of course).

To show our appreciation, we promise to provide attentive support and assistance, some free consulting help, and a certain level of exclusivity as a lighthouse customer.  This of course all comes free with your early access to a slick new tool that could massively increase your distribution without impacting your current risk level.

Early Adopter?

To reserve your spot, click here .


Rating agencies deserve more attention?

September 1, 2010

It seems Moody’s, for it’s part, is out of reach of the SEC


Without wanting to rehash years of analysis of the financial crisis, there’s something a bit odd about this discussion and their role in the financial markets. They’re alleged to have made a mistake in the credit models that investors rely on … but why are investors relying on them? Can investors really shirk this responsibility?

It’s particularly related to their private information, and how debt investors using them as an information shortcut.

Consider equity investments … there are strict accounting and disclosure rules to (attempt to) ensure investors are on equal footings, and management are very careful not to create opportunities for insider trading. However the story is different when firms issue debt as there are no such constraints. Management wants a bond issue. They woo the rating agencies with pitch books filled with inside information in order to achieve the highest possible rating. The rating agencies digest this and in turn produce a rating related to the estimated risk of default… and this is all perfectly legit.

Why, when we consider this from first principles, should they play this role? Are the investors not able to digest this information? Is the risk of debt really that much harder to judge than equity? Through this process the system may place too much faith in trusting the modeling skills of 3 private firms … so it’s not surprising that CDO’s we’re mis-classified.

Accounting standards are there for a reason – to create information symmetry among market participants. Debt in general seems to be fraught with information asymmetry. Examples of firms with legit private information (versus public filings) include :

  • Rating agencies (who transmit information via ratings)
  • Credit default swap issuers (who transmit information via prices)
  • Junk bond investors (who don’t transmit the information)

This may not be a bad thing … from a market efficiency standpoint perhaps you could argue that any firm is free to participate in these markets … but in the light of the regulatory pressure to push responsibility back to bank shareholders (and away from government), why not consider the same in debt markets?

Unbiased estimator variable selection

July 16, 2010

We’re continuing to test the tendency of modeling processes to overfit, per an earlier post.

The issue with this approach is that, in most practical settings, variables are either in or out, based on some variable selection process.


When a variable is “in”, typically the parameter is considered a best unbiased estimator … in normal speak this means that if the average in the sample data is .3, the the parameter will be such that the model predicts .3.


This is why, with sparsely populated variables, there is such a risk of overfitting when including too many variables – the model will fit to the sample noise.

Stepwise overfitting example

July 7, 2010

We’ve been generating some data in order to test modeling techniques on sparse high dimensionality data. In a simple example, in which categorical variables are created with levels sampled from a random uniform distribution, the effect of overfitting is significant. In this example a stepwise AIC method was used to select variables, then ROC curves produced to demonstrate fit quality in and out of sample. Here is the result :

This is a well known effect – stepwise models tend to lead to overfitting. Harrell is famous for his diatribe against this, a useful summary of which is here :

However, in practise these warnings are rarely heeded (whether though automated stepwise or “human stepwise” – i.e. not leveraging domain expertise when choosing between models, rather just blindly using quality of fit measures).

This is serving as a useful benchmark against other related modeling techniques.

Growing data volumes

June 25, 2010

It’s self evident that data volumes are growing exponentially … see this chart from the economist a while back :

What’s less frequently discussed is the enabling software technology that is making helping to incorporate the technology in to business decisions. This includes Excel 2007+, which allows greater than 2^64 rows, faster machines including 64bit (without the 4gb windows limitation), and more comfort with tools like SAS/SQL/Emblem that tackle large models. As a result the “average” user within a corporate can now crunch all of these larger datasets.

It’s human nature to seek explanations of things, however, just because the technology will handle the data and the modeling tool says “fit converged”, one must retain some statistical caution and common sense. I’ve anecdotally seen many occasions in which users massively overfit just because the tools allows them to – alternatively put occam’s razor used to be enforced not by good sense but by technical limitation. I wonder what will limit us in the future?

Common R modeling commands

June 22, 2010

A short post on basic R syntax :


fit <- glm(function, data, family = binomial)

y <- predict(fit, type=”response”)

Penalized logistic regression

fit <- lrm(function, data, penalty=1)

Classification tree

fit <- ctree(function, data)

Bias reduction logistic

fit <- logistf(function, data)