Hi all – we’ve switched to our new blog URL at :
This site will be closed in a few days
Hi all – we’ve switched to our new blog URL at :
This site will be closed in a few days
If you’ve ever read our blog or navigated our site, you’ve likely seen the phrase, ‘removing information asymmetries’. If you’ve sat through a meeting with us, you’ve been lectured on how data transparency can benefit the consumer. Let me try to connect the dots.
Asymmetric information refers to a situation in which one party in a transaction has more or superior information compared to another. Economist George Akerloff publicized the problems of asymmetric information in his 1970’s paper discussing the ‘market for lemons’ in the used car industry. He explained that because a buyer cannot generally ascertain the value of a vehicle accurately, he/she would be willing to pay only an average price for it. Knowing in advance that the ‘good sellers’ are going to reject this average price, the buyer removes the aforementioned ‘lemon seller’s advantage’ by adjusting downward the price they are willing to pay. In the end, the average price isn’t even offered, only the ‘lemon’ price is. Effectively, the ‘bad’ drive the ‘good’ out of the market.
A similar situation occurs in the credit markets. Let us examine a case in which a lender is faced with uncertainty about the creditworthiness of a group of borrowers. Having to account for the bad risks, lenders are pushed to charge artificially high interest rates to cross subsidize their risk. Recognizing this and not willing to borrow at usurious rates, the good subset of creditworthy borrowers remove themselves from the credit markets. Similar to above, the ‘bad’ have driven out the ‘good.’
This inefficient risk cross subsidization affects a large portion of the $many trillion financial services markets, and removing it will yield huge value in the coming years. The availability of information is paramount to realizing this value. Fortunately, data today is being created at an unprecedented rate.
At Demyst.Data, we are constructing the infrastructure and mechanisms to aggregate & analyze this data. Our clients are working to engage the consumer to share their information and educating them on the benefits of transparency. Together, we are removing the asymmetries necessary to draw the ‘goods’ back to the market and to help lenders make educated lending decisions. We believe we’re engaged in a win/win game; hence, our passion, excitement, and enthusiasm about the potential value of improved information.
As we’ve added hundreds of interesting online attributes, we’ve been hitting some performance bottlenecks when processing larger, batch datasets. This hasn’t been an issue for customers thankfully, and it doesn’t affect our realtime APIs, but it’s still frustrating. I had a spare day, so it felt like time for a performance boost.
Here’s the executive summary :
To start I spun up a deliberately small cut down test server, set up a reasonably complex API, and used the great tools at http://blitz.io to rush the API with hundreds of concurrent requests
That spike at the start was a concern, even though it is on a small server.
All the CPU usage was in rack/passenger, so I dusted off the profiler. Thread contention was getting in the way. We need threads because we integrate with so many third party APIs. We were still on REE for it’s memory management, however that uses green threads, so it was time to bite the bullet and (1) update to ruby 1.9.X.
That helped a fair amount, but we were still getting the timeouts.
So we re-ran the profile and noticed a strange amount of time in activerecord associations and one particular activerecord and a different mongodb query. This led to a few things …
2. We didn’t dig in to why but mymodel.relatedmodel.create :param => X was causing some painful slowness in the association code. It wasn’t that important to keep the syntactic sugar; switching to Relatedmodel.create :mymodel_id => mymodel.id, :param => X saved a bunch.
3. We added a couple of activerecord indexes, which helped a bit. MongoDB indexes were working a charm, but there was one particular group of 3 independent indexes that were always used in conjunction, and the mongo profiler was revealing nscanned of >10000 for some queries. Creating a combined index helped a lot. Another couple of examples that remind us that, while ORMs are nice, you can never forget there’s a database sitting under all of this.
The result?
And no timeouts until about 150 concurrent hits.
The performance was already plenty in our production system (we automatically scale horizontally as needed), but this helped improved things about 2-3x.
That’s enough for today. We’ll share some more details on performance benchmarks in the coming weeks.
Any other thoughts from the community? Please email me (mhookey at demystdata dot com).
Facebook’s documentation on authentication via Facebook and the graph API is very comprehensive … but sometimes a worked example still helps. Here is how you can add a “Connect with Facebook” button with minimal effort, using rails, coffeescript, and ruby
You need to register your app with Facebook if you haven’t already
From here. Look under authentication, and copy/paste to application.js and/or /layouts/application.html.erb. Add this line to the script to make sure the async loading works
window.setup();
<div class="field">; <fb:login-button size="large"> Connect to Facebook </fb:login-button> </div>
For example if you want to access the logged in customer’s profile after they have logged in, to customize the page, you might do something like this in coffeescript:
$ -> window.setup() window.setup = -> window.FB.Event.subscribe('auth.login', -> do_something()) if window.FB? do_something = -> console.log "doing something ..." window.FB.getLoginStatus (authtoken) -> if authtoken.authResponse window.FB.api '/me', (fbdata) -> console.log "FB name : #{fbdata['name']}"} # Add interesting personalization logic here
… and you’re ready to go
We have a white labelled offering where we can host this for you, and return the data through painless APIs, in case you’re looking to get up and running even faster. email us and let us know what you’re working on.
Just a technical FYI; this took a little digging to find in the documentation.
If you’re using Curl::Easy in ruby to download html (or results from our API), then FYI the default is to NOT follow redirects. If you want to follow redirects and download the page contents of the target, you’ll need to set the option option
easy.follow_location
to true.
Here’s a code snippet :
def download_url url res = "" tries = 0 begin tries += 1 easy = Curl::Easy.new easy.timeout = 30 easy.follow_location = true easy.url = url easy.perform res = easy.body_str rescue Exception => e retry unless tries > 2 puts "#{url} failed, returning empty string, #{e.message}" end ic = Iconv.new('UTF-8//IGNORE', 'UTF-8') res = ic.iconv(res + ' ')[0..-2] return res.downcase end
The simplest starting point for any user is the basic Aggregation API, where you can pull together all of the best customer data based on minimal inputs. This is for the more advanced users that want to build their own analytics.
Are you looking for Yahoo or Google data? Geolocation data? Or demographics by email? This Aggregation API may be a great way to start.
This was always available, but we’ve now given it the pride of place that it deserves and a permanent, static endpoint (/engine/raw).
Something we’re proud of here at Demyst.Data is the ability to create APIs based on minimal input variables. The most common use case we come across is offer targeting. In short this means that you can guide your visiting customers towards the most relevant products, by predicting something (such as conversion likelihood) based solely on their IP
To learn more, see our ‘how-to’ : https://beta.demystdata.com/info/use_offer
Based on client requests, we’ve been hard at work tapping in to additional useful variables. A quick update on highlighted additions :
We’re always open to special requests, so please let us know if you have an upcoming project that requires integrating with better web data, segmentation, or predictive analytics but haven’t quite figured out how to apply the demyst.data toolkit.
We’re pleased to announce the release of what-if (scenario testing) functionality for each API, all included within the base package.
This allows you to perform scenario testing on your underlying API. For example if you build a conversion API, where product offered is a variable, it can be nice to test the impact of changes to product offers. This is now possible :
Be warned though, if you want to draw strong conclusions from this analysis you’re predicting a counterfactual scenario. To do that with the most confidence, statistical purists would strongly suggest you need a randomized experiment (in this case random in the product offer variable). Even if you don’t have this, our modeling approach bring in as much third party data as possible to remove the biases inherent in an historical analysis, as such it can still suggest where the low hanging fruit might be.
Try it out, under ‘what-if’ on the left hand side.
In our continued effort to demystify data, we’ve recently published our available attributes, which clarifies which inputs are required for each attribute. We’re continually updating this list, so please let us know if you have any suggested additions.
Occasionally it can be nice to avoid using (or even seeing) particular attributes. We’ve recently added support for this too … within the Account page. Just enter a comma separated list of values, and when third party data is being appended, any fields with names including this text will be skipped.
If you have any questions or suggestions, please let us know.
We all want to ask as few questions as possible on our web forms. However each question adds incremental value. How do we think through the trade-off of additional questions (leading to accuracy) vs simplicity?
First, let’s illustrate the tradeoff here.
We’re trying to find the optimal number of questions, where conversion is maximized, subject to some minimum level of information content.
The first, perhaps obvious, observation here, is that third party data is always a good idea. You get extra information content, for example to customize offers and look and feel, without impacting on the consumer experience.
Next, we need some way to test the information content of various subsets of the questions. Demyst.Data offers a way to do this – but the concept is pretty simple.
1. Upload your exhaustive questions, and a target variable
2. Fit some scorecard or segmentation that you’re happy with
Here’s ours. This can be thought of as the ‘taj mahal’ workflow (i.e. all questions are included).
3. Delete columns, rinse and repeat
The next step is to delete each column, and refit the entire scorecard, and plot side by side. Again, here’s one we prepared earlier.
The orange line, the baseline, is flat (clearly if you don’t ask any questions then predictive lift isn’t possible). The red line is what it looks like if no “Demyst” data is appended. All this means is we’ve temporary turned off the third party data and refit. The “without demyst” line is almost as steep as the full ‘taj mahal’ line. In a real dataset, this might mean you wouldn’t bother buying third party data (not something we’d advocate – actually what’s happening here is the emails are always joe, or john, so it’s not surprising that it’s not adding much value).
4. Keep going
There’s a near limitless number of permutations of this exercise.
No we can see that credit and email as standalone don’t add much value. Age is really the winner here, suggesting a radically simpler quoting process.
We don’t have the full picture yet, since we don’t know if that reduction in lift is compensated by a corresponding lift in conversion thanks to a simpler workflow. That’s a topic for another post.
Thank you to all who have registered for a private beta trial. We’re thrilled with number of requests and will continue to open up spots daily. For those of you who have already signed up and tested the tool, please, don’t be shy, send us your reactions. We need real user feedback so we can perfect the experience and continue to meet our clients needs better.
In line with some of the input from early adapters, we’re excited to announce a new addition to our team, Bryan Connor, a UX and data visualization expert who is putting in countless hours to make the product as user friendly and intuitive as possible. You can sample some of Bryan’s prior work at http://dribbble.com/bryanconnor and I’ve included an initial iteration of the tool below (or, sign up here for a beta trial- the new design is in place!).
In other updates, our engineers continue to enhance the modeling techniques, access new data sources, and perfect the outputs to improve on some of the results of our early pilots. And of course, we’re working on our 7-minute demo for Finovate and getting our travel plans in place. Hit us up, we’re seeing up to 40% lift in predicting default versus the status quo. Let us help you grow!
“We’re really excited to have DeMyst Data demoing their innovative new solution at FinovateFall. We think the audience will find their new solution for helping lenders with segmentation and offer customization via alternative data sources very interesting.” ~Eric Mattson, CEO of Finovate
For those of you unfamiliar with Finovate, it is “the conference” for showcasing innovations in the fields of banking and financial technology. On stage, we’ll publicly launch the tool and demo some of our initial results with real client data. Our focus will be on exposing lenders to the rich segmentation we are able to create with minimal customer inputs and illustrating how the outputs can be used to customize offers for thin file consumers.
We’ll be in NY (and traveling around the US) for a few weeks leading up to the conference and look forward to re-connecting with many of you and meeting others for the first time. Drop us a note, we’d love to share some results and discuss how the product can help you grow!
We’re just at the tail end of shifting our infrastructure from a static VPS to Amazon EC2. Someone much smarter than I designed a server self configuration process. Here’s what we came up with.
The final product
Here’s all it takes for us to spin up a fully functioning web node now :
How it works :
We’re opening some private beta trials to the DeMyst API, a tool that leverages rich public data sources; think digital footprint, telecommunications usage, and much more, to predict risk or conversion when there is minimal consumer information available. Initially targeted at lenders, a few client meetings quickly revealed that our tool had broad applications to anyone working with ‘thin file’ customers. So, like any good group of bootstrapping engineers, we iterated a bit, and hence, the latest version of the DeMyst API, a tool capable of predicting anything, was born.
How does it work?
The tool excels at predicting a ‘target’ with minimal inputs/identifiers. Of course, the more identifiers the better, but the reality is, there is a bunch of rich data out there in the public domain that with a bit of aggregation, some pretty geeky analytics, and kickass technology, allows us to produce either a standalone prediction or complementary attributes for your own scorecards. The UX is painless; just upload a decent sized sample containing whatever identifiers are available (e.g. it even works with just email!), we’ll append nifty third party data, exert our ‘muscle’, and within minutes, produce an API with a custom prediction. Yes, minutes.
Sounds cool, prove it:
From a proud founder’s perspective, we’ve been pretty blown away by how much lift is being created by the toolkit. We can boast a >90% hit rate and significant growth improvements. To prove it, we’re offering a few private beta spots so you can test for yourself.
What’s the catch?
There are two:
To show our appreciation, we promise to provide attentive support and assistance, some free consulting help, and a certain level of exclusivity as a lighthouse customer. This of course all comes free with your early access to a slick new tool that could massively increase your distribution without impacting your current risk level.
Early Adopter?
To reserve your spot, click here .
It was Mark Twain who popularised it, but the original authorship of the oft quoted phrase “lies, damn lies and statistics” is widely contested. One to whom it is frequently attributed is Benjamin Disraeli. A distinguished conservative politician and literary figure, Disraeli’s business ventures are deservedly less celebrated.
His speculative investments in South American mining companies in the early nineteenth century proved calamitous and almost ruined him. One wonders whether, had he a more considered view of the power of information than is implied by the phrase with which he is sometimes associated, he may have avoided the pitfalls of reckless investment.
While the world, and particularly the ethereal world, is awash with data (and indeed statistics), it is alarming how infrequently that data is converted to useful information. At a time when data is generated and captured at an unprecedented rate and indeed has become inordinately accessible, it is ironic that we remain so beholden to the spinmeisters and their political masters. The power of information has never been more readily, tantalisingly, at our fingertips but somehow we don’t reach out and grasp it.
At Silne we have a healthy disregard for what we call information asymmetries. In equity markets information asymmetries are said to be removed through the trading activities of arbitrageurs. When I trade on the basis of closely held information I essentially expose that information to the world. In the meantime of course, I make money. Information asymmetries then confer power on the holder of information, or serve to diminish the interests of those without access to it. That’s not fair and we don’t like it.
We define information asymmetries rather broadly … information is available but is not being used; you have information but I don’t; information exists but I don’t know it does. Finding relevant, predictive data, sifting and analysing it, and using it to solve problems and improve decision making is not easy but it can be a route to the truth, not the damn lies which Disraeli so lamented.
In Rails, it’s all too easy to forget that rails activerecord models sit on top of a database. Don’t.
We tackle big data problems, and queries are usually the performance bottleneck. Here are a couple of simple tips for optimizing rails code without resorting to custom SQL (apologies if some of these seem too obvious to mention, but they can be quite common issues when tackling thousands of records) :
These types of activities can retain the syntactical sugar of rails, which helps with maintenance, readability, and security. It also leads to less overhead when switching backend databases. Finally it’s more fun.
That said, here at Silne we’re crunching some large data on the backend, so there are times when activerecord just won’t cut it, and you need to optimize your data manipulations directly… but don’t discount the flexibility of activerecord.