Deprecated blog

June 9, 2012

Hi all – we’ve switched to our new blog URL at :

This site will be closed in a few days

http://demystdata.wordpress.com


Singapore bound…

April 20, 2012

 

Innotribe Startup Challenge

Matteo Rizzi, Innovation Manager, Innotribe, says “I’m delighted to announce DemystData as a semi-finalist and look forward to discovering more about the business. This year’s semi-finalists have assessed the developments and trends in the region and have identified opportunities in the market. The entrants have each demonstrated a forward-thinking and innovative approach to the financial sector and have developed start-up businesses which could have profound impacts on the future of the industry. I’m extremely excited to give DemystData the opportunity to pitch its ideas to some of the top decision makers in the industry”.

Enough said! Looking forward to seeing everyone in Singapore on April 24th,

About Innotribe

Launched in 2009, Innotribe is SWIFT’s initiative to enable collaborative innovation in financial services.  Innotribe presents an energising mix of education, new perspectives, collaboration, facilitation and incubation to professionals and entrepreneurs who are willing to drive change within their industry. It fosters creative thinking in financial services, through debating the options (at Innotribe events) and supporting the creation of innovative new solutions (through the Incubator, Startup Challenge and Proof of Concepts (POCs). It is through this approach, the Innotribe team at SWIFT is able to generate a platform that enables innovation across SWIFT and the financial community. For more information, please visit http://www.innotribe.com/.

Advertisement

Your data…Your asset…

March 7, 2012

Image

A month ago, the New York Times published an opinion piece entitled ‘Facebook is Using You.’ Effectively, they argued that the use of aggregated online data is an invasion of privacy and that a person’s online profile and/or behavior potentially paints an inaccurate picture of who they actually are.   At one extreme, yes, I agree- there is a much room in today’s society for marketers, health care providers, financial service firms, insurers, etc to misuse a person’s data based on their search habits or the types of websites an individual visits.  On the flip side, I would suspect that 9 times out of 10, there is some correlation between a users ‘web data’ and who they actually are.  In fact, I’d be willing to wager that for a large portion of the world, someone’s online profile is actually a more holistic representation of their character than may be found in more antiquated reputation databases.  I also think it’s important to decipher between data that is self reported, e.g. that which a user enters or provides on sites such as Facebook or LinkedIn when creating a profile or on Foursquare when checking in at a location, and that which is ‘mined’ online through the use of cookies and or other tracking mechanisms.   For context, at Demyst.Data, we focus on the former, and only that which is publicly available, and the application of such data solely for the benefit for the consumer.

It is our opinion that the ability to effectively access, analyze and deploy a person’s data creates an invariably better customer experience for the ‘goods’ of the world.  Online data provides many who otherwise would be considered ‘off the grid,’ think youths, immigrants, the un-banked and under-banked, with a mechanism to establish an asset and a dossier for which reputation laden industries can make informed decisions about such people.   Without this profile, they are essentially invisible with no access to relevant offers, no access to fair credit, and probably most importantly, no mechanism to transform and transition to being ‘on the grid’.

Curious about the information that is publicly available on you?  Look yourself up for a sampling.  If you don’t like what you see or feel as if your online footprint is not actually representative of the information you have provided to some of our partner sites, you can always opt out of our database by clicking here.


Why data transparency is good for ‘rejected’ customers.

January 8, 2012

 

 Image

 

 

 

 

 

 

 

 

 

If you’ve ever read our blog or navigated our site, you’ve likely seen the phrase, ‘removing information asymmetries’.  If you’ve sat through a meeting with us, you’ve been lectured on how data transparency can benefit the consumer.  Let me try to connect the dots.

 Asymmetric information refers to a situation in which one party in a transaction has more or superior information compared to another. Economist George Akerloff publicized the problems of asymmetric information in his 1970’s paper discussing the ‘market for lemons’ in the used car industry. He explained that because a buyer cannot generally ascertain the value of a vehicle accurately, he/she would be willing to pay only an average price for it. Knowing in advance that the ‘good sellers’ are going to reject this average price, the buyer removes the aforementioned ‘lemon seller’s advantage’ by adjusting downward the price they are willing to pay.  In the end, the average price isn’t even offered, only the ‘lemon’ price is.  Effectively, the ‘bad’ drive the ‘good’ out of the market.

 Image

A similar situation occurs in the credit markets.  Let us examine a case in which a lender is faced with uncertainty about the creditworthiness of a group of borrowers. Having to account for the bad risks, lenders are pushed to charge artificially high interest rates to cross subsidize their risk. Recognizing this and not willing to borrow at usurious rates, the good subset of creditworthy borrowers remove themselves from the credit markets.  Similar to above, the ‘bad’ have driven out the ‘good.’

This inefficient risk cross subsidization affects a large portion of the $many trillion financial services markets, and removing it will yield huge value in the coming years. The availability of information is paramount to realizing this value.  Fortunately, data today is being created at an unprecedented rate. 

At Demyst.Data, we are constructing the infrastructure and mechanisms to aggregate & analyze this data.  Our clients are working to engage the consumer to share their information and educating them on the benefits of transparency.  Together, we are removing the asymmetries necessary to draw the ‘goods’ back to the market and to help lenders make educated lending decisions.  We believe we’re engaged in a win/win game; hence, our passion, excitement, and enthusiasm about the potential value of improved information. 


Performance tuning mongodb on ruby

November 26, 2011

As we’ve added hundreds of interesting online attributes, we’ve been hitting some performance bottlenecks when processing larger, batch datasets. This hasn’t been an issue for customers thankfully, and it doesn’t affect our realtime APIs, but it’s still frustrating. I had a spare day, so it felt like time for a performance boost.

Here’s the executive summary :

  1. If you use threads, upgrade to ruby 1.9.X
  2. Index index index
  3. Don’t forget that rails sits on a database

To start I spun up a deliberately small cut down test server, set up a reasonably complex API, and used the great tools at http://blitz.io to rush the API with hundreds of concurrent requests

That spike at the start was a concern, even though it is on a small server.

All the CPU usage was in rack/passenger, so I dusted off the profiler. Thread contention was getting in the way. We need threads because we integrate with so many third party APIs. We were still on REE for it’s memory management, however that uses green threads, so it was time to bite the bullet and (1) update to ruby 1.9.X.

That helped a fair amount, but we were still getting the timeouts.

So we re-ran the profile and noticed a strange amount of time in activerecord associations and one particular activerecord and a different mongodb query. This led to a few things …

2. We didn’t dig in to why but mymodel.relatedmodel.create :param => X was causing some painful slowness in the association code. It wasn’t that important to keep the syntactic sugar; switching to Relatedmodel.create :mymodel_id => mymodel.id, :param => X saved a bunch.

3. We added a couple of activerecord indexes, which helped a bit. MongoDB indexes were working a charm, but there was one particular group of 3 independent indexes that were always used in conjunction, and the mongo profiler was revealing nscanned of >10000 for some queries. Creating a combined index helped a lot. Another couple of examples that remind us that, while ORMs are nice, you can never forget there’s a database sitting under all of this.

The result?

And no timeouts until about 150 concurrent hits.

The performance was already plenty in our production system (we automatically scale horizontally as needed), but this helped improved things about 2-3x.

That’s enough for today. We’ll share some more details on performance benchmarks in the coming weeks.

Any other thoughts from the community? Please email me (mhookey at demystdata dot com).


Integrate with Facebook connect with rails and coffeescript

October 26, 2011

Facebook’s documentation on authentication via Facebook and the graph API is very comprehensive … but sometimes a worked example still helps. Here is how you can add a “Connect with Facebook” button with minimal effort, using rails, coffeescript, and ruby

Get your APP ID

You need to register your app with Facebook if you haven’t already

Add standard facebook loading code

From here. Look under authentication, and copy/paste to application.js and/or /layouts/application.html.erb. Add this line to the script to make sure the async loading works

window.setup();

Add the div to your page

	<div class="field">;
		<fb:login-button size="large">
		  Connect to Facebook
		</fb:login-button>
	</div>

Do something with your newly logged in customers

For example if you want to access the logged in customer’s profile after they have logged in, to customize the page, you might do something like this in coffeescript:

$ ->
    window.setup()

window.setup = ->
    window.FB.Event.subscribe('auth.login', -> do_something()) if window.FB?

do_something = ->
    console.log "doing something ..."

    window.FB.getLoginStatus (authtoken) ->
        if authtoken.authResponse
            window.FB.api '/me', (fbdata) ->
                console.log "FB name : #{fbdata['name']}"}
                # Add interesting personalization logic here

… and you’re ready to go


We have a white labelled offering where we can host this for you, and return the data through painless APIs, in case you’re looking to get up and running even faster. email us and let us know what you’re working on.


Foursquare API integration

October 24, 2011

Foursquare’s API changes quickly, so this post may be out of date before you get started.

However they offer up to the minute, user generated, location based venue data that makes it well worth the effort, especially if you’re cross referencing it with other location based data sources.

As with many social APIs, there are requests which need oauth (i.e. the end user opts-in) and those which don’t. This post gives an example of integrating with the public (non-oauth) data using ruby.

A quick example :

  def foursquare latlon
    apikey = get_my_foursquare_api_key # sign up as a developer, hardcode your key here
    apisecret = get_my_foursquare_api_secret  # ... and your secret key here
    @url = "https://api.foursquare.com/v2/venues/search?ll=#{latlon}&client_id=#{apikey}&client_secret=#{apisecret}"
    hsh = download(@url) # use Curl or some other method to get the json results
    results = {}
    results = hsh.response.groups.first.items if hsh.response && hsh.response.groups && hsh.response.groups.first && hsh.response.groups.first.items
    results
  end

Hopefully this is quite self explanatory. The result is an array of all nearby venues.

For example if you wanted to find popular venues near DC :

  x = foursquare("38.898717,-77.035974")
  pp x
  puts x.first.name # West Wing

Or, if you’d like to save some time and avoid this work altogether, we integrate and aggregate a range of interesting data, and deliver it through 1 simple API, so you don’t have to


Ruby curl with follow redirects

October 23, 2011

Just a technical FYI; this took a little digging to find in the documentation.

If you’re using Curl::Easy in ruby to download html (or results from our API), then FYI the default is to NOT follow redirects. If you want to follow redirects and download the page contents of the target, you’ll need to set the option option

easy.follow_location

to true.

Here’s a code snippet :

  def download_url url
    res = ""

    tries = 0
    begin 
      tries += 1
      easy = Curl::Easy.new
      easy.timeout = 30
      easy.follow_location = true
      easy.url = url
      easy.perform
      res = easy.body_str 
    rescue Exception => e  
      retry unless tries > 2
      puts "#{url} failed, returning empty string, #{e.message}"
    end

    ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
    res = ic.iconv(res + ' ')[0..-2]

    return res.downcase
  end

An Aggregation API for all

October 21, 2011

The simplest starting point for any user is the basic Aggregation API, where you can pull together all of the best customer data based on minimal inputs. This is for the more advanced users that want to build their own analytics.

Are you looking for Yahoo or Google data? Geolocation data? Or demographics by email? This Aggregation API may be a great way to start.

This was always available, but we’ve now given it the pride of place that it deserves and a permanent, static endpoint (/engine/raw).


Offer targeting with Demyst.Data

October 17, 2011

Something we’re proud of here at Demyst.Data is the ability to create APIs based on minimal input variables. The most common use case we come across is offer targeting. In short this means that you can guide your visiting customers towards the most relevant products, by predicting something (such as conversion likelihood) based solely on their IP

To learn more, see our ‘how-to’ : https://beta.demystdata.com/info/use_offer


Adding twitter and other social data

October 13, 2011

Based on client requests, we’ve been hard at work tapping in to additional useful variables. A quick update on highlighted additions :

  1. Connectedness measure for an email
  2. Twitter mentions, we previously had presence, but now we’re tapping in to the twitter API to handle keywords and hashtags. E.g. If you query on twitter=mhookey (or @mhookey or #mhookey) then you find the number of times people mention hate/fail or love me
  3. Twitter velocity – I.e. What rate are tweets arriving for a given keyword

We’re always open to special requests, so please let us know if you have an upcoming project that requires integrating with better web data, segmentation, or predictive analytics but haven’t quite figured out how to apply the demyst.data toolkit.


What-if (we threw an analytics party and everyone came)?

October 11, 2011

We’re pleased to announce the release of what-if (scenario testing) functionality for each API, all included within the base package.

This allows you to perform scenario testing on your underlying API. For example if you build a conversion API, where product offered is a variable, it can be nice to test the impact of changes to product offers. This is now possible :

Be warned though, if you want to draw strong conclusions from this analysis you’re predicting a counterfactual scenario. To do that with the most confidence, statistical purists would strongly suggest you need a randomized experiment (in this case random in the product offer variable). Even if you don’t have this, our modeling approach bring in as much third party data as possible to remove the biases inherent in an historical analysis, as such it can still suggest where the low hanging fruit might be.

Try it out, under ‘what-if’ on the left hand side.


Clearer data attributes

September 25, 2011

In our continued effort to demystify data, we’ve recently published our available attributes, which clarifies which inputs are required for each attribute. We’re continually updating this list, so please let us know if you have any suggested additions.

Occasionally it can be nice to avoid using (or even seeing) particular attributes. We’ve recently added support for this too … within the Account page. Just enter a comma separated list of values, and when third party data is being appended, any fields with names including this text will be skipped.

If you have any questions or suggestions, please let us know.


Optimize your web forms; the conversion rate vs accuracy trade off

September 8, 2011

We all want to ask as few questions as possible on our web forms. However each question adds incremental value. How do we think through the trade-off of additional questions (leading to accuracy) vs simplicity?

First, let’s illustrate the tradeoff here.

We’re trying to find the optimal number of questions, where conversion is maximized, subject to some minimum level of information content.

The first, perhaps obvious, observation here, is that third party data is always a good idea. You get extra information content, for example to customize offers and look and feel, without impacting on the consumer experience.

Next, we need some way to test the information content of various subsets of the questions. Demyst.Data offers a way to do this – but the concept is pretty simple.

1. Upload your exhaustive questions, and a target variable

2. Fit some scorecard or segmentation that you’re happy with

Here’s ours. This can be thought of as the ‘taj mahal’ workflow (i.e. all questions are included).

3. Delete columns, rinse and repeat

The next step is to delete each column, and refit the entire scorecard, and plot side by side. Again, here’s one we prepared earlier.

The orange line, the baseline, is flat (clearly if you don’t ask any questions then predictive lift isn’t possible). The red line is what it looks like if no “Demyst” data is appended. All this means is we’ve temporary turned off the third party data and refit. The “without demyst” line is almost as steep as the full ‘taj mahal’ line. In a real dataset, this might mean you wouldn’t bother buying third party data (not something we’d advocate – actually what’s happening here is the emails are always joe, or john, so it’s not surprising that it’s not adding much value).

4. Keep going

There’s a near limitless number of permutations of this exercise.

No we can see that credit and email as standalone don’t add much value. Age is really the winner here, suggesting a radically simpler quoting process.

We don’t have the full picture yet, since we don’t know if that reduction in lift is compensated by a corresponding lift in conversion thanks to a simpler workflow. That’s a topic for another post.


Beta invites, UX, & stuff

August 20, 2011

Thank you to all who have registered for a private beta trial.  We’re thrilled with number of requests and will continue to open up spots daily.  For those of you who have already signed up and tested the tool, please, don’t be shy, send us your reactions.  We need real user feedback so we can perfect the experience and continue to meet our clients needs better.

In line with some of the input from early adapters, we’re excited to announce a new addition to our team, Bryan Connor, a UX and data visualization expert who is putting in countless hours to make the product as user friendly and intuitive as possible.  You can sample some of Bryan’s prior work at http://dribbble.com/bryanconnor and I’ve included an initial iteration of the tool below (or, sign up here for a beta trial- the new design is in place!).

In other updates, our engineers continue to enhance the modeling techniques, access new data sources, and perfect the outputs to improve on some of the results of our early pilots.  And of course, we’re working on our 7-minute demo for Finovate and getting our travel plans in place.  Hit us up, we’re seeing up to 40% lift in predicting default versus the status quo.  Let us help you grow!


Finnovating!

August 11, 2011

“We’re really excited to have DeMyst Data demoing their innovative new solution at FinovateFall. We think the audience will find their new solution for helping lenders with segmentation and offer customization via alternative data sources very interesting.” ~Eric Mattson, CEO of Finovate

For those of you unfamiliar with Finovate, it is “the conference” for showcasing innovations in the fields of banking and financial technology. On stage, we’ll publicly launch the tool and demo some of our initial results with real client data.  Our focus will be on exposing lenders to the rich segmentation we are able to create with minimal customer inputs and illustrating how the outputs can be used to customize offers for thin file consumers.

We’ll be in NY (and traveling around the US) for a few weeks leading up to the conference and look forward to re-connecting with many of you and meeting others for the first time.   Drop us a note, we’d love to share some results and discuss how the product can help you grow!


Consider EC2 auto config instead of Capistrano

August 5, 2011

We’re just at the tail end of shifting our infrastructure from a static VPS to Amazon EC2. Someone much smarter than I designed a server self configuration process. Here’s what we came up with.

The final product

Here’s all it takes for us to spin up a fully functioning web node now :

 
How it works :
  1. Take existing AMI, create it with the default groups (web, worker, etc), and the default size
  2. The AMI includes a startup script to do a git clone of the repository.
  3. The startup script runs exactly 1 script within that repository, which can be used for anything. We use it for a bundle install, installing a couple of server dependencies that aren’t in the AMI (yet), mount the EBS block if it’s a db_role, assign the elasticip (or load balancer) if a web role
  4. That script also launches god for each role, which is how we make sure our web servers, background processing tasks, and database engines are running
Does it work?
Sure does, we’re pretty happy with the final outcome. Now to deploy new code we just reboot all running instances. We switched from capistrano, so there are some pros & cons :
Benefits of auto-launch
  • Less maintenance overhead : Deploy new server == Deploy new code. There is only one set of scripts to maintain. There is no way to end of with the “oops, I forgot to tell you I would always have to run rake blah blah on each node after reboot” problem
  • Auto scaling : Using tools like scalr or amazon auto-scale, we can trigger new nodes to spin up or shut down without human intervention.
  • Identical environments for test, deploy, staging, and production
  • Everything is in git : Everything is a rake task or config script in the one repository / project for now. Bringing a new engineer on board is less painful than it used to be. As we grow, we may need to split this out so that the environment and application can be managed independently, but for now this is a positive
Problems
Our prior process had it’s advantages which, in hindsight, we undervalued :
  • Capistrano deploy via symlink switching. We do our best for now and don’t reboot all servers at the same time, but it’s still riskier than the capistrano approach which deploys then switches the symlink at the last moment. We need to be careful not to suffer downtime during a deploy. We’re may end of using capistrano or our own process for this to deploy minor tweaks without a reboot (by dynamically querying all running instances to get the :roles).
  • Capistrano rollback. We can’t easily rollback if a deploy was a bad idea. We love TDD so hope this isn’t a problem, but issues always slip through so we need to be careful to deploy via branches.
  • Painful linux startup scripts. Writing code in a fully loaded ruby environment is easier than shell scripting, especially when the shell script is run at startup (without the environment configured). This took a while to get right. As we add complexity I hope this doesn’t slow us down.
Recommendation
If you want the best of heroku, but still need your own custom environment, then I’d look in to this. Above all else it created a discipline for us to feel comfortable that when demand spikes occur, we’ll be ready.

We can predict anything!

July 12, 2011

conversion, marketing, underwriting model, powerful analytics API, grow your thin file customersWhat?

We’re opening some private beta trials to the DeMyst API, a tool that leverages rich public data sources; think digital footprint, telecommunications usage, and much more, to predict risk or conversion when there is minimal consumer information available.  Initially targeted at lenders, a few client meetings quickly revealed that our tool had broad applications to anyone working with ‘thin file’ customers.  So, like any good group of bootstrapping engineers, we iterated a bit, and hence, the latest version of the DeMyst API, a tool capable of predicting anything, was born.

How does it work? 

The tool excels at predicting a ‘target’ with minimal inputs/identifiers. Of course, the more identifiers the better, but the reality is, there is a bunch of rich data out there in the public domain that with a bit of aggregation, some pretty geeky analytics, and kickass technology, allows us to produce either a standalone prediction or complementary attributes for your own scorecards.    The UX is painless; just upload a decent sized sample containing whatever identifiers are available (e.g. it even works with just email!), we’ll append nifty third party data, exert our ‘muscle’, and within minutes, produce an API with a custom prediction. Yes, minutes.

Sounds cool, prove it: 

From a proud founder’s perspective, we’ve been pretty blown away by how much lift is being created by the toolkit.   We can boast a  >90% hit rate and significant growth improvements.  To prove it, we’re offering a few private beta spots so you can test for yourself.

What’s the catch? 

There are two:

  • You must be brutally honest and willing to provide us with feedback to help us refine the product.
  • You must have a genuine interest in commercially using the product if you like it (at a preferred rate of course).

To show our appreciation, we promise to provide attentive support and assistance, some free consulting help, and a certain level of exclusivity as a lighthouse customer.  This of course all comes free with your early access to a slick new tool that could massively increase your distribution without impacting your current risk level.

Early Adopter?

To reserve your spot, click here .


Lies, Damn Lies and Statistics

June 24, 2011

  

It was Mark Twain who popularised it, but the original authorship of the oft quoted phrase “lies, damn lies and statistics” is widely contested. One to whom it is frequently attributed is Benjamin Disraeli. A distinguished conservative politician and literary figure, Disraeli’s business ventures are deservedly less celebrated.

His speculative investments in South American mining companies in the early nineteenth century proved calamitous and almost ruined him. One wonders whether, had he a more considered view of the power of information than is implied by the phrase with which he is sometimes associated, he may have avoided the pitfalls of reckless investment.

While the world, and particularly the ethereal world, is awash with data (and indeed statistics), it is alarming how infrequently that data is converted to useful information. At a time when data is generated and captured at an unprecedented rate and indeed has become inordinately accessible, it is ironic that we remain so beholden to the spinmeisters and their political masters. The power of information has never been more readily, tantalisingly, at our fingertips but somehow we don’t reach out and grasp it.

At Silne we have a healthy disregard for what we call information asymmetries. In equity markets information asymmetries are said to be removed through the trading activities of arbitrageurs. When I trade on the basis of closely held information I essentially expose that information to the world. In the meantime of course, I make money. Information asymmetries then confer power on the holder of information, or serve to diminish the interests of those without access to it. That’s not fair and we don’t like it.

We define information asymmetries rather broadly … information is available but is not being used; you have information but I don’t; information exists but I don’t know it does. Finding relevant, predictive data, sifting and analysing it, and using it to solve problems and improve decision making  is not easy but it can be a route to the truth, not the damn lies which Disraeli so lamented.


Activerecord SQL optimization tips

June 20, 2011

In Rails, it’s all too easy to forget that rails activerecord models sit on top of a database. Don’t.

We tackle big data problems, and queries are usually the performance bottleneck. Here are a couple of simple tips for optimizing rails code without resorting to custom SQL (apologies if some of these seem too obvious to mention, but they can be quite common issues when tackling thousands of records) :

  1. Understand ActiveRecord.new vs create. New just instantiates the object, whereas create actually inserts it. Use your .save!’s wisely.
  2. Select only what you need from the database. E.g. Mdl.where(“#{attr} is not null”).select(attr).each { |o| do_something(o.attr) }  is SO much better than Mdl.all.each { |o| do_something(o.attr) if o.attr }
  3. Use transactions for parallel processing to save on COMMIT time. e.g. self.transaction { mdls.each { |o| o.save! } }.
  4. Use update_all … e.g. Mdl.all.update_all(:field => nil) is a lot better than mdls.each {|m| m.field = nil; m.save!}

These types of activities can retain the syntactical sugar of rails, which helps with maintenance, readability, and security. It also leads to less overhead when switching backend databases. Finally it’s more fun.

That said, here at Silne we’re crunching some large data on the backend, so there are times when activerecord just won’t cut it, and you need to optimize your data manipulations directly… but don’t discount the flexibility of activerecord.