Showing posts with label Vertica. Show all posts
Showing posts with label Vertica. Show all posts

Thursday, February 23, 2017

IDOL-Powered Appliance Delivers Better Decisions Via Comprehensive Business Information Searches

Transcript of a discussion on how HPE's platform and data solutions have been combined by SEC 1.01 for an appliance approach to index and deliver comprehensive business information results.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition to the Hewlett Packard Enterprise (HPE) Voice of the Customer podcast series. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on digital transformation. Stay with us now to learn how agile businesses are fending off disruption in favor of innovation.

Gardner
Our next case study highlights how a Swiss engineering firm created an appliance that quickly deploys to index and deliver comprehensive business information. It performs a simulation across thousands of formats and hundreds of languages and then provides via a simple search interface unprecedented access to trends, leads, and the makings of highly informed business decisions.

We will now explore how SEC 1.01 AG delivers a truly intelligent services solution -- one that returns new information to ongoing queries and combines internal and external information on all sorts of resources to produce a 360-degree view of end users’ areas of intense interest.

Join us as we learn how finding and using the best available information can be done in about half the usual time. We're here with our guest David Meyer, Chief Technology Officer at SEC 1.01 AG in Switzerland.
 
Welcome, David.

David Meyer: Thank you.

Meyer
Gardner: What are some of the trends that are driving the need for what you've developed. It's called the i5 appliance?

Meyer: The most important thing is that we can provide instant access to company-relevant information. This is one of today’s biggest challenges that we address with our i5 appliance.

Decisions are only as good as the information bases they are made on. The i5 provides the ability to access more complete information bases to make substantiated decisions. Also, you don’t want to search all the time; you want to be proactively informed. We do that with our agents and our automated programs that are searching for new information that you're interested in.

Gardner: As an organization, you've been around for quite a while and involved with  large applications, packaged applications -- SAP, for example and R/3 -- but over time, more data sources and ability to gather information came on board, and you saw the need in the market for this appliance. Tell us a little bit about what led you to create it?

Accelerating the journey

Meyer: We started to dive into big data about the time that HPE acquired Autonomy, December 2011, and we saw that it’s very hard for companies to start to become a data-driven organization. With the i5 appliance, we would like to help companies accelerate their journey to become such a company.

Gardner: Tell us what you mean by a 360-degree view? What does that really mean in terms of getting the right information to the right people at the right time?

Meyer: In a company's information scope, you don’t just talk about internal information, but you also have external information like news feeds, social media feeds, or even governmental or legal information that you need and don’t have to time to search for every day.

So, you need to have a search appliance that can proactively inform you about things that happen outside. For example, if there's a legal issue with your customer or if you're in a contract discussion and your partner loses his signature authority to sign that contract, how would you get this information if you don't have support from your search engine?
Mission Critical
Server Choices

Have Never Been Better
Gardner: And search has become such a popular paradigm for acquiring information, asking a question, and getting great results. Those results are only as good as the data and content they can access. Tell us a little bit about your company SEC 1.01 AG, your size and your scope or your market. Give us a little bit of background about your company.

Meyer: We've been an HPE partner for 26 years, and we build business-critical platforms based on HPE hardware and also the HPE operating system, HP-UX. Since the merger of Autonomy and HPE in 2011, we started to build solutions based on HPE's big-data software, particularly IDOL and Vertica.

Gardner: What was it about the environment that prevented people from doing this on their own? Why wouldn't you go and just do this yourself in your own IT shop?

Meyer: The HPE IDOL software ecosystem, is really an ecosystem of different software, and these parts need to be packed together to something that can be installed very quickly and that can provide very quick results. That’s what we did with the i5 appliance.

We put all this good software from HPE IDOL together into one simple appliance, which is simple to install. We want to accelerate the time that is needed to start with big data to get results from it and to get started with the analytical part of using your data and gain money out of it.

Multiple formats

Gardner: As we mentioned earlier, getting the best access to the best data is essential. There are a lot of APIs and a lot of tools that come with the IDOL ecosystem as you described it, but you were able to dive into a thousand or more file formats, support a 150 languages, and 400 data sources. That's very impressive. Tell us how that came about.

Meyer: When you start to work with unstructured data, you need some important functionality. For example, you need to have support for lot of languages. Imagine all these social media feeds in different languages. How do you track that if you don't support sentiment analysis on these messages?

On the other hand, you also need to understand any unstructured format. For example, if you have video broadcasts or radio broadcasts and you want to search for the content inside these broadcasts, you need to have a tool to translate the speech to text. HPE IDOL brings all the functionality that is needed to work with unstructured data, and we packed that together in our i5 appliance.

Gardner: That includes digging into PDFs and using OCR. It's quite impressive how deep and comprehensive you can be in terms of all the types of content within your organization.
Access the Free
HPE Vertica

Community Edition
How do you physically do this? If it's an appliance, you're installing it on-premises, you're able to access data sources from outside your organization, if you choose to do that, but how do you actually implement this and then get at those data sources internally? How would an IT person think about deploying this?

Meyer: We've prepared installable packages. Mainly, you need to have connectors to connect to repositories, to data ports. For example, if you have a Microsoft Exchange Server, you have a connector that understands very well how the Exchange server can communicate to that connector. So, you have the ability to connect to that data source and get any content including the metadata.

You talk about metadata for an e-mail, for example, the “From” to “To”, to “Subject,” whatever. You have the ability to put all that content and this metadata into a centralized index, and then you're able to search that information and refine the information. Then, you have a reference to your original document.

When you want to enrich the information that you have in your company with external information, we developed a so-called SECWebConnector that can capture any information from the Internet. For example, you just need to enter an RSS feed or a webpage, and then you can capture the content and the metadata you want it to search for or that is important for your company.

Gardner: So, it’s actually quite easy to tailor this specifically to an industry focus, if you wish, to a geographic focus. It’s quite easy to develop an index that’s specific to your organization, your needs, and your people.

Informational scope

Meyer: Exactly. In our crowded informational system that we have with the Internet and everything, it’s important that companies can choose where they want to have the information that is important for them. Do I need legal information, do I need news information, do I need social media information, and do I need broadcasting information? It’s very important to build your own informational scope that you want to be informed about, news that you want to be able to search for.

Gardner: And because of the way you structured and engineered this appliance, you're not only able to proactively go out and request things, but you can have a programmatic benefit, where you can tell it to deliver to you results when they arise or when they're discovered. Tell us a little bit how that works.

Meyer: We call them agents. You can define which topics you're interested in, and when some new documents are found by that search or by that topic, then you get informed, with an email or with a push notification on the mobile app.

Gardner: Let’s dig into a little bit of this concept of an appliance. You're using IDOL and you're using Vertica, the column-based or high-performance analytics engine, also part of HPE, but soon to be part of Micro Focus. You're also using 3PAR StoreServ and ProLiant DL380 servers. Tell us how that integration happened and why you actually call this an appliance, rather than some other name?
In our crowded informational system that we have with the Internet and everything, it’s important that companies can choose where they want to have the information that is important for them.

Meyer: Appliance means that all the software is patched together. Every component can talk to the others, talks the same language, and can be configured the same way. We preconfigure a lot, we standardize a lot, and that’s the appliance thing.

And it’s not bound on hardware. So, it doesn’t need to be this DL380 or whatever. It also depends on how big your environment will be. It can also be a c7000 Blade Chassis or whatever.

When we install an appliance, we have one or two days until it’s installed, and then it starts the initial indexing program, and this takes a while until you have all the data in the index. So, the initial load is big, but after two or three days, you're able to search for information.

You mentioned the HPE Vertica part. We use Vertica to log every action that goes on, on the appliance. On one hand, this is a security feature. You need to prove if nobody has found the salary list, for example. You need to prove that and so you need to log it.

On the other hand, you can analyze what users are doing. For example, if they don’t find something and it’s always the same thing that people are searching in the company and can't find, perhaps there's some information you need to implement into the appliance.

Gardner: You mentioned security and privileges. How does the IT organization allow the right people to access the right information? Are you going to use some other policy engine? How does that work?

Mapped security

Meyer: It's included. It's called mapped security. The connector takes the security information with the document and indexes that security information within the index. So, you will never be able to find a document that you don't have access to in your environment. It's important that this security is given by default.

Gardner: It sounds to me, David, like were, in a sense, democratizing big data. By gathering and indexing all the unstructured data that you can possibly want to, point at it, and connect to, you're allowing anybody in a company to get access to queries without having to go through a data scientist or a SQL query author. It seems to me that you're really opening up the power of data analysis to many more people on their terms, which are basic search queries. What does that get an organization? Do you have any examples of the ways that people are benefiting by this democratization, this larger pool of people able to use these very powerful tools?

Meyer: Everything is more data-driven. The i5 appliance can give you access to all of that information. The appliance is here to simplify the beginning of becoming a data-driven organization and to find out what power is in the organization's data.
Mission Critical
Server Choices

Have Never Been Better
For example, we enabled a Swiss company called Smartinfo to become a proactive news provider. That means they put lots of public information, newspapers, online newspapers, TV broadcasts, radio broadcasts into that index. The customers can then define the topics they're interested in and they're proactively informed about new articles about their interests.

Gardner: In what other ways do you think this will become popular? I'm guessing that a marketing organization would really benefit from finding relationships within their internal organization, between product and service, go-to market, and research and development. The parts of a large distributed organization don't always know what the other part is doing, the unknown unknowns, if you will. Any other examples of how this is a business benefit?

Meyer: You mentioned the marketing organization. How could a marketing organization listen what customers are saying? For example, on social media they're communicating there, and when you have an engine like i5, you can capture these social media feeds, you can do sentiment analysis on that, and you will see an analyzed view on what's going on about your products, company, or competitors.

You can detect, for example, a shitstorm about your company, a shitstorm about your competitor, or whatever. You need to have an analytic platform to see that, to visualize that, and this is a big benefit.

On the other hand, it's also this proactive information you get from it, where you can see that your competitor has a new campaign and you get that information right now because you have an agent with the customer's name. You can see that there is something happening and you can act on that information.

Gardner: When you think about future capabilities, are there other aspects that you can add on? It seems extensible to me. What would we be talking about a year from now, for example?

Very extensible

Meyer: It's pretty much extensible. I think about all these different verticals. You can expand it for the health sector, for the transportation sector, whatever. It doesn't really matter.

We do network analysis. That means when you prepare yourself to visit a company, you can have a network picture, what relationships this company has, what employees work there, who is a shareholder of that company, which company has contracts with any of other companies?

This is a new way to get a holistic image of a company, a person, or of something that you want to know. It's thinking how to visualize things, how to visualize information, and that's the main part we are focusing on. How can we visualize or bring new visualizations to the customer?

Gardner: In the marketplace, because it's an ecosystem, we're seeing new APIs coming online all the time. Many of them are very low cost and, in many cases, open source or free. We're also seeing the ability to connect more adequately to LinkedIn and Salesforce, if you have your license for that of course. So, this really seems to me a focal point, a single pane of glass to get a single view of a customer, a market, or a competitor, and at the same time, at an affordable price.

Let's focus on that for a moment. When you have an appliance approach, what we're talking about used to be only possible at very high cost, and many people would need to be involved -- labor, resources, customization. Now, we've eliminated a lot of the labor, a lot of the customization, and the component costs have come down.
Access the Free
HPE Vertica

Community Edition
We've talked about all the great qualitative benefits, but can we talk about the cost differential between what used to be possible five years ago with data analysis, unstructured data gathering, and indexing, and what you can do now with the i5?

Meyer: You mentioned the price. We have an OEM contract, and that that's something that makes us competitive in the market. Companies can build their own intelligence service. It's affordable also for small and medium businesses. It doesn't need to be a huge company with own engineering and IT staff. It's affordable, it's automated, it's packed together, and simple to install.

Companies can increase the workplace performance and shorten the processes. Anybody has access to all the information they need in their daily work, and they can focus more on their core business. They don't lose time in searching for information and not finding it and stuff like that.

Gardner: For those folks who have been listening or reading, are intrigued by this, and want to learn more, where would you point them? How can they get more information on the i5 appliance and some of the concepts we have been discussing?

Meyer: That's our company website, sec101.ch. There you can find any information you would like to have.
Anybody has access to all the information they need in their daily work, and they can focus more on their core business. They don't lose time in searching for information and not finding it and stuff like that.

Gardner: And this is available now.

Meyer: This is available now.

Gardner: Well, great, I'm afraid we will have to leave it there. We have been exploring how SEC 1.01 AG delivers a true intelligence services solution, one that returns new information to ongoing queries and combines internal and external information on all sorts of sources to produce a 360 degree view of any user's interests that they choose.

We've learned how HPE's platform and data solutions have also been uniquely combined by SEC 1.01 for an appliance approach that quickly deploys to index and deliver these comprehensive business information results.

Please join me in thanking our guest, David Meyer, Chief Technology Officer at SEC 1.01 AG in Switzerland. Thank you so much, David.

Meyer: Thank you, Dana.

Gardner: And thanks to our audience as well for joining us for this Hewlett Packard Enterprise Voice of the Customer Digital Transformation discussion.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored interviews. Thanks again for listening, and please come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how HPE's platform and data solutions have been combined by SEC 1.01 for an appliance approach to index and deliver comprehensive business information results. Copyright Interarbor Solutions, LLC, 2005-2017. All rights reserved.

You may also be interested in:


Thursday, June 09, 2016

Alation Centralizes Enterprise Data Knowledge by Employing Machine Learning and Crowdsourcing

Transcript of a discussion on how Alation makes data actionable by keeping it up-to-date and accessible using innovative means.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the Hewlett Packard Enterprise (HPE) Voice of the Customer podcast series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation -- and how it’s making an impact on people’s lives.

Gardner
Our next big-data case study discussion focuses on the Tower of Babel problem for disparate data, and explores how Alation manages multiple data types by employing machine learning and crowdsourcing.

We'll explore how Alation makes data more actionable via such innovative means as combining human experts and technology systems.

To learn more about how enterprises and small companies alike can access more data for better analytics, please join me in welcoming Stephanie McReynolds, Vice-President of Marketing at Alation in Redwood City, California. Welcome.
Embed the HPE
Big Data
OEM Software
Stephanie McReynolds: Thank you, Dana. Glad to be here.

Gardner: I've heard of crowdsourcing for many things, and machine learning is more-and-more prominent with big-data activities, but I haven't necessarily seen them together. How did that come about? How do you, and why do you need to, employ both machine learning and experts in crowdsourcing?

McReynolds: Traditionally, we've looked at data as a technology problem. At least over the last 5-10 years, we’ve been pretty focused on new systems like Hadoop for storing and processing larger volumes of data at a lower cost than databases could traditionally support. But what we’ve overlooked in the focus on technology is the real challenge of how to help organizations use the data that they have to make decisions. If you look at what happens when organizations go to apply data, there's often a gap between the data we have available and what decision-makers are actually using to make their decisions.

McReynolds
There was a study that came out within the last couple of years that showed that about 56 percent of managers have data available to them, but they're not using it . So, there's a human gap there. Data is available, but managers aren't successfully applying data to business decisions, and that’s where real return on investment (ROI) always comes from. Storing the data, that’s just an insurance policy for future use.

The concept of crowdsourcing data, or tapping into experts around the data, gives us an opportunity to bring humans into the equation of establishing trust in data. Machine-learning techniques can be used to find patterns and clean the data. But to really trust data as a foundation for decision making human experts are needed to add business context and show how data can be used and applied to solving real business problems.

Gardner: Usually, when you're employing people like that, it can be expensive and doesn't scale very well. How do you manage the fit-for-purpose approach to crowdsourcing where you're doing a service for them in terms of getting the information that they need and you want to evaluate that sort of thing? How do you balance that?

Using human experts

McReynolds: The term "crowdsourcing" can be interpreted in many ways. The approach that we’ve taken at Alation is that machine learning actually provides a foundation for tapping into human experts.

We go out and look at all of the log data in an organization. In particular, what queries are being used to access data and databases or Hadoop file structures. That creates a foundation of knowledge so that the machine can learn to identify what data would be useful to catalog or to enrich with human experts in the organization. That's essentially a way to prioritize how to tap into the number of humans that you have available to help create context around that data.

That’s a great way to partner with machines, to use humans for what they're good for, which is establishing a lot of context and business perspective, and use machines for what they're good for, which is cataloging the raw bits and bytes and showing folks where to add value.

Gardner: What are some of the business trends that are driving your customers to seek you out to accomplish this? What's happening in their environments that requires this unique approach of the best of machine and crowdsourcing and experts?

McReynolds: There are two broader industry trends that have converged and created a space for a company like Alation. The first is just the immense volume and variety of data that we have in our organizations. If it weren’t the case that we're adding additional data storage systems into our enterprises, there wouldn't be a good groundwork laid for Alation, but I think more interestingly perhaps is a second trend and that is around self-service business intelligence (BI).

So as we're increasing the number of systems that we're using to store and access data, we're also putting more weight on typical business users to find value in that data and trying to make that as self-service a process as possible. That’s created this perfect storm for a system like Alation which helps catalog all the data in the organization and make it more accessible for humans to interpret in accurate ways.
So as we're increasing the number of systems that we're using to store and access data, we're also putting more weight on typical business users to find value in that data and trying to make that as self-service a process as possible.

Gardner: And we often hear in the big data space the need to scale up to massive amounts, but it appears that Alation is able to scale down. You can apply these benefits to quite small companies. How does that work when you're able to help a very small organization with some typical use cases in that size organization?

McReynolds: Even smaller organizations, or younger organizations, are beginning to drive their business based on data. Take an organization like Square, which is a great brand name in the financial services industry, but it’s not a huge organization in and of itself, or Inflection or Invoice2go, which are also Alation customers.

We have many customers that have data analyst teams that maybe start with five people or 20 people. We also have customers like eBay that have closer to a thousand analysts on staff. What Alation provides to both of those very different sizes of organizations is a centralized place, where all of the information around their data is stored and made accessible.

Even if you're only collaborating with three to five analysts, you need that ability to share your queries, to communicate on which queries addressed which business problems, which tables from your HPE Vertica database were appropriate for that, and maybe what Hive tables on your Hadoop implementation you could easily join to those Vertica tables. That type of conversation is just as relevant in a 5-person analytics team as it is in a 1000-person analytics team.

Gardner: Stephanie, if I understand it correctly, you have a fairly horizontal capability that could apply to almost any company and almost any industry. Is that fair, or is there more specialization or customization that you apply to make it more valuable, given the type of company or type of industry?

Generalized technology

McReynolds: The technology itself is a generalized technology. Our founders come from backgrounds at Google and Apple, companies that have developed very generalized computing platforms to address big problems. So the way the technology is structured is general.

The organizations that are going to get the most value out of an Alation implementation are those that are data-driven organizations that have made a strategic investment to use analytics to make business decisions and incorporate that in the strategic vision for the company.

So even if we're working with very small organizations, they are organizations that make data and the analysis of data a priority. Today, it’s not every organization out there. Not every mom-and-pop shop is going to have an Alation instance in their IT organization.

Gardner: Fair enough. Given those organizations that are data-driven, have a real benefit to gain by doing this well, they also, as I understand it, want to get as much data involved as possible, regardless of its repository, its type, the silo, the platform, and so forth. What is it that you've had to do to be able to satisfy that need for disparity and variety across these data types? What was the challenge for being able to get to all the types of data that you can then apply your value to?
Embed the HPE
Big Data
OEM Software
McReynolds: At Alation, we see the variety of data as a huge asset, rather than a challenge. If you're going to segment the customers in your organization, every event and every interaction with those customers becomes relevant to understanding who that individual is and how you might be able to personalize offerings, marketing campaigns, or product development to those individuals.

That does put some burden on our organization, as a technology organization, to be able to connect to lots of different types of databases, file structures, and places where data sits in an organization.

So we focus on being able to crawl those source systems, whether they're places where data is stored or whether they're BI applications that use that data to execute queries. A third important data source for us that may be a bit hidden in some organizations is all the human information that’s created, the metadata that’s often stored in Wiki pages, business glossaries, or other documents that describe the data that’s being stored in various locations.

We actually crawl all of those sources and provide an easy way for individuals to use that information on data within their daily interactions. Typically, our customers are analysts who are writing SQL queries. All of that context about how to use the data is surfaced to them automatically by Alation within their query-writing interface so that they can save anywhere from 20 percent to 50 percent of the time it takes them to write a new query during their day-to-day jobs.

Gardner: How is your solution architected? Do you take advantage of cloud when appropriate? Are you mostly on-premises, using your own data centers, some combination, and where might that head to in the future?

Agnostic system

McReynolds: We're a young company. We were founded about three years ago and we designed the system to be agnostic as to where you want to run Alation. We have customers who are running Alation in concert with Redshift in the public cloud. We have customers that are financial services organizations that have a lot of personally identifiable information (PII) data and privacy and security concerns, and they are typically running an on-premise Alation instance.

We architected the system to be able to operate in different environments and have an ability to catalog data that is both in the cloud and on-premise at the same time.

The way that we do that from an architectural perspective is that we don’t replicate or store data within Alation systems. We use metadata to point to the location of that data. For any analyst who's going to run a query from our recommendations, that query is getting pushed down to the source systems to run on-premise or on the cloud, wherever that data is stored.

Gardner: And how did HPE Vertica come to play in that architecture? Did it play a role in the ability to be agnostic as you describe it?
It gives the IT department insight. Day-to-day, Alation is typically more of a business person’s tool for interacting with data.

McReynolds: We use HP Vertica in one portion of our product that allows us to provide essentially BI on the BI that’s happening. Vertica is used as a fundamental component of our reporting capability called Alation Forensics that is used by IT teams to find out how queries are actually being run on data source systems, which backend database tables are being hit most often, and what that says about the organization and those physical systems.

It gives the IT department insight. Day-to-day, Alation is typically more of a business person’s tool for interacting with data.

Gardner: We've heard from HPE that they expect a lot more of that IT department specific ops efficiency role and use case to grow. Do you have any sense of what some of the benefits have been from your IT organization to get that sort of analysis? What's the ROI?

McReynolds: The benefits of an approach like Alation include getting insight into the behaviors of individuals in the organization. What we’ve seen at some of our larger customers is that they may have dedicated themselves to a data-governance program where they want to document every database and every table in their system, hundreds of millions of data elements.

Using the Alation system, they were able to identify within days the rank-order priority list of what they actually need to document, versus what they thought they had to document. The cost savings comes from taking a very data-driven realistic look at which projects are going to produce value to a majority of the business audience, and which projects maybe we could hold off on or spend our resources more wisely.

One team that we were working with found that about 80 percent of their tables hadn't been used by more than one person in the last two years. In that case, if only one or two people are using those systems, you don't really need to document those systems. That individual or those two individuals probably know what's there. Spend your time documenting the 10 percent of the system that everybody's using and that everyone is going to receive value from.

Where to go next

Gardner: Before we close out, any sense of where Alation could go next? Is there another use case or application for this combination of crowdsourcing and machine learning, tapping into all the disparate data that you can and information including the human and tribal knowledge? Where might you go next in terms of where this is applicable and useful?

McReynolds: If you look at what Alation is doing, it's very similar to what Google did for the Internet in terms of being available to catalog all of the webpages that were available to individuals and service them in meaningful ways. That's a huge vision for Alation, and we're just in the early part of that journey to be honest. We'll continue to move in that direction of being able to catalog data for an enterprise and make easily searchable, findable, and usable all of the information that is stored in that organization.

Gardner: Well, very good. I'm afraid we will have to leave it there. We've been examining how Alation maps across disparate data while employing machine learning and crowdsourcing to help centralize and identify data knowledge. And we've learned how Alation makes data actionable by keeping it up-to-date and accessible using innovative means.
Embed the HPE
Big Data
OEM Software
So a big thank you to our guest, Stephanie McReynolds, Vice-President of Marketing at Alation in Redwood City, California. Thank you so much, Stephanie.

McReynolds: Thank you. It was a pleasure to be here.

Gardner: And a big thank you as well to our audience for joining us for this big data innovation case study discussion.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a sponsored discussion on how Alation makes data actionable by keeping it up-to-date and accessible using innovative means. Copyright Interarbor Solutions, LLC, 2005-2015. All rights reserved.

You may also be interested in:

Wednesday, November 18, 2015

Big Data Enables Top User Experiences and Extreme Personalization for Intuit TurboTax

Transcript of a BriefingsDirect discussion on how TurboTax uses big data analytics to improve performance despite high data volumes during peak usage.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the HPE Discover Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation and how it’s making an impact on people’s lives.

Gardner
Our next big data innovation case study highlights how Intuit uses deep-data analytics to gain a 360-degree view of its TurboTax application's users’ behavior and preferences. Such visibility allows for rapid applications improvements and enables the TurboTax user experience to be tailored to a highly detailed degree.

Here to share how analytics paves the way to better understanding of end-user needs and wants, we're joined by Joel Minton, Director of Data Science and Engineering for TurboTax at Intuit in San Diego. Welcome to Briefings Direct, Joel.

Joel Minton: Thanks, Dana, it’s great to be here.
HP Vertica Community Edition
Start Your Free Trial Now
Gain Access to All Features and Functionality
Gardner: Let’s start at a high-level, Joel, and understand what’s driving the need for greater analytics, greater understanding of your end-users. What is the big deal about big-data capabilities for your TurboTax applications?

Minton: There were several things, Dana. We were looking to see a full end-to-end view of our customers. We wanted to see what our customers were doing across our application and across all the various touch points that they have with us to make sure that we could fully understand where they were and how we can make their lives better.

Minton
We also wanted to be able to take that data and then give more personalized experiences, so we could understand where they were, how they were leveraging our offerings, but then also give them a much more personalized application that would allow them to get through the application even faster than they already could with TurboTax.

And last but not least, there was the explosion of available technologies to ingest, store, and gain insights that was not even possible two or three years ago. All of those things have made leaps and bounds over the last several years. We’ve been able to put all of these technologies together to garner those business benefits that I spoke about earlier.

Gardner: So many of our listeners might be aware of TurboTax, but it’s a very complex tax return preparation application that has a great deal of variability across regions, states, localities. That must be quite a daunting task to be able to make it granular and address all the variables in such a complex application.

Minton: Our goal is to remove all of that complexity for our users and for us to do all of that hard work behind the scenes. Data is absolutely central to our understanding that full end-to-end process, and leveraging our great knowledge of the tax code and other financial situations to make all of those hard things easier for our customers, and to do all of those things for our customers behind the scenes, so our customers do not have to worry about it.

Gardner: In the process of tax preparation, how do you actually get context within the process?

Always looking

Minton: We're always looking at all of those customer touch points, as I mentioned earlier. Those things all feed into where our customer is and what their state of mind might be as they are going through the application.

To give you an example, as a customer goes though our application, they may ask us a question about a certain tax situation.

When they ask that question, we know a lot more later on down the line about whether that specific issue is causing them grief. If we can bring all of those data sets together so that we know that they asked the question three screens back, and then they're spending a more time on a later screen, we can try to make that experience better, especially in the context of those specific questions that they have.

As I said earlier, it's all about bringing all the data together and making sure that we leverage that when we're making the application as easy as we can.

Gardner: And that's what you mean by a 360-degree view of the user: where they are in time, where they are in a process, where they are in their particular individual tax requirements?

Minton: And all the touch points that they have with not only things on our website, but also things across the Internet and also with our customer-care employees and all the other touch points that we use try to solve our customers’ needs.
During our peak times of the year during tax season, we have billions and billions of transactions.

Gardner: This might be a difficult question, but how much data are we talking about? Obviously you're in sort of a peak-use scenario where many people are in a tax-preparation mode in the weeks and months leading up to April 15 in the United States. How much data and how rapidly is that coming into you?

Minton: We have a tremendous amount of data. I'm not going to go into the specifics of the complete size of our database because it is proprietary, but during our peak times of the year during tax season, we have billions and billions of transactions.

We have all of those touch points being logged in real-time, and we basically have all of that data flowing through to our applications that we then use to get insights and to be able to help our customers even more than we could before. So we're talking about billions of events over a small number of days.

Gardner: So clearly for those of us that define big data by velocity, by volume, and by variety, you certainly meet the criteria and then some.

Unique challenges

Minton: The challenges are unique for TurboTax because we're such a peaky business. We have two peaks that drive a majority of our experiences: the first peak when people get their W-2s and they're looking to get their refunds, and then tax day on April 15th. At both of those times, we're ingesting a tremendous amount of data and trying to get insights as quickly as we can so we can help our customers as quickly as we can.

Gardner: Let’s go back to this concept of user experience improvement process. It's not just something for tax preparation applications but really in retail, healthcare, and many other aspects where the user expectations are getting higher and higher. People expect more. They expect anticipation of their needs and then delivery of that.

This is probably only going to increase over time, Joel. Tell me a little but about how you're solving this issue of getting to know your user and then being able to be responsive to an entire user experience and perception.

Minton: Every customer is unique, Dana. We have millions of customers who have slightly different needs based on their unique situations. What we do is try to give them a unique experience that closely matches their background and preferences, and we try to use all of that information that we have to create a streamlined interaction where they can feel like the experience itself is tailored for them.
So the most important thing is taking all of that data and then providing super-personalized experience based on the experience we see for that user and for other users like them.

It’s very easy to say, “We can’t personalize the product because there are so many touch points and there are so many different variables.” But we can, in fact, make the product much more simplified and easy to use for each one of those customers. Data is a huge part of that.

Specifically, our customers, at times, may be having problems in the product, finding the right place to enter a certain tax situation. They get stuck and don't know what to enter. When they get in those situations, they will frequently ask us for help and they will ask how they do a certain task. We can then build code and algorithms to handle all those situations proactively and be able to solve that for our customers in the future as well.

So the most important thing is taking all of that data and then providing super-personalized experience based on the experience we see for that user and for other users like them.

Gardner: In a sense, you're a poster child for a lot of elements of what you're dealing with, but really on a significant scale above the norm, the peaky nature, around tax preparation. You desire to be highly personalized down to the granular level for each user, the vast amount of data and velocity of that data.

What were some of your chief requirements at your architecture level to be able to accommodate some of this? Tell us a little bit, Joel, about the journey you’ve been on to improve that architecture over the past couple of years?

Lot of detail

Minton: There's a lot of detail behind the scenes here, and I'll start by saying it's not an easy journey. It’s a journey that you have to be on for a long time and you really have to understand where you want to place your investment to make sure that you can do this well.

One area where we've invested in heavily is our big-data infrastructure, being able to ingest all of the data in order to be able to track it all. We've also invested a lot in being able to get insights out of the data, using Hewlett Packard Enterprise (HPE) Vertica as our big data platform and being able to query that data in close to real time as possible to actually get those insights. I see those as the meat and potatoes that you have to have in order to be successful in this area.

On top of that, you then need to have an infrastructure that allows you to build personalization on the fly. You need to be able to make decisions in real time for the customers and you need to be able to do that in a very streamlined way where you can continuously improve.

We use a lot of tactics using machine learning and other predictive models to build that personalization on-the-fly as people are going through the application. That is some of our secret sauce and I will not go into in more detail, but that’s what we're doing at a high level.

Gardner: It might be off the track of our discussion a bit, but being able to glean information through analytics and then create a feedback loop into development can be very challenging for a lot of organizations. Is DevOps a cultural parallel path along with your data-science architecture?
HP Vertica Community Edition
Start Your Free Trial Now
Gain Access to All Features and Functionality
I don’t want to go down the development path too much, but it sounds like you're already there in terms of understanding the importance of applying big-data analytics to the compression of the cycle between development and production.

Minton: There are two different aspects there, Dana. Number one is making sure that we understand the traffic patterns of our customer and making sure that, from an operations perspective, we have the understanding of how our users are traversing our application to make sure that we are able to serve them and that their performance is just amazing every single time they come to our website. That’s number one.

Number two, and I believe more important, is the need to actually put the data in the hands of all of our employees across the board. We need to be able to tell our employees the areas where users are getting stuck in our application. This is high-level information. This isn't anybody's financial information at all, but just a high-level, quick stream of data saying that these people went through this application and got stuck on this specific area of the product.

We want to be able to put that type of information in our developer’s hands so as the developer is actually building a part of the product, she could say that I am seeing that these types of users get stuck at this part of the product. How can I actually improve the experience as I am developing it to take all of that data into account?

We have an analyst team that does great work around doing the analytics, but in addition to that, we want to be able to give that data to the product managers and to the developers as well, so they can improve the application as they are building it. To me, a 360-degree view of the customer is number one. Number two is getting that data out to as broad of an audience as possible to make sure that they can act on it so they can help our customers.

Major areas

Gardner: Joel, I speak with HPE Vertica users quite often and there are two major areas that I hear them talk rather highly of the product. First, has to do with the ability to assimilate, so that dealing with the variety issue would bring data into an environment where it can be used for analytics. Then, there are some performance issues around doing queries, amid great complexity of many parameters and its speed and scale.

Your applications for TurboTax are across a variety or platforms. There is a shrink-wrap product from the legacy perspective. Then you're more along the mobile lines, as well as web and SaaS. So is Vertica something that you're using to help bring the data from a variety of different application environments together and/or across different networks or environments?

Minton: I don't see different devices that someone might use as a different solution in the customer journey. To me, every device that somebody uses is a touch point into Intuit and into TurboTax. We need to make sure that all of those touch points have the same level of understanding, the same level of tracking, and the same ability to help our customers.

Whether somebody is using TurboTax on their computer or they're using TurboTax on their mobile device, we need to be able to track all of those things as first-class citizens in the ecosystem. We have a fully-functional mobile application that’s just amazing on the phone, if you haven’t used it. It's just a great experience for our customers.

From all those devices, we bring all of that data back to our big data platform. All of that data can then be queried, because you want to understand, many questions, such as when do users flow across different devices and what is the experience that they're getting on each device? When are they able to just snap a picture of their W-2 and be able to import it really quickly on their phone and then jump right back into their computer and finish their taxes with great ease?
You need to be able to have a system that can handle that concurrency and can handle the performance that’s going to be required by that many more people doing queries against the system.

We need to be able to have that level of tracking across all of those devices. The key there, from a technology perspective, is creating APIs that are generic across all of those devices, and then allowing those APIs to feed all of that data back to our massive infrastructure in the back-end so we can get those insights through reporting and other methods as well.

Gardner: We've talked quite a bit about what's working for you: a database column store, the ability to get a volume variety and velocity managed in your massive data environment. But what didn't work? Where were you before and what needed to change in order for you to accommodate your ongoing requirements in your architecture?

Minton: Previously we were using a different data platform, and it was good for getting insights for a small number of users. We had an analyst team of 8 to 10 people, and they were able to do reports and get insights as a small group.

But when you talk about moving to what we just discussed, a huge view of the customer end-to-end, hundreds of users accessing the data, you need to be able to have a system that can handle that concurrency and can handle the performance that’s going to be required by that many more people doing queries against the system.

Concurrency problems

So we moved away from our previous vendor that had some concurrency problems and we moved to HPE Vertica, because it does handle concurrency much better, handles workload management much better, and it allows us to pull all this data.

The other thing that we've done is that we have expanded our use of Tableau, which is a great platform for pulling data out of Vertica and then being able to use those extracts in multiple front-end reports that can serve our business needs as well.

So in terms of using technology to be able to get data into the hands of hundreds of users, we use a multi-pronged approach that allows us to disseminate that information to all of these employees as quickly as possible and to do it at scale, which we were not able to do before.
There's always going to be more data that you want to track than you have hardware or software licenses to support.

Gardner: Of course, getting all your performance requirements met is super important, but also in any business environment, we need to be concerned about costs.

Is there anything about the way that you were able to deploy Vertica, perhaps using commodity hardware, perhaps a different approach to storage, that allowed you to both accomplish your requirements, goals in performance, and capabilities, but also at a price point that may have been even better than your previous approach?

Minton: From a price perspective, we've been able to really make the numbers work and get great insights for the level of investment that we've made.

How do we handle just the massive cost of the data? That's a huge challenge that every company is going to have in this space, because there's always going to be more data that you want to track than you have hardware or software licenses to support.

So we've been very aggressive in looking at each and every piece of data that we want to ingest. We want to make sure that we ingest it at the right granularity.

Vertica is a high-performance system, but you don't need absolutely every detail that you’ve ever had from a logging mechanism for every customer in that platform. We do a lot of detail information in Vertica, but we're also really smart about what we move into there from a storage perspective and what we keep outside in our Hadoop cluster.

Hadoop cluster

We have a Hadoop cluster that stores all of our data and we consider that our data lake that basically takes all of our customer interactions top to bottom at the granular detail level.

We then take data out of there and move things over to Vertica, in both an aggregate as well as a detail form, where it makes sense. We've been able to spend the right amount of money for each of our solutions to be able to get the insights we need, but to not overwhelm both the licensing cost and the hardware cost on our Vertica cluster.

The combination of those things has really allowed us to be successful to match the business benefit with the investment level for both Hadoop and with Vertica.

Gardner: Measuring success, as you have been talking about quantitatively at the platform level, is important, but there's also a qualitative benefit that needs to be examined and even measured when you're talking about things like process improvements, eliminating bottlenecks in user experience, or eliminating anomalies for certain types of individual personalized activities, a bit more quantitative than qualitative.
We're actually performing much better and we're able to delight our internal customers to make sure that they're getting the answers they need as quickly as possible.

Do you have any insight, either anecdotal or examples, where being able to apply this data analytics architecture and capability has delivered some positive benefits, some value to your business?

Minton: We basically use data to try to measure ourselves as much as possible. So we do have qualitative, but we also have quantitative.

Just to give you a few examples, our total aggregate number of insights that we've been able to garner from the new system versus the old system is a 271 percent increase. We're able to run a lot more queries and get a lot more insights out of the platform now than we ever could on the old system. We have also had a 41 percent decrease in query time. So employees who were previously pulling data and waiting twice as long had a really frustrating experience.

Now, we're actually performing much better and we're able to delight our internal customers to make sure that they're getting the answers they need as quickly as possible.

We've also increased the size of our data mart in general by 400 percent. We've massively grown the platform while decreasing performance. So all of those quantitative numbers are just a great story about the success that we have had.

From a qualitative perspective, I've talked to a lot of our analysts and I've talked to a lot of our employees, and they've all said that the solution that we have now is head and shoulders over what we had previously. Mostly that’s because during those peak times, when we're running a lot of traffic through our systems, it’s very easy for all the users to hit the platform at the same time, instead of nobody getting any work done because of the concurrency issues.

Better tracking

Because we have much better tracking of that now with Vertica and our new platform, we're actually able to handle that concurrency and get the highest priority workloads out quickly, allow them to happen, and then be able to follow along with the lower-priority workloads and be able to run them all in parallel.

The key is being able to run, especially at those peak loads, and be able to get a lot more insights than we were ever able to get last year.
HP Vertica Community Edition
Start Your Free Trial Now
Gain Access to All Features and Functionality
Gardner: And that peak load issue is so prominent for you. Another quick aside, are you using cloud or hybrid cloud to support any of these workloads, given the peak nature of this, rather than keep all that infrastructure running 365, 24×7? Is that something that you've been doing, or is that something you're considering?

Minton: Sure. On a lot of our data warehousing solutions, we do use cloud in points for our systems. A lot of our large-scale serving activities, as well as our large scale ingestion, does leverage cloud technologies.

We don't have it for our core data warehouse. We want to make that we have all of that data in-house in our own data centers, but we do ingest a lot of the data just as pass-throughs in the cloud, just to allow us to have more of that peak scalability that we wouldn’t have otherwise.
The faster than we can get data into our systems, the faster we're going to be able to report on that data and be able to get insights that are going to be able to help our customers.

Gardner: We're coming up toward the end of our discussion time. Let’s look at what comes next, Joel, in terms of where you can take this. You mentioned some really impressive qualitative and quantitative returns and improvements. We can always expect more data, more need for feedback loops, and a higher level of user expectation and experience. Where would you like to go next? How do you go to an extreme focus even more on this issue of personalization?

Minton: There are a few things that we're doing. We built the infrastructure that we need to really be able to knock it out of the park over the next couple of years. Some of the things that are just the next level of innovation for us are going to be, number one, increasing our use of personalization and making it much easier for our customers to get what they need when they need it.

So doubling down on that and increasing the number of use cases where our data scientists are actually building models that serve our customers throughout the entire experience is going to be one huge area of focus.

Another big area of focus is getting the data even more real time. As I discussed earlier, Dana, we're a very peaky business and the faster than we can get data into our systems, the faster we're going to be able to report on that data and be able to get insights that are going to be able to help our customers.

Our goal is to have even more real-time streams of that data and be able to get that data in so we can get insights from it and act on it as quickly as possible.

The other side is just continuing to invest in our multi-platform approach to allow the customer to do their taxes and to manage their finances on whatever platform they are on, so that it continues to be mobile, web, TVs, or whatever device they might use. We need to make sure that we can serve those data needs and give the users the ability to get great personalized experiences no matter what platform they are on. Those are some of the big areas where we're going to be focused over the coming years.

Recommendations

Gardner: Now you've had some 20/20 hindsight into moving from one data environment to another, which I suppose is equivalent of keeping the airplane flying and changing the wings at the same time. Do you have any words of wisdom for those who might be having concurrency issues or scale, velocity, variety type issues with their big data, when it comes to moving from one architecture platform to another? Any recommendations you can make to help them perhaps in ways that you didn't necessarily get the benefit of?

Minton: To start, focus on the real business needs and competitive advantage that your business is trying to build and invest in data to enable those things. It’s very easy to say you're going to replace your entire data platform and build everything soup to nuts all in one year, but I have seen those types of projects be tried and fail over and over again. I find that you put the platform in place at a high-level and you look for a few key business-use cases where you can actually leverage that platform to gain real business benefit.

When you're able to do that two, three, or four times on a smaller scale, then it makes it a lot easier to make that bigger investment to revamp the whole platform top to bottom. My number one suggestion is start small and focus on the business capabilities.

Number two, be really smart about where your biggest pain points are. Don’t try to solve world hunger when it comes to data. If you're having a concurrency issue, look at the platform you're using. Is there a way in my current platform to solve these without going big?

Frequently, what I find in data is that it’s not always the platform's fault that things are not performing. It could be the way that things are implemented and so it could be a software problem as opposed to a hardware or a platform problem.
HP Vertica Community Edition
Start Your Free Trial Now
Gain Access to All Features and Functionality
So again, I would have folks focus on the real problem and the different methods that you could use to actually solve those problems. It’s kind of making sure that you're solving the right problem with the right technology and not just assuming that your platform is the problem. That’s on the hardware front.

As I mentioned earlier, looking at the business use cases and making sure that you're solving those first is the other big area of focus I would have.

Gardner: I'm afraid we will have to leave it there. We've been learning about how Intuit uses deep-data analytics to gain a 360-degree view of its TurboTax applications user behavior and preferences. And we have heard about how such visibility allows for rapid applications improvements, providing an extreme personalization level and enabling the user of TurboTax to experience a higher degree of customization, something tailored directly for their situation.

So join me in thanking Joel Minton, Director of Data Science and Engineering for TurboTax at Intuit in San Diego. Thanks so much, Joel.

Minton: Thank you, Dana. I really enjoyed it.

Gardner: And I'd also like to thank our audience for joining this big-data innovation case study discussion. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how TurboTax uses big data analytics to improve performance despite high data volumes during peak usage. Copyright Interarbor Solutions, LLC, 2005-2015. All rights reserved.

You may also be interested in: