Showing posts with label HPE Discover. Show all posts
Showing posts with label HPE Discover. Show all posts

Tuesday, August 23, 2016

How HudsonAlpha Innovates on IT for Research-Driven Education, Genomic Medicine and Entrepreneurship

Transcript of a discussion on how HudsonAlpha leverages modern IT infrastructure and big data analytics to power research projects as well as pioneering genomic medicine findings.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition to the Hewlett Packard Enterprise (HPE) Voice of the Customer podcast series. I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT Innovation -- and how it's making an impact on people's lives.

Gardner
Our next IT infrastructure thought leadership case study explores how the HudsonAlpha Institute for Biotechnology engages in digital transformation for genomic research and healthcare paybacks.

We'll now hear how HudsonAlpha leverages modern IT infrastructure and big-data analytics to power a pioneering research project incubator and genomic medicine innovator.

To describe new possibilities for exploiting cutting-edge IT infrastructure and big data analytics for healthcare innovation, we're joined by Dr. Liz Worthey, Director of Software Development and Informatics at the HudsonAlpha Institute for Biotechnology in Huntsville, Alabama. Welcome, Liz.

Dr. Liz Worthey: Thanks for inviting me.

Gardner: It seems to me that genomics research and IT have a lot in common. There's not much daylight between them -- two different types of technology, but highly interdependent. Have I got that right?
Start Your HPE Vertica
Community Edition Trial
Worthey: Absolutely. It used to be that the IT infrastructure was fairly far away from the clinic or the research, but now they're so deeply intertwined that it necessitates many meetings a week between the leadership of both in order to make sure we get it right.

Gardner: And you have background in both. Maybe you can tell us a little bit about that.

Worthey: My background is primarily on the biology side, although I'm Director of Informatics and I've spent about 20 years working in the software-development and informatics side. I'm not IT Director, but I'm pretty IT savvy, because I've had to develop that skill set over the years. My undergraduate degree was in immunology, and since then, my focus has really been on genetics informatics and bioinformatics.

Gardner: Please describe what genetic informatics or genomic informatics is for our audience.

Worthey: Since 2003, when we received the first version of a human reference genome, there's been a large field involved in the task of extracting knowledge that can be used for society and health from genomic data.

Worthey
A [human] genome is 3.2 billion nucleotides in length, and in there, there's a lot of really useful information. There's information about which diseases that individual may be more likely to get and which diseases they will get.

It’s also information about which drugs they should and shouldn't take; information about which types of procedures, surveillance procedures, what colonoscopies they should have. And so, the clinical aspects of genomics are really developing the analytical capabilities to extract that data in real time so that we can use it to help an individual patient.

On top of that, there's also a lot of research. A lot of that is in large-scale studies across hundreds of thousands of individuals to look for signals that are more difficult to extract from a single genome. Genomics, clinical genomics, is all of that together.

Parallel trajectory

Gardner: Where is the societal change potential in terms of what we can do with this information and these technologies?

Worthey: Genomics has existed for maybe 20 years, but the vast majority of that was the first step. Over the last six years, we've taken maybe the second or third step in a journey that’s thousands of steps long.

We're right on the edge. We didn’t used to be able to do this, because we didn't have any data. We didn't have the capability to sequence a genome cheaply enough to sequence lots. We also didn't have the storage capabilities to store that data, even if we could produce it, and we certainly didn't have enough compute to do the analysis, infrastructure-wise. On top of that, we didn’t actually have the analytical know-how or capabilities either. All of that is really coalescing at the same time.

As we are doing genomics, and that technology and the sequencing side has come up, the compute and the computing technologies have come up at the time. They're feeding each other, and genomics is now driving IT to think about things in a very different way.

Gardner: Let's dive into that a little bit. What are the hurdles technologically for getting to where you want to be, and how do you customize that or need to customize that, for your particular requirements?

Worthey: There are a number of hurdles. Certainly, there are simpler hurdles that we have to get past, like storage, storage tied with compression. How do you compress that data to where you can store millions of genomes at a price that's affordable.

A bigger hurdle is the ability to query information at a lot of disparate sites. When we think about genomic medicine, one of the things that we really want do is share data between institutions that are geographically diverse. And the data that we want to share is millions of data points, each of which has hundreds or thousands of annotations or curations.
When we think about genomic medicine, one of the things that we really want do is share data between institutions that are geographically diverse.

Those are fairly complex queries, even when you're doing it in one site, but in order to really change the practice of medicine, we have to be able to do that regionally, nationally, and globally. So, the analytics questions there are large.

We have 3.2 billion data points for each individual. The data is quite broad, but it’s also pretty deep. One of the big problems is that we don’t have all the data that we need to do genomic medicine. There's going to be data mining -- generate the data, form a hypothesis, look at the data, see what you get, come back with a new hypothesis, and so on.

Finally, one of the problems that we have is that a lot of algorithms that you might use only exists in the brains of MDs, other clinical folks, or researchers. There is really a lot of human computer interaction work to be done, so that we can extract that knowledge.

There are lots of problems. Another big problem is that we really want to put this knowledge in the hands of the doctor while they have seven minutes to see the patient. So, it’s also delivery of answers at that point in time, and the ability to query the data by the person who is doing the analysis, which ideally will be an MD.

Cloud technology

Gardner: Interestingly, the emergence of cloud methods and technology over the past five or 10 years would address some of those issues about distributing the data effectively -- and also perhaps getting actionable intelligence to a physician in an actual critical-care environment. How important is cloud to this process and what sort of infrastructure would be optimal for the types of tasks that you have in mind?

Worthey: If you had asked me that question two years ago, on the genomic medicine side, I would have said that cloud isn't really part of the picture. It wasn't part of the picture for anything other than business reasons. There were a lot of questions around privacy and sharing of healthcare information, and hospitals didn’t like the idea.

They're very reluctant to move to the cloud. Over the last two years, that has started to change. Enough of them had to decide to do it, before everybody would view it as something that was permissible.

Cloud is absolutely necessary in many ways, because we have periods where lots of data that has to be computed and analytics has to be run. Then, we have periods where new information is coming off the sequencer. So, it’s that perfect crest and trough.

If you don't have the ability to deal with that sort of fluctuation, if you buy a certain amount of hardware and you only have it available in-house, your pipeline becomes impacted by the crests and then often sits idle for a long time.
Start Your HPE Vertica
Community Edition Trial
But it’s also important to have stuff in-house, because sometimes, you want to do things in a different way. Sometimes, you want to do things in a more secure manner.

It's kind of our poster child for many of the new technologies that are coming out that look at both of those, that allow you to run things in-house and then also allow you to run the same jobs on the same data in the cloud as well. So, it’s key.

Gardner: That brings me to the next question about this concept of genomics as a service or a platform to support genomics as a service. How do you envision that and how might that come about?

Worthey: When we think about the infrastructure to support that, it has to be something flexible and it has to be provided by organizations that are able to move rapidly, because the field is moving really quickly.

It has to be infrastructure that supports this hypothesis-driven research, and it has to be infrastructure that can deal with these huge datasets. Much of the data is ordered, organized, and well-structured, but because it's healthcare, a lot of the information that we use as part of the interpretation phase of genomic medicine is completely unstructured. There needs to be support for extraction of data from silos.

My dream is that the people who provide these technologies will also help us deal with some of these boundaries, the policy boundaries, to sharing data, because that’s what we need to do for this to become routine.

Data and policy

Gardner: We've seen some of that when it comes to other forms of data, perhaps in the financial sector. More and more, we're seeing tokenization, authentication, and encryption, where data can exist for a period of time with a certain policy attached to it, and then something will happen if the data is a result for that policy. Is that what you're referring to?

Worthey: Absolutely. It's really interesting to come to a meeting like HPE Discover because you get to see what everybody else is doing in different fields. Much of the things that people in my field have regarded as very difficult are actually not that hard at all; they happen all the time in other industries.

A lot of this -- the encryption, the encrypted data sharing, the ability to set those access controls in a particular way that only lasts for a certain amount of time for a particular set of users -- seems complex, but it happens all the time in other fields. A big part of this is talking to people who have a lot of experience in a regulated environment. It’s just not this regulated environment and learning the language that they use to talk to the people that set policy there and transferring that to our policy makers and ideally getting them together to talk to one another.

Gardner: Liz, you mentioned the interest layers in getting your requirements to the technology vendors, cloud providers, and network providers. Is that under way? Is that something that's yet to happen? Where is the synergy between the genomic research community and the technology-vendor platform provider community?
This is happening fast. For genomics, there's been a shift in the volume of genomic data that we can produce with some new sequencing technology that's coming.

Worthey: This is happening fast. For genomics, there's been a shift in the volume of genomic data that we can produce with some new sequencing technology that's coming. If you're a provider of hardware or service user solutions to deal with big data, looking at genomics, as the people here are probably going to overtake many of those other industries in terms of the volume and complexity of the data that we have.

The reason that that's really interesting is because then you get invited to come and talk at forums, where there's lots of technology companies and you make them aware of the work that has to be done in the field of medicine, and in genomic research, and then you can start having those discussions.

A lot of the things that those companies are already doing, the use cases, are similar and maybe need some refinement, but a lot of that capability is already there.

Gardner: It's interesting that you’ve become sort of the “New York” of use cases. If you can make it there, you can make it anywhere. In other words, if we can solve this genomic data issue and use the cloud fruitfully to distribute and gather -- and then control and monitor the data as to where it should be under what circumstances -- we can do just about anything.

Correct me if I am wrong, though. We're using data in the genomic sense for population groups. We're winnowing those groups down into particular diseases. How farfetched is it to think about individuals having their own genomic database that would follow them like an authenticated human design? Is that completely out of the bounds? How far would that possibly be?

Technology is there

Worthey: I’ve had my genome sequenced, and it’s accessible. I could pick it up and look at it on the tools that I developed through my phone sitting here on the table. In terms of the ability to do that, a lot of that technology is already here.

The number of people that are being sequenced is increasing rapidly. We're already using genomics to make diagnosis in patients and to understand their drug interactions. So, we are here.

One of the things that we are talking about just now is, at what point in a person’s life should you sequence their genome. I and a number of other people in the field believe that that is earlier, rather than later, before they get sick. Then, we have that information to use when they get those first symptoms. You are not waiting until they're really ill before you do that.

I can’t imagine a future where that's not what's going to happen, and I don’t think that future is too far away. We're going to see it in our lifetimes, and our children are definitely going to see it in theirs.
The data that we already have, clinical information, is really for that one person, but your genome is shared among your family, even distant relatives that you’ve never met.

Gardner: The inhibitors, though, would be more of an ethical nature, not a technological nature.

Worthey: And policy, and society; the society impact of this is huge.

The data that we already have, clinical information, is really for that one person, but your genome is shared among your family, even distant relatives that you’ve never met. So, when we think about this, there are many very hard ethical questions that we have to think about. There are lots of experts that are working on that, but we can’t let that get in the way of progress. We have to do it. We just have to make sure we do it right.

Gardner: To come back down a little bit toward the technology side of things, seeing as so much progress has been made and that there is the tight relationship between information technology and some of the fantastic things that can happen with the proper knowledge around genomic information, can you describe the infrastructure you have in place? What’s working? What do you use for big-data infrastructure, and cloud or hybrid cloud as well?

Worthey: I'm not on the IT side, but I can tell you about the other side and I can talk a little bit on the IT side as well. In terms of the technologies that we use to store all of that varying information, we're currently using Hadoop and Mongo DB. We finished our proof of concept with HPE, looking at their Vertica solution.

We have to work out what the next steps might be for our proof of concept. Certainly, we’re very interested in looking at the solutions that they have in here. They fit with our needs. The issue that’s been addressed on that side is lots of variants, complex queries, that you need to answer really fast.

On the other side, one of the technological hurdles that we have to meet is the unstructured data. We have electronic health record (EHR) information that’s coming in. We want to hook up to those EHRs and we want to use systems to process that data to make it organized, so that we can use it for the interpretation part.

In-house solution

We developed in-house solutions that we're using right now that allow humans to come in and look at that data and select the terms from it. So, you’d select disease terms. And then, we have in-house solutions to map them to the genomic side. We're looking at things like HPE’s IDOL as a proof-of-concept (POC) on that side. We're talking to some EHR companies about how to hook up the EHR to those solutions to our software to make it a seamless product and that would give us all that.

In terms of hardware, we do have HPE hardware in-house. I think we have 12 petabytes of their storage. We also have data direct network hardware, a general parallel file system solution. We even have things down to graphics processors for some of the analysis that we do. We’ve a large deck of such GPUs because in some cases it’s much faster for some other types of problems that we have to solve. So we are pretty IT-rich, a lot of heavy investment on the IT side.

Gardner: And cloud -- any preference to the topology that works for you architecturally for cloud, or is that still something you are toying with?
We not only do the research and the clinical, but we also have a lab that produces lots of data for other customers, a lab that produces genomic data as a service.

Worthey: We're currently looking at three different solutions that are all cloud solutions. We not only do the research and the clinical, but we also have a lab that produces lots of data for other customers, a lab that produces genomic data as a service.

They have a challenge of getting that amount of data returned to customers in a timely fashion. So, there are solutions that we're looking at there. There are also, as we talked at the start, solutions to help us with that in-flow of the data coming off the sequencers and the compute -- and so we're looking at a number of different solutions that are cloud-based to solve some of those challenges.

Gardner: Before we close, we’ve talked about healthcare and population impacts, but I should think there's also a commercial aspect to this. That kind of information will lend itself to entrepreneurial activities, products and services, a great demand in the marketplace? Is that something you're involved with as well, and wouldn’t that help foot the bill for some of these many costly IT infrastructure investments?

Worthey: One of the ways that HudsonAlpha Institute was set up was just that model. We have a research, not-for-profit side, but we also have a number of affiliate companies that are for-profit, where intellectual property and ideas can go across to that site and be used to generate revenue that fund the research and keep us moving and be on the cutting-edge.

We do have a services lab that does genomic sequencing in analytics. You can order that from them. We also service a lot of people who have government contracts for this type of work. And then, we have an entity called Envision Genomics. For disclosure, I'm one of founders of that entity. It’s focused on empowering people to do genomic medicine and working with lots of different solution providers to get genomic medicine being done everywhere it’s applicable.

Gardner: Well, it's been a fascinating discussion. Thank you for sharing that, and I look forward to tracking the close relationship between IT and genomics, because as you say, they are self-supporting, reinforcing, and both very powerful in their impacts on society.
Start Your HPE Vertica
Community Edition Trial
We've been learning how the HudsonAlpha Institute for Biotechnology engages in digital transformation for genomic research and for healthcare paybacks. We’ve heard how HudsonAlpha leverages modern IT and cloud infrastructures and big data analytics to power research projects as well as pioneering medicine uses and companies.

So please join me in thanking our guest, Dr. Liz Worthey, Director of Software Development and Informatics at the HudsonAlpha Institute for Biotechnology in Huntsville, Alabama. Thank you so much, Liz.

Worthey: Thank you very much. Thanks for inviting me.

Gardner: And I will also thank our audience as well for joining us for this Hewlett Packard Enterprise Voice of the Customer Podcast. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and do come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how HudsonAlpha leverages modern IT infrastructure and big data analytics to power research projects as well as pioneering genomic medicine findings. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

You may also be interested in:

Friday, June 24, 2016

Here's How Two Part-Time DBAs Maintain Mobile App Ad Platform Tapjoy’s Massive Data Needs

Transcript of a discussion on how mobile app advertising platform Tapjoy handles fast and massive data flows with just two part-time database administrators.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next BriefingsDirect Voice of the Customer innovation case study discussion. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT transformation, and how it’s making an impact on people’s lives.

Gardner
Our next big-data case study discussion examines how mobile app advertising platform Tapjoy handles fast and massive data -- some two dozen terabytes per day -- with just two part-time database administrators (DBAs).

So examine how Tapjoy’s data-driven business of serving 500 million global mobile users -- or more than 1.5 million add engagements per day, a data volume of a 120 terabytes -- runs on HPE Vertica.

We'll learn more about how high scale and complexity meets minimal labor for building user and advertiser loyalty from David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco.
HPE Vertica
Community Edition
Start Your Free Trial Now
Welcome, David.

David Abercrombie: Thank you so much for having me.

Gardner: Mobile advertising has really been a major growth area, perhaps more than any other type of advertising. We hear a lot about advertising waning, but not mobile app advertising. How does Tapjoy and its platform help contribute to the success of what we're seeing in the mobile app ad space?

Abercrombie: The key to Tapjoy’s success is engaging the users and rewarding them for engaging with an ad. Our advertising model is you engage with an ad and then you get typically some sort of reward: A virtual currency in the game you're playing or some sort of discount.

Abercrombie
We actually have the kind of ads that lead users to seek us out to engage with the ads and get their rewards.

Gardner: So this is quite a bit different than a static presented ad. This is something that has a two-way street, maybe multiple directions of information coming and going. Why the analysis? Why is that so important? And why the speed of analysis?

Abercrombie: We have basically three types of customers. We have the app publishers who want to monetize and get money from displaying ads. We have the advertisers who need to get their message out and pay for that. Then, of course, we have the users who want to engage with the ads and get their rewards.

The key to Tapjoy’s success is being able to balance the needs of all of these disparate uses. We can’t charge the advertisers too much for their ads, even though the monetizers would like that. It’s a delicate balancing act, and that can only be done through big-data analysis, careful optimization, and careful monitoring of the ad network assets and operation.

Gardner: Before we learn more about the analytics, tell us a bit more about what role Tapjoy plays specifically in what looks like an ecosystem play for placing, evaluating, and monetizing app ads? What is it specifically that you do in this bigger app ad function?

Ad engagement model

Abercrombie: Specifically what Tapjoy does is enable this rewarded ad engagement model, so that the advertisers know that people are going to be paying attention to their ads and so that the publishers know that the ads we're displaying are compatible with their app and are not going to produce a jarring experience. We want everybody to be happy -- the publishers, the advertisers, and the users. That’s a delicate compromise that’s Tapjoy’s strength.

Gardner: And when you get an end user to do something, to take an action, that’s very powerful, not only because you're getting them to do what you wanted, but you can evaluate what they did under what circumstances and so forth. Tell us about the model of the end user specifically. What is it about engaging with them that leads to the data -- which we will get to in a moment?

Abercrombie: In our model of the user, we talk about long-term value. So even though it may be a new user who has just started with us, maybe their first engagement, we like to look at them in terms of their long-term value, both to the publishers and the advertiser.

We don’t want people who are just engaging with the ad and going away, getting what they want and not really caring about it. Rather, we want good users who will continue their engagement and continue this process. Once again, that takes some fairly sophisticated machine-learning algorithms and very powerful inferences to be able to assess the long-term value.

As an example, we have our publishers who are also advertisers. They're advertising their app within our platform and for them the conversion event, what they are looking for, is a download. What we're trying to do is to offer them users who will not only download the game once to get that initial payoff reward, but will value the download and continue to use it again and again.
The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising.

So all of our models are designed with that end in mind -- to look at the long-term value of the user, not just the immediate conversion at this instant in time.

Gardner: So perhaps it’s a bit of a misnomer to talk about ads in apps. We're really talking about a value-add function in the app itself.

Abercrombie: Right. The people who are advertising don’t want people to just see their ads. They want people to follow up with whatever it is they're advertising. If it’s another app, they want good users for whom that app is relevant and useful.

That’s really the way we look at it. That’s the way to enhance the overall experience in the long-term. We're not just in it for the short-term. We're looking at developing a good solid user base, a good set of users who engage thoroughly.

Gardner: And as I said in my set-up, there's nothing hotter in all of advertising than mobile apps and how to do this right. It’s early innings, but clearly the stakes are very high.

A tough business

Abercrombie: And it’s a tough business. People are saturated. Many people don’t want ads. Some of the business models are difficult to master.

For instance, there may be a sequence of multiple ad units. There may be a video followed by another ad to download something. It becomes a very tricky thing to balance the financing here. If it was just a simple pass-through and we take a cut, that would be trivial, but that doesn't work in today's market. There are more sophisticated approaches, which do involve business risk.

If we reward the user, based on the fact that they're watching the video, but then they don't download the app, then we don't get money. So we have to look very carefully at the complexity of the whole interaction to make it as smooth and rewarding as possible, so that the thing works. That's difficult to do.

Gardner: So we're in a dynamic, fast-growing, fairly fresh, new industry. Knowing what's going to happen before it happens is always fun in almost any industry, but in this case, it seems with those high stakes and to make that monetization happen, it’s particularly important.
HPE Vertica
Community Edition
Start Your Free Trial Now
Tell me now about gathering such large amounts of data, being able to work with it, and then allowing analysis to happen very swiftly. How do you go about making that possible?

Abercrombie: Our data architecture is relatively standard for this type of clickstream operation. There is some data that can be put directly into a transactional database in real time, but typically, that's only when you get to the very bottom of the funnel, the conversion stuff. But all that clickstream stuff gets written, has JSON formatted log files, gets swept up by a queuing system, and then put into our data systems.

Our legacy system involved a homegrown queuing system, dumping data into HDFS. From there, we would extract and load CSVs into Vertica. As with so many other organizations, we're moving to more real-time operations. Our queuing system has evolved from a couple of different homegrown applications, and now we're implementing Apache Kafka.

We use Spark as part of our infrastructure, as sort of a hub, if you will, where data is farmed out to other systems, including a real-time, in-memory SQL database, which is fairly new to us this year. Then, we're still putting data in HDFS, and that's where the machine learning occurs. From there, we're bringing it into Vertica.

In Vertica -- and our Vertica cluster has two main purposes -- there is the operational data store, which has the raw, flat tables that are one row for every event, with the millisecond timestamps and the IDs of all the different entities involved.

From that operational data store, we do a pure SQL ETL extract into kind of an old-school star schema within Vertica, the same database.

Pure SQL

So our business intelligence (BI) ETL is pure SQL and goes into a full-fledged snowflake schema, moderately denormalized with all the old-school bells and whistles, the type 1, type 2, slowly changing dimensions. With Vertica, we're able to denormalize that data warehouse to a large degree.

Sitting on top of that we have a BI tool. We use MicroStrategy, for which we have defined our various metrics and our various attributes, and it’s very adept at knowing exactly which fact table and which dimensions to join.

So we have sort of a hybrid architecture. I'd say that we have all the way from real-time, in-memory SQL, Hadoop and all of its machine learning and our algorithmic pipelines, and then we have kind of the old-school data warehouse with the operational data store and the star schema.

Gardner: So a complex, innovative, custom architectural approach to this and yet I'm astonished that you are running and using Vertica in multiple ways with two part-time DBAs. How is it possible that you have minimal labor, given this topology that you just described?

Abercrombie: Well, we found Vertica very easy to manage. It has been very well-behaved, very stable.
In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

For instance, we don’t even really use the Management Console, because there is not enough to manage. Our cluster is about 120 terabytes. It’s only on eight nodes and it’s pretty much trouble free.

One of the part-times DBAs deals with kind of more operating-system level stuff --  patches, cluster recovery, those sorts of issues. And the other part-time DBA is me. I deal more with data structure design, SQL tuning and Vertica training for our staff.

In terms of ad-hoc users of our Vertica database, we have well over 100 people who have the ability to run any query they want at any time into the Vertica database.

When we first started out, we tried running Vertica in Amazon EC2. Mind you, this was four or five years ago. Amazon EC2 was not where it is today. It failed. It was very difficult to manage. There were perplexing problems that we couldn’t solve. So we moved our Vertica and essentially all of our big-data data systems out of the cloud onto dedicated hardware, where they are much easier to manage and much easier to bring the proper resources.

Then, at one time in our history, when we built a dedicated hardware cluster for Vertica, we failed to heed properly the hardware planning guide and did not provision enough disk I/O bandwidth. In those situations, Vertica is unstable, and we had a lot of problems.

But once we got the proper disk I/O, it has been smooth sailing. I can’t even remember the last time we even had a node drop out. It has been rock solid. I was able to go on a vacation for three weeks recently and know that there would be no problem, and there was no problem.

Gardner: The ultimate key performance indicator (KPI), "I was able to go on vacation."

Fairly resilient

Abercrombie: Exactly. And with the proper hardware design, HPE Vertica is fairly resilient against out-of-control queries. There was a time when half my time was spent monitoring for slow queries, but again, with the proper hardware, it's smooth sailing. I don’t even bother with that stuff anymore.

Our MicroStrategy BI tool writes very good SQL. Part of the key to our success with this BI portion is designing the Vertica schema and the MicroStrategy metadata layer to take advantage of each other’s strengths and avoid each other’s weaknesses. So that really was key to the stable, exceptional performance we get. I basically get no complaints of slow queries from my BI tool. No problem.

Gardner: The right kind of problem to have.

Abercrombie: Yes.

Gardner: Okay, now that we have heard quite a bit about how you are doing this, I'd like to learn, if I could, about some of the paybacks when you do this properly, when it is running well, in terms of SQL queries, ETL load times reduction, the ability for you to monetize and help your customers create better advertising programs that are acceptable and popular. What are the paybacks technically and then in business terms?
The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

Abercrombie: In order to get those paybacks, a key element was confidence in the data, the results that we were shipping out. The only way to get that confidence was by having highly accurate data and extensive quality control (QC) in the ETL.

What that also means is that as a product is under development and when it’s not ready yet, the instrumentation isn’t ready, that stuff doesn’t make it into our BI tool. You can only get that stuff from ad hoc.

So the benefit has been a very clear understanding of the day-to-day operations of our ad network, both for our internal monitoring to know when things are behaving properly, when the instrumentation is working as expected, and when the queues are running, but also for our customers.

Because of the flexibility that we can do from a traditional BI system with 500 metrics, over a couple of dozen dimensions, our customers, the publishers and the advertisers, get incredible detail, customized exactly the way they need for ingestion into their systems or to help them understand how Tapjoy is serving them. Again, that comes from confidence in the data.

Gardner: When you have more data and better analytics, you can create better products. Where might we look next to where you take this? I don’t expect you to pre-announce anything, but where can you now take these capabilities as a business and maybe even expand into other activities on a mobile endpoint?

Flexibility in algorithms

Abercrombie: As we expand our business and move into new areas, what we really need is flexibility in our algorithms and the way we deal with some of our real-time decision making.

So one area that’s new to us this year is the in-memory SQL database like MemSQL. Some of our old real-time ad optimization was based on pre-calculating data and serving it up through HBase KeyValue, but now, where we can do real-time aggregation queries using SQL, that is easy to understand, easy to modify, very expressive and very transparent. It gives us more flexibility in terms of fine-tuning our real-time decision-making algorithms, which is absolutely necessary.

As an example, we acquired a company in Korea called 5Rocks that does app tech and that tracks the users within the app, like what level they're on, or what activities they're doing and what they enjoy, with an eye towards in-app purchase optimization.

And so we're blending the in-app purchase optimization along with traditional ad network optimization, and the two have different rules and different constraints. So we really need the flexibility and expressiveness of our real-time decision making systems.

Gardner: One last question. You mentioned machine learning earlier. Do you see that becoming more prominent in what you do and how you're working with data scientists, and how might that expand in terms of where you employ it?

Abercrombie: Tapjoy started with machine learning. Our data scientists are machine learning. Our productive algorithm team is about six times larger than our traditional Vertica BI team. Mostly what we do at Tapjoy is predictive analytics and various machine-learning things. So we wouldn't be alive without it. And we expanded. We're not shifting in one direction or another. It's apples and oranges, and there's a place for both.
HPE Vertica
Community Edition
Start Your Free Trial Now
Gardner: I'm afraid we will have to leave it there. We've been examining how mobile app advertising platform Tapjoy handles fast and massive data flows with just two part-time database administrators. And we've learned how Tapjoy’s data-driven business of serving 500 million global mobile users, more than 1.5 million ad engagements per day, runs on HPE Vertica.

Please join me in thanking our guest. We've been here with David Abercrombie, Principal Data Analytics Engineer at Tapjoy in San Francisco. Thank you so much, David.

Abercrombie: Thank you.

Gardner: And I'd like to thank our audience as well for joining us for this big data innovation case study discussion.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how mobile app advertising platform Tapjoy handles fast and massive data flows with just two part-time database administrators. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

You may also be interested in:

Thursday, February 11, 2016

How New York Genome Center Manages the Massive Data Generated from DNA Sequencing

Transcript of a discussion on how the drive to better diagnose diseases and develop more effective treatments is aided by swift, cost efficient, and accessible big data analytics infrastructure.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the HPE Discover Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation and how it’s making an impact on people’s lives.

Gardner
Our next big-data use case leadership discussion examines how the non-profit New York Genome Center manages and analyzes up to 12 terabytes of data generated each day from its genome sequence appliances. We’ll learn how the drive to better diagnose disease and develop more effective treatments is aided by swift, cost efficient, and accessible big-data analytics.

To hear how genome analysis pioneers exploit vast data outputs to then speedily correlate for time-sensitive reporting, please join me in welcoming our guest.
Start Your
HPE Vertica
Community Edition Trial Now
We're here with Toby Bloom, Deputy Scientific Director for Informatics at the New York Genome Center in New York. Welcome, Toby.

Toby Bloom: Hi. Thank you.

Gardner: First, tell us a little bit about your organization. It seems like it’s a unique institute, with a large variety of backers, consortium members. Tell us about it.

Bloom
Bloom: New York Genome Center is about two-and-a-half years old. It was formed initially as a collaboration among 12 of the large medical institutions in New York: Cornell, Columbia, NewYork-Presbyterian Hospital, Mount Sinai, NYU, Einstein Montefiore, and Stony Brook University. All of the big hospitals in New York decided that it would be better to have one genome center than have to build 12 of them. So we were formed initially to be the center of genomics in New York.

Gardner: And what does one do at a center of genomics?

Bloom: We're a biomedical research facility that has a large capacity to sequence genomes and use the resulting data output to analyze the genomes, find the causes of disease, and hopefully treatments of disease, and have a big impact on healthcare and on how medicine works now.

Gardner: When it comes to doing this well, it sounds like you are generating an awesome amount of data. What sort of data is that and where does it come from?

Bloom: Right now, we have a number of genome sequencing instruments that produce about 12 terabytes of raw data per day. That raw data is basically lots of strings of As, Cs, Ts and Gs -- the DNA data from genomes from patients who we're sequencing. Those can be patients who are sick and we are looking for specific treatment. They can be patients in large research studies, where we're trying to use and correlate a large number of genomes to find the similarities that show us the cause of the disease.

Gardner: When we look at a typical big data environment such as in a corporation, it’s often transactional information. It might also be outputs from sensors or machines. How is this a different data problem when you are dealing with DNA sequences?

Lots of data

Bloom: Some of it’s the same problem, and some of it’s different. We're bringing in lots of data. The raw data, I said, is probably about 12 terabytes a day right now. That could easily double in the next year. But than we analyze the data, and I probably store three to four times that much data in a day.

In a lot of environments, you start with the raw data, you analyze it, and you cook it down to your answers. In our environment, it just gets bigger and bigger for a long time, before we get the answers and can make it smaller. So we're dealing with very large amounts of data.

We do have one research project now that is taking in streaming data from devices, and we think over time we'll likely be taking in data from things like cardiac monitors, glucose monitors, and other kinds of wearable medical devices. Right now, we are taking in data off apps on smartphones that are tracking movement for some patients in a rheumatoid arthritis study we're doing.
In our environment, it just gets bigger and bigger for a long time, before we get the answers and can make it smaller. So we're dealing with very large amounts of data.

We have to analyze a bunch of different kinds of data together. We’d like to bring in full medical records for those patients and integrate it with the genomic data. So we do have a wide variety of data that we have to integrate, and a lot of it is quite large.

Gardner: When you were looking for the technological platforms and solutions to accommodate your specific needs, how did that pan out? What works? What doesn’t work? And where are you in terms of putting in place the needed infrastructure?

Bloom: The data that comes off the machines is in large files, and a lot of the complex analysis we do, we do initially on those large files. I am talking about files that are from 150 to 500 gigabytes or maybe a terabyte each, and we do a lot of machine-learning analysis on those. We do a bunch of Bayesian statistical analyses. There are a large number of methods we use to try to extract the information from that raw data.
Start Your
HPE Vertica
Community Edition Trial Now
When we've figured out the variance and mutations in the DNA that we think are correlated with the disease and that we were interested in looking at, we then want to load all of that into a database with all of the other data we have to make it easy for researchers to use in a number of different ways. We want to let them find more data like the data they have, so that they can get statistical validation of their hypotheses.

We want them to be able to find more patients for cohorts, so they can sequence more and get enough data. We need to be able to ask questions about how likely it is, if you have a given genomic variant, you get a given disease. Or, if you have the disease, how likely it is that you have this variant. You can only do that if it’s easy to find all of that data together in one place in an organized way.

So we really need to load that data into a database and connect it to the medical records or the symptoms and disease information we have about the patients and connect DNA data with RNA data with epigenetic data with microbiome data. We needed a database to do that.

We looked at a number of different databases, but we had some very hard requirements to solve. We were looking for one that could handle trillions of rows in a table without failing over, tens of trillions of rows without falling over, and to be able to answer queries fast across multiple tables with tens of trillions of rows. We need to be able to easily change and add new kinds of data to it, because we're always finding new kinds of data we want to correlate. So there are things like that.

Simple answer

We need to be able to load terabytes of data a day. But more than anything, I had a lot of conversations with statisticians about why they don’t like databases, about why they keep asking me for all of the data in comma-delimited files instead of databases. And the answer, when you boiled it down, was pretty simple.

When you have statisticians who are looking at data with huge numbers of attributes and huge numbers of patients, the kinds of statistical analysis they're doing means they want to look at some much smaller combinations of the attributes for all of the patients and see if they can find correlations, and then change that and look at different subsets. That absolutely requires a column-oriented database. A row-oriented relational database will bring in the whole database to get you that data. It takes forever, and it’s too slow for them.

So, we started from that. We must have looked at four or five different databases. Hewlett Packard Enterprise (HPE) Vertica was the one that could handle the scale and the speed and was robust and reliable enough, and is our platform now. We're still loading in the first round of our data. We're still in the tens of billions of rows, as opposed to trillions of rows, but we'll get there.
We must have looked at four or five different databases. Vertica was the one that could handle the scale and the speed and was robust and reliable enough and is our platform now.

Gardner: You’re also in the healthcare field. So there are considerations around privacy, governance, auditing, and, of course, price sensitivity, because you're a non-profit. How did that factor into your decision? Is the use of off-the-shelf hardware a consideration, or off-the-shelf storage? Are you looking at conversion infrastructure? How did you manage some of those cost and regulatory issues?

Bloom: Regulatory issues are enormous. There are regulations on clinical data that we have to deal with. There are regulations on research data that overlap and are not fully consistent with the regulations on clinical data. We do have to be very careful about who has access to which sets of data, and we have all of this data in one database, but that doesn’t mean any one person can actually have access to all of that data.

We want it in one place, because over time, scientists integrate more and more data and get permission to integrate larger and larger datasets, and we need that. There are studies we're doing that are going to need over 100,000 patients in them to get statistical validity on the hypotheses. So we want it all in one place.

What we're doing right now is keeping all of the access-control information about who can access which datasets as data in the database, and we basically append clauses to every query to filter down the data to the data that any particular user can use. Then we'll tell them the answers for the datasets they have and how much data that’s there that they couldn’t look at, and if they needed the information, how to go try to get access to that.

Gardner: So you're able to manage some of those very stringent requirements around access control. How about that infrastructure cost equation?

Bloom: Infrastructure cost is a real issue, but essentially, what we're dealing with is, if we're going to do the work we need to do and deal with the data we have to deal with, there are two options. We spend it on capital equipment or we spend it on operating costs to build it ourselves.

In this case, not all cases, it seemed to make much more sense to take advantage of the equipment and software, rather than trying to reproduce it and use our time and our personnel's time on other things that we couldn’t as easily get.

A lot of work went into HPE Vertica. We're not going to reproduce it very easily. The open-source tools that are out there don’t match it yet. They may eventually, but they don’t now.

Getting it right

Gardner: When we think about the paybacks or determining return on investment (ROI) in a business setting, there’s a fairly simple straightforward formula. For you, how do you know you’ve got this right? What is it when you see certain, what we might refer to in the business world as service-level agreements (SLAs) or key performance indicators (KPIs)? What are you looking for when you know that you’ve got it right and when you’re getting the job done, based all of its requirements and from all of these different constituencies?

Bloom: There’s a set of different things. The thing I am looking for first is whether the scientists who we work with most closely, who will use this first, will be able to frame the questions they want to ask in terms of the interface and infrastructure we’ve provided.

I want to know that we can answer the scientific questions that people have with the data we have and that we’ve made it accessible in the right way. That we’ve integrated, connected and aggregated the data in the right ways, so they can find what they are looking for. There's no easy metric for that. There’s going to be a lot of beta testing.
The place where this database is going to be the most useful, not by any means the only way it will be used, is in our investigations of common and complex diseases, and how we find the causes of them and how we can get from causes to treatments.

The second thing is, are we are hitting the performance standards we want? How much data can I load how fast? How much data can I retrieve from a query? Those statisticians who don’t want to use relational databases, still want to pull out all those columns and they want to do their sophisticated analysis outside the database.

Eventually, I may convince them that they can leave the data in the database and run their R-scripts there, but right now they want to pull it out. I need to know that I can pull it out fast for them, and that they're not going to object that this is organized so they can get their data out.

Gardner: Let's step back to the big picture of what we can accomplish in a health-level payback. When you’ve got the data managed, when you’ve got the input and output at a speed that’s acceptable, when you’re able to manage all these different level studies, what sort of paybacks do we get in terms of people’s health? How do we know we are succeeding when it comes to disease, treatment, and understanding more about people and their health?

Bloom: The place where this database is going to be the most useful, not by any means the only way it will be used, is in our investigations of common and complex diseases, and how we find the causes of them and how we can get from causes to treatments.

I'm talking about looking at diseases like Alzheimer’s, asthma, diabetes, Parkinson’s, and ALS, which is not so common, but certainly falls in the complex disease category. These are diseases that are caused by some combinations of genomic variance, not by a single gene gone wrong. There are a lot of complex questions we need to ask in finding those. It takes a lot of patience and a lot of genomes, to answer those questions.

The payoff is that if we can use this data to collect enough information about enough diseases that we can ask the questions that say it looks like this genomic variant is correlated with this disease, how many people in your database have this variant and of those how many actually have the disease, and of the ones who have the disease, how many have this variant. I need to ask both those questions, because a lot of these variants confer risk, but they don’t absolutely give you the disease.

If I am going to find the answers, I need to be able to ask those questions and those are the things that are really hard to do with the raw data in files. If I can do just that, think about the impact on all of us? If we can find the molecular causes of Alzheimer’s that could lead to treatments or prevention and all of those other diseases as well.

Gardner: It’s a very compelling and interesting big data use case, one of the best I’ve heard.

I am afraid we’ll have to leave it there. We've been examining how the New York Genome Center manages and analyzes vast data outputs to speedily correlate for time-sensitive reporting, and we’ve learned how the drive to better diagnose diseases and develop more effective treatments is aided by swift, cost efficient, and accessible big data analytics infrastructure.
Start Your
HPE Vertica
Community Edition Trial Now
So, join me in thanking our guest, Toby Bloom, Deputy Scientific Director for Informatics at the New York Genome Center. Thank you so much, Toby.

Bloom: Thank you, and thanks for inviting me.

Gardner: Thank you also to our audience for joining us for this big data innovation case study discussion. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of HPE-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a discussion on how the drive to better diagnose diseases and develop more effective treatments is aided by swift, cost efficient, and accessible big data analytics infrastructure. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

You may also be interested in:

Thursday, January 14, 2016

How SKYPAD and HPE Vertica Enable Luxury Retail Brands to Gain Rapid Insight into Consumer Sales Trends

Transcript of a BriefingsDirect discussion on how Sky I.T. has changed its platform and solved the challenges around variety, velocity, and volume for big data to make better insights available to retail users.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Dana Gardner: Hello, and welcome to the next edition of the HPE Discover Podcast Series. I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for this ongoing discussion on IT innovation and how it’s making an impact on people’s lives.

Gardner
Our next big-data use case leadership discussion explores how retail luxury goods market analysis provider Sky I.T. Group has upped its game to provide more buyer behavior analysis faster and with more depth. We will see how Sky I.T. changed its data analysis platform infrastructure and why that has helped solve its challenges around data variety, velocity, and volume to make better insights available to its retail users.

Here to share how retail intelligence just got a whole lot smarter, we are joined by Jay Hakami, President of Sky I.T. Group in New York. Welcome, Jay.

Jay Hakami: Thank you very much. Thank you for having us.

Gardner: We're also here with Dane Adcock, Vice President of Business Development at Sky I.T. Group. Welcome Dane.

Dane Adcock: Thank you very much.

Gardner: And we're here with Stephen Czetty, Vice President and Chief Technology Officer at Sky I.T. Group. Welcome to BriefingsDirect, Stephen.
Sky I.T. Group
Retail Business-Intelligence Solutions
Get More Information
Stephen Czetty: Thank you, Dana, and I'm looking forward to the chance.

Gardner: What are the top trends that are driving the need for greater and better big-data analysis for retailers? Why do they need to know more, better, faster?

Adcock: Well, customers have more choices. As a result, businesses need to be more agile and responsive and fill the customer's needs more completely or lose the business. That's driving the entire industry into practices that mean shorter times from design to shelf in order to be more responsive.

It has created a great deal of gross marketing pressure, because there's simply more competition and more selections that a consumer can make with their dollar today.

Gardner: Is there anything specific to the retail process around luxury goods that is even more pressing when it comes to this additional speed? Are there more choices and  higher expectations of the end user?

Greater penalty

Adcock: Yes. The downside to making mistakes in terms of designing a product and allocating it in the right amounts to locations at the store level carries a much greater penalty, because it has to be liquidated. There's not a chance to simply cut back on the supply chain side, and so margins are more at risk in terms of making the mistake.

Ten years ago, from a fashion perspective, it was about optimizing the return and focusing on winners. Today, you also have to plan to manage and optimize the margins on your losers as well. So, it's a total package.

Gardner: So, clearly, the more you know about what those users are doing or what they have done is going to be essential. It seems to me, though, that we'rere talking about a market-wide look rather than just one store, one retailer, or one brand.

How does that work, Jay? How do we get to the point where we've been able to gather information at a fairly comprehensive level, rather than cherry-picking or maybe getting a non-representative look based on only one organization’s view into the market?

Hakami: With SKYPAD, what we're doing is collecting data from the supplier, from the wholesaler, as well as from their retail stores, their wholesale business, and their dot-com, meaning the whole omni channel. When we collect that data, we cleanse it to make sure its meaningful to the user.

Hakami
Now, we're dealing with a connected world where the retailer, wholesalers, and suppliers have to talk to one another and plan together for the buying season. So the partnerships and the insight that they get into the product performance is extremely important, as Dane mentioned, in terms of the gross margin and in terms of the software information. SKYPAD basically provides that intelligence, that insight, into this retail/wholesale world.

Gardner: Correct me if I'm wrong, but isn’t this also a case where people are opening up their information and making it available for the benefit of a community or recognizing that the more data and the more analysis that’s available, the better it is for all the participants, even if there's an element of competition at some point?

Hakami: Dana, that's correct. The retail business likes to share the information with their suppliers, but they're not sharing it across all the suppliers. They're sharing it with each individual supplier. Then, you have the market research companies who come in and give you aggregation of trends and so on. But the retailers are interested in sell-through. They're interested in telling X supplier, "This is how your products are performing in my stores."

If they're not performing, then there's going to be a mark down. There's going to be less of a margin for you and for us. So, there's a very strong interest between the retailer and a specific supplier to improve the performance of the product and the sell-through of those products on the floor.

Gardner: Before we learn more about the data science and dealing with the technology and business case issues, tell us a little bit more about Sky I.T. Group, how you came about, and what you're doing with SKYPAD to solve some of these issues across this entire supply chain and retail market spot.

Complex history

Hakami: I'll take the beginning. I'll give you a little bit of the history, Dana, and then maybe Dane and Stephen can jump in and tell you what we are doing today, which is extremely complex and interesting at the same time.

We started with SKYPAD about eight years ago. We found a pain point within our customers where they were dealing with so many retailers, as well as their own retail stores, and not getting the information that they needed to make sound business decisions on a timely basis.

We started with one customer, which was Theory. We came to them and we said, "We can give you a solution where we're going to take some data from your retailers, from your retail stores, from your dot-com, and bring it all into one dashboard, so you can actually see what’s selling and what’s not selling."

Fast forward, we've been able to take not only EDI transactions, but also retail portals. We're taking information from any format you can imagine -- from Excel, PDF, merchant spreadsheets -- bringing that wealth of data into our data warehouse, cleansing it, and then populating the dashboard.

So today, SKYPAD is giving a wealth of information to the users by the sheer fact that they don’t have to go out by retailer and get the information. That’s what we do, and we give them, on a Monday morning, the information they need to make decisions.
As these business intelligence (BI) tools have become more popular, the distribution of data coming from the retailers has gotten more ubiquitous and broader in terms of the metrics.

Dane, can you elaborate more on this as well?

Adcock: This process has evolved from a time when EDI was easy, because it was structured, but it was also limited in the number of metrics that were provided by the mainstream. As these business intelligence (BI) tools have become more popular, the distribution of data coming from the retailers has gotten more ubiquitous and broader in terms of the metrics.

But the challenge has moved from reporting to identification of all these data sources and communication methodologies and different formats. These can change from week to week, because they're being launched by individuals, rather than systems, in terms of Excel spreadsheets and PDF files. Sometimes, they come from multiple sources from the same retailer.

One of our accounts would like to see all of their data together, so they can see trends across categories and different geographies and markets. The challenge is to bring all those data sources together and align them to their own item master file, rather than the retailer’s item master file, and then be able to understand trends, which accounts are generating the most profits, and what strategies are the most profitable.
Visit the Sky BI Team
At The 2016 NRF Big Show!
Get More Information
It's been a shifting model from the challenge of reporting all this data together, to data collection. And there's a lot more of it today, because more retailers report at the UPC level, size level, and the store level. They're broadcasting some of this data by day. The data pours in, and the quicker they can make a decision, the more money they can make. So, there's a lot of pressure to turn it around.

Gardner: Let me understand, Dane. When you're putting out those reports on Monday morning, do you get queries back? Is this a sort of a conversation, if you will, where not only are you presenting your findings, but people have specific questions about specific things? Do you allow for them to do that, and is the data therefore something that’s subject to query?

Subject to queries

Adcock: It’s subject to queries in the sense that they're able to do their own discovery within the data. In other words, we put it in a BI tool, it’s on the web, and they're doing their own analysis. They're probing to see what their best styles are. They're trying to understand how colors are moving, and they're looking to see where they're low on stock, where they may be able to backfill in the marketplace, and trying to understand what attributes are really driving sales.

But of course, they always have questions about completeness of the data. When things don’t look correct, they have questions about it. That drives us to be able to do analysis on the fly, on-demand, and deliver some responses, "All your stores are there, all of your locations, everything looks normal." Or perhaps there seems to be some flaws or things in the data that don’t actually look correct.

Not only do we need to organize it and provide it to them so that they can do their own broad, flexible analysis, but they're coming back to us with questions about how their data was audited. And they're looking for us to do the analysis on the spot and provide them with satisfactory answers.

Gardner: Stephen Czetty, we've heard about the use case, the business case, and how this data challenge has grown in terms of variety as well as volume. What do you need to bring to the table from the architecture and the data platform to sustain this growth and provide for the agility that these market decision makers are demanding?

Czetty: We started out with an abacus, in a sense, but today we collect information from thousands of sources literally every single week. Close to 9,000 files will come across to us and we'll process them correctly and sort of them out -- what client they belong to and so forth, but the challenge is forever growing.

Czetty
We needed to go from older technology to newer technology, because our volumes of data are increasing and the amount of time that we need to consume to data in is static.

So we're quite aware that we have a time limit. We found Vertica as a platform for us to be able to collect the data into a coherent structure in a very rapid time as opposed to our legacy systems.

It allows us to treat the data in a truly vertical way, although that has nothing to do with the application or the database itself. In the past we had to deal with each client separately. Now we can deal with each retailer separately and just collect their data for every single client that we have. That makes our processes much more pipelined and far faster in performance.

The secret sauce behind that is the ability in our Vertica environment to rapidly sort out the data -- where it belongs, who it belongs to -- calculate it out correctly, put it into the database tables that we need to, and then serve it back to the front end that we're using to represent it.

That's why we've shifted from a traditional database model to a Vertica-type model. It's 100 percent SQL for us, so it looks the same for everybody who is querying it, but under the covers we get tremendous performance and compression and lots of cost savings.

Gardner: For some organizations that are dealing with the different sources and  different types of data, cleansing is one problem. Then, the ability to warehouse that and make it available for queries is a separate problem. You've been able to tackle those both at the same time with the same platform. Is that right?

Proprietary parsers

Czetty: That's correct. We get the data, and we have proprietary parsers for every single data type that we get. There are a couple of hundred of them at this point. But all of that data, after parsing, goes into Vertica. From there, we can very rapidly figure out what is going where and what is not going anywhere, because it’s incomplete or it’s not ours, which happens, or it’s not relevant to our processes, which happens.

We can sort out what we've collected very rapidly and then integrate it with the information we already have or insert new information if it's brand-new. Prior to this, we'd been doing this by hand to a large-scale, and that's not effective any longer with our number of clients growing.

Gardner: I'd like to hear more about what your actual deployment is, but before we do that, let’s go back to the business case. Dane and Jay, when Vertica came online, when Steve was able to give you some of these more pronounced capabilities, how did that translate into a benefit for your business? How did you bring that out to the market, and what's been the response?

Hakami: I think the first response was "wow." And I think the second response was "Wow, how can we do this fast and move quickly to this platform?"
Prior to this, we'd been doing this by hand to a large-scale, and that's not effective any longer with our number of clients growing.

Let me give you some examples. When Steve did the proof of concept (POC) with the folks from HP, we were very impressed with the statistics we had seen. In other words, going from a processing time of eight or nine hours to minutes was a huge advantage that we saw from the business side, showing our customers that we can load data much faster.

The ability to use less hardware and infrastructure as a result of the architecture of Vertica allowed us to reduce, and to continue to reduce, the cost of infrastructure. These two are the major benefits that I've seen in the evolution of us moving from our legacy to Vertica.

From the business perspective, if we're able to deliver faster and more reliably to the customer, we accomplished one of the major goals that we set for ourselves with SKYPAD.

Adcock: Let me add something there. Jay is exactly right. The real impact, as it translates into the business, is that we have to stop processing and stop collecting data at a certain point in the morning and start processing it in order for us to make our service-level agreements (SLAs) on reporting for our clients, because they start their analysis. The retail data comes in staggered over the morning and it may not all be in by the time that we need to shut that processing off.

One of the things that moving to Vertica has allowed us to do is to cut that time off later, and when we cut it off later, we have more data, as a rule, for a customer earlier in the morning to do their analysis. They don’t have to wait until the afternoon. That’s a big benefit. They get a much better view of their business.

Driving more metrics

The other thing that it has enabled us to do is drive more metrics into the database and do some processing in the database, rather than in the user tool, which makes the user tool faster and it provides more value.

For example, maybe for age on the floor, we can do the calculation in the background, in the database, and it doesn't impede the response in the front-end engine. We get more metrics in the database calculated rather than in our user tool, and it becomes more flexible and more valuable.
Sky I.T. Group
Retail Business-Intelligence Solutions
Get More Information
Gardner: So not only are you doing what you used to do faster, better, cheaper, but you're able to now do things you couldn't have done before in terms of your quality of data and analysis. Is there anything else that is of a business nature that you're able to do vis-à-vis analytics that just wasn't possible before, and might, in fact, be equivalent of a new product line or a new service for you?

Czetty: In the old model, when we got a new client we had to essentially recreate the processes that we'd built for other clients to match that new client, because they're collecting that data just for that client just at that moment.
In the current model, where we're centered on retailers, the only thing that will take us a long time to do in this particular situation is if there's a new retailer that we've never collected data from.

So 99 percent of it is the same as any other client, but one percent is always different, and it had to be built out. On-boarding a client, as we call it, took us a considerable amount of time -- we are talking weeks.

In the current model, where we're centered on retailers, the only thing that will take us a long time to do in this particular situation is if there's a new retailer that we've never collected data from. We have to understand their methodology of delivery, how it comes, how complex it is and so forth, and then create the logic to load that into the database correctly to match up with what we are collecting for others.

In this scenario, since we’ve got so many clients, very few new stores or new retailers show up, and typically it’s just our clients on retail chain, and therefore our on-boarding is just simplified, because if we are getting Nordstrom’s data from client A, we're getting the same exact data for client B, C, D, E, and F.

Now, it comes through a single funnel and it's the Nordstrom funnel. It’s just a lot easier to deal with, and on-boarding comes naturally.

Hakami: In addition to that, since we're adding more significant clients, the ability to increase variety, velocity, and volume is very important to us. We couldn't scale without having Vertica as a foundation for us. We'd be standing still, rather than moving forward and being innovative, if we stayed where we were. So this is a monumental change and a very instrumental change for us going forward.

Gardner: Steve, tell us a little bit about your actual deployment. Is this a single tenant environment? Are you on a single database? What’s your server or data center environment? What's been the impact of that on your storage and compression and costs associated with some of the ancillary issues?

Multi-tenant environment

Czetty: To begin with, we're coming from a multi-tenant environment. Every client had its own private database in the past, because in DB2, we couldn't add all these clients into one database and get the job done. There was not enough horsepower to do the queries and the loads.

We ran a number of databases on a farm of servers, on Rackspace as our hosting system. When we brought in Vertica, we put up a minimal configuration with three nodes, and we're still living with that minimal configuration with three nodes.

We haven't exhausted our capacity on the license by any means whatsoever in loading up this data. The compression is obscenely high for us, because at the end of the day, our data absolutely lends itself to being compressed.

Everything repeats over and over again every single week. In the world of Vertica, that means it only appears once in wherever it lives in the database, and the rest of it is magic. Not to get into the technology underneath it at this point, from our perspective, it's just very effective in that scenario.
With the three nodes, we've had zero problems with performance. It hasn't been an issue at all. We're just looking back and saying that we wish we had this a little sooner.

Also in our DB2 world, we're using quite costly large SAN configurations with lots of spindles, so that we can have the data distributed all across the spindles for performance on DB2, and that does improve the performance of that product.

However, in Vertica, we have 600 GB drives and we can just pop more in if we need to expand our capacity. With the three nodes, we've had zero problems with performance. It hasn't been an issue at all. We're just looking back and saying that we wish we had this a little sooner.

Vertica came in and did the install for us initially. Then, we ended up taking those servers down and reinstalling it ourselves. With a little information from the guide, we were able to do it. We wanted to learn it for ourselves. That took us probably a day and a half to two days, as opposed to Vertica doing it in two hours. But other than that, everything is just fine. We’ve had a little training, we’ve gone to the Vertica event to learn how other people are dealing with things, and it's been quite a bit of fun.

Now there is a lot of work we have to do at the back end to transform our processes to this new methodology. There are some restrictions on how we can do things, updates and so forth. So, we had to reengineer that into this new technology, but other than that, no changes. The biggest change is that we went vertical on the retail silos. That's just a big win for us.

Gardner: As you know, Vertica is cloud ready. Is there any benefit to that further down the road where maybe it’s around issues of a spike demand in holiday season, for example, or for backup recovery or business continuity? Any thoughts about where you might leverage that cloud readiness in the future?

Dedicated servers

Czetty: We're already sort of in the cloud with the use of dedicated servers, but in our business, the volume increases in the stores around holidays is not doubling the volume. It’s adding 10 percent, 15 percent, maybe 20 percent of the volume for the holiday season. It hasn’t been that big a problem in DB2. So, it’s certainly not going to be a problem in Vertica.

We've looked at virtualization in the cloud, but with the size of the hardware that we actually want to run, we want to take advantage of the speed and the memory and everything else. We put up pretty robust servers ourselves, and it turns out that in secure cloud environments like we're using right now at Rackspace, it's simply less expensive to do it as dedicated equipment. To spin up a machine, like another node for us at Rackspace, would take about same time it would take for virtual system setup and configure to a day or so. They can give us another node just like this on our rack.

We looked at the cloud financially every single time that somebody came around and said there was a better cloud deal, but so far, owning it seems to be a better financial approach.

Gardner: Before we close out, looking to the future, I suppose the retailers are only going to face more competition. They're going to be getting more demand from their end users or customers for user experience for information.
We looked at the cloud financially every single time that somebody came around and said there was a better cloud deal, but so far, owning it seems to be a better financial approach.

We're going to see more mobile devices that will be used in a dot-com world or even a retail world. We are going to start to see geolocation data brought to bear. We're going to expect the Internet of Things (IoT) to kick in at some point where there might be more sensors involved either in a retail environment or across the supply chain.

Clearly, there's going to be more demand for more data doing more things faster. Do you feel like you're in a good position to do that? Where do you see your next challenges from the data-architecture perspective?

Czetty: Not to disparage too much the industry of luxury, but at this point, they're not the bleeding edge on the data collection and analysis side, where they are on the bleeding edge on social media and so forth. We've anticipated that. We've got some clients who were collecting information about their web activities and we have done analysis for identifying customers who are presenting different personas through their different methods as they contact the company.

We're dabbling in that area and that’s going to grow as it becomes so tablet-oriented or phone-oriented as the interfaces go. A lot of sales are potentially going to go through social media and not just the official websites in the future.

We'll be capturing that information as well. We’ve got some experience with that kind of data that we’ve done in the past. So, this is something I'm looking forward to getting more of, but as of today, we’re only doing it for a few clients.

Well positioned

Hakami: In terms of planning, we're very well-positioned as a hub between the wholesaler and the retailer, the wholesaler and their own retail stores, as well as the wholesaler and their dot-coms. One of the things that we are looking into, and this is going to probably get more oxygen next year, is also taking a look at the relationships and the data between the retailer and the consumer.

As you mentioned, this is a growing area, and the retailers are looking to capture more of the consumer information so they can target-market to them, not based on segment but based on individual preferences. This is again a huge amount of data that needs to be cleansed, populated, and then presented to the CMOs of companies to be able to sell more, market more, and be in front of their customers much more than ever before.

Gardner: That’s a big trend that we are seeing in many different sectors of the economy -- that drive for personalization, and it really is a result of these data technologies to allow that to happen.

Last word to you, Dane. Any other thoughts about where the intersection of computer science capabilities and market intelligence demands are coming together in new and interesting ways?

Adcock: I'm excited about the whole approach to leveraging some predictive capabilities alongside the great inventory of data that we've put together for our clients. It's not just about creating better forecasts of demand, but optimizing different metrics, using this data to understand when product should be marked down, what types of attributes of products seem to be favored by different locations of stores that are obviously alike in terms of their shopper profiles, and bringing together better allocations and quantities in breadth and depth of products to individual locations to drive better, higher percentage of full-price selling and fewer markdowns for our clients.

So it’s a predictive side, rather than discovery using a BI tool.

Czetty: Just to add to that, there's the margin. When we talked to CEOs and CFOs five or six years ago and told them we could improve business by two, three, or four percent, they were laughing at us, saying it was meaningless to them. Now, three, four, or five percent, even in the luxury market, is a huge improvement to business. The companies like Michael Kors, Tory Burch, Marc Jacobs, Giorgio Armani, and Prada are all looking for those margins.
I'm excited about the whole approach to leveraging some predictive capabilities alongside the great inventory of data that we've put together for our clients.

So, how do we become more efficient with a product assortment, how do we become more efficient with distribution and all of these products to different sales channels, and then how do we increase our margins? How do we not over-manufacture and not create those blue shirts in Florida, where they are not selling, and create them for Detroit, where they're selling like hotcakes.

These are the things that customers are looking at and they must have that tool or tools in place to be able to manage their merchandising and by doing so become a lot more agile and a lot more profitable.

Gardner: Well, great. I'm afraid we will have to leave it there. We've been discussing how retail luxury goods and fashion market goods providers are using analysis from Sky I.T. Group and how Sky I.T. Group heads up its game through using HPE Vertica to provide more buyer behavior analysis faster, better, and cheaper.

And we’ve seen how Sky I.T. has changed its platform and solved the challenges around variety, velocity, and volume for that data to make those better insights available to those retail users, allowing them to become more data-driven across their entire market.

So please join me in thanking our guests. We have been talking with Jay Hakami,  President of Sky I.T. Group in New York. Thank you so much, Jay.
Visit the Sky BI Team
At The 2016 NRF Big Show!
Get More Information
Hakami: Thank you, Dana. I appreciate it very much.

Gardner: And we've also been talking with Dane Adcock, Vice President of Business Development at Sky I.T. Group. Thank you, Dane.

Adcock: It’s great to have the conversation. Thank you.
Gardner: I've enjoyed it myself. And lastly, a big thank you to Stephen Czetty, Vice-President and Chief Technology Officer there at Sky I.T. Group. Thank you, Stephen.

Czetty: You're very welcome, and I enjoyed the conversation. Thank you.

Gardner: And I’d also like to thank our audience as well for joining us for this big-data use case leadership discussion.

I'm Dana Gardner, Principal Analyst at Interarbor Solutions, your host for this ongoing series of Hewlett Packard Enterprise-sponsored discussions. Thanks again for listening, and come back next time.

Listen to the podcast. Find it on iTunes. Get the mobile app. Download the transcript. Sponsor: Hewlett Packard Enterprise.

Transcript of a BriefingsDirect discussion on how Sky I.T. has changed its platform and solved the challenges around variety, velocity, and volume for big data to make better insights available to retail users. Copyright Interarbor Solutions, LLC, 2005-2016. All rights reserved.

You may also be interested in: