Showing posts with label Bloor. Show all posts
Showing posts with label Bloor. Show all posts

Wednesday, January 07, 2009

Webinar: Analysts Plumb Desktop as a Service as the Catalyst for Cloud Computing Value for Enterprises and Telcos

Transcript of a recent webinar, produced by Desktone, on the future of cloud-hosted PC desktops and their role in enterprises.

Listen to the webinar here.

Jeff Fisher: Hello and welcome, everyone. Thanks so much for attending this Desktone Webinar Series entitled, “Desktops as a Service, The Evolution of Corporate Computing.” I’m Jeff Fisher, senior director of strategic development at Desktone, and I will also be the host and moderator of the events in this series.

We are really excited to kick off this series of webinars with one focused on cloud-hosted desktops and are equally as excited and privileged to have just a wonderful panel with us starting with Rachel Chalmers from the 451 Group, Dana Gardner from Interarbor Solutions, and Robin Bloor from Hurwitz and Associates.

For those of you who don’t know, Rachel, Dana and Robin are really three of the top minds in this emerging cloud-hosted desktop space. It’s going to be great to see just what they have to say about the topic and we’ll talk to them just a little bit later on.

Before we do that, I want to spend a little bit of time talking about Desktone’s vision and definition of cloud-hosted desktops and, most importantly, about why we believe that virtual desktops, as opposed to virtual servers, are really going to kick-start adoption of cloud computing within the enterprise.

Desktone is a venture-backed software company. We’re based outside of Boston in a town called Chelmsford. We raised $17 million in series A in a round of funding in summer 2007. Highland Capital and Softbank Capital led that round. We also got an investment at that time from Citrix Systems.

We’re currently about 35 full-time employees and have 25 full-time outsourced software developers. The executive team has experience leading desktop virtualization vendors such as Citrix, Microsoft and Softricity, and also experience running Fortune 500 IT organizations at Schwab and Staples.

We have a number of technology partners in the area of virtualization software, servers, storage and thin clients, and some key service provider partnerships with HP, IBM, Verizon and Softbank. What’s really important to note here is that Desktone actually goes to market through these service provider partners.

We don’t host virtual desktops ourselves, but rather the desktops as a service (DaaS) offering that we enable is provided through service provider partners. The only services that we host ourselves, or are offered directly, are trial and pilot services.

We built a platform called the Desktone Virtual-D Platform. It’s the industry’s first virtual desktop hosting platform specifically designed to enable desktops to be delivered as an outsourced subscription service.

And what’s important to understand is that this platform is designed specifically for service providers to be able to offer desktop hosting in the same way that they offer Web hosting or e-mail hosting.

We architected the Virtual-D Platform from the ground-up with that mission in mind. It’s a solution for running virtualized, yet genuine, Windows client environments, whether XP or Vista, in a service provider cloud. We’ll talk more about how we define a service provider cloud in a bit.

It leverages a core virtual desktop infrastructure (VDI) architecture, that is, server-hosted desktop virtual machines, which are accessed by users through PC remoting technologies like remote desktop protocol (RDP), for example. The Virtual-D Platform enables cloud-scale and multitenancy, which are two of the key things that a service provider needs to have to be able to be in this business.

Without getting into too much detail, it’s not really viable to take an enterprise VDI architecture or a product that’s been architected to deliver enterprise VDI and just port it over for service provider use.

It’s not viable for service providers to manage individual instances of VDI products. They really need a platform to manage this efficiently and effectively. The other key thing that the Virtual-D Platform does is separate the responsibilities of the user, the enterprise desktop administrator, and the service provider hosting operator, so that each of these constituents has their own view into the system, through a Web-based interface of course, and can do what they need to do without seeing functions and capabilities that are only really required by some of the other groups.

So, it’s a very, very different technology approach, although the net result appears in certain ways to be similar to some of the VDI platforms that you probably know pretty well.

So, that’s Desktone in a nutshell.

Promise of the cloud

Let’s get to the promise of the cloud. Clearly, everyone is talking about cloud computing. You can’t look anywhere within IT and not hear about it. It’s amazing to see it surpassing even the frenzy around virtualization. In fact, most of the conversations people are having today are around virtualization and how it can take place in the cloud. Everyone wants to focus on all the benefits, including anytime/anywhere access and subscription economics.

However, like any other major trend that unfolds in IT, there are a number of challenges with the cloud. When people talk about cloud computing with respect to the enterprise, in most cases they’re talking about virtualizing server workloads and moving those workloads into a service provider cloud.

Clearly, that shift introduces a number of challenges. Most notable is the challenge of data security. Because server workloads are very tightly-coupled with their data tier, when you move the server or the server instance, you have to move the data. Most IT folks are not really comfortable with having their data reside in a service provider or external data center.

For that reason Desktone believes that it’s actually going to be virtual desktops, not servers, that are the better place to start and are going to be what jump starts this whole enterprise adoption of cloud computing.

The reason is pretty simple. Most fixed corporate desktop environments -- those are desktops that have a permanent home within your enterprise – already probably have their application and user data abstracted away from the actual desktop. The data is not stored locally. It’s stored somewhere on the network, whether it’s security credentials within the active directory (AD), whether it’s home drives that store user data, or it’s the back end of client-server applications. All the back end systems run within your data center.

When you shift that kind of environment to the cloud, although the desktop instance has moved, the data is still stored in the enterprise data center. Now, what you are left with are virtual desktops running in a highly secure virtual branch office of the enterprise. That’s how we like to refer to our service-provider partners’ data centers, as secure virtual branch offices of your enterprise.

In addition, if you virtualize and centralize physical PCs, which used to reside in remote branch offices that have limited or no physical security, you’ve actually increased the security of the environment and the reasons are clear. The PC can no longer walk off, because it doesn’t have a physical manifestation.

Because users are interacting with their virtual desktops through PC remoting technology, a la Remote Desktop Protocol (RDP), you have control as an administrator as to whether or not they can print to their access device or through their access device, whether or not they can get access to USB devices, USB key fobs that they put into that device.

So, you can control the downstream movement of data from the virtual desktop to the edge. You can also control the upstream movement of data from the edge to the virtual desktop and stop malware and viruses from being introduced through USB keys as well.

Those are some really nice benefits. We have an animation that illustrates this, showing a physical desktop PC, which accesses its data from the enterprise data center – again whether it’s AD or user data or the actual apps themselves. Then, what happens is the actual PC is virtualized and centralized into a service-provider cloud, at which point it’s accessed from an access device, whether that’s a thin client or a thick client, a PC that’s been repurposed to act like a thin client, or a dumb terminal and/or laptops as well.

The key message here is that, although the instance of the desktop is moved, the data does not have to move along with it. Through private conductivity between the enterprise and the service provider, it’s possible to access the data from the same source.

Service-provider cloud

The other interesting thing is this notion of the service-provider cloud, which is that it actually can traverse both the enterprise and the service provider data centers.

So, depending on the use case, service providers can either keep the virtual infrastructure and the racks powering that virtual infrastructure in their data center or they can, in certain cases, put the physical infrastructure within the enterprise data center, what we call the customer premise equipment model. The most important thing is that it doesn’t break the model.

There is flexibility in the location of the actual hosting infrastructure. Yet, no matter where it resides, whether it’s in a service provider data center or an enterprise data center, the service provider still owns and operates it and the enterprise still pays for it as a subscription.

Let’s touch on just a couple of other benefits and then we’ll jump into talking with our panel. The Desktone DaaS cloud vision preserves the rich Windows client experience in the cloud. This is true, blue Windows -- XP or Vista -- not another form of Windows computing whether it’s shared service in the form of Terminal Services and/or browser-based solutions like Web OSes or webtops.

That’s important, because most enterprises have the Windows apps that they need to run and they don’t want to have to re-architect and re-engineer the packages to run within a multi-user environment. They certainly don’t want a browser-based environment where they can’t run those apps.

In the same vein, this sustains the existing enterprise IT operating model, while introducing cloud-like properties so that IT desktop administrators can continue to use the same tools and processes and procedures to support the virtual desktops in the cloud as they have done and as they will continue to do with their physical desktops.

We talked about the notion of separating service provider enterprise responsibilities. It’s really important to be able to draw a line and say that the service provider is responsible for the hosting infrastructure, and yet the enterprise is still ultimately responsible for the virtual desktops themselves, the OS images, the patching, the licensing, the applications, application licensing, etc.

And then, finally, this notion we mentioned of combining both on- and off- premise hosting models is important. I think most of the leading analyst firms agree that in order for enterprises to be able to adopt cloud computing they are not going to be able to go from a fully enterprise data center model to a full cloud model. There’s got to be some sort of common ground in between and, again, the fact this model supports both is important.

Now, let’s turn to our panel and see what they have to say. We’ll get started with Rachel Chalmers who is research director of infrastructure management at the 451 Group. She’s led the infrastructure software practice for the 451 Group since its debut in April 2000.

She’s pioneered coverage on services-oriented architecture (SOA), distributed application management, utility computing and open-source software, and today she focuses on data center automation server, desktop, and application virtualization. Rachel, thank you so much for being with us today.

Rachel Chalmers: You’re very welcome. It’s good to be here.

Fisher: Rachel, I actually credit you with being the first analyst to really put cloud-hosted desktop virtualization on the map and the reason is because you’ve written two really expansive and excellent reports on desktop virtualization. The first one you released in the summer of 2007. The follow-up one was released this past summer of ’08.

What I’ve really found interesting was that in the updated version you actually modified your desktop virtualization taxonomy to include cloud-hosted desktops as a first-class citizen, so to speak, alongside client-hosted desktop virtualization and server-hosted desktop virtualization. Of course that begs the question, what was so compelling about the opportunity that made you do that?

Taxonomy is key


Chalmers: Taxonomy is the key word. For those who aren’t familiar with The 451 Group, we focus very heavily on emerging and innovative technology. We do a ton of work with start-ups and when we work with public companies, it’s from the point of view of how change is going to affect their portfolio, where the gaps are, who they should buy. So we’re very much the 18th century naturalists of the analyst industry. We’re sailing around the Galapagos Islands and noting intriguing differences between finches.

I know we described cloud-hosted desktop virtualization as one of these very constructive differences between finches. When I sat down and tried to get my arms around desktop virtualization, it was just at the tail end of 2007 and, as you’ll recall, just as it’s illegal for a vendor to issue a press release now without describing their product as a cloud-enablement product. In 2007, it was illegal to issue a press release without describing a product as virtualization of some kind.

I was tracking conservatively 40 to 50 companies that were doing what they described as desktop virtualization and they were all doing more or less completely different things. So, the first job as a taxonomist is to sit down and try and figure out some of the broad differences between companies that claim to be doing identical things and claim to deliver identical functionality. One of the easiest ways to categorize the true desktop virtualization guys, as opposed to the terminal services or application streaming vendors, was to figure out exactly where the virtual machine (VM) was running.

So I split it three ways. There are three sensible places to run a desktop virtual machine. One is on the physical client, which gives you a whole bunch of benefits around the ability to encrypt and lock down a laptop and manage it remotely. One is to run it on the server, which is the very similarly tried-and-tested VM with VDI or Citrix XenDesktop method. That’s appropriate for a lot of these cases, but when you run out of server capacity or storage in the server-hosted desktop virtualization model a lot of companies would like elastic access to off-site resources.

This is particularly appropriate, for example, for retailers who see a big balloon in staffing – short-term and temporary staffing around the holiday seasons, although possibly not this year -- or for companies that are doing things off-shore and want to provide developer desktops in a very flexible way, or in education where companies get big summer classes, for example, and want to fire up a whole bunch of desktops for their students.

This kind of elastic provisioning is exactly what we see on the server virtualization side around cloud bursting. On the desktop side, you might want to do cloud bursting. You might even want to permanently host those desktops up in the cloud with a hosting provider and you want exactly the same things that you want from a server cloud deployment. You want a very, very clean interface between the cloud resources and the enterprise resources and you want a very, very granular charge back in billing.

And so, we see cloud-hosted desktop virtualization as a special case of server-hosted desktop virtualization. Really, Desktone has been the pioneer in defining what that interface should look like, where the enterprise data should reside, where AD, with its authentication and authorization functions, should reside, and what gets handled by the service provider and how that gets handled by the service provider.

Desktone isn’t the only company in cloud-hosted desktop virtualization, but it’s certainly the best-known and it’s certainly done the best job of articulating what the pieces will look like and how they’ll work together.

Fisher: Great.

Chalmers: It’s a very impressive finch.

Fisher: Always appreciate it. Dana and Robin, do you have any additional comments on what Rachel had to say?

New era in compute resources

Dana Gardner: Yes, I think we’re entering a new era in how people conceive of compute resources. To borrow on Rachel’s analogy, a lot of these finches have been around, but there hasn’t been a lot of interest in terms of an environment where they could thrive. What’s happening now is that organizations are starting to re-evaluate the notion that a one-size-fits-all PC paradigm makes sense.

We have lots of different slices of different types of productivity workers. As Rachel mentioned, some come and go on a seasonal basis, some come and go on a project basis. We’re really looking at a slice-and-dice productivity in a new way, and that forces the organization to really re-evaluate the whole notion of application delivery.

If we look at the cost pressures that organizations are under, recognizing that it’s maintenance and support, and risk management and patch management that end up being the lion’s share of the cost of these systems, we’re really at a compelling point where the cost and the availability of different alternatives has really sparked sort of a re-thinking.

And a lot of general controlled-management security risk avoidance issues require organizations to increasingly bring more of their resources back into a server environment.

But, if you take that step in virtualization and you look at different ways of slicing and dicing your workers, your users, if you can virtualize internally – well then we might as well make the next step and say, “What should we virtualize externally?” “Who could do this better than we can at a scale that brings the cost down even higher?”

This is particularly relevant if they’re commodity level types of applications and services. It could be communications and messaging, it could be certain accounting or back office functions. It just makes a lot of sense to start re-evaluating. What we haven’t seen, unfortunately, is some clear methodologies about how to make these decisions and boundaries inside of organizations with any sort of common framework or approach.

It’s still a one-off company by company approach -- which workers should we keep on a full-fledged PC? Who should we put on a mobile Internet device, for example? Who could go into a cloud-based applications hosting type of scenario that you’ve been describing?

It’s still up in the air and I’m hoping that professional services and systems integrators over the next months and years will actually come up with some standard methodologies for going in and examining the cost-benefit analysis, what types of users and what types of functions and what types of applications it makes sense to put into these different finch environments.

Fisher: Absolutely. I couldn’t agree more and I’ve always been one who talks about use cases. It all comes down to the use cases.

The technology is great, the innovation is great but especially in the case of desktop usage you really have to figure out what people are doing, what they need to do, what they don’t need to be doing at work but are currently doing, and that’s the whole notion of this how the consumerization piece fits in and personal life melds in with business life. You can say, well, this person doesn’t need to do that but if they are today you need to figure out how to make that work and how to take that into account.

So, I agree with you. It’d be fantastic to get to a world where there was just a better way to have better knowledge around use cases and which ones fit with which delivery models.

Chalmers: I think that’s a really crucial point. Just as server and work cloud virtualization have transformed the way we can move desktops and servers around, I see a lot of really fascinating work being done around user virtualization.

Jeff, you talked a lot about the issue of having user data stored separately from the dynamic run-time data. I know you’ve done a lot of work with AppSense within Merrill Lynch. There’s a group of companies -- AppSense, RTO Software, RES, and Sansa -- that are all doing really interesting work around maintaining that user data in a stateful way, but also enabling IT operators to be able to identify groups of users who may need different form factors for their desktop usage and for their work profile.

Buy side perspective

Gardner: We’ve been looking at this from the perspective on the buy side where it makes a lot of sense, but there’s also some significant momentum on the sell side. These organizations that are perhaps traditional telcos, co-location, or hosting organizations, cloud providers or some ecology of providers that actually run on someone else’s cloud but have a value-added services capability of some sort.

These are on the sell side and they’re looking for opportunities to increase their value, not just to small to medium-sized businesses but to those larger enterprises. They’re going to be looking and trying to define certain classes of users, certain classes of productivity and work and workflow, and packaging things in a new and interesting way.

That’s the next shoe to fall in all of this: the type of customer that you have there at Desktone. It’s incumbent upon them now to start doing some packaging and factoring the cost savings, not just on an application-on-application basis but more on a category of workflow business process work and do the integration on the back-end across.

Perhaps that will involve multiple cloud providers, multiple value-added services providers, and they then take that as a solution sell back into the enterprise, where they can come up with a compelling cost-per-user-per-month formula. It’s recurring revenue. It’s predictable. It will probably even go down over time, as we see more efficiency driven into these cloud-based provisioning and delivery systems.

So, there’s a whole new opportunity for the sellers of services to package, integrate, add value, and then to take that on a single-solution basis into a large Fortune 1000 organization, make a single sale, and perhaps have a customer for 10 or 15 years as a result.

Chalmers: It is a tremendously exciting opportunity for our managed-hosting provider clients. It’s the dominating topic of conversation at a lot of the events that we run for that group. Traditionally, a really, really great managed hoster that delivers an absolutely fantastic service will become the beloved number one vendor of choice of the IT operator.

If that managed hosting provider can deliver the same quality of service on the desktop, then they will be the beloved number one vendor of everybody up to and including the CIO and the CEO. It’s a level of exposure they’ve just never been able to aspire towards before.

Robin Bloor: I think that’s probably right. One of the things that is really important about what’s happening here with the virtualization of the desktop is just the very simple fact that desktop costs have never been well under control. The interesting thing is that with end users that we’ve been talking to earlier this year, when they look at their user populations, they normally come to the conclusion that something like 70 or 80 percent of PC users are actually using the PC in a really simple way. The virtualization of those particular units is an awful lot easier to contemplate than the sophisticated population of heavy workstation use and so on.

With the trend that’s actually in operation here, and especially with the cloud option where you no longer need to be concerned about whether your data center actually has the capacity to do that kind of thing, there’s an opportunity with a simple investment of time to make a real big difference in the way the desktop is managed.

Fisher: I totally agree. Thanks, Robin. All right. Let’s shift gears and talk to Dana Gardner. Dana is the president of Interarbor Solutions, and is known for identifying software and enterprise infrastructure trends and new IT business growth opportunities.

During the last 18 years he’s refined his insights as an industry analyst and news editor, and lately he’s been focused on application development and deployment strategies and cloud computing.

So Dana, you’ve been covering us for a while on your blog. For those of you who don’t know, Dana’s blog is called BriefingsDirect. It’s a ZDNet blog. You’ve covered our funding and platform launch, and some of our partner announcements, and we’ve had some time to sit down as well and chat.

In a posting this summer you wrote about Pike County– which is a school district in Kentucky, where IBM has successfully sold a 1,400-seat DaaS deployment. That’s something that we’re going to dive deeper into on a couple of the webinars in the series.

You’ve stated a broad affection for the term “cloud computing,” and all that sticks to that nowadays will mean broad affection, too, for DaaS. Can you elaborate on that?

Entering transitional period

Gardner: Well, sure. As I said, we’re entering a transitional period, where people are really re-thinking how they go about the whole compute and IT resources equation. There’s almost this catalyst effect or the little Dutch boy taking his finger out of the hole in the dike, where the whole thing comes tumbling down.

When you start moving toward virtualization and you start re-thinking about infrastructure, you start re-thinking the relationship between hardware and software. You start re-thinking the relationship between tools and the deployment platform, as you elevate the virtualization and isolate applications away from the platform, and you start re-thinking about delivery.

If you take the step toward terminal services and delivering some applications across the wire from a server-based host, that continues to tip this a little bit toward, “Okay, if I could do it with a couple of apps, why not look at more? If I could do it with apps, why not with desktop? If I can do it with one desktop, why not with a mobile tier?”

If I’m doing some web apps, and I have traditional client-server apps and I want to integrate them, isn’t it better to integrate them in the back-end and then deliver them in a common method out to the client side?

So we’re really going through this period of transformation, and I think that virtualization has been a catalyst to VDI and that VDI is therefore a catalyst into cloud. If you can do it through your servers, somebody else can do it through theirs.

If we’ve managed the wide-area network issues, if we have performance that’s acceptable at most of the application performance criteria for the bell curve of users, the productivity workers, we just go down this domino line of one effect after another.

When we start really seeing total costs tip as a result, the delta between doing it yourself and then doing it through some of these newer approaches is just super-compelling. Now that we’re entering into an economic period, where we’re challenged with top-line and bottom-line growth, people are not going to take baby steps. They’re going to be looking for transformative, real game-changing types of steps. If you can identify a class of users and use that as a pilot, if you can find the right partners for the hosting and perhaps even a larger value-added services portfolio approach, you start gaining the trust, you start seeing that you can do IT at some level but others can do it even better.

The cloud providers are in the business of reducing their costs, increasing their utilization, exploiting the newer technologies, and building data centers primarily with a focus on this level of virtualization and delivery of services at scale with performance criteria. Then, it really becomes psychology and we’re looking at, as you said earlier, the trust level about where to keep your data and that’s really all that’s preventing us now from moving quite rapidly into some of these newer paradigms.

The cost makes sense. The technology makes sense. It’s really now an issue of trust, and so it’s not going to happen overnight, but with baby steps, the domino effect, as you work toward VDI internally. If you work towards cloud with a couple of apps, certain classes of users, before long that whole dike is coming down and you might see only a minority of your workers actually doing things in the conventional client-server, full PC local run-time and data storage mode.

I think we’re really just now entering into a fairly transformative period, but it’s psychologically gaining ground rapidly.

Fisher: Yes, definitely. Rachel, Robin, any thoughts on Dana’s comments?

Psychological issues


Chalmers: I think that’s exactly right and I think the psychological issues are really important, as Dana has described them. One of the huge barriers to adoption of earlier models of this kind of remote desktop-like terminal services has been just that they’re different from having a full, rich Windows user experience in front of you.

The example people keep returning to is the ability to have a picture of your kids as your desktop wallpaper. It seems so trivial from an IT point of view, but just the ability to personalize your own environment in that way turned out to be a major obstacle to adoption of the presentation servers in that model.

You can do that in a virtual desktop environment. You can serve that exact same desktop environment to the same employee, whether she’s working from San Francisco or London. Because the VDI deployment model also has the same, yet better, features to that employee, it becomes much easier to persuade organizations to adopt this model and the cost savings that come along with it.

So, we underplay the psychological aspects at our peril. People are human beings and they have human foibles, and technology needs to work around that rather than assuming that it doesn’t exist.

Bloor: Yes, I’d go along with that. What you’ve actually got here is a technology where the ultimate user won’t necessarily know whether they’ve got a local PC. Nowadays, you can buy devices where the PC itself is buried in the screen.

So, it’s like they may psychologically, in one way or another, have some kind of feeling of ownership for their environment, but if they get the same environment virtually that they would adopt physically, they’re not going to object. Certainly some of the earlier experiences that users have had is that problems go away. The number of desk-side visits required for support, when all you’ve got is a thin client device on the desktop, diminishes dramatically.

The user suddenly has a responsibility for various things that they would do within their own environment lifted completely from them. So, although you don’t advertise it as this there’s actually a win for the user in this.

Chalmers: That’s exactly right and fewer desktop visits, fewer IT guys coming around to restart your blue screen or desktop, that translates directly into increased productivity.

Fisher: Yes, and what we like to talk about at Desktone is just this notion of anytime, anywhere. It’s one thing to get certain limited apps and services. It’s another thing to be able to get your PC environment, your corporate persona everywhere you go.

If you need to work from home for a couple of days a week, or in emergency situations, it’s great to be able to have that level of mobility and flexibility. So, we totally agree.

Now let’s move over now to Robin Bloor. He's a partner at Hurwitz and Associates. He’s got over 20 years experience in IT analysis and consultancy and is an influential and respected commentator on many corporate IT issues. His recent research is focused on virtualization, desktop management, and cloud computing.

Robin, in your post about Desktone on your blog -- “have Mac will blog” -- a title I love -- you mentioned that you were surprised to see the DaaS or cloud value prop for client virtualization emerge this early on. You mentioned that you found our platform architecture diagram to be extremely helpful in explaining the value prop, and I just would like you to provide some more color around those comments.

Tracking virtualization

Bloor: Sure. I really came into this late last year and, in one way or another, I was looking at the various things that were happening in terms of virtualization. I’d been tracking the escalating power of PC CPUs and the fact that, by and large, in a lot of environments the PC is hard to use.

If you do an analysis of what is happening in terms of CPU usage, then the most active thing that happens on a PC is that somebody waves their mouse around or possibly somebody is running video, in which case the CPU is very active. But it became obvious that you could put a virtualized environment on a PC.

When I realized that people were doing that, I got interested in the way that people were actually doing it and, there are a lot of things out there, if you actually look at it. It absolutely stunned me that a cloud offering became available earlier this year because that meant that somebody would actually have had to have been thinking about this two years ago in order to put together the technology that would enable such an offering like that.

So just look at the diagram and you certainly see why, from the corporate point of view, if you’re somebody that’s running a thousand desktops or more it’s a problem. It is a problem in terms of an awful lot of things but mostly it’s a support issue and it’s a management issue. When you get an implementation that involves changing the desktop from a PC to a thin client and you don’t put anything into the data center, it improves.

You’ve now got a situation where you don’t need cages in the data center running PC blades or running virtualized blades to actually provide the service. You don’t need to implement the networking stuff, the brokering capability, boost the networking in case it’s clashing with anything else, or re-engineer networks.

All you do is you go straight into the cloud and you have control of the cloud from the cloud. It’s not going to be completely pain free obviously, but it’s a fairly pain-free implementation. If I were in the situation of making a buying decision right now, I would investigate this very, very closely before deciding against it, because this has got to be the least disruptive solution. And if the apparent cost of ownership turns out to be the same or less than any other solution, you’re going to take it very seriously.

Fisher: Absolutely. Rachel, Dana, thoughts and comments?

Chalmers: I agree and I love this diagram. It’s the one that really conveyed to me how cloud-hosted desktop virtualization might work, and what the value prop is to the IT department, because they get to keep all the stuff they care about, all the user data, all the authentication authorization, all of the business apps. All they push out is support for those desktops, which frankly had been pushed out anyway.

There’s always one guy or gal in the IT organization who is hiking around from desktop to desktop installing antivirus or rebooting machines. Now, instead of that person being hiking around the offices, they are employed by the service provider and sitting in a comfy chair and being ergonomically correct.

Rational architecture

Gardner: Yes, I would say that this is a much more lucid and rational architecture. We’ve found ourselves, over the past 15 or 20 years, sort of the victim of a disjointed market roll out. We really didn’t anticipate the role of the Internet, when client-server came about. Client-server came about quickly just after local area networks (LANs) were established.

We really hadn’t even rationalized how a LAN should work properly, before we were off and running into bringing browsers in TCP/IP stacks. So, in a sense, we've been tripping over and bouncing around from one very rapid shift in technology to another. I think we’re finally starting to think back and say, “Okay, what’s the real rational, proper architectural approach to this?”

We recognize that it’s not just going to be a PC on every desktop. It’s going to be a broadband Internet connection in every coat pocket, regardless of where you are. That fundamentally changes things. We’re still catching up to that shift.

When I look at a diagram like Desktone’s, I say, “Ah-ha!” You know now that we fairly well understand the major shifts that have occurred in the past 20-25 years. If we could start from a real computer-science perspective, if we could look at it rationally from a business and cost perspective, how would we properly architect how we deploy and distribute IT resources? We’re really starting to get to a much more sensible approach, and that’s important.

Bloor: Yes, I would completely go with Dana on that. From an architect’s point of view, if nobody had influenced you in any way and you were just asked to draw out a sense of a virtualization of services to end users, you would probably head in this direction. I have no doubt about it. I’ve been an architect in my time, and it’s just very appealing. It looks like what Desktone DaaS has here is resources under control, and we’ve never had that with a PC.

Fisher: Well, that was great, and I really appreciate you guys taking the time to answer my questions. With the remaining 10 minutes, I’d like to turn it over for some Q&A.

The question coming in has to do with server-based computing app delivery with respect to this model.

This is something that comes up all the time. People say, “We’re currently using terminal services or presentation server,” which is obviously what we use for app deployment. How does that application deployment model fit into this world? And to kick off the discussion I’ll tell you that at Desktone we view what we’re doing very much as the virtualization of the underlying environment of the actual PC itself and the core OS.

That doesn’t change the fact that there are still going to be numerous ways to deploy applications. There’s local installation. There’s local app virtualization. There’s the streaming piece of app virtualization. And, of course, there’s server-based computing which is, by far, the most widely used form of virtualized application delivery.

And, not to mention the fact that in our model there is a private LAN connection between the enterprise and the service provider. In some cases, the latency around that connection is going to warrant having particularly chatty applications still hosted back in the enterprise data center on either Citrix and/or Microsoft terminal servers. So, I don’t view this as being really a solution that cannibalizes traditional server-based computing. What do you guys think?

Chalmers: I think that’s exactly accurate. You mentioned right at the front of the call that Citrix is an investor in Desktone. Clearly the VDI model itself is one that extends the application of terminal services from traditional task workers to all knowledge workers –those people who are invested in having a picture of their kids on the desktop wallpaper.

I think cloud-hosted desktop virtualization extends that again, so that, for example, if you’re running a very successful terminal services application and you don’t want to rip that out—very sensible, because ripping and replacing is much more expensive than just maintaining a legacy deployment of something like that—you can drop in XenDesktop. XenDesktop can talk quite happily to what is now XenApp, the presentation server deployment.

It can talk quite happily to a Desktone back-end and have all of its VDI virtual desktops hosted on a hosting provider. If you’ve got a desk full of Wall Street traders, it can also connect them up to blade PCs, dedicated resources that are running inside the data center.

So XenDesktop is an example of the kinds of desktop connection brokers you’re going to see, but as happy supporting traditional server-based computing or the blade PC model as they are being the front end for a true VDI deployment.

Bloor: Yes, I’d go along with that. One of the things that’s interesting in this space is that there are a number of server-based computing implementations that have been, what I’ll call, early attempts to virtualize the PC, and you may get adrift from some of those implementations. I know in certain banks that they did this purely for security reasons.

You know the virtualized PC is aa secure a server as a computer is. So you may get some drift from one kind of implementation to another, but in general, what’s going to happen is that the virtual PC is just the same as a physical PC. So, you just continue to do what you did before.

Fisher: Absolutely. I do agree that there definitely will be a shift and that again – back to the use cases -- people are going to have to say, “Okay, here are the four reasons we did server-based computing,” not, “We did server-based computing because we thought it was cool.”

Maybe in the area of security, as Robin mentioned, or some other areas, those reasons for deployment go away. But certainly, dealing with latency over LAN, depending on where the enterprise data center sits, where the user sits and where the hosting provider sits, there very well may still be a compelling need to use server-based computing.

Okay. We’ve got about five minutes left. There was an interesting question about disaster recovery (DR), using cloud-hosted desktops as DR for VDI, and this is a subject that’s close to my heart. It will be interesting to hear what you guys have to say about it. There is actually already the notion of some of our service-provider partners looking at providing desktop disaster recovery as a service. It’s almost like a baby step to full-blown cloud-hosted desktops.

Maybe you don’t feel comfortable having your users’ primary desktop hosted in the cloud, but what about a disaster recovery instance in case their PC blue screens and is not recoverable and they’re in some kind of time-critical role and they need to get back and up and running.

Or, as is probably more commonly thought of, what if they’re the victims of some sort of natural disaster and need to get access to an instance of the corporate desktop. What do you guys think about that concept?

Bloor: There are going to be a number of instances where people just go to this, particularly banks where, because of the kind of regulatory or even local standards they operate, they have to have a completely dual capability. It’s a lot easier to have dual capability if you’re going virtual, and I’m not sure that you would necessarily have the disaster recovery service virtual and the real service physical. You might have them both virtual, because you can do that.

This is just a matter of buying capacity, and the disaster-recovery stuff is only required at the point in time where you actually have the disaster. So, it’s got to be less expensive. Certainly, when you’re thinking about configuration in changed management for those environments, when you’ve got specifically completely dual environments, this makes the problem a lot easier.

Gardner: I think there are literally dozens of different security and risk-avoidance benefits to this model. There’s the business continuity issue, the fact that cloud providers will have redundancy across data centers, across geographies, the fact that there is also intellectual property risk management where you can control what property is distributed and how, and it keeps it centrally managed and check-in and check-out can be rigorously managed. And, then there’s also an audit trail as to who is there, so there’s compliance and regulatory benefits.

There’s also control over access to privileges, so that when someone changes a job it’s much easier to track what applications they would and wouldn’t get, in that you’ve basically re-factored their desktop from scratch that next day that they start the new job. So, the risk-compliance and avoidance issues are huge here, and for those types of companies or public organizations where that risk and avoidance issue is huge, we’ll see more of this.

I think that the Department of Defense and some of the intelligence communities have already moved very rapidly towards all server-side control and, for the same reasons that would make sense for a lot of businesses too.

Chalmers: Disaster recovery is always top of mind this time of year, because the hurricanes come around just in time for the new financial round of budgeting. But really it’s a no-brainer for a small business. For the companies that I talked to that are only running one data center, the only thing that they’re looking at the cloud for right now is disaster recovery, and that applies as much to their desktop resources as to their server resources.

Fisher: Great. Well, we are just about out of time so I want to close out. First, lots of information about what we’re doing at Desktone is up on our website, including an analyst coverage page under the News and Events section where you can find more information about Robin and Dana and Rachel’s thinking and – as well as other analysts.

We do maintain a blog at www.desktopsasaservice.com. We have a number of webinars up and coming to round out the series. We’ll be talking to Pike County, a customer of IBM’s and a user of the Desktone DaaS solution, and we’ll be speaking with our partner IBM. We’ll also have the opportunity to have Paul Gaffney, our COO, on a couple of our webinars as well.

So with that I will thank our terrific panel. Rachel, Dana and Robin, thank you so much for joining and for a fantastic conversation on the subject, and thank you so much everyone out there for attending.

View the webinar here.

Transcript of a recent webinar, produced by Desktone, on the future of cloud-hosted PC desktops and their role in enterprises.

Monday, January 05, 2009

A Technical Look at How Parallel Processing Brings Vast New Capabilities to Large-Scale Data Analysis

Transcript of BriefingsDirect podcast on new technical approaches to managing massive data problems using parallel processing and MapReduce technologies.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Dana Gardner: Hi, this is Dana Gardner, principal analyst at Interarbor Solutions, and you're listening to BriefingsDirect. Today we present a sponsored podcast discussion on new data-crunching architectures and approaches, ones designed with petabyte data sizes in their sights.

It's now clear that the Internet-size data gathering, swarms of sensors, and inputs from the mobile device fabric, as well as enterprises piling up ever more kinds of metadata to analyze, have stretched traditional data-management models to the breaking point.

In response, advances in parallel processing, using multi-core chipsets have prompted new software approaches such as MapReduce that can handle these data sets at surprisingly low total cost.

We'll examine the technical underpinnings that support the new demands being placed on, and by, extreme data sets. We'll also uncover the means by which powerful new insights are being derived from massive data compilations in near real time.

Here to provide an in-depth look at parallelism, modern data architectures, MapReduce technologies, and how they are coming together, is Joe Hellerstein, professor of computer science at UC Berkeley. Welcome, Joe.

Joe Hellerstein: Good to be here, Dana.

Gardner: Also Robin Bloor, analyst and partner at Hurwitz & Associates. Thanks for joining, Robin.

Robin Bloor: It's good to be here.

Gardner: We're also joined by Luke Lonergan, CTO and co-founder of Greenplum. Welcome to the show, Luke.

Luke Lonergan: Hi, Dana, glad to be here.

Gardner: The technical response to oceans of data is something that has been building for some time. Multi-core processing has also been something in the works for a number of years. Let's go to Joe Hellerstein first. What's different now? What is in the current confluence of events that is making this a good mixture of parallelism, multi-core, and the need to crunch ever more data?

Hellerstein: It's an interesting question, because it's not necessarily a good thing. It's a thing that's emerged that seems to work. One thing you can look at is data growth. Data growth has been following and exceeding Moore's Law over time. What we've been seeing is that the data sets that people are gathering and storing over time have been doubling at a rate of even faster than every 18 months.

That used to track Moore's Law well enough. Processors would get faster about every 18 months. Disk storage densities would go up about every 18 months. RAM sizes would go up by factor of two about every 18 months.

What's changed in the last few years is that clock speeds on processors have stopped doubling every 18 months. They're growing very slowly, and chip manufacturers like Intel have moved instead to utilizing Moore's Law to put twice as many transistors on a chip every 18 months, but not to make those transistors run your CPU faster.

Instead, what they are doing is putting more processing cores on every chip. You can expect the number of processors on your chip to double every 18 months, but they're not going to get any faster.

So data is growing faster, and we have chips basically standing still, but you're getting more of them. If you want to take advantage of that data, you're going to have to program in parallel to make use of all those processors on the chips. That's the confluence that's happening. It's the slowdown in clock speed growth against the continued growth in data.

Effects on mainstream compute problems

Gardner: Joe, where do you expect that this is going to crop up first? I mentioned a few examples of large data sets from the Internet, such as with Google and what it's doing. We're concerned about the mobile tier and how much data that's providing to the carriers. Is this something that's only going to affect a select few problems in computing, or do you expect this to actually migrate down into what we consider mainstream computing issues?

Hellerstein: We tend to think of Google as having the biggest data sets around. The amazing thing about the Web is the amount of data there that was typed in by people. It's just phenomenal to think about the amount of typing that's gone on to generate that many petabytes of data.

What we're going to see over time is that data production is going to be mechanized and follow Moore's Law as well. We'll have devices generating data. You mentioned sensors. Software logs are big today, and there will be other sources of data ... camera feeds and so on, where automated generation is going to pump out lots of data.

That data doesn't naturally go to Web search, per se. That's data that manufacturers will have, based on their manufacturing processes. There is security data that people who have large physical plants will have coming from video cameras. All the retail data that we are already capturing with things like Universal Product Code (UPC) and radio-frequency identification (RFID) is going to increase as we get finer-grain monitoring of objects, both in the supply chain and in retail.

We're going to see all kinds of large organizations gathering data from all sorts of automated sources. The only reason not to gather that data is when you run out of affordable processing and storage. Anybody with the budget will have as much data as they can budget for and will try to monetize that. It's going to be pervasive.

Gardner: Robin Bloor, you've been writing about these issues for some time. Now, we have had multi-core silicon, and we've had virtualization for some time, but there seems to be a lag between how the software can take advantage of what's going on on the metal. What's behind this discrepancy, and where do you expect that to go?

Bloor: There are different strands to this, because if we talk about parallelization, then with large database products, to a certain extent, we have already moved into the parallelization.

It's an elastic lag that comes from the fact that, when a chip maker does something new on the chip, unless its just a speed -- which was a great thing about clock speed -- you have to change your operating system to some degree to take advantage of what's new on the chip. Then, you may have to change the compilers and the way you write code in order to take advantage of what's on the chip.

It immediately throws a lag into the progress of software, even if the software can take advantage of it. With multi-core, we don't have specific tools to write parallel software, except in one or two circumstances, where people have gone through the trouble to do that. They are not pervasive.

You don't have operating systems naturally built for sharing the workload of multi-core. We have applications like virtualization, for example, that can take advantage of multi-core to some degree, but even those were not specifically written for multi-core. They were written for single-core processes.

So, you have a whole lag in the works here. That, to a certain extent, makes multi-core compelling for where you have parallel software, because it can attack those problems very, very well and can deliver benefit immediately. But you run into a paradox when Intel comes out with a four-way or an eight-way or a 16-way chip set. Then the question is how are you going to use that?

Multi-core becomes the killer app

Gardner: You've written recently Robin that the killer app, so to speak, for multi-core is data query. Why do you feel that's the case?

Bloor: There are a lot of reasons for it. First of all, it parallelizes extremely well. Basically, you have a commanding node that's looking after a data query. You can divide the data and the resources in such a way that you just basically run everything in parallel.

The other thing that's really neat about this application is it's a complete batch application, in the sense that you just keep pushing the data through an engine that keeps doing the queries. So, you're making pretty effective use of all the processes that are available to you. It's very high usage.

If you run an operating system that's based upon intervals, you're waiting for things to happen. At various times, the operating system is idle. It doesn't seem like they're very long times, but mostly on a PC the operating system is never doing anything. Even when you're running applications on a PC, it's rarely doing very much, even in a single-CPU situation. In a multiple-CPU situation, it's very hard to divide the workload.

So that's the situation. You've got this problem that we have with very large heaps of data. They've been growing roughly at a factor of about 1,000 every six years. It's an awesome growth rate. At the same time, we have the technology where we can take a very good dash at this and use the CPU power we've got very effectively.

Gardner: Luke Lonergan, we now have a data problem, and we have some shifts and changes in the hardware and infrastructure. What now needs to be brought into this to create a solution among these disparate variables?

Longergan: Well, it's interesting. As I listened to Joe and Robin talk about the problem, what comes to mind is a transition in computing that happened in the 1970s and 1980s. What we've done at Greenplum is to make a parallel operating system for data analysis.

If you look back on super computing, there were times when people were tackling larger and larger problems of compute. We had to invent different kinds of computers that could tackle that kind of problem with a greater amount of parallelism than people had seen before -- the Connection Machine with 64,000 processors and others.

What we've done with data analysis is to make what Robin brings forward happen -- have all units available within a group of commodity computers, which is the popular computing platform. It's really required for cost-efficient analysis to bring that to bear automatically on structured query language (SQL) queries and a number of different data-intensive computing problems.

The combination of the software-switch interconnect, which Greenplum built into the Greenplum product, and the underlying use of commodity parallel computers, is brought together in this database system that makes it possible to do SQL query and languages like MapReduce with automatic parallelism. We're already handling problems that involve thousands of individual cores on petabytes of data.

The problem is very much real. As Joe indicated, there are very many people storing and analyzing more data. We're very encouraged that most of our customers are finding new uses for data that are earning them more money. Consequently, the driver to analyze more and more data continues to grow. As our customers get more successful, this use of data is becoming really important.

Gardner: Back to Joe. This seem to be a bright spot in computer science, tackling these issues, particularly in regard to massive data sets, not just relational data, of course, but a multitude of different types of content and data. What's being done at the research level that backs up this direction or supports this new solution direction?

Data-centered approach has huge power

Hellerstein: It's an interesting question, because the research goes back a ways. We talked about how database systems and relational query, like SQL, can parallelize neatly. That comes straight out of the research literature projects, like the Gamma Research Project at Wisconsin in 1980s, and the Bubba Project at MCC. What's happened with that work over time is that it has matured in products like Greenplum, but it's been kind of cornered in the SQL world.

Along came Google and borrowed, reused, and reapplied a lot of that technology to text- and Web-processing with their MapReduce framework. The excitement that comes from such a successful company as Google tackling such a present problem as we have today with the Web, has begun to get the rest of computer science to wake up to the notion that a data-centric approach to parallelism has enormous power.

The traditional approach to parallelism and research in the 1980s was to think about taking algorithms -- typically complicated scientific algorithms that physicists might want to use -- and trying to very cleverly figure out how to run them on lots of cores.

Instead, what you're seeing today is people say, "Wow, well, let's get a lot of data. It's easy to parallelize the data. You break it up into little chunks and you throw it out to different machines. What can we do cleverly in computing with that kind of a framework?" There are a lot of ideas for how to move forward in machine learning and computer vision, and a variety of problems, not just databases now, where you are taking this massively parallel data-flow approach.

Gardner: I've heard this term "shared nothing architecture," and I have to admit I don't know anything about what it means. Robin, do you have a sense of what that means, and how that relates to what we are discussing?

Bloor: Yeah, I do. The first time I ran into this was not in respect to this at all. I did some work for the Hong Kong Jockey Club in the 1990s. What they do is take all the gambling on all the horse racing that goes on in Hong Kong. It's a huge operation, much, much bigger than its name sounds.

In those days, they got, I think, the largest transaction rate in the world, or at least it was amongst the top ten. They were getting 3,000 bets in the last second before a race, and they lose the money from the bet if the bet doesn't go on.

The law in Hong Kong was that the bet has to be registered on disk, before it was actually a real bet. So, if in any way, anything fell over or broke during the minute leading up to a race, a lot of money could be lost.

Basically they had an architecture that was a shared nothing architecture. They had a router in front of an awful lot of servers, which were doing nothing but taking bets and writing them to disk. It was server, after server, after server. If at any point, there was any indication that the volume was going up, they would just add servers, and it would divide the workload into smaller and smaller chunks, so it could do it.

You can think of almost being like a supermarket in the sense of lots and lots of different tools and lots of queues for people, but each tool is a resource on its own, and it shares nothing with anything else. Therefore, no bottlenecks can build up around any particular line.

If you have somebody directing the traffic, you can make sure that the flow goes through. So you go from that, straight into a query on a very large heap of data, if you manage to divide the data up in an efficient way.

A lot of these very big databases consist of nothing more than one big fact table -- a little bit more, but not much more than one big fact table. You split that over 100 machines, and you have a query against a whole fact table. Then, you just actually have 100 queries against 100 different data sets, and you bring the answer back together again.

You can even do fault tolerance in terms of the router for all this. So, with that, you can end up with nothing being shared, and you just have the speed. Basically, any device that's out there is doing a bit of query for you. If you've got 1,000 of them, you go 1,000 times faster. This scales extraordinarily well, because nothing is shared.

Gardner: Luke, tell me how these concepts of being able to scale relate to what the developers need to do. It seems to me that we've got some infrastructure benefits, but if we don't have the connection between how these business analysts and others that are seeking the results can use these infrastructure benefits, we're not capitalizing. What needs to happen now in terms of the logic as that relates to the data?

The net effects on users

Longergan: It's a good question, because, in the end, it's about users being able to gain access to all that power. What really turned the corner for general data analysis using SQL is the ability for a user to not to have to worry about what kind of table structure they have. They can have lots of small tables joining to lots of big tables, and big tables joining to each other.

These are things they do to make the business map better to the data analysis they're doing. That throws a monkey wrench in the beautiful picture of just subdividing the data and then running individual queries.

What the developer needs is an engine that doesn't care how the data is distributed, per se, just being able to use all of that parallelism on the problems of interest. The core problem we've solved is the ability for our engine to redistribute the data and the computation on the fly, as these queries and analysis are being performed.

It's the combination, as Robin put it earlier, of a compiler technology, which is our parallelizing optimizer, and a software interconnect, which we call a soft switch technology. The combination of those two things enables a developer of business logic and business analysis to not to have to worry about what is underneath them.

The physical model of how the database is distributed in a shared nothing architecture in a Greenplum system is not visible to the developer. That is where the SQL-focused data analytics realm has gone by necessity. It really has made it possible to continue to grow the amount of data, and continue to be able to run SQL analysis against that data. It's the ability to express arbitrarily constructed business rules against a large-scale data store.

Gardner: We did one of these podcasts not too long ago with Tim O'Reilly. He mentioned that he'd heard from Joe Hellerstein that every freshman now at UC Berkeley studying computer science is being taught Hadoop, which is related on an open-source development level and community to MapReduce. SQL is now an elective for seniors.

It seems that maybe we've crossed a threshold here in terms of how people are preparing themselves for this new era. Joe, how does that relate to how this new logic and ability to derive queries from these larger data sets is unfolding?

Hellerstein: What you're seeing there is three things happening at once. The first is that we have a real desire on the educational side to teach the next generation of programmers something about parallelism. It's really sticking your head in the sand to teach programming the way we have always taught it and not address the fact that every efficient program over the next ... forever is going to have to be a parallel program. That's the first issue.

The second issue is what's the simplest thing you can teach to computer science students to give them a tangible feeling for parallelism, to actually get them running code on lots of machines and get it going? The answer is data parallelism -- not a complicated scientific algorithm that's been carefully untangled, but simple data parallelism in a language that doesn't really require them to learn any new conceptual ideas that they wouldn't have learned in a high school AP course where they learned say Python or Java.

When you look at those requirements, you come up with the Google MapReduce model as instantiated in the open-source code of Hadoop. They can write simple straight-line programs that are procedural. They look just like "For" loops and "If-Then" statements. The students can see how that spreads out over a lot of data on a lot of machines. It's a very approachable way to get students thinking about parallelism.

The third piece of this, which you can't discount, is the fact that Google is very interested in making sure that they have a pipeline of programmers coming in. They very aggressively have been providing useful pedagogical tools, curriculum, and software projects, to universities to ramp this up.

So it's a win-win for the students, for the university, and frankly for Google, Yahoo, and IBM, who have been pushing this stuff. It's an interesting thing, an academic-industrial collaboration for education.

At the business level

Gardner: Let's bring this from a slightly abstract level down to a business level. We seem to be focusing more on purpose-built databases, appliances, packaging these things a little differently than we had in a distributed environment. Luke, what's going on in terms of how we package some of these technologies, so that businesses can start using them, perhaps at a crawl, walk, run type of a ramp up?

Longergan: Businesses have invested a tremendous amount of their time over the last 15 to 25 years in SQL, and some of the more traditional kinds of business analysis that pay off very well are ensconced in that programming model. So, packaging a system that can do transactional, mixed workloads with large amounts of concurrency, with applications that use the SQL paradigm, is very important.

Second, the ability to leverage the trends in microprocessors and inexpensive servers, and combine those with this kind of software model that scales and takes advantage of very high degree of parallelism, requires a certain amount of integration expertise.

Packaging this together as software plus hardware, making that available as a reference architecture for customers, has been very important and has been very successful in our accounts at New York Stock Exchange, Fox, MySpace, and many others.

Finally, as Joe and you were hinting at, there are changes in the programming paradigm. In being able to crawl, walk, and then run, you have to support the legacy, but then give people a way to get to the future. The MapReduce paradigm is very interesting, because it bridges the gap between traditional data-intensive programming with SQL and the procedural world of unstructured text analysis.

This set of technologies, put together into a single operating system-like formulation and package, has been our approach, and it's been very popular.

Gardner: Robin Bloor, this whole notion of legacy integration is pretty important. A lot of enterprises don't have the luxury of starting out "green field," don't have the luxury of hiring the best and brightest new computer scientists, and working on architecture from a pure requirements-based perspective. They have to deal with what they have in place. Increasingly, they want to relate more of what they have in place into an analytic engine of some kind.

What's being done from your perspective vis-à-vis parallelization and things like MapReduce that allow for backward compatibility, as well as setting yourself up to be positioned to expand and to take advantage of some of these advancements?

Bloor: The problem you have with what is fondly called legacy by everybody is that it really is impossible. The kind of things that were done in the past, very strongly bound the software to the data, to the environment it ran in. Therefore, unhooking that, other than starting again from scratch, is a very difficult thing to do.

Certainly, a lot of work is going on in this area. One thing that you can do is to create something -- I don't know if there is an official title to it, but everybody seems to use the word data fabric. The idea being that you actually just siphon data off from all of the data pools that you have throughout an organization, and use the newer technology in one way or another to apply to the whole data resource, as it exists.

This isn't a trivial thing to do, by the way. There are a lot of things involved, but it's certainly a direction in which things are actually going to move. It's possibly not as well acknowledged as it should be, but most of the things that we call data warehouses out there, the implementations have been done in the area of business intelligence (BI), actually don't run very well.

You have situations where people post queries, and it may take hours to answer a query. Because it takes hours to answer a query, and you have a whole scheme, a reason why you are actually mining the data for something, if every step takes a couple of hours, it's very difficult to carry out an analysis like that in a particularly effective way.

A 100-to-1 value improvement

If you take something like the Greenplum technology, and you point to the same problem, even though you are not dealing with petabytes of data, you can still have this parallel effect. You can get answers back that used to take 100 minutes, and you will get 100 to 1 out of this. You may get more, but you will certainly get 100 to 1 out of this, and it changes the way that you do the job that you have.

One thing that's kind of invisible is that there is a lot of data out there that's not being analyzed fast enough to be analyzed effectively. That's something that I think parallelism is going to address.

The other thing where it is going to play a part is that organizations are going to build data fabrics. In one way or another, they will siphon the data off and just handle it in a parallel manner. There is a lot you can do with that, basically.

Gardner: Joe Hellerstein, is there more being brought to this from the data architecture perspective, jibing the old with the new, and then providing yet better performance when it comes to these massive analytic chores?

Hellerstein: What I'm excited about, and I see this at Greenplum -- there's another company called Aster Data that's doing this, and I wouldn't be surprised if we see more of this in the market over time -- is the combination of SQL and MapReduce in a unified way in programming environments. This is short-term step, but it's a very pragmatic one that can help with people's ability to get their hands on data in an organization.

The idea is that, first of all, you want to have the same access to all your data via either an SQL interface or a MapReduce programming interface. When I say all the data, I mean the stuff you used to get with SQL, the database data, and the stuff you might currently be getting with MapReduce, which might be text files or log files in a distributed-file system. You ought to be able to access those with whatever language suits you, mix and match.

So, you can take your raw log files, which are raw text, and use SQL to join those against a customer table. Or, if you're a MapReduce programmer who does analytics and doesn't know SQL, say you're a statistician, you can write a MapReduce program that does some fancy statistical analysis. You can point it at text fields in a database full of user comments, or at purchase records that you used to have to dump out of the database into text formats to get your hands on. So, part of this is getting more access to more people who have programing paradigms at their fingertips.

Another piece of this is that some things are easier to do in MapReduce, and some things are easier to do in SQL, even when you know both. Good programmers have a lot of tools in their tool belt. They like to be able to use whatever tool is appropriate for the task. Having both of these things interleaved is really quite helpful.

Gardner: Luke, to what degree are they interleaved now, and to what degree can we expect to see more?

Longergan: It's been very gratifying that just making some of those pragmatic capabilities available and helping customers to use them has so far yielded some pretty impressive results. We have customers who have solved core business problems, in ways they couldn't have before, by unifying the unstructured text-file data sources with the data that was previously locked up inside the database.

As Joe points out, it's a good programmer who knows how to use all of the various tools that they have at their disposal. Being able to pull one that's just right for the task off the shelf is a great thing to do. With the Greenplum system we've made this available as a simple extension and just another language that one can use with the same parallel data engine, and that's been very successful so far.

Impact on cloud computing

Gardner: Let's look at how this impacts one of the hot topics of the day, and that's cloud computing, the idea that sourcing of resources can come from a variety of organizations. You're not just going to get applications as a service or even Web services, but increasingly infrastructure functionality as a service.

Does this parallelization, some of these new approaches to programming, and the ability to scale have an impact on how well organizations can start taking advantage of what's loosely defined as cloud computing? Let's start with you Joe.

Hellerstein: I'm not quite sure how this is going to play out. There are a couple of questions about how an individual organization's data will end up in the cloud. Inevitably it will, but in the short-term, people like to keep their data close, particularly database data that's traditionally been in the warehouses, very carefully managed. Those resources are very carefully protected by people in the organization.

It's going to be some time until we really see everybody's data warehouses up in the cloud. That said, as services move into the cloud, the data that those services spit out and generate, their log files, as well as the data that they're actually managing, are going to be up in the cloud anyway.

So, there is this question of, how long will it be until you really get big volumes of data in the cloud. The answer is that certainly new applications will be up there. We may start to see old data getting uploaded in the cloud as well.

There's another class of data that's already becoming available in the cloud. There is this recent announcement from Amazon that they are going to make some large data sets available on their platform for public access. I think we'll see more of this, of data as a utility that's provided by third parties, by governments, by corporations, by whomever has data that they want to share.

We'll start to see big data sets up there that don't necessarily belong to anyone, and they are going to be big. In that environment, you can imagine big data analytics will have to run in the cloud, because that's where the data will be.

One of the fun things about the cloud that's really exciting is the elasticity of the resources. You don't buy yourself a data center full of machines, but you rent as many machines as you need for a task.

If you have a task that's going to look at a lot of data, you would rent a lot of machines for a few hours, and then you would shrink your pool. What this is going to allow people to do is that even small organizations may, for a short period of time, look at an enormous amount of data, which perhaps doesn't originate in their own data production environment, but is something that they want to utilize for their purposes.

There is going to be a democratization of the ability to take advantage of information, and it comes from this ability to share these resources that compute, as well as the actual content to share them in a temporary way.

Gardner: Let's go to Robin on that. It seems that there is a huge potential payoff if, as Joe mentioned, you can gather data from a variety of sources, perhaps not in your own applications, not your own infrastructure and/or legacy, but go out and rent or borrow some data, but then do some very interesting things with it. That requires joins, that requires us to relate data from one cloud to another or to suck it into one cloud, do some wonderful magic-dust pixie sprinkling on it, and then move along.

How do you view this problem of managing boundaries of clouds, given that there is such a potential, if we could do it well, with data?

Looking at networks

Bloor: There would have to be, because you are looking at a technical problem, and you really are going to have to have specific interfaces for doing that, especially if you are joining data across clouds. Let's drop the word "cloud" and just think large network, because everything that is representative of the cloud ultimately comes down to being somewhat of a larger network.

When you've got something very large, like what Google and Amazon have, then you have this incredible flexibility of resources. You can push resources in or redeploy these resources very, very effectively. But you're not going to be able to do joins across data heaps in one cloud and another cloud, and in perhaps a particular network without there being interfaces that allow you to do that, and without query agents sitting in those particular clouds that are going to go off and do the work. You're going to care very much as to how fast they do that work as well.

This is going to be a job for big engines like Greenplum, rather than your average relational database, because your average relational database is going to be very slow.

Also, you have to master the join. In other words, the result has to arrive somewhere, and be brought together. There are a number of technical issues that are going to have to be addressed, if we're going to do this effectively, but I don't see anything that stops it being done. We have the fast networks to enable this. So, I think it can be done.

Gardner: Luke, last word goes to you. I don't expect you to pre-announce necessarily, but how do you, from Greenplum's perspective, address this need for joining, but recognizing it's a difficult technical problem?

Longergan: Well, the cloud really manifests itself as a few different things to us. When Joe was talking about how people are going to be putting, and are already putting, a lot of services up in the cloud that are generating a lot of new data, then it requires that the kinds of data analysis, as Robin was hinting at, scale to meet that demand.

We already have the engine that implements those kinds of join in between networks abilities. So we are cloud capable. The real action is going to be when people start to do business that counts on public clouds to function properly, and are generating enormous amounts of very valuable data that requires the kind of parallel compute that we provide.

Joining inside clouds, using cloud resources to do the kind of data analysis work, this is all happening as we speak, and this is another aspect of what's forcing the change from an earlier paradigm of database to the modern massively parallel one.

Gardner: I just want to wrap up quickly now. Thank you. Joe Hellerstein, you mentioned earlier on Moore's Law and how it stalled a bit on the silicon. Are we going to see a similar track, however, to what we did with processing over the last 15 years -- a rapid decrease in the total cost associated with these tasks? Even if we don't necessarily see the same effect in terms of the computing, are we going to be able to do what we've been describing here today at an accelerating decreased total cost?

Hellerstein: Absolutely. The only barrier to this is the productivity of programmers.

Just think about storage. I have a terabyte disk in my basement that holds videos, and it costs $100 or so dollars at Amazon.com. Ten years ago a terabyte was referred to by the experts in the field as a "terror byte." That's how worried people were about data volumes like that.

We'll see that again. Disk densities show no signs of slowing down. So, data is going to be essentially no cost. The data-gathering infrastructure is also going to be mechanized. We're going through what I call the industrial revolution of data production. We're just going to build machines to generate data, because we think we can get value out of that data, and we can store it essentially for free.

The compute cost of multi-core with parallelism is going to continue Moore's Law. It's just going to continue it in a parallel programming environment. If we can get all those cores looking at all that data, it won't cost much to do that, and the cost of that will continue to shrink by half.

The only real barrier to the process is to make those systems easy to program and manageable. Cloud helps somewhat with manageability, and programming environments like SQL and MapReduce are well-suited to parallelism. We're going to just see an enormous use of data analysis over time. It's just going to grow, because it gets cheaper and cheaper and bigger and bigger.

Gardner: Well, great, that's very exciting. We've been discussing advances in parallel processing using multi-core chipsets and how that's prompted new software approaches such as MapReduce that can handle these large data sets, as we have just pointed out, at surprisingly low total cost.

I want to thank our panel for today. We have been joined by Joe Hellerstein, professor of computer science at UC Berkeley, and I should point out also an adviser at Greenplum. Thank you for joining, Joe.

Hellerstein: It was a pleasure.

Gardner: Robin Bloor, analyst and partner at Hurwitz & Associates. I appreciate your input, Robin.

Bloor: Yeah, it was fun.

Gardner: Luke Lonergan, CTO and co-founder at Greenplum. Thank you, sir.

Longergan: Thanks, Dana.

Gardner: This is Dana Gardner, principal analyst at Interarbor Solutions, you have been listening to a sponsored podcast from BriefingsDirect. Thanks and come back next time.

Listen to the podcast. Download the podcast. Find it on iTunes/iPod. Learn more. Sponsor: Greenplum.

Transcript of BriefingsDirect podcast on new technical approaches to managing massive data problems using parallel processing and MapReduce technologies. Copyright Interarbor Solutions, LLC, 2005-2009. All rights reserved.