Robert Kahn | A different kind of Internet
Internet pioneer still envisions an Internet that manages information, instead of just moving data
It’s been just over 40 years since Robert E. Kahn took a leave of absence from the MIT faculty and helped create a computer networking project whose effects are still being felt today. His design ideas enabled the creation of ARPAnet, the world’s first packet-switched network, and ultimately to the Internet. His subsequent work at the Defense Advanced Research Projects Agency, where he headed the Information Processing Techniques Office, would later lead to the largest computer research and development program ever undertaken by the federal government.
As chairman and chief executive of the non-profit Corporation for National Research Initiatives, Dr. Kahn continues to be a global advocate for long-range infrastructure research and open-architecture networking. He still harbors a vision for how the Internet could be used to manage, not just move information, using Digital Object Architecture and Knowbots. He shares that vision in a recent interview with GCN chief editor Wyatt Kash, and explains how it could help make electronic health records and other data more secure and permanently accessible.
(An abridged version of this interview, as it appeared in GCN’s May 18, 2009 issue, is available here.)
GCN: You started CNRI to foster experimental research projects that are national and infrastructural in nature. Where are you focusing your efforts these days?
ROBERT KAHN: Interestingly, in this very difficult economic situation that we’re dealing with today, a lot of the stimulus focus is on infrastructure creation and improvement. Some are of the traditional kind, where you need jackhammers and concrete. But some of it is informational--in areas like getting medical records online. We have witnessed explosive growth over the last two decades in a variety of computation capabilities and the Internet in particular. Those are the kind of things that CNRI was set up to deal with—where we try and come up with good system and architectural ideas such that those kind of projects can be carried out successfully on a national scale.
Many of the advances that we need to make in this country’s infrastructure are going to be learning experiences. Development of the Internet was a learning experience, and I think most of the informational infrastructure ones in the future are going to be of that form. We’re actually an ideal organization to assist or lead such efforts because we can work with the research community and we can work with the industry and government to make those things come to fruition. Of course, it all depends on the necessary resources being available.
GCN: Do you see government taking a larger role in these kinds of these projects?
KAHN: The government’s role in things infrastructural is absolutely essential. It really is very difficult--I would say almost impossible--to create national infrastructure without at least the imprimatur of the government. It doesn’t always require government funding, although sometimes it’s essential. But if the government is opposed to the creation of infrastructure, it isn’t likely to happen. I think the Obama administration is rightly seeing an opportunity here for making investments in this country which are needed, but have just not been economically feasible for the private sector to do so all by itself in the past.
There are many areas where we could have made a lot more progress if there had been an organized effort to create new types of infrastructure. Some of the most vigorous opposition actually comes from the private sector - not usually in terms of overt actions, but by just not wanting to see change happen in areas that would dramatically affect them.
I ran into this phenomenon when I was first starting in the network business many years ago. There were very significant revenue streams being generated from telephony. You can argue whether packet switching (and DARPA) played a role in changing that, but the combination of the technology and the public acceptance of the Internet were clearly getting to the point where it was going to make a major difference in how communications were taking place in the country.
But if you owned a business and somebody said, here’s a great new idea but the business potential for it might be 20 years in the future, would you invest your resources in that right now, especially when it might reduce or even eliminate revenue streams that are steady producers for the company? It might not be in the shareholders interest.
So this is a conundrum that we face about the best way to make long – really long-term investments. We have many resources that can be tapped for things that will produce relatively quick results. But what if you’re talking about 20, 30, or 40 years in the future and you know it’s going to take steady progress to get there, whose job should it be to enable that kind of effort?
When it’s infrastructure, it’s very unclear. In the Reagan administration, they thought infrastructure development should be the responsibility of the private sector. The term they used here was “industrial policy. In other words, the Government should stay out of that. The current administration has a different view, perhaps due to the current adverse economic situation.
I don’t think it necessarily works well when the federal government is the builder of the infrastructure either. They can be a (or even the) funder of it, initially, but ultimately, you need a plan that’s going to work for the long haul. We don’t have good solutions today in this country for how to build infrastructure when it’s going to take many years to get there.
Now you could argue: Do it all faster. But I don’t think that’s really an answer. If somebody said, create the Internet and make it all happen in a year or two--what possible mechanism could you have used to take the effects of the last 10 or 20 years and compress it into one or two years? You would have needed an immense personal-computer industry to emerge overnight, with all the hardware and software that went along with it; you’d have to get multiple organizations formed around the globe to provide carrier services. When you’re dealing with infrastructure development and evolution, in most cases it simply takes time. I thought CNRI could play a key role in this area, but here we are some 22 years later, still trying to do our part.
GCN: Do you see your company as catalyst or laboratory?
KAHN: Maybe both. There are things that we can research, develop and or deploy. There are things that we can catalyze by working with the rest of the research community, government and industry – and the there are things that we actually can do in kick starting operational services.
We began our activities at CNRI with substantial contributions from a few large U.S. companies and government–to which we’re still grateful-- that we could leverage on infrastructure research in any way we sought fit. So a lot of the work on the Digital Object Architecture really was a result of some of those early investments.
GCN: What has evolved from your work around Digital Object Architecture—and what you call the Handle System?
KAHN: Well, the Handle System is one component of the more comprehensive architecture that I call the Digital Object Architecture that was intended to reflect what the Internet might have looked like if its main goal was to manage information, as opposed to just moving bits or packets from place to place.
The key element of the architecture is the “digital object," or structured information that incorporates a unique identifier, and which can be parsed by any machine that knows how digital objects are structured. This makes it essentially machine-independent. So I can take a digital object and store it on this machine, move it somewhere else, or preserve it for a long time by porting it from place to place. The digital object itself would be understood as a structural unit.
A digital object doesn’t become a digital object any more than a file becomes a file if it doesn’t have the equivalent of a name and an ability to access it. The fact that you type a few keys on your keyboard doesn’t make the typed characters a file. You have to give it a file name and put the characters into the file system. So these digital objects have identifiers, which we sometimes call digital object identifiers. We also use the term “handle” as the shorthand name for the identifier.
So say the identifier for this object is 1234/XJ493267. That doesn’t necessarily mean anything. What do you do with it? There needs to be a resolution system that you can ask about the digital object that has this identifier. For example, what are the one or more IP addresses where the digital object can be accessed? So the Handle System resolves these handles, once submitted over the net, into handle records--and it gives your computer the handle record for that identifier almost instantly.
Now, what this rapid resolution capability can do for managing information? It depends on what the user puts into the handle record. For example, as just described, he could put in the many places on the Internet that this object is stored; it doesn’t have to be just one. He could put in authentication information for that object, so that you could verify that the object is what it purports to be and that it wasn’t corrupted along the way. It could be public keys for access. It could provide various terms and conditions for use. There are a variety of things that you could put in that would give you some sense of what you might do with this object. Or, for users with only a browser, it could be one or more URLs to actually download and display the object.
Another part of the architecture is the notion of a repository, where digital objects may be deposited and from which they may be accessed later on. Repositories make use of existing storage systems.
The Handle System might provide the IP address of the repository. You then go to that repository and say, I’d like to access the object whose identifier is X. With many existing database systems, you’ve got to know a lot of the details to get in. But the repository software can totally insulate you from all of these details. It makes it easy to store a digital object, easy to access a digital object (or parts thereof), and you can use the technology to preserve the object for a long period of time, if you like. You never have to worry about redoing the digital object when you change from one system to another, whether it’s the same manufacturer or a different one. You just port the digital from place to place.
Now, if you’ve encrypted that digital object, you may not be able to find it if you didn’t keep very good records of its identifier. So enter the notion of the metadata registry, which is a system that you can interact with in some natural way that will help you locate the digital objects that you’re looking for when you don’t know the identifiers to start with. It will also let you build collections and identify collections of digital objects that might span across many different repositories. So it’s a really powerful kind of capability.
GCN: Where are these systems being used today?
KAHN: Parts of the Digital Object Architecture are widely used by the publishing industry. Almost all of the electronic journals that I’m familiar with use the Handle System for identifier resolution, because if you put an electronic journal on the electronic bookshelf, you’d like to be able to pull it off in 50 or 100 years and still know that the “clickables” work. Without something like a unique persistent identifier, it’s not clear how you’d find these (referenced materials) in the future, especially if they have moved from place to place over the years. You don’t want to go back and have to change an electronic journal issued years ago every time a digital object is moved from one machine to another machine.
GCN: So how does the architecture allow the registry to stay current?
KAHN: Remember, the Handle System is a big distributed system. It’s not a single server in a single location. It consists of many servers (or services, actually) running in lots of different places running local handle services, each of which is itself potentially distributed.
Suppose there are lots of different users that have copies of these electronic journals. Let’s say you don’t even know who they are. When one of these users tries to access a reference cited in some other electronic journal and clicks on it, his system first goes to the Handle System to pull back the handle record. If the Handle System is updated when a change occurs, every one of the users everywhere suddenly will have access to the current information.
The system uses a set of procedures that define handles as having a prefix, a slash and a suffix. The suffix can be anything you like: an existing name, a numerical sequence, a driver’s license or Social Security number – it could even be a cryptographic sequence. The prefix is given to an organization or individual so that when it creates the suffix, the handle is guaranteed to be unique. You could operate a local handle service yourself or use a service provided by someone else.
When you install a local handle service, it first creates (locally) something called a site bundle, which contains information like the IP address where it’s located and a public-private key pair. You keep the private key, and send the public key to the administrator of the Global Handle Registry. That administrator will then allot a prefix to the local handle service, enter it into the global registry on your behalf and enable you to then change the handle record entry in the Global Handle Registry using your public-private key pair. You keep the only copy of the private key. The administrator will let you know the allotted prefix.
For example, a university might have an allotted prefix, and then they can generate other ones derived from their allotted prefix themselves. So if a university has the prefix 1500, they could allot 1500.1 to some entity within the university, which in turn, can create 1500.1.A, or 1500. 1. B. The university can also create a prefix with semantics, say, 1500.headquarters, and so forth. And they can create these derived prefixes all on their own; they don’t have to work through anybody else. And every one of these new prefixes can be separately administered under its own public-private key pair. It’s really pretty powerful.
GCN: So it’s similar to managing sub-domains?
KAHN: A little like that, with a few key distinctions. The domain-name system has been in such wide use that it’s very hard to make changes to it. They’ve been struggling with DNS security for over 10 years. The DNS basically produces one IP address per domain name (along with the naming authority for that domain name). In other words, it maps a domain name into an IP address, which might be one of several from a list, along with the naming authority.
But when you look at what can be done in the Handle System, it provides a much larger collection of things that users can define themselves. And it’s able to sign the information that results from requests to the system, the whole system is self-certifying at several different levels.
GCN: Why hasn’t the architecture gained greater traction?
KAHN: The original idea for the digital object architecture was mine; it grew out of some work that I had done with Vint Cerf back in the mid 1980s on what we called “knowledge robots,” or “Knowbot programs,” which are mobile programs in the Internet. The details of the original Handle System implementation were worked out by one of our best programmers at the time, David Ely. And over the years, many other people have played a role in helping to create parts of the digital object architecture. We continue work on evolving the components of the technology and applying it in different contexts.
But it’s not sufficient to just have good technical components of an infrastructure. You need to work with people on applying it. The example I like to give is the following. If I were to show you two really great pieces of new technology we’ve just created - a CPU chip, which you’ve never seen it before, and a memory chip and told you that by putting these two components together, you can do wonderful things with the result, probably nothing would happen, because you wouldn’t know what to do. If somebody else then says, well, I got an idea, why don’t I just make a box with a power supply and a few interface cards that’s got those things in it, and give that assembled system to you. Most likely, you still wouldn’t know what to do, because there’s no software. So, then you would have to figure out how to program those components, and generate computer languages and operating systems to use. Maybe after awhile, this process becomes more turnkey and therefore more understandable.
We’re now working on applications of this technology, and injecting in it to different application areas. We also haven’t really done much marketing of the technology, but this could change.
GCN: Where might these applications evolve—especially for use in government?
KAHN: Well, one of the areas that this is clearly an excellent application is the whole area of archiving. Just about every organization in government has the need to retain information. The problem is that most parts of the government deal with archiving in a more limited context using readily available commercial technology and/or services. So if you retain information on a big external disk and have something you need, you retrieve the disk and plug it in. Sooner or later, you’re probably going to have lots of these disks, and maybe you need to access a piece of information when it’s not online, and you can’t really get your hands on it.
We have experimented with some archiving capabilities on the net, some developed by CNRI and some by others, that are intended to serve long-term archival storage needs.
My hope is that we can make the digital object technology, which operated in the Internet environment, available as we did with the original Internet technology, and get a lot of people in the public and private sectors to understand its power and the capability. Because it’s an open architecture, it has the potential to grow organically as did the nascent Internet.
Another application area that I’m very optimistic about is the medical informatics area. Digital Object Architecture is almost ideal for what government and the health sector need for handling medical records online. The big issue there is not whether you can do that technically–we could have found a way to make records available electronically 50 years ago. The question today is how can we do it in a way that satisfies all of the societal concerns that people have about moving to online medical records and assimilating them into the fabric of society?
We should find a way to link this approach to the banking infrastructure, and to the insurance infrastructure, because they all play together. We’re having enough issues with those infrastructures today it may seem a little premature. But the bigger issue here, as it has been for years, is privacy and how you deal with that requirement across different organizational boundaries.
The Web as originally deployed assumed that everything would be publicly available. But over time, Web sites have cordoned off parts of that public place. For example, you may need passwords to get into this web page and thus to access that information. And, moreover, you have to know exactly where to look for certain information. That’s where search engines come into play, assuming they can index the information you want from public spaces, or by private arrangements.
The Digital Object Architecture was designed, at least at the level of repositories, from the opposite point of view. Namely, it assumed that what you’re storing is your own information, and it’s not available to anybody else.
Now, there’s a variety of ways that repositories can work. But if you use unique identifiers to reference material on the net, you need a resolution system to map the identifiers to locations on the net where the material is stored. If, for example, the identifiers were lengthy cryptographically-generated strings, there would be no semantic information in the strings. Whatever length of identifier you decide to use, nothing is going to be accessible except to people who know the identifier and can resolve it to meaningful state information about the information it references.
In some cases, protecting the identifiers themselves might be important, but you can also put in other protections elsewhere, such as in the repositories. You can have particular objects available only to particular individuals - only the owner of the object; only those on specifically-designated lists; only people who have authenticated roles and so forth. Now, not only do you have to know the identifier for the information, but you’ve got to be designated for access to that information. So the protection occurs at the object level, rather than with protecting the identifier or by providing only a password at the boundary.
GCN: How do you deal with the need to authenticate that someone is who he says he is?
KAHN: Let’s distinguish between the mechanisms that the technical architecture might use from the non-technical techniques used by people to determine who an individual is and to give that individual whatever credentials they may require, such as a public-private key pair. By putting this information in a public-key infrastructure, such as the Handle System enables, the technical means is available to verify that the user is the person who holds the private key corresponding to the public key held in the system. People have been trying very hard to get PKI deployed. The digital object architecture incorporates it intrinsically. Specifically, we use it as a PKI system for identity management in the repository world.
In that world, every individual would have a unique persistent identifier. That identifier has a representation in the Handle System which you can use to verify that they are who they purport to be. So during the initial transaction with a repository, there would be an exchange of the following kind. The repository would ask the user to encrypt a random string with his private key, send it back so the repository can verify the user. This same technique can be used (in reverse) by the user to verify that the repository is who it purports to be. Of course, if a private key is lost, that information needs to be made known to the system as soon as possible so that public-private key pair may be revoked.
This is only a start; it’s the infrastructure part. Ultimately what it takes to manage information is not only a good infrastructure underneath it – that’s necessary but not sufficient. Most of the effort is going to go into deciding what information you want to keep and what you don’t want to keep; if you keep it, how do you want to characterize it; when you identify things, how do you want to do the identification; who do you want to have on what lists and who is going to manage the movement of the information when you eventually decide to move it from one place to another. And when you create new types of information, can the metadata for that information be extracted automatically, or will individuals continue to play a role in generating and maintaining it?
If I told you this handle record contained an entry of type URL, today you’d know what that is and plug it directly into your browser. But if this were 20 years from now and the corresponding entry indicated that this was type XYZ - the first question would probably be what does type XYZ mean? The system allows types to be separately resolved in the system. Initially, we turned types into identifiers that represented them in the Handle System. Today, we still retain that capability, but generally identify new types directly with handles. By this approach, all types are resolvable in the Handle System. Maintaining this information, as with all metadata, can be a big job. But the bigger job is not about the technology. Rather, it’s the actual decision-making that has to take part, organization by organization, about what they want to keep, how they want to keep it and the rest of it – the infrastructure can handle it.
GCN: What has become of your work with Knowbots?
KAHN: Well, we use them in a number of different contexts and in many applications. But I think the reason that it didn’t get as much adoption as I thought it would was because there was a concern on the part of many organizations and their IT staff that these mobile programs should be treated like viruses. That is, if somebody’s going to send a program to your machine that can run on your machine and you’re not in charge of that program, then the concern is that they can do anything they want on your machine --and therefore this is something to be avoided.
A lot of people didn’t want to go down that path. I thought this was unfortunate because it should have been possible for them to validate the kind of operations that a Knowbot Service Station, which runs the Knowbot programs, could take and therefore to delineate a set of actions that could be invoked in a Knowbot operating environment.
It would be the equivalent of the old automats in New York City. It’s not like you could turn on the oven temperature and overcook their hamburgers. You could plug one or more nickels in the appropriate slot and pull out your prepackaged sandwich.
Say there were only a small set of things that you could do that were all circumscribed. Maybe you provide external users with certain programs they can run on your machine because you created the programs and knew they were “safe”. In this case, the external user would just tell you which of your programs to run on your machine.
We tend to use mobile programs for purposes of data collection and reporting. We’ll collect some information on one of our systems and send a Knowbot program that we’ve created to another one of our system locations, where it might collect additional information and send it along to another such location and eventually return to us, a single consolidated view of all the collected information. We undertook experiments in advanced networking, where we were trying to track the utilization of a program running on a PDA in your jacket pocket while you’re moving around--and the PDA, with your concurrence, is to be a server on the net. As you moved around, your server would keep showing up in different places because you had different network options each with different communication capabilities. So we used Knowbot programs to report in, based on identifiers that were unique for these capabilities. It was a pretty interesting demonstration.
But the use of mobile programs hasn’t yet taken off in a big way. The infrastructural capabilities that we’ve created are available to the research community to turn them into systems that they can use. They come ready to enable somebody to program them to do a specific task, but they don’t come fully programmed for any specific task.
GCN: What other projects is CNRI working on that might be of interest to the government IT sector?
KAHN: Well, we’re involved in a number of other efforts. Clearly, we’re interested in the whole field of networking. I’ve been involved in a lot of the international deliberations concerning the future of the Internet. We’re very interested in how people manage collections of information, so we’ve been doing some work with a Pentagon project called Advanced Distributed Learning. Finally, we’re also involved in fostering research and providing prototyping services in the fields of Micro-Electro-Mechanical Systems and Nanotechnology, using an approach similar in some ways, although different in many others, to one that we developed with the research community when I was working at DARPA. (See GCN Interview-Extra).
And we constantly are looking for ways in which we can help with new national and even global infrastructure initiatives.