Introducing Structured Dataintroducing Structured Essay, Research Paper
The purpose of this research project, Structured Data Types into Internet-scaleInformation Systems to Utilize the Internet as an Effective Business Tool, is to better understand the Internet and actual applications. Enormous amounts of composite information have been accumulated within corporations, government organizations and universities. Such information continues to grow at ever-increasing rate. It ranges from software artifacts to engineering and financial databases, and comes in different types (e.g., source code, e-mail messages, bitmaps) and representations (e.g., plain text, binary). This information has to be accessed through a variety of vendor tools and locally developed applications. It is becoming increasingly easier to create new information due to the generation of sophisticated commercial authoring and office automation software. It is also becoming easier and cheaper to provide access to this information, due to the increasingly pervasive and more robust data communication technology. However, the knowledge about the existence and location of information, as well as the means of its retrieval, have become bewildering and confusing to many users. Managing the increasingly large volume of information on computer networks is rapidly becoming an important problem in computing. The Internet, the largest wide-area computer network, is growing exponentially in terms of hosts, users, and traffic. The Internet backbone carried 14 Terabytes of data in March 1994; about half of that was due to information services, such as FTP, Gopher, and WWW. It is clear that a large supply and demand of information exists.Statement of the ProblemThe form, in which information is dispersed on the Internet, leaves much to be desired. Most information has some sort of semantic structure to it. It could be a text broken up into chapters and paragraphs, a bus schedule showing routes and times, a city map displaying streets and elevations, or a complex medical database. But while Internet information systems may be able to transmit the data involved with these pieces of information, they give little assistance in telling how the data is structured.The semantic structure of information makes a large body of information much more manageable. Knowing the meaning of a type of information helps one extract, derive, compile, and condense useful information from a larger set of raw data. It helps in searching for relevant information, and in intelligently filtering out irrelevant information to a query. In these tasks, it is not enough to simply know that a piece of information is composed of several components; ideally, one wants to be able to know the meaning of the components, and what one can do with the parts. A search of card catalog entries, for instance, may need to know how to extract the author of an entry, and compare the author’s name against a search term.In the Internet, there is little support for semantically structured information. A particular application, such as a library catalog, may define a certain format for their book entries, which may be semantically rich, but only meaningful to programs specifically written to understand that format. A client program written to read University A’s card catalog may be able to make no sense of University B’s card catalog, even though both are available on the Internet.In contrast, applications that want to share their information widely are generally forced to use a lowest common denominator approach. The most common such denominator is plain unstructured text. Frequently used applications may, over a long period of time, settle on higher-level common denominators. But higher-level formats still lack much of the semantic structure many applications need; and the process of finding a usable common standard even for these formats can take years. (Then, in a few years more, these formats are often replaced by other, incompatible formats.) The rate at which new data types can be introduced and used in an Internet context is far too slow, and cannot be made much faster with current standards procedures.The Setting of the ProblemWith a computer, a modem, and a telephone line, every person can become a publisher, every desktop can be a broadcasting station. Large groups of people, scattered across large geographic areas, can use public “bulletin-board” media and private electronic mail to communicate interactively with each other, to publish statements, conduct debates, seek understanding, organize action. Each node of a computer-mediated communication network is both a consumer and a producer of information. Every desktop is a printing press, an electronic soapbox, a multimedia broadcasting center. Each node is potentially connected to each other node. Current estimates are that 100 million people had Internet access in January 1997; this number is predicted to exceed 200 million by the year 2000. The explosion of the Internet has delivered to our desktops the virtual equivalent of a vast warehouse of data and facts. A large handicap is the consumption in the computer field that human time and judgment are extremely scarce and costly resources, much too expensive to be wasted on multiple search efforts. Too much time can be spent searching for subject matters, and having to seek aggressively thorugh related subject titles that show up in the search text that may or may not be what you are looking for. The battle for the shape of the Internet is joined. Part of the battle is a battle of money and power, but the great venture is understanding it capabilities. I believe that each of us can have an active role in it’s future. Whether we will live in a surveillance state or democratic society ten years from now might depends on what we know and do now. The outcome remains uncertain. What the Internet will become is still, in large part, up to the users. History and Background of the ProblemDating back to 1962, the US Air Force commissioned a design for its command and control systems to withstand and recover from a strategic nuclear attack. Today it is a tremendous power shift. This power shift is about modern age technology, and our ability to utilize its capabilities in practical ways. The reason for the rapid new growth is that you no longer need to be a computerwizard to take advantage of these services, thanks to recent software breakthroughs that bring the internet to life through easy-to-use point-and-click graphic interfaces. Every day, new services enter the on-line world. Internet users exchangeinformation through electronic mail, file transfers, information browsing, social communication, topical discussion groups, electronic news services, and a host of other services.A ten-year-old kid with a few hundred dollars can plug those two technologies together today and have access to every major university library on earth, a pulpit, and a world full of affiliates at their fingertips. Another source of power to this new medium is that it enables different classes of people to connect with each other in innovative ways. A high school student in Taiwan, a grandmother in Alaska, a business man in silicon valley or Huntsville Alabama can “meet” to discuss ecology or astronomy, politics or parenting, high technology or antiques. From those public connections they can create personal relationships cutting through traditional barriers of gender, age, race, class, nationality, physical location. New concepts in industries, marketplaces, communitys are what can emerge from the web of connections and relationships. Problem StatementHow can information be provided better on the Internet at a higher semantic level, while remaining usable by a large number of information clients?Two observations are relevant here: The concept of abstract data types, or of “objects”, provides a solution to many of the complexities of data formats and operations. Abstract data types provide a well-defined interface of operations and attributes, so that a client can use a complex reference without having to know how it is formatted, or how operations are implemented. Indeed, a number of systems already attempt to implement an object-oriented system distributed over the network. None have yet been able to cope with the scale and diversity of the Internet. This is in part because they are designed for general-purpose computing, which includes both reads and writes. They therefore have to worry about issues like consistency of data updates, fault tolerance, and a fairly uniform semantic model for references and meta-data. These problems are much less relevant (and sometimes impossible to solve) in a system designed for disseminating information widely, rather than mutating it. A very large body of knowledge and computing power is already available in the information agents (clients, servers, and mediators) that exist on the Internet. At present, most agents dealing with information are set up in a few standard ways; most commonly, a client operated by a user will contact a server maintaining a database, and retrieve a reference from the database directly. In occasional variations, a “server” may act as a gateway to a database another server maintains; or a fixed data type conversion program may be run off-line by a client or a server. These types of interactions are useful but limited. Human “agents” commonly use richer techniques to discover information: they collaborate with “experts” in a particular domain in order to find relevant initial information in a domain, and for assistance in gathering and understanding that information. Similar techniques for computerized agents could be quite useful as well, in particular “mediators”, third-party experts suggested by Wiederhold in “Mediators in the Architecture of Future Information SystemDescription of the InterventionI propose to make an explicit object-based level of abstract data usable in Internet information systems. Widespread use of such abstract data requires that new types be definable anywhere on the network, and not simply by some central standards authority. Furthermore, in order for these types to be used, information about these new types, and operations on those types, must be available to other agents which request it. This requires not only support specifically for abstract types, but also a well-defined interface for agents to talk to each other about types and operations; and some standard method to provide information about types, their operations, and their relations.I claim that these requirements can be satisfied with a two-level software architecture. The upper level focuses on the data being shared, and the abstract operations being carried out on it. At this level, methods are invoked; object references are resolved; new data types and operations are defined. (See figure 1b.) The lower level focuses on the agents supporting these operations. Here, agents request other agents for data objects or references, carry out abstract data operations on behalf of other agents, and encode and decode concrete representations of abstract objects so that they can be passed through the network. (See figure 1a.). This level abstractly describes what is already carried out (in a domain-specific manner) by the protocols of many existing Internet agents, such as HTTP [BL93] servers, Domain Name Service [Moc87] resolvers, or WAIS [Kah91] indexes. (Information from these existing systems can also be incorporated into the higher-level information system through the use of “wrapper” or “gateway” agents, which provide explicit abstract types for the implicit data abstractions these systems support.)Figure 1. The two levels of abstraction in an information system. To bridge these two levels of abstraction, the agents need to know about the types of objects they are manipulating. For this purpose, I propose a special mediator agent that can give information about types of information in the network. A client or a server can contact this agent (which I call a type oracle) to find an agent to carry out a defined operation on a data type, or tofind out how information of one type or encoding can be converted into another type or encoding. Someone who wishes to define a new data type or encoding can register it (and its operations) with a type oracle, which can then share this information with other agents, including other type oracles. Oracles can also use their knowledge of the lattice of types and encodings toderive new transformations not provided by any single agent (such as a conversion from type A to type C that uses a converter from A to B followed by a converter from B to C).A few questions arise at this point: Can a coherent information system be built to this design? Will the design really give widely-distributed information systems more semantic power? Will it be useful for real applications, or will it introduce too much overhead (either in response time to queries, or in the amount of work a client or provider is expected to do) to be feasible? Will it be able to inter-operate with existing information systems? I propose the following course of action to answer these questions: First, analyze existing information systems that are already in use over the Internet, such as Gopher, World Wide Web, and the Domain Name System. The objective of the analysis, will be to show the common features of these systems. Show how their data and agent abstractions can be interpreted (at least implicitly) in terms of the architectural model given above, determine why they have gained frequent use in the large-scale, dissimilar element environment of the Internet. Some analogs in distributed systems and object-oriented databases will be considered as well. Second, a detailed design of an information system based on the architecture I proposed, and build a prototype implementation. This will include a number of information agents using a common toolkit; a type oracle; a collection of data types and encodings supported by the agents and the oracle; and protocols to allow the agents to work together and operate on the types. This will demonstrate that the design is feasible, and that it can handle a reasonable-sized repertiore of common types. Third, test the implementation. This will involve one or more case studies, where I choose a particular information gathering problem, and show how my system makes it significantly easier for agents to be built to allow clients to find useful information than existing systems do. It will also involve observation of a less controlled test: releasing the design and implementation to users on the Internet. This will allow me to see if disinterested users find value added in my approach, and also find where difficulties arise in practice with the system.
Key ConceptsThe key concepts of the thesis, then, are these:An information system architecture using typed, replicating objects to model information, with an underlying agent communication protocol.Use of mediator agents (”type oracles”) to maintain information about an ever-growing lattice of types, and to assist agents that want to use these types.Encapsulating existing data on the Internet with structured types and encodings, allowing it to be used in higher-level architectures.Internet Information Systems: Uses and ProblemsAs noted in the introduction, the Internet is rapidly becoming a widely used medium for exchanging information. Many applications proposed for networked information systems, imply a rich structure to this information. For example, a medical researcher may want to examine blood pressure readings from a clinical sampling and correlate them to heart attack occurrences, using the structure of patient medical histories. A scientist may want to find books in several libraries about plate tectonics, using catalog entries and search indexes. A software engineer may wish to find and examine C++ modules for processing SQL queries, using the structure of program archives and descriptions.In an ideal world, such tasks would be simple to carry out effectively. But they remain difficult or infeasible in today’s Internet, due in part to limitations of the net’s current model of information space. Among these limitations:The conceptual structure of information space is hazy: Experienced users often have trouble not only with finding information they are interested in, but even with finding out what information exists on a subject they are interested in. The software engineer, for instance, may not know where to search for C++ modules, let alone ones that have anything to do with programming.Various indexing schemes have been proposed to make a better conceptual map of cyberspace, but there is no clear consensus yet on what kinds of indexes to use.With no regulated common formats, common semantic bases for indexing, or general mechanisms for relating one indexing scheme to another, indexing schemes will remain primitive and incomplete. The structure and encoding of information objects is often inappropriate for applications: A large corpus of information, even an explicitly structured one, may still be useless to someone who lacks the knowledge or computing power to sift through the information to find relevant facts or derive or synthesize needed knowledge. Theoretically, oneform of information may have information content equal to (or even greater than) another, but still be much less useful in a practical sense. If, for instance, the only interface to medical records returns plain text in various formats, it can be prohibitively difficult to extract appropriate information about blood pressure and heart attacks. Current practices encourage information providers to provide information either in a lowest-common-denominator form, or in a form specifically tuned for a single application. Both of these inhibit useful information sharing. Maintaining useful information sources is difficult: It is relatively easy in many cases to put some information on-line and offer it to the world. It is much more difficult to keep the data current and the format relevant. Part of this problem is related to the previous one: it is extremely difficult to define new formats and types of data without having client applications explicitly reprogrammed, or maintaining a number of gateways or alternate repositories for different formats understood by clients. Mediators can conceivably be used to update data automatically and provide gateway services, but they require well-understood interfaces to work in a general context.Computation models: The need for abstract types: A number of the problems above can be solved in part by better computation models for internet information systems, in particular, abstract types. Some benefits of abstract type systems:They provide an appropriate level of abstraction for data manipulation. Client programs can be written in data-driven terms like ’search this index using these attributes’ or ‘retrieve the object referred to by this attribute’, without needing to know the full details of the data implementation.They provide a useful model for taking advantage of the expertise of a network of agents. In today’s information systems, the burden of computation and type decoding falls entirely on the client, or on the server providing data. But in models using abstract data objects, operations manipulating information are associated with the data types, rather than any particular agent. Knowledge about how to operate on the types can be delegated to sites that define the type, or have been told the type definition. They provide a vocabulary for information about new data types and formats that is independent of representation or implementation concerns. Thus, information systems do not have to settle for a lowest-common-denominator approach for information exchange, nor do they need to settle for a fixed repertoire of types and operations. The structure and semantics of different search indexes, for instance, can be described and related via different abstract types.In the next section, I will look briefly at two communities working towards usable wide-area data types: the distributed computing community and the community of developers of existing Internet information systems. By examining the strengths and weaknesses of their approaches, I will lay the groundwork for an architecture combining features from both communities.Distributed Computing PerspectivesThe distributed computing community has already proposed or implemented a number of systems for distributed objects. If abstract data types are so useful for distributed information systems, then, why hasn’t one of these object systems taken over cyberspace? While immaturity of these systems may be one possible reason, another important reason is that the applicationsthese systems are designed for are different in important ways from information dissemination applications.Why existing distributed computing models aren’t sufficient: Distributed computing researchers have long been aware that computing over multiple machines introduces many new problems not present in a single address space:Data-related problems. An arbitrary distributed process may need to have strong guarantees about the consistency of the data it manipulates. But in a large-scale heterogeneous system, it can be very difficult to keep data consistent without locking up arbitrary servers for unacceptable durations. This is unacceptable in a wide-area information system. Operation-related problems. In an undistributed application, a request for an operation can be made with a simple procedure call. In a small-scale distributed application, the operation may involve a remote procedure call, with some conventions for encoding parameters, carrying out the operation, and returning results. In a large-scale distributed world, not all agents are known, and communication channels and agents, and their semantics, are out of the control of any oneperson or project.So even more complications arise. It may now be relevant, for instance, for a server to know who invokes an operation, or for a client to know the cost of an operation. New modes of failure and recovery strategies may be called for (since permission to carry out an operation may be denied, a remote server is not available, or the return type is unexpected). The role of meta-data to evaluate the results of an operation becomes more critical. These are symptoms of the greater level of heterogeneity introduced by scaling up a distributed application to the Internet world. Languages, operating systems, data types, and ways of organizing data and software vary (both over different servers and over time) more widely than most distributed systems are designed to handle.How existing Internet info-systems are differentFortunately, because their application domain is limited, Internet information applications do not have to solve all the problems inherent in distributed computing. In particular, the information delivery task can be simplified by the following domain assumptions:The predominant flow of information is one-way. Information is originally provided by certain sources acting in server roles, and then retrieved, transformed, and used by other agents acting in client roles.Read access to information is widely available, but write access is not available, or severely limited. (Clients can transform the values they receive, but cannot mutate the source data themselves.Some information systems allow clients to send back requests to change or add to the data a server provides, but these changes, if made at all, are done locally by the server, outside the scope of the information retrieval application.) This assumption avoids many of the complications of general-purpose wide-area database systems. Also, in many Internet applications, changes in information do not have to be propagated immediately. Conventional database systems take great pains to make sure that query responses use the latest available version of a set of data, and that the set of data given is internally consistent. In many wide-area information applications, these kinds of guarantees are either prohibitively costly, or flatly impossible. Fortunately, many applications do not need these sorts of guarantees; or can make do with simply knowing roughly the consistency or currency of information. And in many systems, mutations of information occur significantly less frequently than accesses to information. (And with some types of meta-information, such as information about types and resources, information tends to accumulate but not mutate.)Relaxing currency and consistency requirements gives third-party agents a useful role in an Internet information system. An agent can provide information originally supplied by another agent without necessarily having to verify that the original agent’s information has not changed. It can synthesize information based on data from several agents. It can derive or transform theinformation for a client in ways the original server might not be able or willing to do. How existing Internet info-systems handle data types.Many Internet information systems have found it useful to define their own semantic types. Gopher, for instance, uses menus and bookmarks to let users navigate. The World Wide Web (WWW) uses simple structured hypertext documents to navigate through the system, and defines a data type (HTML) for these documents. While these types are more useful than the simple ASCII text used to encode them, users of these systems soon want more structured types. For example, a number of WWW sites have “What’s New” pages in HTML, which invariably consist of a list of dates, resource descriptions, and links to the resources, in reverse date order (and sometimes spread out over several documents). This format convention reflects a new ‘abstract type’ to the human client. But this type cannot be easily used by programs (though it might be convenient for some of them) because the information system provides no way to describe the new type in a well-defined way. A standards body might incorporate it into a later revision of the information system, but if this occurs at all, it will take a very long time.An example. Even with a relatively small, simple set of types, agents may have difficulty exchanging information. Suppose that a client program on a Macintosh has a reference to an image it wishes to display. Retrieval of the image is simple enough in many Internet infosystems: The client examines the reference to see what server it should contact, talks to the server with the appropriate protocol, and gets the image shipped to it for display. The World Wide Web, Gopher, and even anonymous FTP are all capable of doing this. But can the client do anything with the information it retrieves? Suppose that the image is stored on a Unix-based server at a remote university. The image is saved there in X bitmap format (xbm), and has been compressed with GNU-zip to make it easier to store, and quicker to ship. This format and encoding makes sense for the Unix environment where the picture is stored, but may not be useful to the client. The Mac client, for instance, may know how to display GIF images, for instance, but not know anything about XBM images (a similar type, but with a different color model and encoding). And the GNU uncompressor may not be available on the Macintosh.The conflict in data types must be resolved if the two agents are going to interact meaningfully. First of all, at least one agent must realize the nature of the conflict. (A naive client program might blithely assume everything is going well, and display the unknown-format image as gibberish– or worse, crash when it tries to display the image.) If the client can tell what kind of information the server is sending it, it can detect a problem, and possibly convert the data to a form it can operate on. Or, the client may tell the server up front what data formats it can deal with, and the server can convert the data appropriately.Existing systems have these capabilities, but only to a limited extent. When Gopher and WWW servers ship data, they also send meta-data identifying the type of the data they ship. The World Wide Web’s HTTP servers also allow a client to send a list of types it will accept. The vocabulary of types one can talk about is limited; in Gopher’s case, to a set of single-charactercodes set by the Gopher developers; and in the Web’s case, to the MIME type set. MIME’s type repertoire allows people to use their own ‘experimental’ type names outside the standard typerepertoire, but all parties in a transaction must have a common understanding of the experimental types used. Also, MIME’s encoding repertoire is small and fixed, so that ‘GIF’ and ‘compressed GIF’ need to be expressed as two different types in the MIME system. (Web developers stretched the MIME convention to add new encoding types, so as to avoid the combinatorial type- expansion problem arising with different data types having different compressions. But the problem resurfaces with two or more levels of encoding, which is not uncommon. Why third party agents are useful. But there is a more fundamental problem to these systems than limited vocabularies for types and encodings. Even if the client knows the kind of data it gets, and the server knows what kind of data the client wants, one of the agents has to know how to adapt to the other. In the image-fetching example, one of the parties has to know how to convert the data from th