Visualization of System Metrics paper

Senior Seminar Paper
Visualization of System Metrics

    My project is an attempt to visually model the most important tasks of an operating system. Allowing the user to view, in a metaphorical way, the processes, services, and resources of a system. This includes, their statistical information such as CPU usage, memory usage, network usage, file usage, and device usage. The main goal of this project is to allow the user to view the above mentioned things, in a way that is intuitive, simple and pleasing to the eye. The purpose of this project is to allow people to monitor their computer systems and resolve problems with them in a more efficient manner and easy manner. There is also the potential to use this as a tool to teach some basic fundamentals of an operation system. I will begin by discussing the research that I have done. Then I will discuss the operating system, followed by its limitations. I will then give and overview of design methodology and the source code. Finally I will explain the user interface.

    My original research fell outside the area of computers. I started out by reading Edward R. Tufte's 3 books on "The Visual Display of Quantitative Information." The goal here was to find the best way to display as much information about the system as possible, without causing the user any confusion. The research resulted in locating two main considerations, data density, and multi-dimensionality.
    Data density, is the amount of information that can be packed into each piece of a particular medium. In the case of computers the manifests itself into how much information can be packed into each pixel of the computer screen. Using a large data density has two problems. The first one being that the nature of this project revolves around discrete information. Focusing that information into a small area would only add confusion. The second problem is that there is not enough information present to use a high data density effectively.
    Multi-dimensionality is the idea of displaying information using more dimensions than are physically possible use a particular medium. The above image is an example of this. It is a chart of Napoleon's march on Moscow. contained within this two dimensional image six dimensions are charted. the colors indicate when the army is marching towards or away from Moscow. The thickness of the lines indicate the army's size. the location of the lines indicates the army's geographic location, and the graph at the bottom indicate date and temperature. This chart served as the inspiration for the user interface. I experimented with the idea of using textures, shapes, varying sizes, colors, locations, and the speed of rotation. Again, the relatively small amount of information that needed to be displayed allowed me to remove a few of these, drastically reducing the overall confusion level. By making the choice to allow the use of text, the level of complexity was dropped to a usable level.
    The final aspect of my research was trying to decide on a visual metaphor to give the user an inherent since of the information being displayed. A number of metaphors were looked at. These included the ideas of representing the system like a desktop, or office, a house with different rooms, the internal human body, a car etc. Each one of theses fell short in at least one area, most in many more. What does it mean to the average person when their desk is on fire, the house plants have all attached themselves to the walls, and your brain has been replaced by an enlarged spleen?

    The operating system that I chose to use for this Project is called Linux. The reasons for choosing it over others are as follows. Linux is an Open Source operating system, meaning that all the source code for the operating system is publicly available and modifiable. This gives me easy access to all of the operating systems internal functions and structures. It also allows me to easily find out what different unlabeled system variables represent and how they are structured. Linux is a variant on the Unix operating system. This means that unlike Windows 95/NT and MS-DOS, it is a complete operating system. It allows multiple users and multi-tasking. It also functions as a network sever giving me more information worth collecting. Unlike other flavors of Unix i.e. Solaris, SCO, Free-BSD, and Ultrix, all of the relevant system information can be found in readable text format under a single directory called "/proc." Along with tons of other real time system information, the proc directory contains the following entries relevant to this project, "PID", "cpuinfo", "meminfo", "loadavg", "stat", "interupts", and "net." I will now discuss the value of each to this project as well as their internal structures. Understanding these was where the greatest majority of my time on this project was spent.
    The PID, which stands for Process Identification, actually consists of a separate directory for each process currently running on the system. The directory name is the process' I.D. number. Located in each of these directories are three relevant files and one relevant directory. The files are "stat", "statm", and "status." The stat file contains all of the information about a process' activities. A typical entry looks like this:
15201 (netscape) S 12369 12353 6104 1025 12344 0 6111 101 8013 606 27354 3712 6 6 0 0 0 5 76949498 29741056 3550 2147483647 134512640 145472877 3221224272 1342250200 143381530 8192 8192 4230 91137 3222454524 2448 85 17

Each field represents the following:

process ID
simple name of executable
state of process (only in 2.1 kernels)
parent process ID
process group ID
session ID
minor device number of tty
controlling tty process group ID
flags as in long format F filed
number of minor page faults
cumulative number of minor page faults
number of major page faults
cumulative number of major page faults
user time
system time
cumulative user time
cumulative system time
priority, time in HZ of process's possible time slice nice value
time-out
it real value
time process was started
total virtual memory size
resident set size
resident pages
starting code address in memory
ending code address in memory
address of stack
segment pointer
instruction pointer
signal
blocked
ignore signal set
catch signal set
kernel function where the process is sleeping
Kilobytes on swap device
cumulative Kilobytes on swap device
exit signal

The "statm" file contains information on a process' memory usage. It typically looks like the following:

5316 3584 2029 1444 0 2140 1589

Each field represents the following:
Kb in total memory
Kb resident in memory
Kb in shared memory
text resident size
library resident size
stack resident size
dirty
The "status" file contains a condensed combination of the two above files. Not only does it contain a combination of the above information, it also displays it in a human readable form. A typical one looks like the following:

Name:   netscape
State: S (sleeping)
Pid:    15201
PPid:   12369
Uid:    501     501     501     501
Gid:    501     501     501     501
Groups: 0 4 10 100 501
VmSize:    28756 kB
VmLck:         0 kB
VmRSS:     14184 kB
VmData:    14684 kB
VmStk:        28 kB
VmExe:     10704 kB
VmLib:      2048 kB
SigPnd: 0000000000002000
SigBlk: 0000000000002000
SigIgn: 0000000080001086
SigCgt: 0000000000016401
CapInh: 00000000fffffeff
CapPrm: 0000000000000000
CapEff: 0000000000000000

The "status" file changes on occasion, depending on wither or not a process is actually a kernel thread or a real process. If it is a kernel thread the "Vm", "Sig", and "Cap" section will not appear. The layout for these files, and there meanings, could only be found in the kernel source code, in the file called "linux/fs/proc/array.c." This file contains all the information for any file in the "/proc" directory. The last entry we care about in a PID directory is the "fd" sub directory. In each "fd" directory are located symbolic links to all files, sockets, pipes, and devices currently opened by the process. A typical one of these looks similar to this:

0 -> /dev/null
1 -> pipe:[47569]
10 -> pipe:[47557]
11 -> pipe:[47558]
12 -> pipe:[47558]
13 -> socket:[47559]
14 -> pipe:[47561]
15 -> pipe:[47561]
16 -> pipe:[47562]
17 -> pipe:[47562]
18 -> /home/jeremiah/.netscape/cert7.db
19 -> /home/jeremiah/.netscape/key3.db
2 -> pipe:[47569]
20 -> /home/jeremiah/.netscape/cache/index.db
21 -> /home/jeremiah/.netscape/history.dat
22 -> pipe:[47569]
23 -> pipe:[47569]
24 -> /dev/tty1
25 -> /dev/tty1
30 -> socket:[48648]
35 -> socket:[48641]
36 -> socket:[48644]
37 -> socket:[48645]
38 -> socket:[48646]
39 -> socket:[48647]
4 -> pipe:[10634]
5 -> pipe:[10634]
6 -> socket:[10635]
7 -> socket:[10637]
8 -> pipe:[47556]
9 -> socket:[47555]

Each pipe, and socket entry shows a number which represents the inode of that socket or pipe. The inode number allows me to easily find a socket or pipe, and the statistics about that pipe or socket. You will notice that some pipes have the same inode number. This is because Linux uses pipes for different purposes. The first is for inter process communication. IPC allows two distinct process to pass information back and forth to each other. The other way that pipes get used are for inter thread communication. ITC allows two threads of the same process to pass information between themselves.
    The next major entry that we care about in the "/proc" directory is the "net" directory. In this directory are four files that contain all of the relevant information about current system sockets. The four file that we care about are "tcp", "unix", "udp", and "raw." The "tcp" file looks like this:

sl local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid timeout inode
   0: 18331C9F:0B30 02331C9F:0050 08 00000000:00000001 00:00000000 00000000   501        0 48648
   1: 18331C9F:0B2F 02331C9F:0050 08 00000000:00000001 00:00000000 00000000   501        0 48647

The last number in each line is the inode of the socket. As I mentioned before, this is how I relate a process to a socket and its information. The majority of sockets on a system are TCP sockets. Most of the internet uses TCP/IP. By reading this file, I can tell who is connected to the operating system, as well as to whom a process on the operating system is connected. UDP sockets are broadcast sockets. Here is what a typical UDP table looks lie.

sl local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid timeout inode
   0: 00000000:0025 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 128
   1: 00000000:0206 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 122

Raw sockets are used to connect a process to a device without the overhead or interference of the operating system. There are usually very few of these. Databases typically use this method so that they can maintain there own file system. Here is an example of the "raw" file.

sl local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid timeout inode
   0: 00000000:0001 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 0
   1: 00000000:0006 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 0

Because the inode numbers are 0 this means that the operating system itself is the only thing with open raw sockets. Finally, the "unix" file contains unix stream sockets. These are sockets open to processes or devices, that handle streams of data. Example are the system logger, as well as most Xwindows applications. Here is what they look like.

Num       RefCount Protocol Flags    Type St Inode Path
c35e1380: 00000001 00000000 00000000 0001 03 30826 @0000046e
c30ee000: 00000001 00000000 00000000 0001 03 10592 @000002aa
c1cffcc0: 00000000 00000000 00010000 0001 01 10635 /tmp/kio_501_12369_0.0
c10f1c20: 00000000 00000000 00010000 0001 01   164 /tmp/.s.PGSQL.5432
c10f1020: 00000000 00000000 00010000 0001 01   147 /var/run/gpmctl
c2a6ec80: 00000001 00000000 00000000 0001 03 48750 @000005bc
c18bc0e0: 00000001 00000000 00000000 0001 03 46520 @00000594
c26b6340: 00000000 00000000 00010000 0001 01 10637 /tmp/kfm_501_12369_0.0

    The other files in the "/proc" directory are relatively self explanatory. Contained in the "meminfo" entry is the system memory usage. In the "loadavg" file is current system wide cpu usage, both current and averaged. The "stat" file contains interrupt statistics and the amount of data written to and from the various block devices on the system. Here is what it looks like.

cpu 4664321 26347685 9630506 40271909
disk 1292708 166719 8055 2695
disk_rio 633374 166716 8055 2693
disk_wio 659334 3 0 2
disk_rblk 1272543 166719 8058 2695
disk_wblk 1515789 3 0 2
page 2048545 1059956
swap 39523 70644
intr 98567541 80914421 324739 0 6882508 7680297 1 7 80 0 0 0 0 1468182 1 1292283 5022 0 0 0 0 0
ctxt 189133333
btime 912978323
processes 49420

    The limitations of the linux operating system are few, but frustrating none the less. The above information is the limit of data that I can get on the system. The only way around this would be to write kernel code that would add the additional functionality that I need. Linux falls short in the following areas. Process disk usage, how much data has a particular process written or read from one block device or another. Process socket usage, how many packets has a process sent out or read over a socket, and what is the possible bandwidth of that socket. I spent a while experimenting with firewalls. They keep information about   how much data has been sent of a socket, but only on a per port basis. In the end this did not prove to be a fruitful endeavor. When I sent a question to the head network programmer for linux about these limitations, his reply was that "no one has ever needed to collect that information". I also believe that the overhead associated with generating these statistics would drastically slow down the system.
    I used the Java programming language to write my program. I chose Java for three reasons. The first reason being its strict adherence to Object Orientation. Java's adherence to OO allows me to treat each abstract group of information as a single object, and reuse it for similar groups of information. In my source code, each process and socket is a spererate object, each one is unique, but each group has the same traits. For example each process has information on its memory usage. Each TCP socket has a local and remote address associated with it etc. Each object is also given methods on how to display itself. By simply telling and object where and when to display itself the programming of the user interface is made much more simple. The second reason that I chose to use Java is because of its ability to be used from internet browsers. Java's ability to be run from a browser is by far and away one of it's most important features. This allows me to monitor any system that has my program on it from any remote system that has a Java enabled browser on it.The third reason for my choice is because of my familiarity with the language.
    My program is structured in the following way. The class file called sysview is invoked as either and applet or as an application from the browser. Sysview calls the control class. Control initializes the display, process, socket, and system objects. Control initializes them by calling the RMI server on the desired host, and maintains an open connection to them. It then sorts these into the appropriate parent objects. Some processes are parents of other process and may contain open sockets as well. the system object only hold information about the system, load average etc. Once display is initialized, it then begins monitoring user events, mouse clicks etc. These events are passed back to control for it to deal wait. The display for a particular object is passed over the network connection, to display. Currently my source code only reflects the RMI side.
    The user interface consists of different colored spheres ranging from green to red along the natural spectrum. Color is used to indicate the proximity of the represented system object to a defined system bound. For example the closer a process gets to 100% CPU usage, the closer to red will the color of its representative sphere be. The reason for using color to represent bound proximity is because of color's ability to easily represent data in a qualitative way, by playing on the user's natural interpretation of various colors.
    The next aspect of the interface are the sizes of the spheres. The sizes of the spheres are determined by the system object usage of a particular resource, and are relative to the other system objects, using that same resource. If there is only one object using a resource, it's size is determined by its percent usage of that resource. The reasoning behind varying the size of the system objects is to allow the user to easily the relative usage between system objects. The larger the sphere the more of a resource it is using. Potentially the size and the color of a system object may represent the same resource in certain instances, if that resource is used to determine a system object's bounds.

    The third aspect of the user interface are the bars that connect the different system objects. These are used to show the relation of system objects between levels of abstraction. The user will navigate through these levels by clicking on an system object, which will display a detailed view of it components, which can also be system objects, creating a tree effect.

    The fourth aspect of the user interface is the textual identification. Text will be overlaid onto a system object to allow the user to distinguish between one object and another. Although this violates the ideal of the project, different test indicate that this is the easiest and clearest way for the user to identify an object. Using different icons constrained what could be displayed too much. Different shapes and textures caused too much confusion, and created the need for the user to learn and remember too much additional information.

    The user will have the ability to change their point of view according to three basic models. These include the process model, the resource model, and the service model. These will be the three main branches of the display tree. If the user clicks on the process model, they will get a view of all processes currently running on the system, each will be identified by its PID (Process ID). Clicking on a process will reveal the specific details of a process such as memory usage, open sockets and pipes, file descriptors, etc. The user will be able to define a universal bound for all process. The default bound will be CPU usage. The resource model will display relevant system resources. Each resource will have a default bound. Clicking on a resource will display both the services as well as the processes using that resource. Potential resources will be bandwidth usage, disk space, I/O usage, memory usage, CPU usage etc. The service model will display user defined services. They will be displayed according to user defined bounds. Each service object will contain all of the processes and resources being used by a service.

    The user will at any time be able to define bounds on object by right clicking on that object. By right clicking on the service sphere, a user will be able to add service to monitor. The user will type in the name that they wish to call the service along with the name of the executable. Service objects will implement a polling architecture by default. All other System objects will not continually update, unless they or their parent is clicked upon, or the user sets a particle system layer to poll.