Benjamin H. Miller
Senior Seminar Project
PARALLEL PROGRAMMING:
Parallel Programming uses multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer. It also offers the opportunity to tackle larger problems; that is, problems with more computational steps or more memory requirements. The latter because multiple computers and multiprocessor systems often have more total memory than a single computer. The computers must be interconnected by a network and a software environment must be present for intercomputer message passing. Several software tools are available for message-passing parallel programming, including PVM and several implementations of MPI, which are all freely available.
There are a couple of types of parallel computer, shared memory multiprocessor system and message passing multicomputer. Shared-Memory Programming consists of multiple processors connected to multiple memory modules, such that each processor can access any memory module. A shared memory multiprocessor system employs a single address space which means that each process has its own address block. Processes communicate through shared variables. Message Passing exists when computers communicate with each other by sending data back and forth across a connection. A process can only access the memory on the local machine. The message-passing multicomputer carries data from one process to the another as indicated by the program. The message passing multicomputer will physically scale easier than a shared memory multiprocessor.
** This section on classifications might stay or go depending on length. **
Classifications of Parallel Processing
WHAT IS PVM?
PVM is a software system that enables a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. It is an integrated set of software tools and libraries that emulates a general-purpose, flexible, heterogeneous concurrent computing framework on interconnected computers of varied architecture. The overall objective of the PVM system is to enable such a collection of computers to be used cooperatively for concurrent or parallel computation. The individual computers may be shared or local-memory multiprocessors, vector supercomputers, specialized graphics engines, or scalar workstations, that may be interconnected by a variety of networks, such as Ethernet, FDDI.
HISTORY OF PVM:
The PVM project began in the summer of 1989 at Oak Ridge National Laboratory. The prototype system, PVM 1.0, was constructed by Vaidy Sunderam and Al Geist; this version of the system was used internally at the Lab and was not released to the outside. Version 2 of PVM was written at the University of Tennessee and released in March 1991. During the following year, PVM began to be used in many scientific applications. After user feedback and a number of changes (PVM 2.1 Ð 2.4), a complete rewrite was undertaken, and version 3 was completed in February 1993. It is PVM version 3.3.
PVM DETAILS:
The PVM system is composed of two parts. The first part is a daemon , called pvmd3 and sometimes abbreviated pvmd , that resides on all the computers making up the virtual machine. (An example of a daemon program is the mail program that runs in the background and handles all the incoming and outgoing electronic mail on a computer.) Pvmd3 is designed so any user with a valid login can install this daemon on a machine. When a user wishes to run a PVM application, he first creates a virtual machine by starting up PVM. The PVM application can then be started from a Unix prompt on any of the hosts. Multiple users can configure overlapping virtual machines, and each user can execute several PVM applications simultaneously. The second part of the system is a library of PVM interface routines. It contains a functionally complete repertoire of primitives that are needed for cooperation between tasks of an application. This library contains user-callable routines for message passing, spawning processes, coordinating tasks, and modifying the virtual machine.
The PVM computing model is based on the notion that an application consists of several tasks. Each task is responsible for a part of the application's computational workload. Sometimes an application is parallelized along its functions; that is, each task performs a different function, for example, input, problem setup, solution, output, and display. This process is often called functional parallelism . A more common method of parallelizing an application is called data parallelism. In this method all the tasks are the same, but each one only knows and solves a small part of the data. This is also referred to as the SPMD (single-program multiple-data) model of computing. PVM supports either or a mixture of these methods. Depending on their functions, tasks may execute in parallel and may need to synchronize or exchange data, although this is not always the case.
All PVM tasks are identified by an integer task identifier (TID) . Messages are sent to and received from tids. Since tids must be unique across the entire virtual machine, they are supplied by the local pvmd and are not user chosen. Although PVM encodes information into each TID the user is expected to treat the tids as opaque integer identifiers. PVM contains several routines that return TID values so that the user application can identify other tasks in the system.
PURPOSE OF MY PROJECT:
Some clusters are designed so that not all nodes are visible to the external world. In these cases, there is often a "front end" node which connects to the world, and the rest of the nodes are on an internal network. This has the disadvantage for PVM users running on an outside computer who want to add one of these computers to their configuration: the PVM code on their computer will be unable to communicate with the nodes on the internal network. The only communication the internal network has with the outside world is through a gateway or portal node. PVM3.4.3 handles this problem in a little different manner than was available in previous versions. It has implemented the BEOLIN port which solves this problem by making the parallelism of the computer setup transparent to the user.
The original way of passing messages from one cluster to the other was by using the low level TCP/IP protocol to pass the packets back and forth through the gateway. Now there exists a second method which still uses TCP/IP but now all messages must pass through the PVM daemon (pvmd) running on the gateway machine.
PVM ARCHITECTURE:
This BEOLIN port generates a single copy of the PVM daemon (pvmd). The daemon spawns tasks onto other processors which are selected based on the contents of the PROC_LIST environment variable, with each task going onto a separate processor. Initially each task communicates with the PVM daemon using what is assumed to be a shared files system, in the /tmps directory. Each task then sets up sockets to communicate with the daemon and with each other. To avoid communication bottlenecks through the daemon, it is strongly recommended that the PvmRoute option be set to PvmRouteDirect in your PVM applications using the pvm_setopt() library call.
For each task the daemon initiates, it forks a process which in turn executes an rsh to initiate the
task on a cluster host. Each forked process shows up in the task list, as
INSTALLATION NOTES:
** This section will be summarized in the final draft for an easier understanding **
To select this Beowulf port when compiling PVM, the PVM_ARCH flag should first be manually set to
"BEOLIN" in your $HOME/.cshrc (or equivalent) shell startup file.
The /tmps directory must be a shared file system among all the Beowulf processors which will be
involved in running PVM tasks. If a different shared directory is desired, this can be set using the
new PVM_TMP environment variable, as is now possible with all standard PVM architectures.
This distribution is shipped by default with the compile flag -DNOPROT added to the ARCHCFLAGS define
in the pvm3/conf/BEOLIN.def configuration file. This turns off security checking between the pvmd
and a task when the task initiates the connection process. It has been turned off because of NFS
performance problems over a shared file system. If you want to turn authentication protection back
on, and don't mind the wait (approximately 15 seconds per check), then the -DNOPROT flag should be
removed from the ARCHCFLAGS define and PVM should be recompiled.
The front-end node of a Beowulf system which provides access to the internal network will have two IP addresses: one registered address visible to the outside world and one internal address visible only to the other nodes in the cluster. The gethostname() function, when run on this front-end node, must return the host name which is recognized by the other nodes in the cluster, in order for them to communicate with the pvmd daemon on the front end. External hosts wishing to add the Beowulf cluster to a virtual machine will use the host name that maps to the registered external IP address. (PVM resolves multiple network (IP) addresses using the host names associated with each address on a given system, as typically specified in the /etc/hosts file.)
RUNNING:
** This section will be summarized in the final draft for an easier understanding **
The desired processors on the internal network should appear in the environment variable PROC_LIST, as a list of host names separated by colons. For example, on our internal network, the processors are named beginning with the letter 'h' followed by the processor number. So in my $HOME/.cshrc file I might have the statement:
This would allow PVM tasks to be spawned onto h11-14, a total of 4 processors. Any additional spawns
beyond the first 4 would result in an error message, unless the earlier tasks completed before the
new spawns occurred.
Remember, when you use this version of PVM you don't add the nodes individually--just do a spawn,
and PVM will automatically spawn to the nodes in your PROC_LIST variable.
** Some of the info that will be removed is from a file that is apart of the distribution of the
PVM source. I am going to just summarize and provide a pointer to the file in case someone wants
more details. **
TEST ARCHITECTURE:
** This will be moved out of the main body to an appendix at the end of this paper. **
** I will only be using in as a reference. **
I will be testing this architecture using sixteen machines that will be fully connected on a private
network. These machines have been given the names AthenaA through AthenaP. I will refer to them
this way throughout the rest of the paper. They will have the ability to communicate with the
outside world through a router, Noether. The router then has two Ethernet cards which are both 10MB
cards. One is connected to the Athena machines and the other is connected to four other machines
which have the names Acl1 through Acl4. These machines will also be fully connected to each other
but not with the any of the Athena machines. The Athena machines are connected in a hyper cube
configuration to provide a shorter pass between nodes. 3Com is the manufacturer of the network
cards. Each machine has four 3C905B 100MB fast Ethernet PCI cards for the hyper cube and one
3C509B-TPO 10Mbps Ethernet ISA card to communicate with a switch. The Athena machines are each
running with American Micro Devices (AMD), CPUÕs. They are AMD K6 300Mhz processors. Western Digital,
our hard disk drive manufacturer.
Most of them has PentiumII 400 with 128 MB RAM and about 4.5 GB hdd. They all have inbuilt 10/100
ethernet capability. There are about 2-3 machines that are radically different (old faculty machines
that got converted into ACL boxes) as well as some with different parts (like IDE zip disk instead
of SCSI or different video card)
Details that need to be included some of which I do not know yet.
TYPE OF TESTS:
I will be using three programs to test the BEOLIN configuration. Two of the types of programs have
been mentioned before in this paper. These were the embarrassingly parallel technique and the data
partitioning technique. The third program will use a parallel programming technique called
pipelining.
I will have two different tests that I will be running on each program. The first test will be used
to figure out the comparison of the fully connected cluster configuration versus the BEOLIN
configuration. To do this I will be configuring PVM and my programs in a way so that each of them
uses exactly the same number of nodes and the same number of processes.
The second test will address the scaling issue in computer programming. Does one of these types
of parallel programming techniques favor the BEOLIN configuration better than the other? How do each
one of them scale meaning how does the resource consumption curve grow over 2, 4, 8, 16 nodes?
EMBARRASSINGLY PARALLEL:
A computation that can be divided into a number of completely independent parts which can be executed
by a separate processor. A truly embarrassingly parallel computation suggest no communication between
separate processes. Results from the slaves do not need to be combined to obtain results. Each
process requires different (or the same) data and produces results from its input without any need
for results from other processes. This situation will give the maximum possible speedup if all the
available processors can be assigned processes for the total duration of the computation. The only
requirement for the master is to distribute the data and to start the processes. If it is represented
as a graph it leads to a completely disconnected computational graph.
PVMPOV will be the implement technique.
** Several paragraphs about why pvmpov is a good benchmark, what it is, how it works and so on. **
PVMPOV = PVM + POV-Ray
PVM is a message passing system that enables a network of computers to be used as a single distributed
memory parallel computer. This network is referred to as the Parallel Virtual Machine. POV-Ray is a
3-dimensional raytracing engine. It takes information you supply and simulates the way light interacts
with the objects you've defined to create stunning 3d pictures and animation. This process is called
rendering. PVMPOV has the ability to distribute a rendering across multiple heterogeneous systems.
Parallel execution is only active if the user gives the "+N" option to PVMPOV. Otherwise, PVMPOV
behaves the same as regular POV-Ray and runs a single task only on the local machine. Using the PVM
model, there is one master and many slave tasks. The master has the responsibility of dividing the
image up into small blocks, which are assigned to the slaves. When the slaves have finished rendering
the blocks, they are sent back to the master, which combines them to form the final image. The master
does not render anything by itself, although there is usually a slave running on the same machine as
the master, since the master doesn't use very much CPU power. If one or more slaves fail, it is
usually possible for PVMPOV to complete the rendering. PVMPOV starts the slaves at a reduced priority
by default, to avoid annoying the users on the other machines. The slave tasks will also automatically
time out if the master fails, to avoid having lots of lingering slave tasks if you kill the master.
DATA PARTITIONING:
Similar to embarrassingly parallel except that the results have to be combined to obtain the results.
Partitioning can either be applied to the data by operating upon the divided data concurrently or can
be applied to the functions of a program. The former is called data decomposition and the latter is
called functional decomposition. Functional decomposition is not as trivial as the data decomposition
so I am going to implement the data decomposition.
Finding the Billionth digit of Pi will be the implementation here.
** Several paragraphs about why pvmpov is a good benchmark, what it is, how it works and so on. **
PIPELINED COMPUTATIONS:
The problem is divided into a series of tasks that have to be completed one after the other. I have
also read that this is similar or the basis of sequential programming. Each task will be executed by
a separate process or processor. Each stage will contribute to the overall problem and pass on
information that is needed for subsequent stages.
Finding Prime Numbers will be the implementation here.
** Several paragraphs about why pvmpov is a good benchmark, what it is, how it works and so on. **
WHAT RESULTS WILL BE MONITORED:
XPVM:
Monitoring Software
XPVM provides a graphical interface to the PVM console commands and information, along with several
animated views to monitor the execution of PVM programs. These views provide information about the
interactions among tasks in a parallel PVM program, to assist in debugging and performance tuning.
XPVM provides "point-and-click" access to the PVM console commands. A pull-down menu allows users
to add or delete hosts to configure the virtual machine. Tasks can be spawned using a dialog box that
prompts for all spawn options, including the trace mask to determine which PVM routines to trace for
XPVM.
SCHEDULE:
RESULTS:
This section will be filled up during the final week of the project. It will include graphs and
tables of data. I will also include a good size write up on the whole testing process.
Other Links:
PVM Homepage - http://www.epm.ornl.gov/pvm/
Homepage - http://www.netlib.org/utk/icl/xpvm/xpvm.html
My Project Homepage - http://www.earlham.edu/~millebe/CS80/