ParGCL/MPI Comments to: Gene Cooperman (gene@ccs.neu.edu) ParGCL provides three parallel programming interfaces, each described individually, below. The three interfaces are: slave-listener - master sends LISP commands to a slave, and then receives replies. MPI - Simplified LISP interface to MPI. At this time, it primarily implements the point-to-point layer of MPI. See the MPI standard on the web for a full manual of the MPI functions. master-slave - simple, easy-to-use master-worker model of parallelism based on TOP-C (http://www.ccs.neu.edu/home/gene/topc.html) It also supports non-trivial parallelism, in which any slave process can cause the "shared data" to be updated, and so made visible to other slaves. To begin, make sure that there is a procgroup file in the directory where you start pargcl, or else call: pargcl -p4pg FULL_PATH_OF_PROCGROUP_FILE The procgroup file specifies the remote "slave" machines and the path to pargcl on those machines. pargcl/bin/procgroup is a sample file. Most people will want to begin with the slave-listener interface. Note that (help 'COMMAND) works for most commands. There are example parallel programs in pargcl/examples/. Here is a sample usage of some simple commands: (send-receive-message '(1+ 5)) ; send message, recv answer (default: slave 1) (send-message '(+ 3 4) 2) ; send to slave 2 for evaluation (send-message "(+ 5 6)" 1) ; send to slave 1 for evaluation (send-message "(+ 7 8)" 1) ; send to slave 1 for evaluation (receive-message 2) (receive-message 1) (probe-message 1) ; Slave 1 has message pending (flush-all-messages) (probe-message 1) ; Slave 1 now has no messages (broadcast-message '(setq myrank (MPI-Comm-rank))) ; rank (id) of slave (par-eval '(defun myfnc (x) (sqrt x))) (send-message '(dotimes (i 1000000000) (myfnc i)) 1) (par-reset) ; useful if slave not responding. (par-load "/home/joe/myfile.lsp") ; Load same file in all processes (defun par-mapcar (fnc list &aux (i -1) result) (par-funcall #'par-mapcar-helper fnc list)) (defun par-mapcar-helper (fnc list &aux (i -1) result) (if (master-p) (setq result (make-array (length list)))) (master-slave :generate-task-input #'(lambda () (incf i) (if (< i (length list)) i nil)) :do-task #'(lambda (j) (funcall fnc (nth j list))) :check-task-result #'(lambda (task-in task-out) (setf (aref result task-in) task-out) 'NO-ACTION) :trace t) (if (master-p) (coerce result 'list) nil) ) (par-mapcar #'sqrt '(1 2 3 4 5 6 7 8 9 10)) (help '*master-slave-time*) *master-slave-time* (send-receive-message '*master-slave-time*) ; time spent on slave 1 ;; The definition of par-mapcar above demonstrates an easy way to ;; generate master-worker parallelism. ;; In :generate-task-input, Variable i is the next task input; ;; this is generated on master ;; In :do-task, task is to call fnc on (nth i list); ;; Note that the argument "list" is shared on all processes ;; The task is then executed on the next available slave ;; In :check-task-result, the master, gets a pair: (i, task-out), ;; where task-out is (funcall fnc (nth i list)) ;; Here, we return 'NO-ACTION, but more general patterns are also ;; possible. See "Overview" in ;; http://www.ccs.neu.edu/home/gene/topc.html for other patterns. ;;;;USAGE: ;;;; A remote process is set up for each one specified in your procgroup file. ;;;; Available commands: ;;;; (send-message ) ;;;; (broadcast-message ) ; no reply message ;;;; (receive-message ) ;;;; (defun send-receive-message ) ;;;; (send-receive-message ) ;;;; (free-msg-buffer ) ; optimization for greater efficiency, only ;;;; ; frees message buffer for re-use in next MPI-recv ;;;; (probe-message ) ;;;; (flush-all-messages) ; flush all commands pending by slaves ;;;; (bye) modified to also delete remote processes ;;;; (get-last-source) ; not needed, but can be useful in special cases ;;;; (get-last-tag) ; not needed, but can be useful in special cases ;;;; --- ;;;; It works to call (send-message ... 1) n times, and then call ;;;; (receive-message 1) n times, although this may be less efficient. ;;;; Commands sent to the same processor, are evaluated in sequence. ;;;; --- ;;;; Send/receive/probe commands also accept optional tag with obvious default ;;;; CAUTION: If you add the optional tag parameter to send-message, ;;;; note that the slave-listener does not reply if tag = broadcast-tag ;;;; Also, tags larger than broadcast-tag are interpreted as (vector fixnum) ;;;; or (vector float), and use the corresponding MPI data types. ==================== The current implementation has a built-in MPI subset (MPINU). Note, however, that one can invoke: ./configure --with-mpicc=XXX where XXX is an mpicc script that compiles an application to use MPI. This allows you to use other dialects of MPI. Note that messages are converted to (vector fixnum), (vector float), or (by default) strings (general print representations) before being sent. Conversion from a general object to its print representation as a string implies significant overhead for very large messages Also, a timer will kill any process that has not received a message for some time (60 minutes by default). Following is a summary of the master-slave layer, slave-listener layer, and MPI layer. ========================================================================= [ The master-slave layer uses the TOP-C model. For a detailed description of the TOP-C model, see http://www.ccs.neu.edu/home/gene/topc.html ] MASTER SLAVE LAYER (master-slave - see related article for full details. The user is responsible for writing the FNC's: :generate-task-input FNC - (generate-task-input) returns TASK-IN, arb. user data structure :do-task FNC - (do-task TASK-IN) returns TASK-OUT, arb. user data struct :check-task-result FNC - (check-task-result TASK-IN TASK-OUT) must return T, NIL, ?, or (CONTINUATION . *) Equivalently, can return symbols: UPDATE, NO-ACTION, REDO, (CONTINUATION . *) T means call (update-shared-data TASK-IN TASK-OUT) NIL means do nothing more with this task ? means re-send TASK for additional comp. by do-task CONTINUATION is like ?, but it allows a user parameter for *. (CONTINUATION . *) becomes the next TASK for the slave. :update-shared-data FNC - (update-shared-data TASK-IN TASK-OUT) executed on master and slaves, used to modify global var's :trace T-OR-NIL) - obvious *master-slave-trace* - global variable, default value for :trace keyword EXAMPLE: see file, myfactor.lsp *master-slave-time* - variable set after each call to master-slave; see (help '*master-slave-time*); On each processor shows time spent. (up-to-date-p) - utility to test if update-shared-data was called between time when most recent result was generated, using generate-task-input, and when the correspond. result was obtained, using check-task-result EXAMPLE: (defun check-task-result-fnc (result) (if (up-to-date-p) t ;data is consistent, update all processors 'REDO)) ;data structures changed, re-do computation (result-task) - Utility returning task corresponding to most recent result returned; Used only in check-task-result EXAMPLE: (defun check-task-result-fnc (result) (case (first (result-task)) (TASK-A ...) (TASK-B ...) ...)) (par-eval 'EXP) - EXP evaluated on all processors EXAMPLE: (par-eval '(defun foo (x) (1+ x))) (par-funcall FNC ARG1 ...) - like (funcall FNC ARG1 ...) with value of ARG's taken from lexical environment on master EXAMPLE: (par-funcall #'load "file.lsp") (par-load FILE) - like (load FILE), loads FILE on all processors (par-reset) - resets slave processors if wedged (deadlocked) EXAMPLE: (par-reset) (master-p) - returns t when executed on master, and nil when on slave. EXAMPLE: (par-eval '(if (master-p) (set-up-large-master-data-structure) (set-up-special-slave-stuff)) ========================================================================= SLAVE LISTENER LAYER - master distributes commands to slaves; These commands should be executed on the master ONLY. The slave-listener layer sets the initial slave directory to the same as the master. The master can arbitrarily intermix sending commands and getting results (send-message LISP-EXPR OPT-MPI-DEST-ID OPT-TAG) - send arb. expression for evaluation; send to OPT-MPI-DEST-ID with TAG, default is OPT-MPI-DEST-ID = 1, TAG = 0 EXAMPLE: (send-message '(+ 3 4)) NOTE: if LISP-EXPR is a string, it assumes this is a print representation of a LISP object, and not a raw LISP string (receive-message OPTIONAL-MPI-ID OPT-TAG) - return result from next msg from MPI-ID that was sent with TAG, default is next available slave, any tag EXAMPLE: (receive-message) -> returns 7 for above case (broadcast-message LISP-EXPR) - send arb. expression to all slaves. No result is returned. (free-msg-buffer BUF) - optimization for greater efficiency; Allows end-user to free message buffer for re-use by slave-listener, saving the slave-listener from generating a new buffer. (mpi-comm-rank) - returns MPI process ID, 0 = master, > 0 for slaves EXAMPLE: (par-eval '(if (= (mpi-comm-rank) 2) (print x))) (get-last-source), (get-last-tag), (get-last-count), (get-last-datatype) all exist. These last should be avoided, but get-last-source can be useful for matching initial response of slave with later continuation by same slave. ========================================================================= MPI LAYER - not all commands are implemented. The MPI manual is also a good source for what the commands do. The calling sequences can be found via (help 'mpi-command) or (describe 'mpi-command) (mpi-iprobe) - non-blocking probe for remaining messages. Returns t or nil. EXAMPLE: (if (null (mpi-iprobe)) (print "READY")) (mpi-probe) - blocking probe for remaining msg's. Always returns t, eventually. EXAMPLE: (progn (mpi-probe) (print "READY")) (autologout X) - if no msg rec'd in X minutes, process aborted. Returns old value of X. Can be called with 0 arg's, leaving old value unchanged. (nice X) - nice val between -19 and 20 (higher is lower prio.); returns old val (limit-rss X) - limit Resident Set Size, when others also want RAM; returns old val (chdir "string") - change current directory (getcwd) - get current working directory as string Many other MPI commands. Not recommended, except for special cases. One could load the MPI layer alone, for special requirements. DEBUGGING - It is recommended, when debugging, to first run on the master and all slaves on a single processor. (The procgroup file can easily be set up this way.) The :trace keyword in the function, master-slave, is also useful. In more difficult debugging situations, one may run the user code with master-slave-seq.lsp in a single process (e.g.: in gcl, not gcl-mpi). In this case, one can use (step ...) to debug routines that would reside on the slave. However, the user code should first be modified to removed any uses of "(master-p)" or other constructs explicitly referring to a distinct master and slave. (par-eval '(si::use-fast-links nil)) is a good precaution during development, to find at what function any crashes occur. If communication time is too much, then you should break up any very large messages into elementary MPI datatypes. So, consider breaking up messages of over several thousand bytes into several messages, where each large messages is of LISP type: (vector fixnum) or (vector float). The variable master-slave-time may also be useful for seeing where all the time is going. PERFORMANCE - Try: 1. Bundling several tasks into one to reduce communication overhead 2. Setting up multiple slave processes on the same processor to gain greater CPU utilization on the processor 3. Using free-msg-buffer in check-task-result to save on garbage collections: Important if do-task is returning long vectors of fixnum or of float. 4. The file, master-slave.lsp, currently has a throttling mechanism in case a slave is very slow, perhaps due to paging against other users. Any slave running CYCLES-BEFORE-DEAD times slower (default = 20) will stop being used. After CYCLES-BEFORE-UNDEAD (default = 200) task times, they will be resuscitated and tried again. These constants in master-slave.lsp can be changed, after which STAR/MPI should be re-made.