Scheduler: A Distributed Computing Environment (text version)

Slide 1: Cover Page
  * Scheduler: A Distributed Computing Environment
  * <Graphic>: showing a web of interconnected computers


Slide 2: Agenda
  * History
  * Overview
  * Usage Summary
  * Future Direction
  * Underlying Programs


Slide 3: History
  * System required to speed-up the yearly reprocessing of large collections.
  * Tried Parallel Virtual Machine (PVM) and Condor (7/98).
  * Moved to our own custom built program starting in 5/99.
  * Reasons for Move:
     - Problems with Condor not working during system backups
     - Wanted ability to run more than 1 job at a time
     - Wanted more control over how things were run


Slide 4: Design Goals
  * Provide easy to use web-based access to the facility
  * Use unused computing cycles on workstations
  * Allow users to schedule access to their workstation
  * Provide fault tolerant system to reduce potential for data loss
  * Allow multiple jobs to run at the same time versus having to sequence jobs
  * Ensure user and data security
  * Has to be extensible
  * Has to be easy to administer and do much automatically


Slide 5: Technologies Used
  * "The Simple Sockets Library" (SSL) from NASA Goddard Space Flight Center's
    Intelligent Robotics Laboratory (IRL). SSL (written in C) "C-like" wrapper
    to TCP/IP sockets
  * Main programs written in C
  * CGI scripts, HTML, JavaScript, and Perl for web-based interface
  * Apache HTTP server with Server Side Includes and security (user
    authentication and data access)


Slide 6: Highlights
  * User's can schedule when their workstation is used in the pool
  * Single non-root user (condor) has access to all workstations within pool
  * Users are emailed upon completion of job and told status of job and how to
    retrieve results
  * There is a behind the scenes "scrubber" program to remove jobs after 15 days
  * Easy to add new programs as long as they conform to our standard input and
    output API.
  * Currently have 18 workstations in our pool


Slide 7: Overview Diagram


Slide 8: Diagram Showing Scheduler Flow
   1) The scheduler is started by the condor user on nls9 and immediately
      starts the tagger_monitor on skr1.  A run_clt is also started by the
      scheduler program at this time on each of the workstations in the
      distributed pool.

   2) The tagger_monitor is ran on skr1 to provide a heartbeat verification of
      the workstation and the tagger_server process for the scheduler program.

   3) User submits a job or status request via the SKR web page.

   4) The user request causes the "foo_batch_meta" cgi-script to be run which
      establishes a TCP/IP socket connection to the scheduler program and
      transmits the request to the scheduler.

   5) When a new job is ready to be processed, the scheduler handles the
      distribution, status checking, error verification, and coordination of the
      processing for the job.  A "file_cnt" program is forked off at job start
      to count the number of citations within the input file.  Once the file
      count is completed, the results are returned to the scheduler.

   6) The run_clt runs the specified application (SKR, MTI, MetaMap, Edgar,
      Arbiter, or a Generic application) on the individual citation that was
      given to it by the scheduler program and returns the results to the
      scheduler program.


Slide 9: Fault Tolerance
  * Tagger monitor serves as a monitor checking that the tagger_server is still
    available. Scheduler pauses if tagger_server goes down
  * If workstation goes down, jobs are resubmitted to other workstations.
    Workstations automatically added back in.
  * If problems with item, retried twice, then marked as error
  * Email notification to admin if any problems
  * Results file check-pointed whenever proper sequenced items are returned
  * Queue check-pointed frequently to allow resume on restart


Slide 10: User Jobs
  * Suite of known programs (SKR, MetaMap, MTI, Edgar, Arbiter, and PhraseX)
    with specific logic and interfaces.
  * Two Generic Modes
     - With or Without EOT Validation
     - Must reside in /nfsvol/nls/bin for security reasons
  * Remotely view interim/final results and status of job
  * Suspend, Resume, or Rerun jobs (user owned)
  * Priority Scheme (Normal, Medium (10:1), and High (100% of queue).  Medium 
    and High are reserved.


Slide 11: Batch Job Status Window
  * Shows real-time status of all jobs in the queue
  * Provides "estimates" of completion time
  * Shows snapshot of performance
  * <Graphic>: showing picture of the Batch Job Status Window


Slide 12: Workstation Scheduling
  * Only allowed by designated responsible user/admin
  * Tells us when we can and can't use a workstation
  * We allow the last running item to complete before removing ourselves from
    the workstation
  * <Graphic>: showing picture of the Workstation Scheduling Window


Slide 13: Resource Allocation
  * The Scheduler uses resource utilization (actual + projected) within a
    sliding window to determine which job receives the next available resource.
  * Job with lowest total score receives next resource.
  * Takes into account both actual usage + estimated usage based on items
    running but not completed yet.
  * Also takes into consideration Medium priority by dividing overall usage by
    10 to provide 10:1 ratio of usage.
  * <Graphic>: showing picture of the Resource Allocation part of Load Balance
    Window.


Slide 14: Resource Allocation (continued)
  * Each window represents 10 seconds
  * Currently based allocation on two windows
     - More windows penalizes running jobs versus new jobs
     - Less windows insufficient data to properly decide
  * Still working out best number of windows to use
  * We want to allow faster running jobs more access to the queue to balance
    their utilization over a job that takes more time to process an item.


Slide 15: Load Balance Window
  * Access by Administrator only
  * Provides real-time display of load balancing algorithm
  * Green (active), White (not active), Grey (not available)
  * <Graphic>: showing picture of the Load Balance Window - shows which jobs
    are allocated to which computer, how many computers each job has, and
    the third table shows the resource allocation or time used by each job.


Slide 16: Administrator
  * Start/Stop/Pause Queue
  * Reprioritize Jobs
  * Suspend/Resume/Rerun Jobs
  * Modify all Workstation Schedules
  * View Load Balance Information
  * Modify Load Balance Sliding Window
  * Modify Individual Item Timeout 


Slide 17: Usage Summary
  * 58 Unique Users
  * 4,218 Unique Jobs since April 25, 2000
  * Used every night for overnight DCMS processing automatically resubmitting
    same job with new data.
  * Jobs range from 1 item to well over a million items.  The only limit is
    disk space.
  * Has cut yearly processing time from weeks to days
  * System has evolved into a general purpose tool capable of assisting
    researchers in processing large datasets.


Slide 18: Future Direction
  * Move towards Linux for servers
     - New multi-cpu Linux server for Scheduler
     - New multi-cpu Linux server for Textool server
     - Looking to move tagger_servers to Linux as well
  * Integrate Textool_monitor into Scheduler heartbeat
  * Further refinement of resource allocation algorithm
  * Look at ways to improve efficiency


Slide 19: Underlying Programs
  * run_clt -- Simply waits for the Scheduler to give it a job to run, runs it,
        and returns the results.
  * tagger_monitor -- Runs on skr1 and runs ps to see if a tagger_server process
        is running, sleeps 5 seconds, and repeats.  When prompted by the
        Scheduler, it responds with a heartbeat.
  * Textool_monitor -- Runs on mti1 and  doesn't communicate with the Scheduler.
        Monitors the Textool server via ps command and kills if > 95% CPU usage.
        The server is restarted automatically.
  * force_error -- Allows admin to force an error on a job which forces a
        checkpoint write. 


Slide 19: Underlying Programs (continued)
  * fix_error -- Allows us to manually rerun any errors that occur in batch and
        splice the correct results into the results file in the proper location.
  * clean_hist -- Handles the cleaning of batch directories and Interactive
        files that are more than 15 days old and updates the list of available
        batches for rerun/resume.