Troubleshooting Condor with Intergraph 2013: Common Fixes

Performance Tuning: Condor for Intergraph 2013 — Best Practices

Overview

  • Goal: Improve throughput, reduce job latency, and increase reliability when running Intergraph 2013 workloads under HTCondor (Condor).
  • Scope: Scheduling, resource configuration, job submission patterns, storage and network considerations, monitoring, and troubleshooting.

1. Understand the Workload

  • Classify jobs: short interactive, medium batch, long-running simulations, I/O-heavy vs CPU-bound.
  • Profile resource use: measure CPU, memory, disk I/O, network, and license usage per job type.
  • Set targets: desired queue wait time, throughput (jobs/hour), and acceptable failure/retry rate.

2. Condor Cluster Topology & Daemons

  • Run the key daemons: condor_master, condor_schedd (per submit host), condor_collector, condor_negotiator, condor_noop (optional for testing), condor_startd (on execute nodes).
  • Dedicated roles: separate collector/negotiator on robust hosts to reduce load on submit/execute machines.
  • High availability: use multiple collectors and redundant negotiators where possible to avoid single points of failure.

3. Resource ClassAds and Machine Attributes

  • Define machine ClassAds that reflect actual capacities: Cpus, Memory, Disk, LoadAvg, Arch, OpSys, and custom tags (e.g., GPU, SSD, license_pool).
  • Use accurate Slot configuration: prefer contiguous multi-core slots for multi-threaded Intergraph tasks; or dynamic slots for mixed workloads.
  • Example attributes to include:
    • Memory = 32768
    • Cpus = 16
    • Disk = 500000
    • HasSSD = True
    • LicensePools = “Intergraph2013:10”

4. Job Submission Best Practices

  • Use submit files with explicit requirements and rank expressions:
    • requirements = (OpSys == “LINUX”) && (Memory >= 8192) && (Cpus >= 4)
    • rank = (HasSSD ? 100 : 0) + (Memory/1024)
  • Limit preemption for long-running or license-sensitive jobs:
    • request_cpus, request_memory in submit description
    • set job ClassAd attributes like JobLeaseDuration, job_machine_attrs
  • Batch small short jobs into job arrays or DAGs to reduce schedd overhead.
  • For I/O-heavy Intergraph tasks, request nodes tagged HasSSD and prefer local scratch via rank.

5. I/O and Storage Optimization

  • Prefer local scratch on execution nodes for temporary working files; stage-in/out via fast parallel transfer (rsync, scp or gridFTP depending on environment).
  • If using shared storage, ensure it’s not a bottleneck: use multiple NAS targets, enable NFS tuning (rsize/wsize), or use clustered filesystems (Lustre/GPFS).
  • Schedule heavy I/O jobs during off-peak hours; apply a priority penalty for jobs hitting shared storage.

6. Network and License Management

  • Monitor and throttle license usage: define a LicensePool attribute and requirements so jobs only run if a license is available.
  • Use condor_status and custom ClassAd attributes to track license counts and consume/release properly.
  • Optimize network topology: colocate busiest submit hosts close to collectors and shared storage; ensure low-latency links for file staging.

7. Scheduling Policies and Fairshare

  • Configure negotiator policy to reflect organizational priorities:
    • set fairness using NEGOTIATOR_PREEMPTION, RANK, and WANT attributes.
  • Use group quotas and fairshare to guarantee resources for high-priority Intergraph users while preventing starvation.
  • Implement preemption policies sparingly; better to use checkpointing for long jobs if supported.

8. Checkpointing and Fault Tolerance

  • Enable Condor checkpointing (if Intergraph jobs support it) to resume long simulations after preemption or node failure.
  • Use job retries with exponential backoff for transient failures.
  • Maintain clean shutdown/startup scripts on execution nodes to avoid orphaned filesystem locks.

9. Monitoring, Metrics, and Alerts

  • Collect metrics: job wait time, run time, slot utilization, I/O wait, network throughput, license usage, failure rates.
  • Use condor_history, condor_q, condor_status plus external monitoring (Prometheus + Grafana, Nagios) for dashboards and alerts.
  • Alert on sustained high load, low free licenses, repeated job failures, or collector/negotiator unavailability.

10. Tuning Parameters to Consider

  • SCHEDD:
    • MAX_JOBS_SUBMITTED_PER_OWNER, JOB_ROUTER settings, and SCHEDD_INTERVAL adjustments to reduce churn.
  • NEGOTIATOR:
    • NEGOTIATOR_INTERVAL, NEGOTIATOR_CPU, and policy config for match frequency and fairness.
  • STARTD:
    • STARTD_CRON for maintenance, and dynamic slot policy to balance single/multi-core jobs.
  • DAEMON_COMMUNICATION:
    • increase timeouts (e.g., DAEMON_CAN_CONNECT_TIMEOUT) in high-latency networks.

11. Testing and Validation

  • Run representative workload benchmarks after tuning: measure throughput, average wait time, and job success rate.
  • Use synthetic loads to validate preemption, checkpointing, and license-pool behavior.
  • Iterate: change one parameter at a time and measure impact.

12. Common Troubleshooting Tips

  • If jobs stall at transfer: verify permissions, network, and storage performance.
  • If low utilization: check mismatched requirements (e.g., overly restrictive Memory or OpSys checks).
  • If many preemptions: lower preemption aggressiveness or reserve nodes for long jobs.
  • If licenses exhausted: verify license pool accounting and reduce greedy acquisitions.

Conclusion

  • Focus on matching Condor configuration to Intergraph job characteristics: accurate ClassAds, appropriate slot topology, storage-aware scheduling, license-aware constraints, and continuous monitoring. Small, measured changes followed by benchmarking yield the best improvements.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *