Performance Tuning: Condor for Intergraph 2013 — Best Practices
Overview
- Goal: Improve throughput, reduce job latency, and increase reliability when running Intergraph 2013 workloads under HTCondor (Condor).
- Scope: Scheduling, resource configuration, job submission patterns, storage and network considerations, monitoring, and troubleshooting.
1. Understand the Workload
- Classify jobs: short interactive, medium batch, long-running simulations, I/O-heavy vs CPU-bound.
- Profile resource use: measure CPU, memory, disk I/O, network, and license usage per job type.
- Set targets: desired queue wait time, throughput (jobs/hour), and acceptable failure/retry rate.
2. Condor Cluster Topology & Daemons
- Run the key daemons: condor_master, condor_schedd (per submit host), condor_collector, condor_negotiator, condor_noop (optional for testing), condor_startd (on execute nodes).
- Dedicated roles: separate collector/negotiator on robust hosts to reduce load on submit/execute machines.
- High availability: use multiple collectors and redundant negotiators where possible to avoid single points of failure.
3. Resource ClassAds and Machine Attributes
- Define machine ClassAds that reflect actual capacities: Cpus, Memory, Disk, LoadAvg, Arch, OpSys, and custom tags (e.g., GPU, SSD, license_pool).
- Use accurate Slot configuration: prefer contiguous multi-core slots for multi-threaded Intergraph tasks; or dynamic slots for mixed workloads.
- Example attributes to include:
- Memory = 32768
- Cpus = 16
- Disk = 500000
- HasSSD = True
- LicensePools = “Intergraph2013:10”
4. Job Submission Best Practices
- Use submit files with explicit requirements and rank expressions:
- requirements = (OpSys == “LINUX”) && (Memory >= 8192) && (Cpus >= 4)
- rank = (HasSSD ? 100 : 0) + (Memory/1024)
- Limit preemption for long-running or license-sensitive jobs:
- request_cpus, request_memory in submit description
- set job ClassAd attributes like JobLeaseDuration, job_machine_attrs
- Batch small short jobs into job arrays or DAGs to reduce schedd overhead.
- For I/O-heavy Intergraph tasks, request nodes tagged HasSSD and prefer local scratch via rank.
5. I/O and Storage Optimization
- Prefer local scratch on execution nodes for temporary working files; stage-in/out via fast parallel transfer (rsync, scp or gridFTP depending on environment).
- If using shared storage, ensure it’s not a bottleneck: use multiple NAS targets, enable NFS tuning (rsize/wsize), or use clustered filesystems (Lustre/GPFS).
- Schedule heavy I/O jobs during off-peak hours; apply a priority penalty for jobs hitting shared storage.
6. Network and License Management
- Monitor and throttle license usage: define a LicensePool attribute and requirements so jobs only run if a license is available.
- Use condor_status and custom ClassAd attributes to track license counts and consume/release properly.
- Optimize network topology: colocate busiest submit hosts close to collectors and shared storage; ensure low-latency links for file staging.
7. Scheduling Policies and Fairshare
- Configure negotiator policy to reflect organizational priorities:
- set fairness using NEGOTIATOR_PREEMPTION, RANK, and WANT attributes.
- Use group quotas and fairshare to guarantee resources for high-priority Intergraph users while preventing starvation.
- Implement preemption policies sparingly; better to use checkpointing for long jobs if supported.
8. Checkpointing and Fault Tolerance
- Enable Condor checkpointing (if Intergraph jobs support it) to resume long simulations after preemption or node failure.
- Use job retries with exponential backoff for transient failures.
- Maintain clean shutdown/startup scripts on execution nodes to avoid orphaned filesystem locks.
9. Monitoring, Metrics, and Alerts
- Collect metrics: job wait time, run time, slot utilization, I/O wait, network throughput, license usage, failure rates.
- Use condor_history, condor_q, condor_status plus external monitoring (Prometheus + Grafana, Nagios) for dashboards and alerts.
- Alert on sustained high load, low free licenses, repeated job failures, or collector/negotiator unavailability.
10. Tuning Parameters to Consider
- SCHEDD:
- MAX_JOBS_SUBMITTED_PER_OWNER, JOB_ROUTER settings, and SCHEDD_INTERVAL adjustments to reduce churn.
- NEGOTIATOR:
- NEGOTIATOR_INTERVAL, NEGOTIATOR_CPU, and policy config for match frequency and fairness.
- STARTD:
- STARTD_CRON for maintenance, and dynamic slot policy to balance single/multi-core jobs.
- DAEMON_COMMUNICATION:
- increase timeouts (e.g., DAEMON_CAN_CONNECT_TIMEOUT) in high-latency networks.
11. Testing and Validation
- Run representative workload benchmarks after tuning: measure throughput, average wait time, and job success rate.
- Use synthetic loads to validate preemption, checkpointing, and license-pool behavior.
- Iterate: change one parameter at a time and measure impact.
12. Common Troubleshooting Tips
- If jobs stall at transfer: verify permissions, network, and storage performance.
- If low utilization: check mismatched requirements (e.g., overly restrictive Memory or OpSys checks).
- If many preemptions: lower preemption aggressiveness or reserve nodes for long jobs.
- If licenses exhausted: verify license pool accounting and reduce greedy acquisitions.
Conclusion
- Focus on matching Condor configuration to Intergraph job characteristics: accurate ClassAds, appropriate slot topology, storage-aware scheduling, license-aware constraints, and continuous monitoring. Small, measured changes followed by benchmarking yield the best improvements.
Leave a Reply