Troubleshooting Condor with Intergraph 2013: Common Fixes

Performance Tuning: Condor for Intergraph 2013 — Best Practices

Overview

Goal: Improve throughput, reduce job latency, and increase reliability when running Intergraph 2013 workloads under HTCondor (Condor).
Scope: Scheduling, resource configuration, job submission patterns, storage and network considerations, monitoring, and troubleshooting.

1. Understand the Workload

Classify jobs: short interactive, medium batch, long-running simulations, I/O-heavy vs CPU-bound.
Profile resource use: measure CPU, memory, disk I/O, network, and license usage per job type.
Set targets: desired queue wait time, throughput (jobs/hour), and acceptable failure/retry rate.

2. Condor Cluster Topology & Daemons

Run the key daemons: condor_master, condor_schedd (per submit host), condor_collector, condor_negotiator, condor_noop (optional for testing), condor_startd (on execute nodes).
Dedicated roles: separate collector/negotiator on robust hosts to reduce load on submit/execute machines.
High availability: use multiple collectors and redundant negotiators where possible to avoid single points of failure.

3. Resource ClassAds and Machine Attributes

Define machine ClassAds that reflect actual capacities: Cpus, Memory, Disk, LoadAvg, Arch, OpSys, and custom tags (e.g., GPU, SSD, license_pool).
Use accurate Slot configuration: prefer contiguous multi-core slots for multi-threaded Intergraph tasks; or dynamic slots for mixed workloads.
Example attributes to include:
- Memory = 32768
- Cpus = 16
- Disk = 500000
- HasSSD = True
- LicensePools = “Intergraph2013:10”

4. Job Submission Best Practices

Use submit files with explicit requirements and rank expressions:
- requirements = (OpSys == “LINUX”) && (Memory >= 8192) && (Cpus >= 4)
- rank = (HasSSD ? 100 : 0) + (Memory/1024)
Limit preemption for long-running or license-sensitive jobs:
- request_cpus, request_memory in submit description
- set job ClassAd attributes like JobLeaseDuration, job_machine_attrs
Batch small short jobs into job arrays or DAGs to reduce schedd overhead.
For I/O-heavy Intergraph tasks, request nodes tagged HasSSD and prefer local scratch via rank.

5. I/O and Storage Optimization

Prefer local scratch on execution nodes for temporary working files; stage-in/out via fast parallel transfer (rsync, scp or gridFTP depending on environment).
If using shared storage, ensure it’s not a bottleneck: use multiple NAS targets, enable NFS tuning (rsize/wsize), or use clustered filesystems (Lustre/GPFS).
Schedule heavy I/O jobs during off-peak hours; apply a priority penalty for jobs hitting shared storage.

6. Network and License Management

Monitor and throttle license usage: define a LicensePool attribute and requirements so jobs only run if a license is available.
Use condor_status and custom ClassAd attributes to track license counts and consume/release properly.
Optimize network topology: colocate busiest submit hosts close to collectors and shared storage; ensure low-latency links for file staging.

7. Scheduling Policies and Fairshare

Configure negotiator policy to reflect organizational priorities:
- set fairness using NEGOTIATOR_PREEMPTION, RANK, and WANT attributes.
Use group quotas and fairshare to guarantee resources for high-priority Intergraph users while preventing starvation.
Implement preemption policies sparingly; better to use checkpointing for long jobs if supported.

8. Checkpointing and Fault Tolerance

Enable Condor checkpointing (if Intergraph jobs support it) to resume long simulations after preemption or node failure.
Use job retries with exponential backoff for transient failures.
Maintain clean shutdown/startup scripts on execution nodes to avoid orphaned filesystem locks.

9. Monitoring, Metrics, and Alerts

Collect metrics: job wait time, run time, slot utilization, I/O wait, network throughput, license usage, failure rates.
Use condor_history, condor_q, condor_status plus external monitoring (Prometheus + Grafana, Nagios) for dashboards and alerts.
Alert on sustained high load, low free licenses, repeated job failures, or collector/negotiator unavailability.

10. Tuning Parameters to Consider

SCHEDD:
- MAX_JOBS_SUBMITTED_PER_OWNER, JOB_ROUTER settings, and SCHEDD_INTERVAL adjustments to reduce churn.
NEGOTIATOR:
- NEGOTIATOR_INTERVAL, NEGOTIATOR_CPU, and policy config for match frequency and fairness.
STARTD:
- STARTD_CRON for maintenance, and dynamic slot policy to balance single/multi-core jobs.
DAEMON_COMMUNICATION:
- increase timeouts (e.g., DAEMON_CAN_CONNECT_TIMEOUT) in high-latency networks.

11. Testing and Validation

Run representative workload benchmarks after tuning: measure throughput, average wait time, and job success rate.
Use synthetic loads to validate preemption, checkpointing, and license-pool behavior.
Iterate: change one parameter at a time and measure impact.

12. Common Troubleshooting Tips

If jobs stall at transfer: verify permissions, network, and storage performance.
If low utilization: check mismatched requirements (e.g., overly restrictive Memory or OpSys checks).
If many preemptions: lower preemption aggressiveness or reserve nodes for long jobs.
If licenses exhausted: verify license pool accounting and reduce greedy acquisitions.

Conclusion

Focus on matching Condor configuration to Intergraph job characteristics: accurate ClassAds, appropriate slot topology, storage-aware scheduling, license-aware constraints, and continuous monitoring. Small, measured changes followed by benchmarking yield the best improvements.

Troubleshooting Condor with Intergraph 2013: Common Fixes

Performance Tuning: Condor for Intergraph 2013 — Best Practices

1. Understand the Workload

2. Condor Cluster Topology & Daemons

3. Resource ClassAds and Machine Attributes

4. Job Submission Best Practices

5. I/O and Storage Optimization

6. Network and License Management

7. Scheduling Policies and Fairshare

8. Checkpointing and Fault Tolerance

9. Monitoring, Metrics, and Alerts

10. Tuning Parameters to Consider

11. Testing and Validation

12. Common Troubleshooting Tips

Comments

Leave a Reply Cancel reply

More posts

Castrator Maintenance and Sterilization: Extend Tool Life and Reduce Infection Risk

One-Click English⇄German Translation Software for Microsoft Word

Changing Seasons Theme: Music and Art Ideas for Every Age

Spelling for Grade 3 — List 4: 30 Essential Words to Practice