Connect on the submit host
ssh cerhouse1
Prepare your SSH environment
Create your ssh-keys (do not set any passphrase, just type ENTER key):
ssh-keygen -t rsa -f ~/.ssh/id_rsa_cercloud
Add the generated key to you authorized_keys2, to allow connection without any password:
cat ~/.ssh/id_rsa_cercloud.pub >> ~/.ssh/authorized_keys2
chmod 600 ~/.ssh/authorized_keys2
Add these lines in your ~/.ssh/config (create the file if needed):
host cerhouse* cercloud* 134.246.156.* br156-* 10.0.0.*
IdentityFile ~/.ssh/id_rsa_cercloud
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Put the correct file permission if needed:
chmod 600 ~/.ssh/config
To check that all is OK, try to ssh cerhouse1 (from cerhouse1) : it should not ask you any password:
yourusername@cerhouse1:~> ssh cerhouse1
Warning: Permanently added 'cerhouse1,134.246.158.137' (RSA) to the list of known hosts.
...
...
yourusername@cerhouse1:~>
That’s OK !
Run a distributed processing and monitor your job
Let’s say that you want to run a script which just takes a number for argument.
In the following example, we run the “sleep” command with a number of seconds for argument (it does nothing but wait n seconds before exiting), on the cluster cloudphys, reserving 500mb ram memory for each execution of “sleep NUMBER”:
yourusername@cerhouse1:~> seq 30 90 | /home5/begmeil/tools/gogolist/bin/gogolist.py \ --stdin --workspace ./workspace \ --execute 'sleep' --qsub-options='-l nodes=1:cloudphys,mem=500mb' \ --split-max-lines=1 \ --reportingThat’s all ! If all is running correctly, you should have a reporting every 1 minute showing you the current status:
Job workspace : ./workspace/20120802/000002 Job successfully registered in monitor. Go to : http://cercloudweb/jobsmonitor/job/29111/ Batch Manager : torque Job id : 54908[].cerhouse1.ifremer.fr job name:sleep id:54908[].cerhouse1.ifremer.fr (Q:0 / R:0 / C:0 / E:0 / H:0 / W:0 / X:0) ) No running jobs. Remaining Jobs to process : 61 (... 60 seconds later...) job name:sleep id:54908[].cerhouse1.ifremer.fr (Q:0 / R:35 / C:26 / E:0 / H:0 / W:0 / X:0) ) Remaining Jobs to process (including currently running) : 35 [2012-08-02T17:35:55Z] Jobs launched: 61/61 (running: 35 terminated: 26) Exit OK = 26 | Exit ERROR = 0 | Lines submitted = 26/61 (42.62%) exec time : mean=0:00:42.576923, sum=0:18:27 (... 60 seconds later...) job name:sleep id:54908[].cerhouse1.ifremer.fr (Q:0 / R:0 / C:61 / E:0 / H:0 / W:0 / X:0) ) job completed (id=54908[].cerhouse1.ifremer.fr) workspace: ./workspace/20120802/000002 Remaining Jobs to process (including currently running) : 0 [2012-08-02T17:36:55Z] Jobs launched: 61/61 (running: 0 terminated: 61) Exit OK = 61 | Exit ERROR = 0 | Lines submitted = 61/61 (100.00%) exec time : mean=0:01:00.098360, sum=1:01:06 Job workspace: ./workspace/20120802/000002 Job is TERMINATED
Notes
While the job is running, you can also monitor it using the Jobsmonitor web interface given in the output. Here : http://cercloudweb/jobsmonitor/job/29111/
Once launched, you can interrupt the reporting (CTRL-C) : the job will continue to run. To restart the reporting, just run:
/home5/begmeil/tools/gogolist/bin/gogolist.py report \ --loop-time=10 \ ./workspace/20120802/000002