Friday, 10 December 2010

Rocks Cluster Config

Shutdown Cluster

/opt/rocks/sbin/cluster-fork shutdown


/opt/rocks/sbin/cluster-fork poweroff (if kernel and bios agree)

Compute node removal
rocks remove host compute-0-14
insert-ethers –-remove=compute-0-14
insert-ethers –-update
rocks sync config

Add/remove Nodes
Remove node with
rocks remove host compute-0-14
followed by
rocks sync config
and then run
insert-ethers --cabinet=0 --rank=14
and then pxe boot it?

Watch the /var/log/daemon log file for DHCPREQUEST from the MAC
address of that node. Once you see request and offer of the IP address
instert-ethers should show that it found new node. Then see if you are
seeing anything in /var/log/httpd/ssl_request_log from that IP
address. Fresh node should ask for a kickstart.cgi
Check for dhcp requests etc
tail -f /var/log/messages

Check Kickstart file is being correctly generated
rocks list host profile compute-0-0 > /tmp/ks.cfg

Check you can download kickstart file
wget --no-check-certificate https://localhost/install/sbin/public/kickstart.cgi

Sync Config
rocks sync config

Set node to be OS rescued or reinstalled
rocks set host pxeboot compute-x-y action=rescue/install

List all hosts On Cluster
cat /etc/hosts

IP Address for Node
host compute-0-3

New Node Install No IP address received
The new node sometimes doesn't get a new ip address via dhcp during pxe boot. A look in the head nodes messages shows no leases available. To fix this do :-

/etc/init.d/syslog restart

Problem with Ganglia Webpage
/etc/init.d/gmetad restart

/etc/init.d/gmond restart

Reinstall Node Problem


Then we tried to insert it:
insert-ethers --cabinet=0 --rank=14

It still failed at "choose a language".

It didn't show # symbol when .
Kickstart file not loading on compute node.

ls -ld /root

Gives … drwx------ 21 root root 4096 Jun 24 12:01 /root
ls -ld /root/.my.cnf

Gives … -r--r----- 1 root apache 28 Nov 25 2008 /root/.my.cnf

Problem with download of kickstart file was to do with /root permissions.

was fixed with chmod o+r /root and chmod o+x /root

After the above two commands were used root permissions were:-

drwx---r-x 21 root root 4096 Jun 24 12:01 /root

This cured the install problem.

Install in Nodes

Now copy file from head node to compute node.

scp /root/.ssh/ root@compute-0-45 root@compute-0-45 ://root/.ssh/

Now Login to the compute node.

ssh compute-0-45

Copy contents of file and append them to the authorized_keys file.

cat /root/.ssh/ >> /root/.ssh/authorized_keys

Restart Ganglia

Sometimes the Ganglia web page from the head node shows all nodes as down but they can be sshed into and pinged via the console and seem very much alive!.

service gmond restart

service gmetad restart

Run a command on all nodes
This will run the cat command on all nodes and output the results on the head node and redirect the output to a file. This gives a list of hostnames and MAC addresses in a txt file.

[root@blub~]#cluster-fork cat /etc/sysconfig/network-scripts/ifcfg-eth0:0 | egrep "compute|HWADDR" > HostHWaddr.txt

Debug Commands Installation

Console Use Keystroke
Shell prompt
Installation log
System messages
Other messages

No comments:

Post a Comment

File resolv.conf changed on reboot

The file /etc/resolv.conf was being reset and losing the correct nameserver on my Raspberry Pi after a reboot. The unfriendly way of fixing ...