Jim's Blog: 2010

Friday 10 December 2010

Rocks Cluster Config

Shutdown Cluster

/opt/rocks/sbin/cluster-fork shutdown

or

/opt/rocks/sbin/cluster-fork poweroff (if kernel and bios agree)

Compute node removal
rocks remove host compute-0-14
insert-ethers –-remove=compute-0-14
insert-ethers –-update
rocks sync config

Add/remove Nodes
Remove node with
rocks remove host compute-0-14
followed by
rocks sync config
and then run
insert-ethers --cabinet=0 --rank=14
and then pxe boot it?

Watch the /var/log/daemon log file for DHCPREQUEST from the MAC
address of that node. Once you see request and offer of the IP address
instert-ethers should show that it found new node. Then see if you are
seeing anything in /var/log/httpd/ssl_request_log from that IP
address. Fresh node should ask for a kickstart.cgi
Check for dhcp requests etc
tail -f /var/log/messages

Check Kickstart file is being correctly generated
rocks list host profile compute-0-0 > /tmp/ks.cfg

Check you can download kickstart file
wget --no-check-certificate https://localhost/install/sbin/public/kickstart.cgi

Sync Config
rocks sync config

Set node to be OS rescued or reinstalled
rocks set host pxeboot compute-x-y action=rescue/install

List all hosts On Cluster
cat /etc/hosts

IP Address for Node
host compute-0-3

New Node Install No IP address received
The new node sometimes doesn't get a new ip address via dhcp during pxe boot. A look in the head nodes messages shows no leases available. To fix this do :-

/etc/init.d/syslog restart

Problem with Ganglia Webpage
/etc/init.d/gmetad restart

/etc/init.d/gmond restart

Reinstall Node Problem

24/6/09

Then we tried to insert it:
insert-ethers --cabinet=0 --rank=14

It still failed at "choose a language".

It didn't show # symbol when .
Kickstart file not loading on compute node.

ls -ld /root

Gives … drwx------ 21 root root 4096 Jun 24 12:01 /root
ls -ld /root/.my.cnf

Gives … -r--r----- 1 root apache 28 Nov 25 2008 /root/.my.cnf

Problem with download of kickstart file was to do with /root permissions.

was fixed with chmod o+r /root and chmod o+x /root

After the above two commands were used root permissions were:-

drwx---r-x 21 root root 4096 Jun 24 12:01 /root

This cured the install problem.

Install id_rsa.pub in Nodes

Now copy id_rsa.pub file from head node to compute node.

scp /root/.ssh/id_rsa.pub root@compute-0-45 root@compute-0-45 ://root/.ssh/linux.pub

Now Login to the compute node.

ssh compute-0-45

Copy contents of linux.pub file and append them to the authorized_keys file.

cat /root/.ssh/linux.pub >> /root/.ssh/authorized_keys

Restart Ganglia

Sometimes the Ganglia web page from the head node shows all nodes as down but they can be sshed into and pinged via the console and seem very much alive!.

service gmond restart

service gmetad restart

Run a command on all nodes
This will run the cat command on all nodes and output the results on the head node and redirect the output to a file. This gives a list of hostnames and MAC addresses in a txt file.

[root@blub~]#cluster-fork cat /etc/sysconfig/network-scripts/ifcfg-eth0:0 | egrep "compute|HWADDR" > HostHWaddr.txt

Debug Commands Installation

Console Use Keystroke
1
Installation
Cntl-Alt-F1
2
Shell prompt
Cntl-Alt-F2
3
Installation log
Cntl-Alt-F3
4
System messages
Cntl-Alt-F4
5
Other messages
Cntl-Alt-F5

Cluster Head Node Overnight Temperature

Using the Temperature sensors on the Motherboard

The command sensors-detect was used to setup the sensors, then the command sensors was used in a script the output of which was then piped to the cut command to extract the wanted data board temperature was redirected to a data file temp.txt along with a comma to delimit the data. Data collection was performed every 5 minutes this was run in an infinite loop overnight. The data file temp.txt was imported into a spreadsheet as a csv comma separated variable file.

The Cheap and nasty Script used

#!/bin/bash
while [ 1 ]
do
temp=`sensors | grep low | grep -v Temp | cut -d\( -f1`
echo $temp >> temp.txt
echo , >> temp.txt
sleep 300
done

Its not the best script, but it does what I wanted it to do IMHO.

....

Wednesday 8 September 2010

IPTables Example

Head Node IPTables example to open port 1099 and save the rules.

Add rule
iptables -A INPUT -p tcp --dport 1099 -j ACCEPT

Add rule
iptables -A OUTPUT -p tcp --dport 1099 -j ACCEPT

Save Rules in the event of reboot
/sbin/service iptables save

Run a command on all nodes

This will run the cat command on all nodes and output the results on the head node and redirect the output to a file. This gives a list of hostnames and MAC addresses in a txt file.

[root@HOST~]#cluster-fork cat /etc/sysconfig/network-scripts/ifcfg-eth0:0 | egrep "compute|HWADDR" > HostHWaddr.txt

Monday 30 August 2010

SSH Tunnel Example

Tunnel ssh from Local Machine to Remote Machine
and from Remote Machine to a Local Machine on
Remote Machines network.

Local to remote machine with 5900 tunnel
ssh -L 5900:127.0.0.1:5900 -l username -p 22 theactualurl.net

Remote to machine on remote network with 5900 to 443 tunnel
sudo ssh -L 5900:127.0.0.1:443 -l username -p 22 remotemachine.local

This allowed one on the local machine to connect to a webserver using https on the remote machines network. The address on the local machine is https://localhost:5900 or https://127.0.0.1:5900 and the connection is tunnelled through port 5900 but the actual server uses port 443.

Wednesday 25 August 2010

NIC IP Aliases

I needed to and Aliases for some IP addresses to the ethernet card on the Rocks Cluster to allow individual nodes to have their own public IP addresses.
I created a copy of the /etc/sysconfig/network-scripts/ifcfg-eth0 to ifcfg-eth0:0. This new file was edited with vi to contain

DEVICE=eth0:0
HWADDR=xx:xx:xx:xx:xx:xx
IPADDR=xxx.xxx.xxx.xx
NETMASK=255.255.255.0
BOOTPROTO=static
ONBOOT=yes

/sbin/service network restart to update the changes made.

/sbin/ifconfig shows the new configuration.

The change was tested by connecting the node switch to the public 211 subnet and the ip address could be pinged from the subnet showing that the changes were working.
Info On Binding IP Addresses

Thursday 29 July 2010

Rocks Cluster NAT

The NAT was not enabled on the cluster to allow nodes access to the public network. This was achieved by editing the /etc/sysconfig/iptables with vim and adding the following lines:-

*nat
-A POSTROUTING -o eth1 -j MASQUERADE
COMMIT

to the beginning of the file.

Save the file using wq

Then restart the service using:-

/sbin/service iptables restart

Then sync the config using :-

rocks sync config

That is now the NAT working and allowing internet access from the compute nodes on the private network.

Rocks Cluster iptables

I needed to modify the iptables to allow connections on some extra tcp ports.
I edited the /etc/sysconfig/iptables file using vim. The extra lines added are shown below:-

# http and https is allowed for all nodes on the public subnet
-A INPUT -m state --state NEW -p tcp --dport 9618 -j ACCEPT
-A INPUT -m state --state NEW -p tcp --dport 9614 -j ACCEPT

The iptables service was restarted using the command /sbin/service iptables restart.

Monday 5 July 2010

NFS not working on some nodes

NFS not working

The nodes were not working and issuing the command
showed that there was an NFS mounting problem
jimp@compute-2:~$ rpcinfo -p
program vers proto port
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100024 1 udp 52121 status
100024 1 tcp 59769 status

The /etc/hosts file was changed to match the other nodes
the rpcinfo command now gives the following.

jimp@compute-5:~$ rpcinfo -p hadoop
program vers proto port
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100024 1 udp 56797 status
100024 1 tcp 54606 status
100021 1 udp 53628 nlockmgr
100021 3 udp 53628 nlockmgr
100021 4 udp 53628 nlockmgr
100021 1 tcp 53694 nlockmgr
100021 3 tcp 53694 nlockmgr
100021 4 tcp 53694 nlockmgr
100003 2 udp 2049 nfs
100003 3 udp 2049 nfs
100003 4 udp 2049 nfs
100003 2 tcp 2049 nfs
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs
100005 1 udp 40126 mountd
100005 1 tcp 37582 mountd
100005 2 udp 40126 mountd
100005 2 tcp 37582 mountd
100005 3 udp 40126 mountd
100005 3 tcp 37582 mountd

Thursday 1 July 2010

Add User to Sudo users

Add User to Sudo users file

To add user to sudo users file use the command sudo adduser some_user admin. This will allow some_user to issue sudo commands as well as the original sudo user.

Monday 28 June 2010

IPTables Lost After Reboot of Head Node

IPTables Lost After Reboot of Head Node

The iptables were lost every time the machine was rebooted so I found I needed to save the iptabels to a file /etc/iptables.rules.
Use the following command to use NAT for the compute nodes sudo iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE
Then in Ubuntu become su by using sudo su then use the command iptables-save > /etc/iptables.rules

Wednesday 16 June 2010

Wake on LAN

WakeonLAN

I have been working with the wakeonlan command and it does not work with the command eg wakeonlan 00:30:48:70:45:8c.
I have however got it to work with wakeonlan -i 192.168.1.255 00:30:48:70:45:8c by
using the broadcast address for the private network the node is on.
I have tried it a good number of times and it is consistently switching on the compute-4 node.
I think the main problem was to do with the wakeonlan command not knowing where to send the magic packet either on the bridge network or the private network.

Monday 7 June 2010

Mac OS X Command Line Software Update

Command line Software Update

I often want to do a software update on Mac OS X from the command line. I use the following command.

sudo softwareupdate -i -a

Thursday 3 June 2010

SSH Connection Problems

Ssh-agent Problem

When I attempted to use the command 'ssh-agent' then the command 'ssh-add' to enter my rsa passphrase and allow me to login to a host without typing my passphrase in all the time. The error I got after entering 'ssh-add' was 'Could not open a connection to your authentication agent'.
The solution was to enter the command 'exec ssh-agent bash' the command 'ssh-add' then the passphrase. It worked after this not sure why, it used to work before, could have been an upgrade in Ubuntu 9.10.

Hadoop Cluster Problems

Network Instability

The cluster had intermittent network availability, sometimes it would accept ssh connections only sometimes. The connections only seemed to last a period of 10 minutes before they were disconnected.
It was initailly thought that it may have been something to do with the dhcp server but it turned out that this was not the cause.
One of the symptoms was during a ping to the DNS server one would get a few returns then a few Destination Host Unreachable errors.
It was eventually traced to having two gateways setup in the '/etc/network/interfaces' one on the public network and one on the private network, when this was corrected to having only one public gateway this fixed the problem.

Setting up Open-SSH in Ubuntu

OpenSSH

I set-up Open-SSH by installing the package ssh using the command 'sudo apt-get -y install ssh'.
I then edited the config file /etc/ssh/sshd_config to include a reference to a banner txt file, change the ssh port to 11000 and add 'UseDNS no' entry, this cures some login delays.

The banner.txt file is displayed when you first login to the machine remotely, it gives warnings about acceptable use policy etc. I chose to listen on port 11000, this has prevented a lot of login attempts on my home server.

To test the server I used a command 'scp -P 11000 file.txt jimp@mumetal:/home/jimp' from a remote machine this transferred a file.txt from the remote machine to the home directory on my machine using ssh.

Wednesday 2 June 2010

Ubuntu Enterprise Cloud

Hadoop Cluster

I had problems with ssh passwordless logins taking ages on the hadoop Ubuntu cluster I was working on. This was fixed by adding a 'UseDNS no' entry to /etc/ssh/sshd_config file and restarting the ssh daemon using the command sudo /etc/init.d/ssh restart.

The compute nodes on the cluster could not access the internet so NAT had to be setup on the head hadoop node.

Where eth0 is the network card with access to the private network.
The following NAT was setup as follows using the command 'sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE' and '/proc/sys/net/ipv4/ip_forward' file entry should be 1.

Restart dnsmasq using the following command 'sudo /etc/init.d/dnsmasq restart'.

Configure the NAT with the following command 'iptables -t nat -A POSTROUTING -o eth1 -j MASQUERADE'. Where eth1 is the network card with access to the internet.

I tested the cluster by pinging a url on the internet with the command 'ping -c 4 google.com' the result was as follows

PING google.com (209.85.227.104) 56(84) bytes of data.
64 bytes from wy-in-f104.1e100.net (209.85.227.104): icmp_seq=1 ttl=51 time=18.8 ms
64 bytes from wy-in-f104.1e100.net (209.85.227.104): icmp_seq=2 ttl=51 time=18.8 ms
64 bytes from wy-in-f104.1e100.net (209.85.227.104): icmp_seq=3 ttl=51 time=18.9 ms
64 bytes from wy-in-f104.1e100.net (209.85.227.104): icmp_seq=4 ttl=51 time=18.8 ms

--- google.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 18.800/18.840/18.912/0.106 ms

thus proving that the NAT settings had worked.

Information used was from the page http://www.ubuntugeek.com/sharing-internet-connection-in-ubuntu.html for this blog

Friday 14 May 2010

Multiple User Login on Ubuntu

Problem with multiple users not able to log in to nodes on a cluster. The solution was to copy entries for particular user in /etc/passwd and /etc/shadow on the head node to the files on the nodes /etc/passwd and /etc/shadow. Also created the users home directory on each node. This seems to have corrected the problem.

Jim's Blog