[R-sig-hpc] snow, socket cluster: problem with path to rscript
luke at stat.uiowa.edu
luke at stat.uiowa.edu
Mon Apr 20 16:19:40 CEST 2009
Glad it is working now.
By default snow uses Sys.info()["nodename"] on the master to determine
the name of the master that is used for the back connection. If you
supply an alternative as master="123...." withthe IP address then that
should work around not having the master name known on the worker.
luke
On Mon, 20 Apr 2009, Matthieu Stigler wrote:
> I solved the problem ;-)
>
> I think that the issue come from that snow is expecting that the name of the
> master has been exported in /etc/hosts (typically by IP master_name) on the
> workers. In my case, it wasn't working before and since I've exported it it
> is working!
>
> The error I did was to check with
>
> #open from the master (212):
> socketConnection(port = 10187, server = TRUE)
>
> #and from the slave
> socketConnection("192.100.100.212", port = 10187) #from 210
>
> #Even if this works it is not a sufficient condition, for snow to work,
> indeed:
> socketConnection("master_name", port = 10187) #from 210
> has to be working also
>
>
>
> Thanks a lot for the help of Dirk, Luke and Steve, who helped me a lot in
> finding this!!
>
> Matthieu
>
> luke at stat.uiowa.edu a écrit :
>> On Fri, 17 Apr 2009, Matthieu Stigler wrote:
>>
>>> Steve Weston a écrit :
>>>> I just noticed that you're running R 2.7.1 on your 192.100.100.212
>>>> machine. I believe there are known socketConnection issues
>>>> with that version of R that Luke fixed as of R 2.7.2. So I strongly
>>>> suggest that you upgrade your version of R.
>>>>
>>> I upgraded to R 2.8 but unfortunately this doesn't change, the port 10187
>>> is still said to be close...
>>>
>>> I obviously have a problem in opening the port, maybe should I rather post
>>> on the debian list or on other forums? I use nc -l -p 10187, so that
>>> telnet
>>
>> According to my man page that argument combination is not legal so I
>> don't know what you actually did.
>>
>>> xxx.212 10187 is working, did it on both machines, but still when running
>>> with makeCluster have that issue, also when running from worker:
>>>
>>> socketConnection("ubuntu", port = 10187)
>>> 192.100.100.212:10187 cannot be opened
>>>
>>>
>>> and with:
>>>
>>> socketConnection(port = 10187, server = TRUE)
>>>
>>> nothing happens, what is actually the expected output?
>>
>> the server call waits until a connection occurs and then returns an R
>> connection object. The clinet socketConnection call returns a socket
>> connection if curresful and gives an error message if not.
>>
>> So on the master do
>>
>> s <- socketConnection(port = 10187, server = TRUE)
>>
>> and this will wait for a connection and return to the prompt when a
>> connectin occurs. On the wroker machine
>>
>> telnet master 10187
>>
>> will either succeed and wait until the server socket is closed or fail
>> with an error message about not being able to open the port. If I use
>>
>> nc master 10187
>>
>> then no an successful connection nc waits (for input) until the server
>> closes the socket with close(s) and then returns to the shell prompt.
>> Failure for me is an immediate resurn to the shell prompt, no error
>> message (and the server side continues to wait).
>>
>> luke
>>
>>>
>>> Thanks a lot for your help and advices!!!
>>>
>>> Mat
>>>> --
>>>> Steve Weston
>>>> REvolution Computing
>>>> One Century Tower | 265 Church Street, Suite 1006
>>>> New Haven, CT 06510
>>>> P: 203-777-7442 x266 | www.revolution-computing.com
>>>>
>>>>
>>>> On Thu, Apr 16, 2009 at 4:52 AM, Matthieu Stigler
>>>> <matthieu.stigler at gmail.com> wrote:
>>>>
>>>>> luke at stat.uiowa.edu a écrit :
>>>>>
>>>>>> On Wed, 15 Apr 2009, Matthieu Stigler wrote:
>>>>>>
>>>>>>
>>>>>>> Steve Weston a écrit :
>>>>>>>
>>>>>>>> On Tue, Apr 14, 2009 at 5:29 AM, Matthieu Stigler
>>>>>>>> <matthieu.stigler at gmail.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> So it is now working for the local computer with. However, when
>>>>>>>>> trying
>>>>>>>>> to
>>>>>>>>> use the external computer, it seems to be working but nothing
>>>>>>>>> happens
>>>>>>>>> after
>>>>>>>>> he asked for the last password...
>>>>>>>>>
>>>>>>>>>
>>>>>>>> This tells you is that "something went wrong". The basic strategy in
>>>>>>>> this case
>>>>>>>> is to use the "outfile" option to hopefully capture an error message.
>>>>>>>> You might
>>>>>>>> need to set outfile differently for different slaves, particularly if
>>>>>>>> you're starting
>>>>>>>> more than one on the same machine, but I suggest just starting one
>>>>>>>> slave
>>>>>>>> on 210 to avoid the issue. So do something like:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> host210 <- list(host = "mat at 192.100.100.210", rscript =
>>>>>>>>> "/usr/bin/Rscript",
>>>>>>>>>
>>>>>>>>>
>>>>>>>> + outfile="/tmp/log.txt")
>>>>>>>>
>>>>>>>>
>>>>>>>>> cl2 <- makeCluster(list(host210), type = "SOCK")
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> Ok, thanks for pointing out this methid.
>>>>>>>
>>>>>>> I tried it and got following error message. This does not seem not be
>>>>>>> computer specific (tried to do it to other host 213, and from other
>>>>>>> host 213
>>>>>>> to 212, always same error message):
>>>>>>>
>>>>>>> starting worker for ubuntu:10187 Error in socketConnection(master,
>>>>>>> port =
>>>>>>> port, blocking = TRUE, open = "a+b") : unable to open connection
>>>>>>>
>>>>>>> Calls: local ... slaveLoop -> recvData -> makeSOCKmaster ->
>>>>>>> socketConnection
>>>>>>>
>>>>>>> In addition: Warning message:
>>>>>>>
>>>>>>> In socketConnection(master, port = port, blocking = TRUE, open =
>>>>>>> "a+b") :
>>>>>>>
>>>>>>> ubuntu:10187 cannot be opened
>>>>>>>
>>>>>>> Execution halted
>>>>>>>
>>>>>>>
>>>>>>> Is it related to ssh or snow? I did not find any reference to that
>>>>>>> prob
>>>>>>> googling for it...
>>>>>>>
>>>>>> It is an issue with your ability to make a socket connection to the
>>>>>> master. Most likely the master computer has a firewall that is
>>>>>> blocking connections to the port snow uses. Try turning the firewall
>>>>>> off or at least enabling the port in the error message.
>>>>>> A simple test is to do
>>>>>>
>>>>>> socketConnection(port = 10187, server = TRUE)
>>>>>>
>>>>>> in an R session on the master and
>>>>>>
>>>>>> telnet ubuntu 10187
>>>>>>
>>>>>> in a shell on your worker machine (assumign your master is called
>>>>>> ubuntu) (or you can use R and
>>>>>>
>>>>>> socketConnection("ubuntu", port = 10187)
>>>>>>
>>>>>> in an R session on the worker).
>>>>>>
>>>>>> luke
>>>>>>
>>>>>>
>>>>> Thanks Luke and Dirk for your help!
>>>>>
>>>>> I don't think it is a firewall error, as both machines have all port
>>>>> open
>>>>> (as default with iptables as I understood), and the admin of the network
>>>>> opened even port 10187.
>>>>>
>>>>> I tried first the three solutions suggested, none of them seem to give
>>>>> good
>>>>> results:
>>>>>
>>>>> $telnet 192.100.100.212 10187
>>>>>
>>>>> Trying 192.100.100.212...
>>>>>
>>>>> telnet: Unable to connect to remote host: Connection refused
>>>>>
>>>>> R>socketConnection(port = 10187, server=TRUE)
>>>>>
>>>>> #nothing happens... is it right?
>>>>>
>>>>>
>>>>> R > socketConnection("192.100.100.212", port = 10187)
>>>>> Erreur dans socketConnection("192.100.100.212", port = 10187) :
>>>>> impossible d'ouvrir la connexion
>>>>>
>>>>> De plus : Warning message:
>>>>>
>>>>> In socketConnection("192.100.100.212", port = 10187) :
>>>>>
>>>>> 192.100.100.212:10187 cannot be opened
>>>>>
>>>>> Same error message when using "ubuntu"/ dsge at 192.100.100.212 etc..
>>>>>
>>>>> Going to a ubuntu forum, someone told that one has to open a server on
>>>>> the
>>>>> port (excuse, explanations are not good as I don't understand that much
>>>>> the
>>>>> subject :-( ).
>>>>> So launching in the master (212):
>>>>>
>>>>> $nc -l -p 10187
>>>>>
>>>>>
>>>>> then one is able to have in 210:
>>>>>
>>>>> $telnet 192.100.100.212 10187
>>>>>
>>>>> Trying 192.100.100.212...
>>>>>
>>>>> Connected to 192.100.100.212.
>>>>>
>>>>> Escape character is '^]'.
>>>>>
>>>>> So it seems that it is working, but there is then no effect on the
>>>>> previous
>>>>> commands socketConnection, makeCluster, still claims that 10187 can't be
>>>>> open.
>>>>>
>>>>> With those elements, do you guys see clearer or is it even darker?
>>>>> Thanks a
>>>>> lot for your help!
>>>>>
>>>>> Matthieu
>>>>>
>>>>>
>>>>>>> Thanks a lot for your help!!
>>>>>>>
>>>>>>>> If it hangs, go to another terminal, ssh to 192.100.100.210, and look
>>>>>>>> at
>>>>>>>> the contents of /tmp/log.txt, and hopefully that will provide a clue
>>>>>>>> to
>>>>>>>> the problem.
>>>>>>>>
>>>>>>>> Another approach is to use the "manual" option. That will print the
>>>>>>>> command that you should use to manually start each of the slaves.
>>>>>>>> You just ssh to that machine from another terminal, and cut and paste
>>>>>>>> the printed command to start the slave. If you set "outfile" to an
>>>>>>>> empty
>>>>>>>> string, then output messages will go right to that terminal.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Steve Weston
>>>>>>>> REvolution Computing
>>>>>>>> One Century Tower | 265 Church Street, Suite 1006
>>>>>>>> New Haven, CT 06510
>>>>>>>> P: 203-777-7442 x266 | www.revolution-computing.com
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>>
>>
>
>
--
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke at stat.uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
More information about the R-sig-hpc
mailing list