Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can i compile SparkNet without using CUDA ( want to run with CPU) #138

Open
prateekarora-git opened this issue Jun 29, 2016 · 11 comments
Open

Comments

@prateekarora-git
Copy link

Hi
I compiled SparkNet successfully with Cuda 7.0 . but when i tried to run "Train Cifar using SparkNet" application its show me .

F0628 17:53:57.325634 29332 cudnn_conv_layer.cpp:52] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
*** Check failure stack trace: ***

My OS is Ubuntu 14.04 and running on virtual Machine and i don't have GPU support now . so can i test application with CPU without using Cuda?

if possible give me steps to compile and run application with CPU.

Regards
Prateek

@robertnishihara
Copy link
Member

Hi Prateek, take a look at the instructions in #110.

@prateekarora-git
Copy link
Author

Thanks
I compiled SparkNet for CPU cluster .
then again tried to run "Train Cifar using SparkNet" application . this time i got error in native library libcaffe.so.1.0.0 at "sum at CifarApp.scala" stage.

Log Contents:
Co# A fatal error has been detected by the [thread 140091143587584 al[t# ad 140091143587584 also had an error]
SIGILL (0x4) at pc=0x00007f696f1b1221, pid=10808, tid=140090651174656

JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)

Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops)

Problematic frame:

C [libcaffe.so.1.0.0-rc3+0x786221] sgemm_kernel+0x21

Failed to write core dump. Core dumps have been di also had an error]

An error report file with more information is saved as:

/yarn/nm/usercache/ubuntu/appcache/application_1467152377093_0019/container_1467152377093_0019_01_000002/hs_err_pid10808.log

If you would like to submit a bug report, please visit:

http://bugreport.sun.com/bugreport/crash.jsp

@pcmoritz
Copy link
Collaborator

Did you create the JARs yourself or did you use the JARs we provide? There seems to be an error in the BLAS library, if you compiled it yourself, which one are you using?

@prateekarora-git
Copy link
Author

yes i build my own jar file .
I checkout SparkNet code from git clone https://github.com/amplab/SparkNet.git

then modified build.sbt (change the URL to snapshot-2016-03-16-CPU and change all of the instances of SPARKNET to SPARKNETCPU.).

I have my own spark 1.6 cluster running using cloudera 5.7.0.

then try to run application using

spark-submit --master yarn-cluster --num-executors 3 --driver-memory 4G --executor-memory 4G --conf spark.akka.frameSize=300 --class apps.CifarApp target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 3

@pcmoritz
Copy link
Collaborator

Oh I see, when I said "create the JAR yourself" I meant to ask if you followed this procedure: https://github.com/amplab/SparkNet/blob/master/doc/creating-jars.md

If you are running on cloudera with ubuntu 14.04 (if that is possible), using the procedure you described should work out of the box. If it uses a different distribution, you might have to follow the above procedure to make sure it works.

@prateekarora-git
Copy link
Author

hi thanks for the information
.
Yes , I am using cloudera 5.7.0 with Ubuntu 14.04 . cloudera distribution have spark 1.6.0 jar files and running spark with Yarn cluster . I have tested many spark application in to my cluster.

so as per my understanding you told that , the procedure i have used to compile SparkNet and running "Train Cifar using SparkNet" should work .

is any hint to solve my problem ?

Regards
Prateek

@prateekarora-git
Copy link
Author

one more thing i am using java version "1.7.0_101"

@pcmoritz
Copy link
Collaborator

On EC2 it works on Ubuntu 14.04. Did you start from a fresh image or might it be that there is another version of BLAS that causes problems? I'm happy to have a quick look at the log hs_err_pid10808.log to see if it contains more information, if you are willing to share that.

@prateekarora-git
Copy link
Author

Attaached log file .
hs_err_pid7119.docx

@prateekarora-git
Copy link
Author

Hi
I tried with fresh image and its working . but i want to run this on my existing cluster where issue is coming. i cant move all my previous work to new cluster

Regards
Prateek

@pcmoritz
Copy link
Collaborator

pcmoritz commented Jul 1, 2016

Is it the same software versions on both the fresh image and your existing cluster? Do you have any other BLAS libraries installed on your existing cluster? My guess right now is that a different blas is loaded at runtime. The log is not very helpful unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants