Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move SparkContext -> SparkSession #291

Open
etrain opened this issue Feb 24, 2017 · 2 comments
Open

Move SparkContext -> SparkSession #291

etrain opened this issue Feb 24, 2017 · 2 comments
Milestone

Comments

@etrain
Copy link
Contributor

etrain commented Feb 24, 2017

We are using SparkContext throughout loaders and example pipelines. It makes sense to move these to using SparkSession given that we're relying on Spark 2.0.

@etrain etrain mentioned this issue Feb 24, 2017
@etrain
Copy link
Contributor Author

etrain commented Feb 24, 2017

Thinking through this change today, I'm not so sure it's necessary at the moment. SparkSession is part of the SparkSQL namespace and primarily designed to support Dataset access. We need it in the Amazon pipeline because we're using SparkSQL's json decoding to load up json files, but then immediately convert the result to an RDD.

To really jump on the Spark 2.0 train, I would recommend the following:

  1. Update all loaders to take a SparkSession and return a Dataset.
  2. Modify the pipeline, transformer, and estimator interfaces to take Dataset[T] as well as RDD[T] and do so in a way that takes advantage of the codegen features of spark 2.
  3. Benchmark and make sure we're not giving anything up with this approach, particularly when it comes to cache management and dealing with dense numerical data, a common use case for us.

For the sake of consistency, it would be nice to have the Amazon Loader/Pipeline deal with SparkContexts rather than SparkSessions. Unfortunately, this can't easily happen internally to the loader because there is no public interface for creating a SparkSession given a SparkContext.

I'm happy to leave this issue open, but will probably assign an 0.5.0 milestone to it, since I'd rather see 2 and 3 get handled along with it.

Let me know what you think @tomerk @shivaram

@etrain etrain added this to the 0.5.0 milestone Feb 24, 2017
@shivaram
Copy link
Contributor

Yeah I think that sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants