SPARK-A-TON
Hacking A Ton of Spark
This one-day Spark-a-ton is an excellent opportunity to have fun while learning new things and contributing to open source project with one of the leading Spark 2.0 experts – for free.
Duration (1 day/8 hours)
24.11.2016: 09:00 – 18:00 @ Poligon
Development Activities
Structured Streaming
- Developing a custom StreamSourceProvider
- Migrating TextSocketStream to SparkSession (currently uses SQLContext)
- Developing Sink and Source for Apache Kafka
- JDBC support (with PostgreSQL as the database)
- Creating custom Encoder
- SPARK-17668 Support representing structs with case classes and tuples in spark sql udf inputs
- Create an encoder between your custom domain object of type
T
and JSON or CSV - See Encoders for available encoders.
- Read Encoders – Internal Row Converters
- (advanced/integration) Create an encoder for Apache Arrow (esp. after the arrow-0.1.0 RC0 release candidate has recently been announced) and ARROW-288 Implement Arrow adapter for Spark Datasets.
- Custom format, i.e.
spark.read.format(...)
orspark.write.format(...)
- Multiline JSON reader / writer
SQLQueryTestSuite
– this is a very fresh thing in Spark 2.0 to write tests for Spark SQL- http://stackoverflow.com/questions/39073602/i-am-running-gbt-in-spark-ml-for-ctr-prediction-i-am-getting-exception-because
- ExecutionListenerManager
- (done) Developing a custom RuleExecutor and enabling it in Spark
- Answering Extending Spark Catalyst optimizer with own rules on StackOverflow
- Sparkathon – Developing Spark Extensions in Scala on Sep 28th
- Creating custom Transformer
- Example: Tokenizer
- Jonatan + Kuba + lejdis (Justyna + Magda)
- The problem is a record of Pipeline with the Transformer, read and use.
- Spark MLlib 2.0 Activator
- Monitoring executors (metrics, e.g. memory usage) using SparkListener.onExecutorMetricsUpdate.
- Develop a new Scala-only TCP-based Apache Kafka client
- A Guide To The Kafka Protocol
- KAFKA-3360 Add a protocol page/section to the official Kafka documentation
- See Scala Kafka Client for inspiration yet it’s just “a thin Scala wrapper over the official Apache Kafka Java Driver”
- Working on Issues reported in TensorFrames.
- Review open issues in Spark’s JIRA and pick one to work on.
Trainer: Jacek Laskowski
An independent consultant who is passionate about software development and teaching people in effective use of Apache Spark, Scala, sbt, and Apache Kafka (with a bit of Hadoop YARN, Apache Mesos, and Docker). He is the leader of the Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw, Poland.
WORKSHOP II
Leapfrog your competition with Spark 2.0! - REGISTRATIONS CLOSED!
This two-day course is designed to teach developers how to implement data processing pipelines and analytics using Apache Spark. Developers will use hands-on exercises to learn the Spark Core, SQL/DataFrame, Streaming, and MLlib (machine learning) APIs. Developers will also learn about Spark internals and tips for improving application performance.
Duration (2 days/2 x 8 hours)
22.11.2016: 09:00 – 19:00 @ Hotel Slon (1st floor, room Club I)
23.11.2016: 09:00 – 19:00 @ Hotel Slon (1st floor, room Club I)
Objectives
After having participated in this course you should:
- Understand how to use the Spark Scala APIs to implement various data analytics algorithms for offline (batch-mode) and event-streaming applications
- Understand Spark internals
- Understand Spark performance considerations
- Understand how to test and deploy Spark applications
- Understand the basics of integrating Spark with Mesos, Hadoop, and Akka
Agenda
Day 1
Spark SQL – 4h
- Dataset / SparkSession / Encoders / Schema / InternalRow
- Aggregations, Window and Join Operators
- Catalyst Query Optimizer
- Thrift JDBC/ODBC Server — Spark Thrift Server (STS)
Spark MLlib – 4h
- ML Pipeline API
Day 2
Spark MLlib – 1h
- ML Pipeline API
Spark Streaming – 5h
- Streaming Operators
- Stateful Operators using mapWithState
- Kafka Integration using Direct API
Structured Streaming – 2h
- Kafka Integration
Prerequisites
- Experience with Scala and sbt
- Knowledge of Spark basics — RDDs, spark-shell, spark-submit
- Experience with the entire lifecycle of a Spark application from development (including sbt-assembly) to spark-submit
- Know how to run Spark Standalone
An independent consultant who is passionate about software development and teaching people in effective use of Apache Spark, Scala, sbt, and Apache Kafka (with a bit of Hadoop YARN, Apache Mesos, and Docker). He is the leader of the Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw, Poland.