Spark Release 2.0.0

原文链接        译者:小村长

Spark2.0在2016年7月26日发布,因为工作中经常用到,所以对它关注比较多,正好今天”提前”下班,所以抽空翻译一下spark2.0发版概述,简单的介绍一下spark2.0的新特性和新变化。好吧,现在就让村长带领大家一起走进spark2.0的神秘殿堂。同时也希望更多的人参入进来,知识因为共享才变的有意义和价值。

Spark 2.0.0是第一个在2.x线上发行的版本. 主要的更新是在API的可用性,SQL2003的支持,性能的提升,结构化流,R UDF的支持还用可操作性的提升. 另外, 这个发行版本包括超过2500个补丁来只300个贡献者.

可以通过 downloads 来下载spark2.0. 你也可以访问 detailed changes来了解细节的改变. 我们向你展示每个模块的细节变化.

API稳定性

Spark 2.0.0是spark 2.x产品线上第一个发行版. Spark保证它所有2.x发行版非实验性API的稳定性. 虽然APIs和1.x有很多相似之处, 同时Spark 2.0.0也有很多大的变化. 可以通过这个 网站来 查看API的移除,修改和过时的信息.

核心和Spark SQL

程序 APIs

在Spark2.0最大的变化是最新更新的APIs:

  • 统一了DataFrame 和Dataset: 在Scala 和Java中, DataFrame 和Dataset做了统一, 也就是说. DataFrame仅仅是 Dataset行的类型别名. 在 Python 和R中, 由于缺乏类型安全, DataFrame仅仅是主要的程序接口.
  • SparkSession: 一个新的入口点代替老的SQLContext 和HiveContext 对于 DataFrame 和Dataset APIs. SQLContext 和HiveContext 继续保留为向后兼容.
  • 一个新的, 最新型的配置API对于SparkSession
  • 更简单的, 性能更好的累加器(accumulator) API
  • 一个新的, 提升了Datasets聚合API的性能

SQL

Spark 2.0大体上实现了对SQL2003的函数支持. Spark SQL现在能够运行所有的 99 TPC-DS 查询. 更多的详细情况如下:

  • Spark自带的SQL解析器不仅仅支持 ANSI-SQL标准同时也支持 Hive QL
  • 启动了本地的DDL 命令
  • 子查询, 包括
    • 不相关的标量子查询
    • 相关的标量子查询
    • 基于NOT IN的子查询 (在 WHERE/HAVING 语句)
    • 基于IN 语句的子查询 (在 WHERE/HAVING 语句)
    • 基于(NOT) EXISTS 语句的子查询 (在 WHERE/HAVING 语句)
  • 标准化View 的支持

另外,当构建没有Hive支持的时候, Spark SQL也包括几乎所有的函数功能当构建Hive支持的时候, 当连接Hive异常, Hive UDFs, 和脚本的转换.

新特性

  • 本地CSV 数据源, 构建在 Databricks’ spark-csv module
  • 关闭缓存和运行期间的堆内存的管理
  • Hive的桶表支持
  •  使用sketches近似统计功能, 包括quantile, Bloom filter, and count-min sketch.

性能和执行时间

  • 实质性的性能提升(2 – 10X) 通过对SQL和DataFrames的操作是通过一个新的技术,我们称之为整个阶段的代码生成.
  • 提升了Parquet浏览速度通过吞吐量的向量化
  • 提升了ORC 性能
  • 化了在 Catalyst查询选项的通用的工作负载
  • 通过继承window本地函数来提升在window上运行的性能
  • 对于本地数据源的自动文件合并

MLlib

MLlib API是以DataFrame为基础的. 以RDD为API进入了过度模式. 通过查询MLlib 向导来了解更多细节

新特征

  • ML persistence: The DataFrames-based API provides near-complete support for saving and loading ML models and Pipelines in Scala, Java, Python, and R. See this blog post for details. (SPARK-6725, SPARK-11939, SPARK-14311)
  • MLlib in R: SparkR now offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression. See this talk to learn more.
  • Python: PySpark now offers many more MLlib algorithms, including LDA, Gaussian Mixture Model, Generalized Linear Regression, and more.scaling
  • Algorithms added to DataFrames-based API: Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer.

这次罗列了很多新的特征.

速度/换算

向量和矩阵保存在DataFrames中使其更高效的序列化, 使其reduce 调用MLlib 算法更加高效. (SPARK-14850)

SparkR

SparkR在spark2.0中最大的提升就是添加了用户自定义函数的功能. 用户可以定义以下三种函数: dapply, gapply, 和 lapply. The first two can be used to do partition-based UDFs using dapply and gapply, e.g. partitioned model learning. The latter can be used to do hyper-parameter tuning.

另外,也增加如下新特性:

  • Improved algorithm coverage for machine learning in R, including naive Bayes, k-means clustering, and survival regression.
  • Generalized linear models support more families and link functions.
  • Save and load for all ML models.
  • More DataFrame functionality: Window functions API, reader, writer support for JDBC, CSV, SparkSession

Streaming

Spark 2.0 ships the initial experimental release for Structured Streaming, a high level streaming API built on top of Spark SQL and the Catalyst optimizer. Structured Streaming enables users to program against streaming sources and sinks using the same DataFrame/Dataset API as in static data sources, leveraging the Catalyst optimizer to automatically incrementalize the query plans.

For the DStream API, the most prominent update is the new experimental support for Kafka 0.10.

依赖和包的改进

在最新的Spark中对spark的操作和包装进行了改进:

  • Spark 2.0 n不在要求把所有的依赖打包到一个jar中.
  • Akka 依赖被移除, 用户根据自己的需求适配任何版本的Akka.
  • Kryo版本适配到3.0.
  • 默认的采用 Scala 2.11编译,二而不是Scala 2.10.

移除,特征改变,过时

移除的

以下的特性在Spark2.0已经删除:

  • Bagel
  • 不在支持Hadoop2.1和更早版本
  • 配置关闭序列化的选项
  • HTTPBroadcast
  • TTL-based metadata cleaning
  • Semi-private class org.apache.spark.Logging. We suggest you use slf4j directly.
  • SparkContext.metricsSystem
  • Block-oriented integration with Tachyon (subsumed by file system integration)
  • Methods deprecated in Spark 1.x
  • Methods on Python DataFrame that returned RDDs (map, flatMap, mapPartitions, etc). They are still available in dataframe.rdd field, e.g. dataframe.rdd.map.
  • Less frequently used streaming connectors, including Twitter, Akka, MQTT, ZeroMQ
  • Hash-based shuffle manager
  • History serving functionality from standalone Master
  • For Java and Scala, DataFrame no longer exists as a class. As a result, data sources would need to be updated.
  • Spark EC2 script has been fully moved to an external repository hosted by the UC Berkeley AMPLab

Behavior Changes

The following changes might require updating existing applications that depend on the old behavior or API.

  • The default build is now using Scala 2.11 rather than Scala 2.10.
  • In SQL, floating literals are now parsed as decimal data type rather than double data type.
  • Kryo version is bumped to 3.0.
  • Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data.
  • Java RDD’s countByKey and countAprroxDistinctByKey now returns a map from K to java.lang.Long, rather than to java.lang.Object.
  • When writing Parquet files, the summary files are not written by default. To re-enable it, users must set “parquet.enable.summary-metadata” to true.
  • The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg. This removes the last dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib migration guide for a full list of API changes.

For a more complete list, please see SPARK-11806 for deprecations and removals.

过时的

下面的特性在Spark2.0中过时了, 可能在未来的Spark 2.x版本中移除:

  • 对Mesos的Fine-grained模式的支持
  • 对Java7的支持
  • 对Python 2.6的支持

Known Issues

  • Lead and Lag’s behaviors have been changed to ignoring nulls from respecting nulls (1.6’s behaviors). In 2.0.1, the behavioral changes will be fixed in 2.0.1 (SPARK-16721).
  • Lead and Lag functions using constant input values does not return the default value when the offset row does not exist (SPARK-16633).

工作人员

译者注: 虽不认识他们,不知道他们是谁,但是感谢他们的辛勤付出,为开源社区提供了这么好的分布式框架,请我们瞄一下他们的名字以示尊重。

Last but not least, this release would not have been possible without the following contributors: Aaron Tokhy, Abhinav Gupta, Abou Haydar Elias, Adam Budde, Adam Roberts, Ahmed Kamal, Ahmed Mahran, Alex Bozarth, Alexander Ulanov, Allen, Anatoliy Plastinin, Andrew, Andrew Ash, Andrew Or, Andrew Ray, Anthony Truchet, Antonio Murgia, Arun Allamsetty, Azeem Jiva, Ben McCann, BenFradet, Bertrand Bossy, Bill Chambers, Bjorn Jonsson, Bo Meng, Brandon Bradley, Brian O’Neill, BrianLondon, Bryan Cutler, Burak Köse, Burak Yavuz, Carson Wang, Cazen, Charles Allen, Cheng Hao, Cheng Lian, Claes Redestad, CodingCat, DB Tsai, DLucky, Daniel Jalova, Daoyuan Wang, Darek Blasiak, David Tolpin, Davies Liu, Devaraj K, Dhruve Ashar, Dilip Biswal, Dmitry Erastov, Dominik Jastrzębski, Dongjoon Hyun, Earthson Lu, Egor Pakhomov, Ehsan M.Kermani, Ergin Seyfe, Eric Liang, Ernest, Felix Cheung, Feynman Liang, Fokko Driesprong, Franklyn D’souza, François Garillot, Gabriele Nizzoli, Gary King, GayathriMurali, Gio Borje, Grace, Grzegorz Chilkiewicz, Guillaume Poulin, Gábor Lipták, Hemant Bhanawat, Herman van Hovell, Herman van Hövell tot Westerflier, Hiroshi Inoue, Holden Karau, Hossein, Huaxin Gao, Imran Rashid, Imran Younus, Ioana Delaney, Iulian Dragos, Jacek Laskowski, Jacek Lewandowski, Jakob Odersky, James Lohse, James Thomas, Jason Lee, Jason Moore, Jason White, Jean-Baptiste Onofré, Jeff L, Jeff Zhang, Jeremy Derr, JeremyNixon, Jo Voordeckers, Joan, Jon Maurer, Joseph K. Bradley, Josh Howes, Josh Rosen, Joshi, Juarez Bochi, Julien Baley, Junyang, Junyang Qian, Jurriaan Pruis, Kai Jiang, KaiXinXiaoLei, Kay Ousterhout, Kazuaki Ishizaki, Kevin Yu, Koert Kuipers, Kousuke Saruta, Koyo Yoshida, Krishna Kalyan, Lewuathe, Liang-Chi Hsieh, Lianhui Wang, Lin Zhao, Lining Sun, Liu Xiang, Liwei Lin, Luc Bourlier, Luciano Resende, Lukasz, Maciej Brynski, Malte, Marcelo Vanzin, Marcin Tustin, Mark Grover, Martin Menestret, Masayoshi TSUZUKI, Matei Zaharia, Matthew Wise, Michael Allman, Michael Armbrust, Michael Gummelt, Michel Lemay, Mike Dusenberry, Mortada Mehyar, Nakul Jindal, Nam Pham, Narine Kokhlikyan, NarineK, Neelesh Srinivas Salian, Nezih Yigitbasi, Nicholas Chammas, Nicholas Tietz, Nick Pentreath, Nilanjan Raychaudhuri, Nirman Narang, Nishkam Ravi, Nong, Nong Li, Oleg Danilov, Oliver Pierson, Oscar D. Lara Yejas, Parth Brahmbhatt, Patrick Wendell, Pete Robbins, Peter Ableda, Prajwal Tuladhar, Prashant Sharma, Pravin Gadakh, QiangCai, Qifan Pu, Raafat Akkad, Rahul Tanwani, Rajesh Balamohan, Rekha Joshi, Reynold Xin, Richard W. Eggert II, Robert Dodier, Robert Kruszewski, Robin East, Ruifeng Zheng, Ryan Blue, Sameer Agarwal, Sandeep Singh, Sanket, Sasaki Toru, Sean Owen, Sean Zhong, Sebastien Rainville, Sebastián Ramírez, Sela, Sergiusz Urbaniak, Shally Sangal, Sheamus K. Parkes, Shivaram Venkataraman, Shixiong Zhu, Shuai Lin, Shubhanshu Mishra, Sital Kedia, Stavros Kontopoulos, Stephan Kessler, Steve Loughran, Subhobrata Dey, Subroto Sanyal, Sumedh Mungee, Sun Rui, Sunitha Kambhampati, Takahashi Hiroshi, Takeshi YAMAMURO, Takuya Kuwahara, Takuya UESHIN, Tathagata Das, Tejas Patil, Terence Yim, Thomas Graves, Timothy Chen, Timothy Hunter, Tom Graves, Tom Magrino, Tommy YU, Travis Crawford, Tristan Reid, Victor Chima, Villu Ruusmann, Wayne Song, WeichenXu, Weiqing Yang, Wenchen Fan, Wesley Tang, Wilson Wu, Wojciech Jurczyk, Xiangrui Meng, Xin Ren, Xin Wu, Xinh Huynh, Xiu Guo, Xusen Yin, Yadong Qi, Yanbo Liang, Yash Datta, Yin Huai, Yonathan Randolph, Yong Gang Cao, Yong Tang, Yu ISHIKAWA, Yucai Yu, Yuhao Yang, Yury Liavitski, Zhang, Liye, Zheng RuiFeng, Zheng Tan, aokolnychyi, bomeng, catapan, cody koeninger, dding3, depend, echo2mei, felixcheung, frreiss, fwang1, gatorsmile, guoxu1231, huangzhaowei, hushan, hyukjinkwon, jayadevanmurali, jeanlyn, jerryshao, jliwork, junhao, kaklakariada, krishnakalyan3, lfzCarlosC, lgieron, mark800, mathieu longtin, mcheah, meiyoula, movelikeriver, mwws, nfraison, oraviv, peng.zhang, petermaxlee, pierre-borckmans, poolis, prabs, proflin, pshearer, rotems, sachin aggarwal, sandy, scwf, seddonm1, sethah, sharkd, shijinkui, sureshthalamati, tedyu, thomastechs, tmnd1991, vijaykiran, wangfei, wangyang, wm624@hotmail.com, wujian, xin Wu, yzhou2001, zero323, zhonghaihua, zhuol, zlpmichelle, Örjan Lundberg, Yang Bo.
Spark新文档

原创文章,转载请注明: 转载自并发编程网 – ifeve.com本文链接地址: Spark Release 2.0.0

  • Trackback 关闭
  • 评论 (0)
  1. 暂无评论

return top