v0.9.5
版本发布时间: 2022-01-13 06:42:34
microsoft/SynapseML最新发布版本:v1.0.5(2024-08-30 10:16:51)
Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.9.5 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, and Java.
Highlights
Geospatial Intelligence | Multivariate Anomaly Detection | Responsible AI at Scale | Text To Speech | Healthcare Analytics |
Large-scale map and geocoding operations | Build custom time series anomaly detection systems | Distributed Conditional Expectation and Partial Dependence Analysis | East-to-use Neural Text to Speech for large datasets | Quickly understand entities and relationships in corpora of medical text. |
New Features
Geospatial Intelligence 🗺️
- Added support for distributed geospatial queries backed by the Azure Maps API
- Added the geospatial usage overview (#1339)
- Explore how to use the geospatial intelligence services to analyze flood risks. (#1339)
- Added the
AddressGeocoder
transformer to map informal addresses to standardized adresses with latitude and longitude (#1294) - Added the
ReverseGeocoder
transformer to map latitude and longitude measurements to standardized addresses. (#1339) - Added the
CheckPointInPolygon
, to detect if latitude and longitude queries lie inside regions of interest (#1339)
Azure Cognitive Services for Big Data 🧠
- Added the Healthcare Analytics Transformer for extracting medical information, entities, and relationships for text. [Example Usage] (#1329)
- Added the
FitMultivariateAnomaly
estimator for training custom anomaly detection models on DataFrames of multivariate time series data (#1272) - Added example notebook for Multivariate Anomaly Detector
- See how to train a custom Multivariate Anomaly detector in the Estimators reference docs (#1323)
- Added simplified Text Analytics transformers that support auto-batching (#1329)
- Added the
TextToSpeech
Transformer for transforming Dataframes of text to audio files with neural voice synthesis (#1320) - Added the
TextAnalyze
transformer to support executing multiple text analytics workloads within a single API call (#1267, #1312)
Responsible AI at Scale 😇
- Added Individual Conditional Expectation explanations and Partial Dependence Plots with the
ICETransformer
. This tool gives detailed explanations of how features in opaque-box models affect the model prediction. (#1284) - Learn about how to use the ICETransformer through an example with the Adult Census dataset
MLFlow 🔃
LightGBM on Spark 🌳
- Improved LightGBM training performance 4x-10x by setting num_threads to be cores-1 (#1282)
- Added the predict_disable_shape_check in LightGBM (#1273)
- Reduced temporary file bloat by creating the LightGBM native temp directory lazily (#1326)
- Added logging for number of columns and rows when creating datasets, set useSingleDatasetMode=True by default (#1222)
Infrastructure 🏭
- SynapseML now installable from Maven Central!
- SynapseML now supports spark v3.2.x
Additional Updates
Bug Fixes 🐞
- Allowed FlattenBatch to propagate non-array values (#1286)
- Fixed flaky tests (#1342)
- Fixed website bugs and migrated docSearch (#1331)
- Fixed issue where IsolationForestModel does not properly exchange params with the inner model (#1330)
- Corrected the objective param when using fobj (#1292)
- Fixed issue where broadcasted sum in breeze 1.0 breaks in Spark 3.2.0 (#1299)
- Hotfixes for R test runners (#1283)
- fix installation instruction (#1268)
- Removing broadcast hint (#1255)
- fix install instructions (#1259)
Build 🏭
- bump algoliasearch-helper from 3.6.1 to 3.6.2 in /website (#1270)
- remove some deps that cause sec issues (#1264)
Documentation 📘
- Fixed broken link to CyberML notebook (#1322)
- Added website announcement bar (#1263)
- Updated and improve readme (#1262)
- Removed references to runme in contributing.md
- Supported Math expressions in website markdown (#1278)
- Corrected Synapse typo in website (#1335)
Maintenance 🔧
- Stopped lightGBM tests from timing out (#1315)
- Fixed r test flakiness (#1314)
- Updated VerifyLightGBMClassifier.scala (#1313)
- Update speech SDK test results
- Add in missing tests in build (#1300)
- Fix flaky build steps (#1298)
- Fix website telemetry (#1261)
- Add website telemetry (#1260)
- Added missing test classes to pipeline
Contributor Spotlight
We are excited to highlight the contributions of the following SynapseML contributors:
Serena Ruan | Ilya Matiach | Sudhindra Kovalam |
Serena is an engineer on the Azure Synapse team in Beijing. In this release, Serena has continued her unbelievable speed of contributions with support for Multivariate Anomaly Detection, MLFlow, and installation from Maven Central. These contributions are just a few of the many projects Serena has contributed since she joined just a few months ago! | Ilya is a prolific engineer on the Azure Machine Learning Boston team working on responsible AI. Ilya contributed LightGBM on Spark and worked tirelessly to improve and support this feature. Ilya has been an active contributor to the SynapseML project for 5 years and has built many of the tools in the library. | Sudhindra is an engineer on the Microsoft Maps team and has contributed intelligent geospatial APIs to SynapseML v0.9.5. Sudhindra developed new ways to automate generation of Spark code from swagger files allowing him to contribute a large suite of features rapidly. |
Elena Zherdeva | The Text Analytics Explorer Interns | Stuart Leeks |
Elena is an engineer on the CSX Data team working on building scalable responsible AI tools. In Elena's first contribution to SynapseML she added Individual Conditional Expectation plots at scale. She also contributed a detailed sample notebook that does a fantastic job of explaining key concepts in Responsible AI. | Samantha Konigsberg (top left), Preeti Pidatala (top right), and Victoria Johnston (bottom) were summer explorer interns on the text analytics team. They collaborated together to build new simplified API's for the text analytics service using the Java SDK layer. One of these contributions was the new Healthcare Analytics API in Spark. This was intern's first Scala project, making this contribution all the more impressive! | Stuart is Engineer on the Commercial Software Engineering. Stuart not only uses SynapseML to power customer engagements, but also directly contributes features needed to make his customers succeed. Stuart contributed support for the new Analyze Text API which allows users to perform multiple intelligent text tasks with a single API call. Stuart also added features to SynapseML’s Mini-batchers to improve their generality. |
Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of SynapseML
Jason Wang @memoryz , Serena Ruan @serena-ruan, Ilya Matiach @imatiach-msft , Stuart Leeks @stuartleeks, Sudhindra Kovalam @SudhindraKovalam, Elena Zherdeva @ezherdeva, Preeti Pidatala @preetipidatala, Samantha Konigsberg @skonigs, Victoria Johnston @victoriajmicrosoft, Markus Cozowicz @eisber, Yazeed Alaudah @yalaudah, Suhas Mehta @suhas92, Kashyap Patel @ms-kashyap, Wenqing Xu @xuwq1993, Markus Weimer, Jeff Zheng, James Verbus @jverbus, Misha Desai, Nellie Gustafsson, Ruixin Xu, Eric Dettinger, Martha Laguna, Louise Han @jr-MS, Rashid Monin, Ali Emami, Clemens Schotte, Edward Un, Johannes Kebeck, Han Li, Assaf Israel @assafi, Tom Finley, Tomas Talius, Mitrabhanu Mohanty, Anand Raman, William T. Freeman, Ryan Hurey, Jarno Ensio, Brian Mouncer, Sharath Chandra, Beverly Kodhek, Nisheet Jain, Akshaya Annavajhala (AK), Euan Garden, Lev Novik, Guolin Ke, Tara Grumm, Ismaël Mejía, Keunhyun Oh, @martin0258, @sinnfashen, Dung Nguyen @nhymxu, @elswork, ONNX Team, Azure Global, Vowpal Wabbit Team, Light GBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team
Learn More
Visit our new website for the latest docs, demos, and examples | Read more about SynapseML's GA release in the Microsoft Research Blog | SynapseML is now generally available on Azure Synapse! Get started here. |
Learn more about Multivariate Anomaly Detection in SynapseML | Read our Paper from IEEE Big Data '21 | Sign up for the Private Preview of Explainable Boosting Machines in SynapseML |