v0.10.0
版本发布时间: 2022-07-18 10:50:31
microsoft/SynapseML最新发布版本:v1.0.5(2024-08-30 10:16:51)
Building production ready distributed machine learning pipelines can be a challenge for even the most seasoned researcher or engineer. We are excited to announce the release of SynapseML v0.10.0 (Previously MMLSpark), an open-source library that aims to simplify the creation of massively scalable machine learning pipelines. SynapseML unifies several existing ML Frameworks and new MSFT algorithms in a single, scalable API that’s usable across Python, R, Scala, Java, .NET, C#, and F#.
Highlights
OpenAI Language Models | .NET, C#, and F# Support | Full MLFlow Support | Live Demos in Browser |
Embed 175-billion parameter models into your databases with ease | Use or train any SynapseML model from .NET | Quick and easy MLOps, model management, and autologging | Explore the SynapseML library with zero setup |
Learn More | Getting Started Guide | Explore the Docs | Run in Browser |
New Features
General ✨
- SynapseML now supports .NET, C#, F#, and other .NET ecosystem languages in addition to Scala, Python, and R. Please see our Setup Guide and LightGBM from .NET example for more details. (#1539, #1156, #1443)
- SynapseML is now usable from your browser with zero setup using Binder. Quickly explore our demos in Binder. (#1487, #1493)
Azure Cognitive Services for Big Data 🧠
- Added OpenAI GPT-3 Sentence Completion Transformer. Use this feature to embed 175-billion parameter language models into distributed pipelines and databases to solve a variety of general purpose NLP tasks across natural language and code. (#1495, #1541)
- Added an example of Sentence Completion with GPT-3 (#1564)
- Added support for Form Recognizer V3.0 (#1269)
- Improved MVAD usability with async training and better data validation (#1477)
- Upgraded the univariate anomaly detection version to v1.1-preview (#1440)
- Added a multivariate anomaly detection sample notebook (#1365)
- Added a Text to Speech example to cognitive service overview (#1350)
- Added opinion mining to TextSentiment Models (#1449)
- Fixed Azure Maps schemas (#1553)
- Removed modelID param validators in FormRecognizerV3 (#1551)
- Fixed form recognizer and form ontology learner issues (#1506)
- Fixed
setServiceName
python method in OpenAI (#1498) - Fixed error in Text Analytics Analyze schema
- Improved error handling for MVAD (#1448, #1391)
- Removed unused concurrency parameter for MVAD (#1383)
- Improved robustness of flood risk notebook by adding polling (#1427)
Responsible AI at Scale 😇
- Added partial dependence plots (PDP) to allow for understanding how independent variables affect a model's prediction (#1426)
- Updated ICE/PDP documentation with PDP-based feature importance and additional examples (#1441, #1352)
- Added a notebook for ICE and PDP feature explainers (#1318)
- Updated data balance documentation to better describe how it can be used to ensure model fairness (#1540)
MLFlow 🔃
- Added documentation for MLFlow autologging (#1508)
- Added documentation on the SynapseML-MLFlow integration (#1428)
LightGBM on Spark 🌳
- Added the ability to pass in generic argument strings to LightGBM enabling many complex parameterizations (#1444)
- Added seed parameters to LightGBM (#1387)
- Added a method to get LightGBM native model string directly (#1515)
- Fixed issue with validation data creation during
useSingleDataset
mode (#1527) - Fixed multiclass training with initial scores (#1526)
- Fixed saving LightGBM model iterations with early stopping (#1497)
- Fixed issue where chunk size parameter was incorrectly specified during data copy (#1490)
- Fixed issue where when empty partition is chosen as the main worker in
singleDatasetMode
(#1458) - Fixed bug with data repartitioning in
LightGBMRanker
(#1368) - Fixed outdated docs for
useSingleDatasetMode
(#1562) - Refactored LightGBM class structure to improve logging and debugging (#1557)
Vowpal Wabbit 🐇
- Fixed issues with the
saveNativeModel
for the VWRegressionModel #1364 (#1366) - Fixed issues with building quadratic interaction terms (#1460)
Isolation Forests 🌲
Additional Updates
Maintenance 🔧
- Removed unused debugging code (#1546)
- Remove Synapse test exclusion for Explanation Dashboard notebook (#1531)
- Made python style checks verbose (#1532)
- Fixed library checking while installing library on Databricks cluster (#1488)
- Upgraded and fix Dockerfiles (#1472)
- Added Developer Docker Image build to pipeline (#1480)
- Fixed ADO area path in Issue Linker (#1464)
- Fix master version badge display
- Improved Databricks error reporting
- Updated azure cli to stop build errors
- Fixed SSL handshake flakiness
- Added
itsdangerous
as a dependency to ADB tests (#1412) - Turned on debug for pr to work item workflow
- Pointed pr linker to official implementation
- Changed GitHub action trigger from pull_request_target to pull_request (#1413)
- Fixed issue where Unit Tests were not executing (#1409)
- Added Azure DevOps PR linker (#1394)
- Updated GH PAT name (#1389)
- Re-enable Synapse E2E Tests (#1517)
- Updated SynapseE2E Tests to Spark 3.2 (#1362)
- Fix ADO issue/pr linking (#1463)
- Cleaned up extra MVAD models and improved network resiliency (#1457)
- Updated azure blob client version (#1563)
- Fixed docker security vulnerability (#1561)
- Streamlined scalastyle hook (#1530)
- Updated CODEOWNERS (#1523)
- Updated OpenAI resource info (#1525)
- Fixed semantic PR checking (#1503)
- Updated docker images to remain compliant (#1500)
- Added component governance explicitly to build so timeout variable works (#1489)
- Fixed path for notebook test files in gitignore (#1485)
- Increased component governance timeout (#1482)
- Added conda caching to build
- Stopped build from failing after 1 hour
- Fixed flaking MVAD test
- Refactored build pipeline definitions
- Split Synapse tests into multiple test (#1377)
- Moved from ADO Pipelines to GitHub Workflows (#1406)
Website Improvements 💻
- Fixed MathJax expressions rendering (#1343)
- Fixed google analytics gtags (#1434)
- Corrected placement of BingSiteAuth.xml config (#1445, #1439)
- Fixed website security and upgrade docusaurus (#1545)
- Moveed Geospatial Services to its own folder (#1345)
- Bumped minimist from 1.2.5 to 1.2.6 in /website (#1455)
- Bumped node-forge from 1.2.1 to 1.3.0 in /website (#1451)
- Bumped prismjs from 1.25.0 to 1.27.0 in /website (#1430)
- Bumped follow-redirects from 1.14.7 to 1.14.8 in /website (#1402)
- Bumped nanoid from 3.1.23 to 3.2.0 in /website (#1355)
- Bumped shelljs from 0.8.4 to 0.8.5 in /website (#1347)
- Bumped follow-redirects from 1.14.1 to 1.14.7 in /website (#1348)
- Bumped cross-fetch from 3.1.4 to 3.1.5 in /website (#1496)
- Bumped async from 2.6.3 to 2.6.4 in /website (#1481)
- Pinned onnxmltools to a specific version (#1524)
Bug Fixes 🐞
- Fixed twitter sentiment detection notebook (#1544)
- Fixed issue with
DataConversion
serialization (#1505) - Fixed typos in
TestBase
(#1501) - Fixed issue in
GridSpace
python API (#1470) - Fixed reflective class loading in IntelliJ (#1456)
- Removed verbose
ComputeModelStatistics
output and convertscoredLabelsCol
to DoubleType (#1361) - Fixed flaking in geospatial notebooks
Code Style 🎶
- Improved style checks using pre-commit (#1538, #1528, #1535)
- Formatted code and notebooks with Black style checker (#1522, #1520)
Documentation 📘
- Tabularized badges for readability (#1486)
- Added a PR template (#1418)
- Improved installation readme (#1369, #1422)
- Added a Security readme (#1511)
- Updated the Azure Synapse readme (#1372)
- Remove reference to custom maven resolver
- Added pointer to docs on synapse pool configuration
- Fixed typos in readme (#1516)
Contributor Spotlight
We are excited to highlight the contributions of the following SynapseML contributors:
Serena Ruan | Ric Serradas | Puneet Pruthi |
Serena is a Software Engineer II on the Synapse team in Beijing and a force of nature. In this release, Serena has continued her prolific contribution steak by adding language support for .NET, C#, and F# and integrating SynapseML with MLFlow. Additionally, Serena has contributed several features to the MLFlow and Spark.NET open-source communities so that these systems can work better for every user. These contributions are just some of the many amazing things Serena has accomplished during this release, and her devotion and craft are pivotal to the ecosystem. | Ric is a Senior Engineering Manager on the OneNote team with a shining personality and drive to collaborate. In just a few weeks Ric hit the ground running by setting up an automated link between GitHub and Azure DevOps, building the first working version of SynapseE2E tests, and re-writing our entire build in GH Actions. Furthermore, Ric worked tirelessly through nights and weekends to land his contributions. | Puneet is a Senior Engineer on the SynapseML team with a knack for engineering systems and dockerization. Puneet's contributions to the library include architecting the new binder integration, driving our Synapse E2E tests to completion, and improving SynapseML’ s infrastructure around community engagement. Puneet is constantly thinking of ways to improve the community and we value his effort. |
Mark Niehaus | Keerthi Yanda | Yagna Oruganti |
Mark is a Senior Software Engineer on the SynapseML team with a deep knowledge of the .NET ecosystem and infrastructure development. In this release, Mark architected SynapseML’ s .NET binding blob publishing strategy, drove the OpenAI GPT-3 bindings to completion, and wrote a detailed GPT-3 walkthrough. Mark completed these projects while supporting the Time Series Insights service, speaking to his ability to keep multiple plates spinning at a time. | Keerthi is a Software Engineer II on the SynapseML team. Despite joining Microsoft just a few months ago, Keerthi has quickly learned the SynapseML ropes to take command of our integration with the Azure Synapse platform. Huge kudos to her for braving long build times, and daunting error messages to make sure SynapseML works out of the box on Synapse Analytics clusters. | Yagna is a Senior Data and Applied Scientist on the Industry AI team with a talent for building solutions that integrate many community tools to solve customer challenges. Yagna's first contribution to SynapseML was a masterpiece of a demo showing how to use Isolation Forests, MLFlow, Tabular SHAP, and the interpret-ml explanation dashboard in a single anomaly detection example. |
Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external, who helped create this version of SynapseML
Serena Ruan @serena-ruan, Eric Dettinger, Scott Votaw @svotaw, Puneet Pruthi @ppruthi, Ric Serradas @riserrad, Mark Niehaus @niehaus59, Kyle Rush @k-rush, Keerthi Yanda @KeerthiYandaOS, Yagna Oruganti @YagnaDeepika, Jason Wang @memoryz, Ilya Matiach @imatiach-msft, Yazeed Alaudah @yalaudah, Elena Zherdeva @ezherdeva, Kashyap Patel @ms-kashyap, Martha Laguna @martthalch @marthalc, Alex Li @liyzcj, Maria Guirguis @maguir, Alexandra Savelieva @alsavelv, @netang, Sudhindra Kovalam @SudhindraKovalam, Markus Cozowicz @eisber, Tom Finley, Markus Weimer, Jeff Zheng, James Verbus @jverbus, Chris Hoder, Misha Desai, Nellie Gustafsson, Eren Orbey, Beverly Kodhek, Louise Han @jr-MS, Justyna Lucznik, Kim Manis, Mitrabhanu Mohanty, Bogdan Crivat, Anand Raman, William T. Freeman, James Montemagno, Luis Quintanilla, Dennis Kennedy, Ryan Hurey, Jarno Ensio, Brian Mouncer, Steve Suh @suhsteve, Akshaya Annavajhala (AK), Guolin Ke, Tara Grumm, Niharika Dutta @Niharikadutta, Andrew Fogarty, Juanyong Duan, Weichen Xu @WeichenXu123, Spark.NET Team, ONNX Team, Azure Global, Vowpal Wabbit Team, LightGBM Team, MSFT Garage Team, MSR Outreach Team, Speech SDK Team, MLflow Team
Learn More
Visit our website for the latest docs, demos, and examples | Read more about SynapseML's GA release in the Microsoft Research Blog | Learn more about our .NET bindings and code generation system. |
Watch a demonstration of SynapseML to create a multilingual search engine. | Read our Paper from IEEE Big Data '21 | Explore our integration with the Azure OpenAI Service |