v0.0.22
版本发布时间: 2023-09-27 20:30:15
facebookresearch/xformers最新发布版本:v0.0.28.post1(2024-09-13 23:52:20)
Fixed
- fMHA: Backward pass now works in PyTorch deterministic mode (although slower)
Added
- fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to
memory_efficient_attention
, see the documentation for more details - fMHA: Added experimental support for Local Attention biases to
memory_efficient_attention
- Added an example of efficient LLaMa decoding using xformers operators
- Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
- Added an efficient rope implementation in triton, to be used in LLM decoding
- Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
-
xformers.info
now indicates the Flash-Attention version used
Removed
- fMHA: Removed
smallK
backend support for CPU.memory_efficient_attention
only works for CUDA/GPU tensors now -
DEPRECATION: Many classes in
xformers.factory
,xformers.triton
andxformers.components
have been or will be deprecated soon (see tracking issue facebookresearch/xformers#848)