v0.0.22

版本发布时间: 2023-09-27 20:30:15

facebookresearch/xformers最新发布版本:v0.0.28.post1(2024-09-13 23:52:20)

fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
Added an example of efficient LLaMa decoding using xformers operators
Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
Added an efficient rope implementation in triton, to be used in LLM decoding
Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info now indicates the Flash-Attention version used

fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue facebookresearch/xformers#848)