Creator of Intel Neural Compressor/Speed/Coder, Intel Ext. for Transformers, AutoRound; HF Optimum-Intel Maintainer; Founding member of OPEA; Opinions my own
🧩No GPU but wanna create your own LLM on laptop?
🎁Here is a gift for you: QLoRA on CPU, making LLM fine-tuning on client CPU possible! Just give a try.
📔Blog: Kudos to ITREX team!
🎯Code:
#IAmIntel
#intelai
@intel
@huggingface
📢Just change the model name, you can run LLMs blazingly fast on your PC using Intel Extension for Transformers powered by SOTA low-bit quantization!
🎯Code: , supporting Mistral, Llama2, Mixtral-MOE, Phi2, Solar, most recent LLMs.
🤗
🎯We released GPT-J-6B INT8 ONNX models (first time for INT8 ONNX LLM❓) with ~4x model size reduction while preserving ~99.9% accuracy of FP32 baseline.
🔥GPT-J-6B INT8 models are now publicly available at Hugging Face model hub!
🚀Accelerate LLM inference on your laptop, again CPU! Up to 4x on Intel i7-12900 over llama.cpp!
🎯Code:
📢Chatbot demo on PC: ; Hugging Face space demo locally:
#oneapi
@intel
@huggingface
@_akhaliq
@Gradio
🤗Intel Extension for Transformers supports Mixtral-8-7B with 8-bit and 4-bit inference optimizations on Intel platforms! Start from CPUs🚀
🙌Don't hesitate to give a try. Sample code below👇
🎯Project:
#iamintel
#intelai
@intel
@huggingface
⚡️AutoRound, new SOTA LLM low-bit quantization approach developed by Intel Neural Compressor team ()
🎯Lots of interesting comparison with GPTQ, AWQ, HQQ, etc. Check out the blog for more details:
@huggingface
#IAmIntel
📢We are hiring full-time interns for LLM-based workflow development (e.g., retrieval-augmented generation for domain chatbot, co-pilot assistant, ...)
📷Location: Shanghai (or working remote in PRC)
🎯Project:
If you have interests, DM with your resume.😀
🔥Wanted to quantize 100B+ model on your laptop with 16GB memory? Hmmm, GPTQ does not work...
🎯Intel Neural Compressor supports layer-wise quantization, unlocking LLM quantization on your laptop! Up to 1000B model❓
📕Blog:
#oneapi
@intel
@huggingface
♥️ Happy Thanksgiving! Thanks to my family, friends, colleagues, partners, collaborators! Love you all!!
🔥We released QLoRA for CPU, to help you enable fine-tune LLMs on your laptop! See below👇
📢Code:
#deeplearning
#intelai
#GenAI
@intel
@huggingface
🚀Share with you a nice blog "llama.cpp + Intel GPUs". Congrats to the awesome team especially Jianyu, Hengyu, Yu, and Abhilash, and thanks to
@ggerganov
for your great support.
📢Check out the blog:
🎯WIP with ollama now
#iamintel
#llama
@ollama
📢Do you want to make your LLM inference fast, accurate, and infinite (up to M tokens)? Here is the improved StreamingLLM with re-evaluate and shift-RoPE-K support on CPUs!
🔥Code:
📕Doc:
#oneapi
@intel
@huggingface
@Guangxuan_Xiao
🔥llama.cpp officially supports Sycl, showing promising perf gains over OpenCL. Give a shot on Intel GPUs e.g., Arc 770!
PR:
Congrats Abhilash/Jianyu/Hengyu/Yu! Thanks
@ggerganov
for the review! Transformer-like API soon in
@RajaXg
🤗Intel Extension for Transformers enables running microsoft/phi-2 smoothly on laptop (faster than human speed🚀). Sample code👇
🎯Code: . Try and have funs!
🎁DM your favorite LLM. Next will be Solar :)
#iamintel
#intelai
@intel
@huggingface
@murilocurti
📢Just created an open-source project to speed up LLMs dedicatedly
🌟Project:
🤗Look forward to your suggestions and let me know the topics that you may have interests and want to see.
#LLM
@intel
@huggingface
📢Continue making LLMs more accessible! Neural Compressor supports layer-wise GPTQ for INT4 quantization up to 1TB ~ 10TB (though not open-sourced yet) even on consumer HW!
📕Instruction:
🌟Project:
#oneapi
@intel
@huggingface
#LLM
🚀Thrilled to announce that NeuralSpeed v1.0 alpha is released! Highly optimized INT4 kernels and blazing fast LLM inference on CPUs!
🎯Integrated by ONNX Runtime; WIP: contribute to AutoAWQ
@casper_hansen_
and AutoGPTQ
📔 Blog:
🔥
🚀Highly-efficient x86 INT4 kernels are now available in ONNX Runtime. Use Intel Neural Compressor to quantize LLMs and run efficiently with ONNX Runtime on Intel CPUs!
📔PR:
🎯Source of INT4 kernels:
#intelai
@intelai
@huggingface
🎁Thrilled to share Intel Neural Compressor v2.4 is out on a nice snowy day in SH, a special release for model quantization/compression for LLMs, helping to bring AI everywhere.
👨💻Release notes:
🚀Code:
#iamintel
#intelai
#oneapi
🚀Embedding is super fast on SPR! Just ~500 seconds for 1M samples (512 seq len/sample) using Intel optimized BGE model using INC and ITREX, making RAG more accessible!
📷Quick guide:
🎯
#iamintel
#intelai
@intelai
@huggingface
🔥Excited to share new BGE-base-v1.5 INT8 models within <1% accuracy loss from FP32 baseline on STS dataset (previous SST2)! BGE for RAG!!
🤗Model-1:
🤗Model-2:
🚀Code:
#oneapi
@IntelSoftware
@huggingface
🎁Happy New Year! We released Intel Neural Compressor v2.4.1 on the last working day in 2023!
📔Release notes:
🎯Code:
🩷Thanks to everyone who has provided your support & help to INC. We are committed to make it better in 2024! 🤗
📽️Editing LLM knowledge is possible, e.g., Rank-One Model Editing (ROME).
📔Paper:
🎯Sample code:
💣The technology behind looks interesting and useful, which is supposed to work with SFT and RAG to reduce the hallucination!
📢Happy to share Intel Extension for Transformers v1.0 released:
🎯 NeuralChat, a custom Chatbot on domain knowledge through Hugging Face PEFT. Now, you can create your own Chatbot within 1 hours on CPUs.
@humaabidi
@MosheWasserblat
@jeffboudier
🎯When DeepSpeed meets Intel AI SWs, the performance magic happens!
🚀Accelerate Llama 2 inference on Xeon SPR by up to ~1.7x!
📔Blog:
🎁Intel AI SWs:
IPEX:
INC:
and
#oneapi
@intelai
@AIatMeta
@MSFTDeepSpeed
🤗NeuralChat beats GPT4 and Claude on hallucination and factual consistency rate in a new leaderboard👇 initiated by
@vectara
.
📢RL/DPO is getting so important to improve the model quality, particularly for responsible AI.
🎯Code to fine-tune NeuralChat:
📢Happy to share INT4 inference on
@intel
GPUs (e.g., PVC & Arc) is available in Intel Ext. for Transformers as an experimental feature (powered by IPEX)! More are coming!!
🎯Release notes:
🚀Code:
#intelai
#intelgpu
@huggingface
📢NeuralChat, an open chat framework created by
@intel
, now supports the
@huggingface
assisted generation to make chatbot more efficient on Intel platforms!
🎯Guide to deploy a chatbot:
🚀Code:
#iamintel
#intelai
Go, ITREX!
🔥All your need is Intel Neural Compressor (INC) for INT4 LLMs. INC v2.5 released with SOTA INT4 LLM quantization (AutoRound) across platforms incl. Intel Gaudi2, Xeon, and GPU.
🎯Models: Llama2, Mistral, Mixtral-MOE, Gemma, Mistral-v0.2, Phi2, Qwen, ...🤗
🎯Embedding model is super important for RAG system. Here is a tutorial showing how to tune BAAI/bge-base for high performance.
📔
💣 Extended LangChain to load optimized embedding model and improved the inference on Intel platforms.
📢More Intel NeuralChat-v3 7B LLMs are released, and more technical details are published in the blog👇
🎯Blog:
🙌Welcome to use
@intel
NeuralChat-v3🤗, which runs highly efficient on Intel platforms using Intel AI SWs.
#iamintel
#intelai
@huggingface
🎯High performance INT4 Mistral-7B model available on
@huggingface
, quantized by Intel Neural Compressor (outperforming GPTQ & AWQ) and efficiently inferenced by Intel Extension for Transformers!
🤗 Model:
🌟,
🎯Meta launched Llama3. See how it works well across Gaudi, Xeon, GPU, and AIPC! Check out the blog:
🔥Happy to share with you AutoRound in Intel Neural Compressor was used to quantize Llama3 INT4 model with the SOTA accuracy!
🔥MLPerf Inference v4.0 inference is out!
1⃣The only CPU able to achieve 99.9% accuracy
2⃣1.8x perf speedup over last submission
3⃣Summarize a news article pre second in real-time
📘Blog:
🎯Code for MLPerf GPT-J:
#MLPerf
#IAmIntel
🎯Want to quantize Transformer model without coding? Yes, use Neural Coder + Optimum-Intel.
🧨5,000+ Transformer models quantized automatically
🔥Neural Coder demo on Hugging Face Spaces: .
⭐️Check it out for a try!
@ellacharlaix
@jeffboudier
@_akhaliq
❓Fine-tuning or RAG? Don't know how to select.
🎯Fine-tuning is not the only way to make your LLM smarter! You can also try RAG. Here are the recommendations and examples:
📢Reproducible through Intel Extension for Transformers: 🚀
🎯We are hosting our personalized Stable Diffusion model with a newly-added object "dicoo" on Hugging Face Spaces: . 🤗Try it out! If you want to replicate the fine-tuning, please visit our previous blog:
📢Slimmed BGE embedding models are coming, shortly after quantized ones. More importantly, slimming and quantization can be combined together!
🎁 Private RAG-based chatbots on clients are more accessible!
👨💻
🎯
#intelai
#NeuralChat
📢"Efficient Post-training Quantization with FP8 Formats" is published! Thanks to the great collaborators!
🎯We released all the FP8 recipes in Intel Neural Compressor: . Check it out!
🎯Happy to announce the source code and examples of "Fast DistilBERT on CPU" (accepted by NeurIPS'22 paper) was released:
🧨Included in Top NLP Papers Nov'22 by
@CohereAI
and highlighted "Fast Transformers on CPUs with SOTA performance" by
@Synced_Global
!
⚡️Breaking news: Open Platform for Enterprise AI (OPEA) is announced by Pat! A lots of great partners👍
🎯The base code is here: , powered by ecosystem projects such as Transformers, TGI, LangChain and the technology from Intel Extension for Transformers.
👨💻If you missed CES 2024 Intel copilot demo, no worry, here is the video.
🎯Features: 1) run on your PC for copilot chat, so it's 100% free and safe; 2) run on server for code generation, so it may generate better code; 3) smart model switch. VS plugin is coming🚀
#intelai
@intel
💣Happy to announce INT4 NeuralChat-7B models available on
@huggingface
, powered by SOTA INT4 algorithm developed by Intel, yet compatible with AutoGPTQ!
🤗
🤗
📔Paper:
🎯Sample code:
📢INT4 GPTQ and RTN landed in ONNX Runtime through Intel Neural Compressor. AI on PC is coming!
📔PR: Thanks to Yuwen, Mengni, and Yufeng!
🌟Code:
#intelai
#onnxruntime
#neuralcompressor
🎁Happy to announce Intel Extension for Transformers supports INT8 quantization for MSFT Phi, making Phi inference more efficient and accessible than ever!
📔Quick guide:
🎯Code available:
#iamintel
#intelai
@intel
@huggingface
It has been a great experience to see the rapid growth of LLMs in open-source community. We are proud to see
@intelai
created LLMs & datasets are welcomed and being used/discussed/improved. Go, Intel LLMs!
Congrats to Intel team members Haihao Shen and Kaokao Lv for their fine-tuned version of Mistral 7B having hit the top of the list on the
@huggingface
LLM leaderboard last week:
Fine-tuned on 8x Intel Gaudi2 Accelerators.
🔥Want to use FP8 inference easily? Intel Neural Compressor is your best choice:
🎯Shared with you our MLSys'24 camera-ready paper: Efficient Post-Training Quantization with FP8 Formats
🤗
@_akhaliq
@navikm
@huggingface
#IAmIntel
🎯Wanted to enable audio in your chatbot? Just few minutes.
📕Here is a guide for you, including ASR, TTS, audio processing, audio streaming, multi-lang EN & CN:
📢Optimized code: with🤗models
#iamintel
#intelai
@intel
@huggingface
🎯How MX data types work for LLMs? New quantization recipes validated by Intel using Neural Compressor, HW architecture and data types proposed by MSFT and defined by OCP
📢Here is a tutorial: with source code publicly available in
🌟Happy to announce Intel Extension for Transformers v1.4 released with a lot of improvements in building GenAI applications on Intel platforms!
🎯Check out the release notes:
🤗
@intel
+
@huggingface
= one of the best GenAI platforms
📢Intel Extension for Transformers () supports INT4 and low-bit inference on both CPUs and GPUs!
📔Simple usage guide:
🔥All your need is to get an Intel GPU and run LLMs
@huggingface
🤗
🚀Happy to support "upstage/SOLAR-10.7B-Instruct-v1.0" in Intel Extension for Transformers!
@upstageai
@hunkims
. INT4 inference is available with one parameter change from "load_in_8bit" to "load_in_4bit".
📢Next one will be Zephyr🙌
👇Check out the sample code and give a try!
📢When AI meets cybersecurity, see how Intel NeuralChat LLM helps here. Happy to share a nice blog "Harnessing the Intel NeuralChat 7B Model for Advanced Fraud Detection". Congrats
@Saminusalisu
!
🎯Check out the details:
#intelai
#iamintel
@humaabidi
🎁Here is a tutorial on how to optimize natural language embedding model and extend LangChain to enable the optimizations. Check out more details:
🤗Code: . Star the project if you find this is useful.
🌟Happy Chinese New Year! 🎇
👨💻2023 is year of open LLMs. Is it time to predict for 2024? DM your thoughts.
📢Re-share the blog from
@clefourrier
: , incl. Intel NeuralChat-7B and DPO dataset😀
🤗We hope to contribute more to open-source LLM community in 2024!
#iamintel
@huggingface
🔥Happy to announce Intel Extension for Transformers v1.3.2 released
📔Release notes:
🎯Highlights: enable popular serving e.g.,
@huggingface
TGI, vLLM, and Triton to build highly efficient chatbot on Intel platforms such as Gaudi2 with a few lines of code
🥳Happy to share with you the Intel optimizations
for Diffusers textual inversion and the fine-tuning demo of Stable Diffusion on Spaces!
👉 Intel optimizations:
🎯Spaces:
🤗Thanks to Patrick,
@anton_lozhkov
@_akhaliq
from HF!
📢Intel Copilot in CES 2024 automatically created a Chatbot for the event! Watch the video of Great Minds keynote: delivered by Intel leaders!!
🎯The copilot is built on top of . The code/ext will be released soon. Stay tuned!🚀
🤗Want to build an enterprise-grade RAG system? Efficient embedding is what you want. Here is a nice blog from Intel and
@huggingface
friends on "Intel Fast Embedding" with and
#IAmIntel
@MosheWasserblat
📢Exciting news! Stable Diffusion on Gaudi!! We released Intel Extension for Transformers to simplify LLM fine-tuning and accelerate LLM inference further🚀
In this installment of "Behind the Compute", a series dedicated to offering insights for others to harness the power of generative AI, we compared the training speed of
@Intel
Gaudi 2 accelerators versus
@Nvidia
's A100 and H100 for two of our models. (1/3)
🎯SmoothQuant is now available in ONNX Runtime through Intel Neural Compressor:
👉Start the example in and quantize your favorite LLM!
👍Thanks to Mengni, Tianlei, Yihong, Yufeng, and the team!
📢We are hiring full-time interns for efficient LLM inference.
🔥Group: Intel/DCAI/AISE
🎯Location: Shanghai, Zizhu
😀Working projects:
* INC:
* ITREX:
If you are interested in LLM compression and inference, DM with your resume.😀
Thanks
@martin_casado
!
@Intel
has been making AI more accessible through rich SW portfolio and diverse Intel HWs ! We also released the high-perf LLMs and high-quality datasets for LLM training! People can easily create their own Chatbot through Intel Extension for Transformers!
🎯Intel optimizations meeting LangChain make RAG system more efficient!
⚡️Here is an optimized embedding using Intel Extension for Transformers ().
Has been integrated into
@LangChainAI
: More are coming. 3⃣2⃣1⃣
#iamintel
🔥MLPerf Inference: Intel Extension for Transformers showed 1.8x performance speedup on GPT-J using INT4 inference on 5th Gen Xeon (vs. 4th). Congrats to the team: Yi, Zhentao, Hengyu, Yu, and Kevin! Blazing fast on CPUs, even clients!!
🎯Blog:
#IAmIntel
Happy New Year! It was my honor to be invited as the first guest speaker of 2024 by
@CohereForAI
. Enjoyed sharing the work that the teams have been doing to make LLMs more efficient on Intel platforms. Thanks to the outstanding event organizer
@AhmadMustafaAn1
!
#iamintel
@intelai
Happy New Year! Our first guest speaker of 2024 is tomorrow, Wednesday, January 3rd as our Geo Regional Asia Group welcomes
@HaihaoShen
, Senior AI architect in DCAI/AISE at Intel Corporation to present "Efficient LLM Inference on CPUs"
Learn more: