Haihao Shen Profile Banner
Haihao Shen Profile
Haihao Shen

@HaihaoShen

2,757
Followers
2,595
Following
36
Media
401
Statuses

Creator of Intel Neural Compressor/Speed/Coder, Intel Ext. for Transformers, AutoRound; HF Optimum-Intel Maintainer; Founding member of OPEA; Opinions my own

Shanghai
Joined September 2021
Don't wanna be here? Send us removal request.
Pinned Tweet
@HaihaoShen
Haihao Shen
14 hours
🔥Wanted to get the best low-bit LLM? Yes, we released a dedicated low-bit open LLM leaderboard for AIPC: , inspired from @huggingface LLM leaderboard! #intelai #inc #GPTQ #AWQ #GGUF @humaabidi @lvkaokao @_akhaliq @ollama @martin_casado @jeremyphoward
5
44
166
@HaihaoShen
Haihao Shen
6 months
🔥Excited to share our NeurIPS'23 paper on Efficient LLM inference on CPUs! Compatible with GGML yet better performance up to 1.5x over llama.cpp! 📢Paper: 📕Code: #oneapi @intel @huggingface @_akhaliq @MosheWasserblat
Tweet media one
7
109
664
@HaihaoShen
Haihao Shen
5 months
📢Just change the model name, you can run LLMs blazingly fast on your PC using Intel Extension for Transformers powered by SOTA low-bit quantization! 🎯Code: , supporting Mistral, Llama2, Mixtral-MOE, Phi2, Solar, most recent LLMs. 🤗
4
58
322
@HaihaoShen
Haihao Shen
1 year
🎯We released GPT-J-6B INT8 ONNX models (first time for INT8 ONNX LLM❓) with ~4x model size reduction while preserving ~99.9% accuracy of FP32 baseline. 🔥GPT-J-6B INT8 models are now publicly available at Hugging Face model hub!
6
48
281
@HaihaoShen
Haihao Shen
6 months
🚀Accelerate LLM inference on your laptop, again CPU! Up to 4x on Intel i7-12900 over llama.cpp! 🎯Code: 📢Chatbot demo on PC: ; Hugging Face space demo locally: #oneapi @intel @huggingface @_akhaliq @Gradio
Tweet media one
9
56
274
@HaihaoShen
Haihao Shen
5 months
📢Thrilled to announce Intel Extension for Transformers v1.3 released, featuring 1) efficient low-bit inference and fine-tuning, and 2) improved open-source chatbot framework Neural Chat. 👨‍💻Notes: 🤗Code: X'mas and Happy New Year!
2
52
194
@HaihaoShen
Haihao Shen
5 months
🤗Intel Extension for Transformers supports Mixtral-8-7B with 8-bit and 4-bit inference optimizations on Intel platforms! Start from CPUs🚀 🙌Don't hesitate to give a try. Sample code below👇 🎯Project: #iamintel #intelai @intel @huggingface
Tweet media one
5
41
233
@HaihaoShen
Haihao Shen
1 month
⚡️AutoRound, new SOTA LLM low-bit quantization approach developed by Intel Neural Compressor team () 🎯Lots of interesting comparison with GPTQ, AWQ, HQQ, etc. Check out the blog for more details: @huggingface #IAmIntel
4
53
217
@HaihaoShen
Haihao Shen
7 months
🔥INT4 whisper family models are out! Powered by Intel Extension for Transformers and INC! @YuwenZhou167648 @mengniwa @huggingface
8
41
209
@HaihaoShen
Haihao Shen
6 months
🚀 NeuralChat-7B-v3-1 continues ranking #1 in @huggingface open 7B LLM leaderboard! Even INT8 model ranked #3 !! 🤗Check out the leaderboard: 🥇Model: #iamintel #intelai #oneapi @intelai @lvkaokao
Tweet media one
4
33
204
@HaihaoShen
Haihao Shen
6 months
📢We are hiring full-time interns for LLM-based workflow development (e.g., retrieval-augmented generation for domain chatbot, co-pilot assistant, ...) 📷Location: Shanghai (or working remote in PRC) 🎯Project: If you have interests, DM with your resume.😀
4
31
201
@HaihaoShen
Haihao Shen
7 months
🔥Wanted to quantize 100B+ model on your laptop with 16GB memory? Hmmm, GPTQ does not work... 🎯Intel Neural Compressor supports layer-wise quantization, unlocking LLM quantization on your laptop! Up to 1000B model❓ 📕Blog: #oneapi @intel @huggingface
8
36
199
@HaihaoShen
Haihao Shen
6 months
♥️ Happy Thanksgiving! Thanks to my family, friends, colleagues, partners, collaborators! Love you all!! 🔥We released QLoRA for CPU, to help you enable fine-tune LLMs on your laptop! See below👇 📢Code: #deeplearning #intelai #GenAI @intel @huggingface
3
44
195
@HaihaoShen
Haihao Shen
2 months
🚀Share with you a nice blog "llama.cpp + Intel GPUs". Congrats to the awesome team especially Jianyu, Hengyu, Yu, and Abhilash, and thanks to @ggerganov for your great support. 📢Check out the blog: 🎯WIP with ollama now #iamintel #llama @ollama
2
47
191
@HaihaoShen
Haihao Shen
6 months
📢Do you want to make your LLM inference fast, accurate, and infinite (up to M tokens)? Here is the improved StreamingLLM with re-evaluate and shift-RoPE-K support on CPUs! 🔥Code: 📕Doc: #oneapi @intel @huggingface @Guangxuan_Xiao
Tweet media one
1
39
184
@HaihaoShen
Haihao Shen
3 months
🔥llama.cpp officially supports Sycl, showing promising perf gains over OpenCL. Give a shot on Intel GPUs e.g., Arc 770! PR: Congrats Abhilash/Jianyu/Hengyu/Yu! Thanks @ggerganov for the review! Transformer-like API soon in @RajaXg
5
39
183
@HaihaoShen
Haihao Shen
5 months
🤗Intel Extension for Transformers enables running microsoft/phi-2 smoothly on laptop (faster than human speed🚀). Sample code👇 🎯Code: . Try and have funs! 🎁DM your favorite LLM. Next will be Solar :) #iamintel #intelai @intel @huggingface @murilocurti
Tweet media one
4
26
178
@HaihaoShen
Haihao Shen
6 months
🔥Excited to share a nice blog from @andysingal about Top-performance 7B LLM NeuralChat-v3-1 from Intel: . Check out the blog and have a try on this model! ⚡️ #IAmIntel #intelai @intel @huggingface
5
25
169
@HaihaoShen
Haihao Shen
6 months
📢Just created an open-source project to speed up LLMs dedicatedly 🌟Project: 🤗Look forward to your suggestions and let me know the topics that you may have interests and want to see. #LLM @intel @huggingface
4
27
169
@HaihaoShen
Haihao Shen
6 months
📢Continue making LLMs more accessible! Neural Compressor supports layer-wise GPTQ for INT4 quantization up to 1TB ~ 10TB (though not open-sourced yet) even on consumer HW! 📕Instruction: 🌟Project: #oneapi @intel @huggingface #LLM
1
26
165
@HaihaoShen
Haihao Shen
4 months
🚀Intel Extension for Transformers accelerates GGUF models now! GGUF is the new format introduced by llama.cpp🎆 🤗Project: #intelai #itrex #inc #gguf @intel @huggingface
Tweet media one
1
35
164
@HaihaoShen
Haihao Shen
2 months
🚀Thrilled to announce that NeuralSpeed v1.0 alpha is released! Highly optimized INT4 kernels and blazing fast LLM inference on CPUs! 🎯Integrated by ONNX Runtime; WIP: contribute to AutoAWQ @casper_hansen_ and AutoGPTQ 📔 Blog: 🔥
6
32
160
@HaihaoShen
Haihao Shen
5 months
🎯Excited to share another NeurIPS'23 paper titled "Effective Quantization for Diffusion Models on CPUs"! Congrats to all the collaborators! 🚀Code: 📜Paper: #iamintel #intelai @intel @huggingface @_akhaliq
1
35
152
@HaihaoShen
Haihao Shen
5 months
🎁Thrilled to share Intel Neural Compressor v2.4 is out on a nice snowy day in SH, a special release for model quantization/compression for LLMs, helping to bring AI everywhere. 👨‍💻Release notes: 🚀Code: #iamintel #intelai #oneapi
1
30
150
@HaihaoShen
Haihao Shen
3 months
🎯 #1 INT4 LLM algorithm: AutoRound invented by @intel , showing SOTA accuracy in Mixtral-8x7B, Phi2, NeuralChat ... 🚀 #1 INT4 LLM inference: Intel Extension for Transformers, running efficiently on Intel devices 🌟 🤗
6
26
147
@HaihaoShen
Haihao Shen
4 months
🎁Happy New Year! We released Intel Neural Compressor v2.4.1 on the last working day in 2023! 📔Release notes: 🎯Code: 🩷Thanks to everyone who has provided your support & help to INC. We are committed to make it better in 2024! 🤗
1
23
140
@HaihaoShen
Haihao Shen
3 months
📽️Editing LLM knowledge is possible, e.g., Rank-One Model Editing (ROME). 📔Paper: 🎯Sample code: 💣The technology behind looks interesting and useful, which is supposed to work with SFT and RAG to reduce the hallucination!
3
26
127
@HaihaoShen
Haihao Shen
3 months
🎯Quantization + Speculative decoding shows significant speedup up to 7.3x on Xeon using Intel AI SWs: 📢IPEX: ITREX: 🤗Blog: Congrats to @IntelAI and @huggingface team! @MosheWasserblat @humaabidi
2
26
128
@HaihaoShen
Haihao Shen
2 months
🤗NeuralChat beats GPT4 and Claude on hallucination and factual consistency rate in a new leaderboard👇 initiated by @vectara . 📢RL/DPO is getting so important to improve the model quality, particularly for responsible AI. 🎯Code to fine-tune NeuralChat:
Tweet media one
5
23
124
@HaihaoShen
Haihao Shen
4 months
🤗Neural Speed now supports GGUF (used in llama.cpp)! 📢Neural Speed is an innovation library, a sibling project with Intel Neural Compressor. 🎯Neural Compressor🔚Algorithm + Accuracy 🚀Neural Speed 🔚 Kernel + Performance 🌟
3
21
119
@HaihaoShen
Haihao Shen
7 months
🔥Want Intel-enhanced llama.cpp? Yes, up to 15x on 1st token gen and 1.5x on other token gen on Intel latest Xeon Scalable Processor (SPR) 📕Blog: Code: #oneapi @intel @huggingface @_akhaliq @llama @llama_index @llama
3
29
115
@HaihaoShen
Haihao Shen
2 months
🔥All your need is Intel Neural Compressor (INC) for INT4 LLMs. INC v2.5 released with SOTA INT4 LLM quantization (AutoRound) across platforms incl. Intel Gaudi2, Xeon, and GPU. 🎯Models: Llama2, Mistral, Mixtral-MOE, Gemma, Mistral-v0.2, Phi2, Qwen, ...🤗
2
17
116
@HaihaoShen
Haihao Shen
3 months
🎯Embedding model is super important for RAG system. Here is a tutorial showing how to tune BAAI/bge-base for high performance. 📔 💣 Extended LangChain to load optimized embedding model and improved the inference on Intel platforms.
1
17
114
@HaihaoShen
Haihao Shen
3 months
㊗️Our paper on "FP8 recipes" has been accepted by MLSys'24. Congrats to all the collaborators @navikm Xin, Qun, Chang, and Mengni! 🤗Paper: 🎯Code:
Tweet media one
4
17
111
@HaihaoShen
Haihao Shen
5 months
📢More Intel NeuralChat-v3 7B LLMs are released, and more technical details are published in the blog👇 🎯Blog: 🙌Welcome to use @intel NeuralChat-v3🤗, which runs highly efficient on Intel platforms using Intel AI SWs. #iamintel #intelai @huggingface
7
16
108
@HaihaoShen
Haihao Shen
3 months
🎯High performance INT4 Mistral-7B model available on @huggingface , quantized by Intel Neural Compressor (outperforming GPTQ & AWQ) and efficiently inferenced by Intel Extension for Transformers! 🤗 Model: 🌟,
6
25
105
@HaihaoShen
Haihao Shen
23 days
🎯Meta launched Llama3. See how it works well across Gaudi, Xeon, GPU, and AIPC! Check out the blog: 🔥Happy to share with you AutoRound in Intel Neural Compressor was used to quantize Llama3 INT4 model with the SOTA accuracy!
3
28
106
@HaihaoShen
Haihao Shen
1 month
🔥MLPerf Inference v4.0 inference is out! 1⃣The only CPU able to achieve 99.9% accuracy 2⃣1.8x perf speedup over last submission 3⃣Summarize a news article pre second in real-time 📘Blog: 🎯Code for MLPerf GPT-J: #MLPerf #IAmIntel
0
18
100
@HaihaoShen
Haihao Shen
1 year
🎯Want to quantize Transformer model without coding? Yes, use Neural Coder + Optimum-Intel. 🧨5,000+ Transformer models quantized automatically 🔥Neural Coder demo on Hugging Face Spaces: . ⭐️Check it out for a try! @ellacharlaix @jeffboudier @_akhaliq
1
25
101
@HaihaoShen
Haihao Shen
6 months
❓Fine-tuning or RAG? Don't know how to select. 🎯Fine-tuning is not the only way to make your LLM smarter! You can also try RAG. Here are the recommendations and examples: 📢Reproducible through Intel Extension for Transformers: 🚀
4
15
101
@HaihaoShen
Haihao Shen
1 year
🎯We are hosting our personalized Stable Diffusion model with a newly-added object "dicoo" on Hugging Face Spaces: . 🤗Try it out! If you want to replicate the fine-tuning, please visit our previous blog:
3
24
96
@HaihaoShen
Haihao Shen
5 months
📢Slimmed BGE embedding models are coming, shortly after quantized ones. More importantly, slimming and quantization can be combined together! 🎁 Private RAG-based chatbots on clients are more accessible! 👨‍💻 🎯 #intelai #NeuralChat
0
18
96
@HaihaoShen
Haihao Shen
8 months
📢"Efficient Post-training Quantization with FP8 Formats" is published! Thanks to the great collaborators! 🎯We released all the FP8 recipes in Intel Neural Compressor: . Check it out!
Tweet media one
1
23
94
@HaihaoShen
Haihao Shen
1 year
🎯Happy to announce the source code and examples of "Fast DistilBERT on CPU" (accepted by NeurIPS'22 paper) was released: 🧨Included in Top NLP Papers Nov'22 by @CohereAI and highlighted "Fast Transformers on CPUs with SOTA performance" by @Synced_Global !
0
9
94
@HaihaoShen
Haihao Shen
1 month
⚡️Breaking news: Open Platform for Enterprise AI (OPEA) is announced by Pat! A lots of great partners👍 🎯The base code is here: , powered by ecosystem projects such as Transformers, TGI, LangChain and the technology from Intel Extension for Transformers.
1
19
92
@HaihaoShen
Haihao Shen
4 months
👨‍💻If you missed CES 2024 Intel copilot demo, no worry, here is the video. 🎯Features: 1) run on your PC for copilot chat, so it's 100% free and safe; 2) run on server for code generation, so it may generate better code; 3) smart model switch. VS plugin is coming🚀 #intelai @intel
2
16
86
@HaihaoShen
Haihao Shen
3 months
🩷A memorable day: Intel Neural Compressor and Intel Extension for Transformers crossed! A baby Neural Speed is on board!!🌟
Tweet media one
0
6
80
@HaihaoShen
Haihao Shen
7 months
🔥Happy to publish the code of SignRound (a leading INT4 quantization method) : 📕Paper: 👉Code: 📢Leave a star if you find it's useful.
Tweet media one
0
22
78
@HaihaoShen
Haihao Shen
5 months
It has been a great experience to see the rapid growth of LLMs in open-source community. We are proud to see @intelai created LLMs & datasets are welcomed and being used/discussed/improved. Go, Intel LLMs!
@IntelAI
Intel AI
5 months
Congrats to Intel team members Haihao Shen and Kaokao Lv for their fine-tuned version of Mistral 7B having hit the top of the list on the @huggingface LLM leaderboard last week: Fine-tuned on 8x Intel Gaudi2 Accelerators.
Tweet media one
2
12
120
3
14
73
@HaihaoShen
Haihao Shen
1 month
🔥Want to use FP8 inference easily? Intel Neural Compressor is your best choice: 🎯Shared with you our MLSys'24 camera-ready paper: Efficient Post-Training Quantization with FP8 Formats 🤗 @_akhaliq @navikm @huggingface #IAmIntel
Tweet media one
0
18
76
@HaihaoShen
Haihao Shen
2 months
🎯How MX data types work for LLMs? New quantization recipes validated by Intel using Neural Compressor, HW architecture and data types proposed by MSFT and defined by OCP 📢Here is a tutorial: with source code publicly available in
0
11
71
@HaihaoShen
Haihao Shen
1 month
🌟Happy to announce Intel Extension for Transformers v1.4 released with a lot of improvements in building GenAI applications on Intel platforms! 🎯Check out the release notes: 🤗 @intel + @huggingface = one of the best GenAI platforms
0
10
74
@HaihaoShen
Haihao Shen
5 months
🚀Happy to support "upstage/SOLAR-10.7B-Instruct-v1.0" in Intel Extension for Transformers! @upstageai @hunkims . INT4 inference is available with one parameter change from "load_in_8bit" to "load_in_4bit". 📢Next one will be Zephyr🙌 👇Check out the sample code and give a try!
Tweet media one
0
13
72
@HaihaoShen
Haihao Shen
3 months
🎁Here is a tutorial on how to optimize natural language embedding model and extend LangChain to enable the optimizations. Check out more details: 🤗Code: . Star the project if you find this is useful. 🌟Happy Chinese New Year! 🎇
0
14
68
@HaihaoShen
Haihao Shen
2 months
👨‍💻2023 is year of open LLMs. Is it time to predict for 2024? DM your thoughts. 📢Re-share the blog from @clefourrier : , incl. Intel NeuralChat-7B and DPO dataset😀 🤗We hope to contribute more to open-source LLM community in 2024! #iamintel @huggingface
4
12
66
@HaihaoShen
Haihao Shen
2 months
🔥Happy to announce Intel Extension for Transformers v1.3.2 released 📔Release notes: 🎯Highlights: enable popular serving e.g., @huggingface TGI, vLLM, and Triton to build highly efficient chatbot on Intel platforms such as Gaudi2 with a few lines of code
0
9
65
@HaihaoShen
Haihao Shen
4 months
📢Intel Copilot in CES 2024 automatically created a Chatbot for the event! Watch the video of Great Minds keynote: delivered by Intel leaders!! 🎯The copilot is built on top of . The code/ext will be released soon. Stay tuned!🚀
0
8
64
@HaihaoShen
Haihao Shen
2 months
📢Exciting news! Stable Diffusion on Gaudi!! We released Intel Extension for Transformers to simplify LLM fine-tuning and accelerate LLM inference further🚀
@StabilityAI
Stability AI
2 months
In this installment of "Behind the Compute", a series dedicated to offering insights for others to harness the power of generative AI, we compared the training speed of @Intel Gaudi 2 accelerators versus @Nvidia 's A100 and H100 for two of our models. (1/3)
Tweet media one
17
62
306
1
12
60
@HaihaoShen
Haihao Shen
10 months
🎯SmoothQuant is now available in ONNX Runtime through Intel Neural Compressor: 👉Start the example in and quantize your favorite LLM! 👍Thanks to Mengni, Tianlei, Yihong, Yufeng, and the team!
0
13
53
@HaihaoShen
Haihao Shen
7 months
📢We are hiring full-time interns for efficient LLM inference. 🔥Group: Intel/DCAI/AISE 🎯Location: Shanghai, Zizhu 😀Working projects: * INC: * ITREX: If you are interested in LLM compression and inference, DM with your resume.😀
3
11
52
@HaihaoShen
Haihao Shen
6 months
Thanks @martin_casado ! @Intel has been making AI more accessible through rich SW portfolio and diverse Intel HWs ! We also released the high-perf LLMs and high-quality datasets for LLM training! People can easily create their own Chatbot through Intel Extension for Transformers!
@martin_casado
martin_casado
6 months
Amazing to see Intel getting into the open source AI game. Well done!
4
16
139
1
8
52
@HaihaoShen
Haihao Shen
7 months
📢Excited to share our new paper () for LLM INT4 quantization with comparable and better results than GPTQ! 🎯Code is available in #oneapi @huggingface @_akhaliq @MosheWasserblat @ellacharlaix
Tweet media one
1
14
52
@HaihaoShen
Haihao Shen
24 days
🔥MLPerf Inference: Intel Extension for Transformers showed 1.8x performance speedup on GPT-J using INT4 inference on 5th Gen Xeon (vs. 4th). Congrats to the team: Yi, Zhentao, Hengyu, Yu, and Kevin! Blazing fast on CPUs, even clients!! 🎯Blog: #IAmIntel
1
11
50
@HaihaoShen
Haihao Shen
4 months
Happy New Year! It was my honor to be invited as the first guest speaker of 2024 by @CohereForAI . Enjoyed sharing the work that the teams have been doing to make LLMs more efficient on Intel platforms. Thanks to the outstanding event organizer @AhmadMustafaAn1 ! #iamintel @intelai
@CohereForAI
Cohere For AI
4 months
Happy New Year! Our first guest speaker of 2024 is tomorrow, Wednesday, January 3rd as our Geo Regional Asia Group welcomes @HaihaoShen , Senior AI architect in DCAI/AISE at Intel Corporation to present "Efficient LLM Inference on CPUs" Learn more:
0
5
10
4
5
43