Haihao Shen @HaihaoShen Twitter profile

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

5

44

166

Last Seen Profiles

@lombris102

@cookitwo

@Najibnaik

@WickedLipsss

@venum0us

@Boonsch

@KEriksenV2

@olgatokariuk

@LArc_official

@BlackoutUnited

@miyosino

@TroyTrojansFB

@cathyfamilyhome

@BS4chan

@tenga_fx

@nflohre

@JosephReed34

@YGODaigaku

@TowardsFairWork

@kanachan2020

@haru_oh

@1bigcarti

@SCFBaseball

@ii1Gi

@stw_pdg

@Moon_flower90

@BadgerCult

@Docbootsnbraces

@seb_mcs

@educasarin

@coco_candy2

@Blast_Bsbl

@Saint_Ivalice

@haashoficial

@VERY_web

@garethmay18

Haihao Shen

@HaihaoShen

5 months

🧩No GPU but wanna create your own LLM on laptop? 🎁Here is a gift for you: QLoRA on CPU, making LLM fine-tuning on client CPU possible! Just give a try. 📔Blog: Kudos to ITREX team! 🎯Code: #IAmIntel #intelai @intel @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

6

162

677

Haihao Shen

@HaihaoShen

6 months

🔥Excited to share our NeurIPS'23 paper on Efficient LLM inference on CPUs! Compatible with GGML yet better performance up to 1.5x over llama.cpp! 📢Paper: 📕Code: #oneapi @intel @huggingface @_akhaliq @MosheWasserblat

7

109

664

Haihao Shen

@HaihaoShen

6 months

🚀Up to 3x LLM inference speedup using speculative decoding from @huggingface with Intel optimizations! 📘Guide: 🎯Project: #oneapi #iamintel #intelai @intel @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

3

56

344

Haihao Shen

@HaihaoShen

5 months

📢Just change the model name, you can run LLMs blazingly fast on your PC using Intel Extension for Transformers powered by SOTA low-bit quantization! 🎯Code: , supporting Mistral, Llama2, Mixtral-MOE, Phi2, Solar, most recent LLMs. 🤗

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Intel/gpt-j-6B-int8-dynamic-inc · Hugging Face

4

58

322

Haihao Shen

@HaihaoShen

1 year

🎯We released GPT-J-6B INT8 ONNX models (first time for INT8 ONNX LLM❓) with ~4x model size reduction while preserving ~99.9% accuracy of FP32 baseline. 🔥GPT-J-6B INT8 models are now publicly available at Hugging Face model hub!

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

6

48

281

Haihao Shen

@HaihaoShen

6 months

🚀Accelerate LLM inference on your laptop, again CPU! Up to 4x on Intel i7-12900 over llama.cpp! 🎯Code: 📢Chatbot demo on PC: ; Hugging Face space demo locally: #oneapi @intel @huggingface @_akhaliq @Gradio

9

56

274

Haihao Shen

@HaihaoShen

6 months

🚀Thrilled to release INT8 BGE-1.5 models on Hugging Face and demonstrated ~5ms latency of embedding 512 seq length on Intel CPU! 👉Code: 🎯Models: ; #oneapi @intel @huggingface @_akhaliq @MosheWasserblat

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

6

40

241

Haihao Shen

@HaihaoShen

5 months

📢Thrilled to announce Intel Extension for Transformers v1.3 released, featuring 1) efficient low-bit inference and fine-tuning, and 2) improved open-source chatbot framework Neural Chat. 👨‍💻Notes: 🤗Code: X'mas and Happy New Year!

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Supervised Fine-Tuning and Direct Preference Optimization on Intel Gaudi2

2

52

194

Haihao Shen

@HaihaoShen

5 months

🤗Intel Extension for Transformers supports Mixtral-8-7B with 8-bit and 4-bit inference optimizations on Intel platforms! Start from CPUs🚀 🙌Don't hesitate to give a try. Sample code below👇 🎯Project: #iamintel #intelai @intel @huggingface

5

41

233

Haihao Shen

@HaihaoShen

6 months

🥇NeuralChat, new Top-1 7B-sized LLM on leaderboard, is now from Intel, trained on Gaudi! 4-bit inference is supported!! 🎯Model: 📢Blog: #oneapi @intel @huggingface @clefourrier @_lewtun @jeffboudier @humaabidi @MosheWasserblat

Demonstrating a Top-Ranked 7B Chat Model on the LLM Leaderboard

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

7

22

226

Haihao Shen

@HaihaoShen

4 months

📢If you don't have a GPU but want to run GPTQ and AWQ INT4 LLMs, here is the alternative to run well on your CPU: . Give it a shot!🤗

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

AutoRound: SOTA Weight-Only Quantization Algorithm for LLMs Across Hardware Platforms

10

47

226

Haihao Shen

@HaihaoShen

1 month

⚡️AutoRound, new SOTA LLM low-bit quantization approach developed by Intel Neural Compressor team () 🎯Lots of interesting comparison with GPTQ, AWQ, HQQ, etc. Check out the blog for more details: @huggingface #IAmIntel

Introduction

Intel/whisper-large-v2-onnx-int4-inc · Hugging Face

4

53

217

Haihao Shen

@HaihaoShen

7 months

🔥INT4 whisper family models are out! Powered by Intel Extension for Transformers and INC! @YuwenZhou167648 @mengniwa @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

8

41

209

Haihao Shen

@HaihaoShen

6 months

📢Thrilled to announce the support of multi-turn chat using 4-bit LLM on Client CPU! 🚀Sample Code: 🔥Instruction to create gradio-based demo: 🎯Project: #oneapi @IntelSoftware @huggingface @Gradio @_akhaliq

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

6

55

206

Haihao Shen

@HaihaoShen

6 months

🚀 NeuralChat-7B-v3-1 continues ranking #1 in @huggingface open 7B LLM leaderboard! Even INT8 model ranked #3 !! 🤗Check out the leaderboard: 🥇Model: #iamintel #intelai #oneapi @intelai @lvkaokao

4

33

204

Haihao Shen

@HaihaoShen

6 months

📢We are hiring full-time interns for LLM-based workflow development (e.g., retrieval-augmented generation for domain chatbot, co-pilot assistant, ...) 📷Location: Shanghai (or working remote in PRC) 🎯Project: If you have interests, DM with your resume.😀

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Quantizing Large Language Models on Your Laptop

4

31

201

Haihao Shen

@HaihaoShen

7 months

🔥Wanted to quantize 100B+ model on your laptop with 16GB memory? Hmmm, GPTQ does not work... 🎯Intel Neural Compressor supports layer-wise quantization, unlocking LLM quantization on your laptop! Up to 1000B model❓ 📕Blog: #oneapi @intel @huggingface

Layer-Wise Low-Bit Weight-Only Quantization

Run LLM on all Intel GPUs Using llama.cpp

8

36

199

Haihao Shen

@HaihaoShen

6 months

♥️ Happy Thanksgiving! Thanks to my family, friends, colleagues, partners, collaborators! Love you all!! 🔥We released QLoRA for CPU, to help you enable fine-tune LLMs on your laptop! See below👇 📢Code: #deeplearning #intelai #GenAI @intel @huggingface

3

44

195

Haihao Shen

@HaihaoShen

2 months

🚀Share with you a nice blog "llama.cpp + Intel GPUs". Congrats to the awesome team especially Jianyu, Hengyu, Yu, and Abhilash, and thanks to @ggerganov for your great support. 📢Check out the blog: 🎯WIP with ollama now #iamintel #llama @ollama

www.intel.com

2

47

191

Haihao Shen

@HaihaoShen

7 months

📢StreamingLLM landed in Intel Extension for Transformers to support LLM inference infinity on CPU, up to 4M tokens! 🎯Check out the code: , search "StreamingLLM" and have a try! #oneapi @intel @huggingface @Guangxuan_Xiao @_akhaliq

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

44

188

Haihao Shen

@HaihaoShen

6 months

📢Do you want to make your LLM inference fast, accurate, and infinite (up to M tokens)? Here is the improved StreamingLLM with re-evaluate and shift-RoPE-K support on CPUs! 🔥Code: 📕Doc: #oneapi @intel @huggingface @Guangxuan_Xiao

1

39

184

Haihao Shen

@HaihaoShen

3 months

🔥llama.cpp officially supports Sycl, showing promising perf gains over OpenCL. Give a shot on Intel GPUs e.g., Arc 770! PR: Congrats Abhilash/Jianyu/Hengyu/Yu! Thanks @ggerganov for the review! Transformer-like API soon in @RajaXg

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

5

39

183

Haihao Shen

@HaihaoShen

4 months

🚀Neural Speed + ONNX Runtime makes LLM inference more efficient on CPUs! 🎯Code: #intelai #aipc #onnxruntime #LLMs

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

3

33

184

Haihao Shen

@HaihaoShen

6 months

🚀Thrilled to share Intel Extension for Transformers supports INT4 model quantized by GPTQ on Intel platforms (Xeon & PC) ! 🎯Guide: 🤗Model: e.g., TheBloke/neural-chat-7B-v3-1-GPTQ ⚡️Code: #iamintel #oneapi @intelai @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance

1

24

178

Haihao Shen

@HaihaoShen

5 months

🤗Intel Extension for Transformers enables running microsoft/phi-2 smoothly on laptop (faster than human speed🚀). Sample code👇 🎯Code: . Try and have funs! 🎁DM your favorite LLM. Next will be Solar :) #iamintel #intelai @intel @huggingface @murilocurti

4

26

178

Haihao Shen

@HaihaoShen

6 months

🔥Excited to share a nice blog from @andysingal about Top-performance 7B LLM NeuralChat-v3-1 from Intel: . Check out the blog and have a try on this model! ⚡️ #IAmIntel #intelai @intel @huggingface

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

5

25

169

Haihao Shen

@HaihaoShen

6 months

📢Just created an open-source project to speed up LLMs dedicatedly 🌟Project: 🤗Look forward to your suggestions and let me know the topics that you may have interests and want to see. #LLM @intel @huggingface

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

4

27

169

Haihao Shen

@HaihaoShen

6 months

📢Continue making LLMs more accessible! Neural Compressor supports layer-wise GPTQ for INT4 quantization up to 1TB ~ 10TB (though not open-sourced yet) even on consumer HW! 📕Instruction: 🌟Project: #oneapi @intel @huggingface #LLM

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

1

26

165

Haihao Shen

@HaihaoShen

4 months

🚀Intel Extension for Transformers accelerates GGUF models now! GGUF is the new format introduced by llama.cpp🎆 🤗Project: #intelai #itrex #inc #gguf @intel @huggingface

1

35

164

Haihao Shen

@HaihaoShen

2 months

🚀Thrilled to announce that NeuralSpeed v1.0 alpha is released! Highly optimized INT4 kernels and blazing fast LLM inference on CPUs! 🎯Integrated by ONNX Runtime; WIP: contribute to AutoAWQ @casper_hansen_ and AutoGPTQ 📔 Blog: 🔥

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

6

32

160

Haihao Shen

@HaihaoShen

4 months

🚀Highly-efficient x86 INT4 kernels are now available in ONNX Runtime. Use Intel Neural Compressor to quantize LLMs and run efficiently with ONNX Runtime on Intel CPUs! 📔PR: 🎯Source of INT4 kernels: #intelai @intelai @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Paper page - Effective Quantization for Diffusion Models on CPUs

1

29

154

Haihao Shen

@HaihaoShen

5 months

🎯Excited to share another NeurIPS'23 paper titled "Effective Quantization for Diffusion Models on CPUs"! Congrats to all the collaborators! 🚀Code: 📜Paper: #iamintel #intelai @intel @huggingface @_akhaliq

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

1

35

152

Haihao Shen

@HaihaoShen

5 months

🎁Thrilled to share Intel Neural Compressor v2.4 is out on a nice snowy day in SH, a special release for model quantization/compression for LLMs, helping to bring AI everywhere. 👨‍💻Release notes: 🚀Code: #iamintel #intelai #oneapi

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

30

150

Haihao Shen

@HaihaoShen

5 months

🚀Embedding is super fast on SPR! Just ~500 seconds for 1M samples (512 seq len/sample) using Intel optimized BGE model using INC and ITREX, making RAG more accessible! 📷Quick guide: 🎯 #iamintel #intelai @intelai @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

31

147

Haihao Shen

@HaihaoShen

3 months

🎯 #1 INT4 LLM algorithm: AutoRound invented by @intel , showing SOTA accuracy in Mixtral-8x7B, Phi2, NeuralChat ... 🚀 #1 INT4 LLM inference: Intel Extension for Transformers, running efficiently on Intel devices 🌟 🤗

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

6

26

147

Haihao Shen

@HaihaoShen

6 months

🔥Excited to share new BGE-base-v1.5 INT8 models within <1% accuracy loss from FP32 baseline on STS dataset (previous SST2)! BGE for RAG!! 🤗Model-1: 🤗Model-2: 🚀Code: #oneapi @IntelSoftware @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

2

28

145

Haihao Shen

@HaihaoShen

4 months

🎁Happy New Year! We released Intel Neural Compressor v2.4.1 on the last working day in 2023! 📔Release notes: 🎯Code: 🩷Thanks to everyone who has provided your support & help to INC. We are committed to make it better in 2024! 🤗

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

23

140

Haihao Shen

@HaihaoShen

3 months

📽️Editing LLM knowledge is possible, e.g., Rank-One Model Editing (ROME). 📔Paper: 🎯Sample code: 💣The technology behind looks interesting and useful, which is supposed to work with SFT and RAG to reduce the hallucination!

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Accelerate StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

3

26

127

Haihao Shen

@HaihaoShen

3 months

🎯Quantization + Speculative decoding shows significant speedup up to 7.3x on Xeon using Intel AI SWs: 📢IPEX: ITREX: 🤗Blog: Congrats to @IntelAI and @huggingface team! @MosheWasserblat @humaabidi

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

2

26

128

Haihao Shen

@HaihaoShen

1 year

📢Happy to share Intel Extension for Transformers v1.0 released: 🎯 NeuralChat, a custom Chatbot on domain knowledge through Hugging Face PEFT. Now, you can create your own Chatbot within 1 hours on CPUs. @humaabidi @MosheWasserblat @jeffboudier

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

0

34

124

Haihao Shen

@HaihaoShen

5 months

🎯When DeepSpeed meets Intel AI SWs, the performance magic happens! 🚀Accelerate Llama 2 inference on Xeon SPR by up to ~1.7x! 📔Blog: 🎁Intel AI SWs: IPEX: INC: and #oneapi @intelai @AIatMeta @MSFTDeepSpeed

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

19

124

Haihao Shen

@HaihaoShen

2 months

🤗NeuralChat beats GPT4 and Claude on hallucination and factual consistency rate in a new leaderboard👇 initiated by @vectara . 📢RL/DPO is getting so important to improve the model quality, particularly for responsible AI. 🎯Code to fine-tune NeuralChat:

5

23

124

Haihao Shen

@HaihaoShen

4 months

📢Happy to share INT4 inference on @intel GPUs (e.g., PVC & Arc) is available in Intel Ext. for Transformers as an experimental feature (powered by IPEX)! More are coming!! 🎯Release notes: 🚀Code: #intelai #intelgpu @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

3

18

123

Haihao Shen

@HaihaoShen

5 months

📢NeuralChat, an open chat framework created by @intel , now supports the @huggingface assisted generation to make chatbot more efficient on Intel platforms! 🎯Guide to deploy a chatbot: 🚀Code: #iamintel #intelai Go, ITREX!

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

0

16

121

Haihao Shen

@HaihaoShen

4 months

🤗Neural Speed now supports GGUF (used in llama.cpp)! 📢Neural Speed is an innovation library, a sibling project with Intel Neural Compressor. 🎯Neural Compressor🔚Algorithm + Accuracy 🚀Neural Speed 🔚 Kernel + Performance 🌟

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

Highly-efficient LLM Inference on Intel Platforms

3

21

119

Haihao Shen

@HaihaoShen

7 months

🔥Want Intel-enhanced llama.cpp? Yes, up to 15x on 1st token gen and 1.5x on other token gen on Intel latest Xeon Scalable Processor (SPR) 📕Blog: Code: #oneapi @intel @huggingface @_akhaliq @llama @llama_index @llama

Leadership performance yet compatible with llama.cpp

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

3

29

115

Haihao Shen

@HaihaoShen

2 months

🔥All your need is Intel Neural Compressor (INC) for INT4 LLMs. INC v2.5 released with SOTA INT4 LLM quantization (AutoRound) across platforms incl. Intel Gaudi2, Xeon, and GPU. 🎯Models: Llama2, Mistral, Mixtral-MOE, Gemma, Mistral-v0.2, Phi2, Qwen, ...🤗

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

2

17

116

Haihao Shen

@HaihaoShen

3 months

🎯Embedding model is super important for RAG system. Here is a tutorial showing how to tune BAAI/bge-base for high performance. 📔 💣 Extended LangChain to load optimized embedding model and improved the inference on Intel platforms.

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

17

114

Haihao Shen

@HaihaoShen

6 months

🚀Even 3rd Intel Xeon ICX can run efficiently on LLM inference! See the demo below👇 📢Demo: 📕Code: Competitive TCO (perf/$)! More importantly, Xeon is almost everywhere!! #oneapi @intel @huggingface @_akhaliq @MosheWasserblat

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Intel Neural Chat 7B - Mistral meets new hardware & new data

1

33

111

Haihao Shen

@HaihaoShen

5 months

🎁Intel Neural Chat-7B: when Mistral meets new hardware (Intel Gaudi2) & new data (Intel DPO dataset) 🚀Code: 🎯Model: 📽️Super cool video from @Sam_Witteveen #iamintel #intelai @intel @huggingface

An overview of Intel's latest neural chat model, trained on custom hardware called Intel Gaudi 2. It discusses the model's unique features, performance, trai...

www.youtube.com

4

19

112

Haihao Shen

@HaihaoShen

3 months

㊗️Our paper on "FP8 recipes" has been accepted by MLSys'24. Congrats to all the collaborators @navikm Xin, Qun, Chang, and Mengni! 🤗Paper: 🎯Code:

4

17

111

Haihao Shen

@HaihaoShen

5 months

📢More Intel NeuralChat-v3 7B LLMs are released, and more technical details are published in the blog👇 🎯Blog: 🙌Welcome to use @intel NeuralChat-v3🤗, which runs highly efficient on Intel platforms using Intel AI SWs. #iamintel #intelai @huggingface

Advancing Large Language Models on Intel Platforms

The Evolution of Intel NeuralChat-7B LLM

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

7

16

108

Haihao Shen

@HaihaoShen

3 months

🎯High performance INT4 Mistral-7B model available on @huggingface , quantized by Intel Neural Compressor (outperforming GPTQ & AWQ) and efficiently inferenced by Intel Extension for Transformers! 🤗 Model: 🌟,

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Llama 3 with Intel AI Solutions

6

25

105

Haihao Shen

@HaihaoShen

23 days

🎯Meta launched Llama3. See how it works well across Gaudi, Xeon, GPU, and AIPC! Check out the blog: 🔥Happy to share with you AutoRound in Intel Neural Compressor was used to quantize Llama3 INT4 model with the SOTA accuracy!

Demonstrates how to accelerate Llama 3 using Intel AI Solutions

www.intel.com

3

28

106

Haihao Shen

@HaihaoShen

1 month

🔥MLPerf Inference v4.0 inference is out! 1⃣The only CPU able to achieve 99.9% accuracy 2⃣1.8x perf speedup over last submission 3⃣Summarize a news article pre second in real-time 📘Blog: 🎯Code for MLPerf GPT-J: #MLPerf #IAmIntel

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

18

100

Haihao Shen

@HaihaoShen

1 year

🎯Want to quantize Transformer model without coding? Yes, use Neural Coder + Optimum-Intel. 🧨5,000+ Transformer models quantized automatically 🔥Neural Coder demo on Hugging Face Spaces: . ⭐️Check it out for a try! @ellacharlaix @jeffboudier @_akhaliq

1

25

101

Haihao Shen

@HaihaoShen

6 months

❓Fine-tuning or RAG? Don't know how to select. 🎯Fine-tuning is not the only way to make your LLM smarter! You can also try RAG. Here are the recommendations and examples: 📢Reproducible through Intel Extension for Transformers: 🚀

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Intel's Neural-Chat 7b: Most Powerful 7B Model! Beats GPT-4!?

4

15

101

Haihao Shen

@HaihaoShen

6 months

🔥Thanks to @intheworldofai for publishing an amazing video to introduce NeuralChat, the most powerful 7B model crated by @IntelAI ranked Top-1 in @huggingface LLM open leaderboard! 🎯Video: #intelai @NoWayYesWei @humaabidi @KeDing @MosheWasserblat

Intel's Neural Chat 7B V3 is reshaping the landscape of language models, and in this video, we dive deep into its groundbreaking features and capabilities. ?...

www.youtube.com

2

29

97

Haihao Shen

@HaihaoShen

6 months

🔥NeuralChat ranked Top-1 among 7B LLMs in Open LLM Leaderboard! 🎯Code to reproduce Top-1 model: Congrats to Kaokao and the team! Thanks to @NoWayYesWei @humaabidi for great support! #oneapi @intel @huggingface @ClementDelangue @clefourrier @jeffboudier

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Personalized Stable Diffusion with Few-Shot Fine-Tuning

0

18

99

Haihao Shen

@HaihaoShen

1 year

🎯We are hosting our personalized Stable Diffusion model with a newly-added object "dicoo" on Hugging Face Spaces: . 🤗Try it out! If you want to replicate the fine-tuning, please visit our previous blog:

Create Your Own Stable Diffusion on a Single CPU

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

3

24

96

Haihao Shen

@HaihaoShen

5 months

📢Slimmed BGE embedding models are coming, shortly after quantized ones. More importantly, slimming and quantization can be combined together! 🎁 Private RAG-based chatbots on clients are more accessible! 👨‍💻 🎯 #intelai #NeuralChat

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

18

96

Haihao Shen

@HaihaoShen

8 months

📢"Efficient Post-training Quantization with FP8 Formats" is published! Thanks to the great collaborators! 🎯We released all the FP8 recipes in Intel Neural Compressor: . Check it out!

1

23

94

Haihao Shen

@HaihaoShen

1 year

🎯Happy to announce the source code and examples of "Fast DistilBERT on CPU" (accepted by NeurIPS'22 paper) was released: 🧨Included in Top NLP Papers Nov'22 by @CohereAI and highlighted "Fast Transformers on CPUs with SOTA performance" by @Synced_Global !

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - opea-project/GenAIExamples: Generative AI Examples is a collection of GenAI examples such...

0

9

94

Haihao Shen

@HaihaoShen

1 month

⚡️Breaking news: Open Platform for Enterprise AI (OPEA) is announced by Pat! A lots of great partners👍 🎯The base code is here: , powered by ecosystem projects such as Transformers, TGI, LangChain and the technology from Intel Extension for Transformers.

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project. - opea-project/G...

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

19

92

Haihao Shen

@HaihaoShen

4 months

👨‍💻If you missed CES 2024 Intel copilot demo, no worry, here is the video. 🎯Features: 1) run on your PC for copilot chat, so it's 100% free and safe; 2) run on server for code generation, so it may generate better code; 3) smart model switch. VS plugin is coming🚀 #intelai @intel

2

16

86

Haihao Shen

@HaihaoShen

3 months

🩷A memorable day: Intel Neural Compressor and Intel Extension for Transformers crossed! A baby Neural Speed is on board!!🌟

0

6

80

Haihao Shen

@HaihaoShen

3 months

💣Happy to announce INT4 NeuralChat-7B models available on @huggingface , powered by SOTA INT4 algorithm developed by Intel, yet compatible with AutoGPTQ! 🤗 🤗 📔Paper: 🎯Sample code:

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

0

15

82

Haihao Shen

@HaihaoShen

4 months

📢INT4 GPTQ and RTN landed in ONNX Runtime through Intel Neural Compressor. AI on PC is coming! 📔PR: Thanks to Yuwen, Mengni, and Yufeng! 🌟Code: #intelai #onnxruntime #neuralcompressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

16

81

Haihao Shen

@HaihaoShen

7 months

🔥Happy to publish the code of SignRound (a leading INT4 quantization method) : 📕Paper: 👉Code: 📢Leave a star if you find it's useful.

0

22

78

Haihao Shen

@HaihaoShen

5 months

🎁Happy to announce Intel Extension for Transformers supports INT8 quantization for MSFT Phi, making Phi inference more efficient and accessible than ever! 📔Quick guide: 🎯Code available: #iamintel #intelai @intel @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

14

77

Haihao Shen

@HaihaoShen

5 months

It has been a great experience to see the rapid growth of LLMs in open-source community. We are proud to see @intelai created LLMs & datasets are welcomed and being used/discussed/improved. Go, Intel LLMs!

Intel AI

@IntelAI

5 months

Congrats to Intel team members Haihao Shen and Kaokao Lv for their fine-tuned version of Mistral 7B having hit the top of the list on the @huggingface LLM leaderboard last week: Fine-tuned on 8x Intel Gaudi2 Accelerators.

2

12

120

3

14

73

Haihao Shen

@HaihaoShen

1 month

🔥Want to use FP8 inference easily? Intel Neural Compressor is your best choice: 🎯Shared with you our MLSys'24 camera-ready paper: Efficient Post-Training Quantization with FP8 Formats 🤗 @_akhaliq @navikm @huggingface #IAmIntel

0

18

76

Haihao Shen

@HaihaoShen

6 months

📢"Efficient LLM Inference on CPUs" featured on @Marktechpost ! 🪧 👉Project: #oneapi @intel @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

4

12

75

Haihao Shen

@HaihaoShen

6 months

🎯Wanted to enable audio in your chatbot? Just few minutes. 📕Here is a guide for you, including ASR, TTS, audio processing, audio streaming, multi-lang EN & CN: 📢Optimized code: with🤗models #iamintel #intelai @intel @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

1

22

74

Haihao Shen

@HaihaoShen

2 months

🎯How MX data types work for LLMs? New quantization recipes validated by Intel using Neural Compressor, HW architecture and data types proposed by MSFT and defined by OCP 📢Here is a tutorial: with source code publicly available in

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

11

71

Haihao Shen

@HaihaoShen

1 month

🎯Thrilled to announce INT4 LLM inference on CPUs landed in @LangChainAI . Thanks to @baga_tur and @j_schottenstein ! 📓PR: 🤗INT4 inference powered by and #IAmIntel

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Release Intel® Extension for Transformers v1.4 Release · intel/intel-extension-for-transformers

1

8

75

Haihao Shen

@HaihaoShen

1 month

🌟Happy to announce Intel Extension for Transformers v1.4 released with a lot of improvements in building GenAI applications on Intel platforms! 🎯Check out the release notes: 🤗 @intel + @huggingface = one of the best GenAI platforms

Highlights Features Productivity Examples Bug Fixing Highlights AutoRound is SOTA weight-only quantization (WOQ) algorithm for low-bit LLM inference on typical LLMs. This release includes support ...

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

10

74

Haihao Shen

@HaihaoShen

2 months

📢Intel Extension for Transformers () supports INT4 and low-bit inference on both CPUs and GPUs! 📔Simple usage guide: 🔥All your need is to get an Intel GPU and run LLMs @huggingface 🤗

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

Harnessing the Intel NeuralChat 7B Model for Advanced Fraud Detection: Bilic's AI-Driven Approach...

3

11

73

Haihao Shen

@HaihaoShen

5 months

🚀Happy to support "upstage/SOLAR-10.7B-Instruct-v1.0" in Intel Extension for Transformers! @upstageai @hunkims . INT4 inference is available with one parameter change from "load_in_8bit" to "load_in_4bit". 📢Next one will be Zephyr🙌 👇Check out the sample code and give a try!

0

13

72

Haihao Shen

@HaihaoShen

4 months

📢When AI meets cybersecurity, see how Intel NeuralChat LLM helps here. Happy to share a nice blog "Harnessing the Intel NeuralChat 7B Model for Advanced Fraud Detection". Congrats @Saminusalisu ! 🎯Check out the details: #intelai #iamintel @humaabidi

NeuralChat 7B and Bilic's pioneering customization LONDON, UK / ACCESSWIRE / January 2, 2024 / London,UK / With the thriving presence of artificial intelligence in our lives, Bilic, a cybersecurity...

finance.yahoo.com

1

18

71

Haihao Shen

@HaihaoShen

3 months

🎁Here is a tutorial on how to optimize natural language embedding model and extend LangChain to enable the optimizations. Check out more details: 🤗Code: . Star the project if you find this is useful. 🌟Happy Chinese New Year! 🎇

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

0

14

68

Haihao Shen

@HaihaoShen

2 months

👨‍💻2023 is year of open LLMs. Is it time to predict for 2024? DM your thoughts. 📢Re-share the blog from @clefourrier : , incl. Intel NeuralChat-7B and DPO dataset😀 🤗We hope to contribute more to open-source LLM community in 2024! #iamintel @huggingface

2023, year of open LLMs

Release Intel® Extension for Transformers v1.3.2 Release · intel/intel-extension-for-transformers

4

12

66

Haihao Shen

@HaihaoShen

2 months

🔥Happy to announce Intel Extension for Transformers v1.3.2 released 📔Release notes: 🎯Highlights: enable popular serving e.g., @huggingface TGI, vLLM, and Triton to build highly efficient chatbot on Intel platforms such as Gaudi2 with a few lines of code

Highlights Support NeuralChat-TGI serving with Docker (8ebff39) Support Neuralchat-vLLM serving with Docker (1988dd) Support SQL generation in NeuralChat (098aca7) Enable llava mmmu evaluation on ...

Add examples with Intel optimizations by hshen14 · Pull Request #1579 · huggingface/diffusers

0

9

65

Haihao Shen

@HaihaoShen

1 year

🥳Happy to share with you the Intel optimizations for Diffusers textual inversion and the fine-tuning demo of Stable Diffusion on Spaces! 👉 Intel optimizations: 🎯Spaces: 🤗Thanks to Patrick, @anton_lozhkov @_akhaliq from HF!

Per comments from #1499, this PR is to add examples with Intel optimizations for fine-tuning and inference. Bfloat16 fine-tuning is enabled for textual_inversion, while Bfloat16 inference is genera...

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

22

64

Haihao Shen

@HaihaoShen

6 months

🪧LLM Leaderboard continues upgrading. Our engineering submission is currently ranking as Top-1 7B fine-tuned LLM. Remember to enable "Show gated/private/deleted models". 🤗Top-1 7B Model: 📷Code: #oneapi @intel @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

13

64

Haihao Shen

@HaihaoShen

4 months

📢Intel Copilot in CES 2024 automatically created a Chatbot for the event! Watch the video of Great Minds keynote: delivered by Intel leaders!! 🎯The copilot is built on top of . The code/ext will be released soon. Stay tuned!🚀

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

0

8

64

Haihao Shen

@HaihaoShen

1 month

🤗Want to build an enterprise-grade RAG system? Efficient embedding is what you want. Here is a nice blog from Intel and @huggingface friends on "Intel Fast Embedding" with and #IAmIntel @MosheWasserblat

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

15

62

Haihao Shen

@HaihaoShen

2 months

📢Exciting news! Stable Diffusion on Gaudi!! We released Intel Extension for Transformers to simplify LLM fine-tuning and accelerate LLM inference further🚀

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

Stability AI

@StabilityAI

2 months

In this installment of "Behind the Compute", a series dedicated to offering insights for others to harness the power of generative AI, we compared the training speed of @Intel Gaudi 2 accelerators versus @Nvidia 's A100 and H100 for two of our models. (1/3)

17

62

306

1

12

60

Haihao Shen

@HaihaoShen

3 months

🎯Qwen is the default model in INT4 inference sample code on Intel GPUs. Check the main page. 📔Sample code: 🤗

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

1

8

57

Haihao Shen

@HaihaoShen

4 months

📢 v0.2 released

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

GitHub - microsoft/onnxruntime: ONNX Runtime: cross-platform, high performance ML inferencing and...

0

8

55

Haihao Shen

@HaihaoShen

10 months

🎯SmoothQuant is now available in ONNX Runtime through Intel Neural Compressor: 👉Start the example in and quantize your favorite LLM! 👍Thanks to Mengni, Tianlei, Yihong, Yufeng, and the team!

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator - microsoft/onnxruntime

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

0

13

53

Haihao Shen

@HaihaoShen

7 months

📢We are hiring full-time interns for efficient LLM inference. 🔥Group: Intel/DCAI/AISE 🎯Location: Shanghai, Zizhu 😀Working projects: * INC: * ITREX: If you are interested in LLM compression and inference, DM with your resume.😀

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

3

11

52

Haihao Shen

@HaihaoShen

5 months

🚀Intel Extension for Transformers now integrates SetFit, an efficient few-shot learning with Sentence Transformers! 🤗PR: 🎁Code: #iamintel @intelai @tomaarsen @MosheWasserblat @huggingface

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

community: Add ITREX optimized Embeddings by yuwenzho · Pull Request #18474 · langchain-ai/langchain

2

10

54

Haihao Shen

@HaihaoShen

6 months

Thanks @martin_casado ! @Intel has been making AI more accessible through rich SW portfolio and diverse Intel HWs ! We also released the high-perf LLMs and high-quality datasets for LLM training! People can easily create their own Chatbot through Intel Extension for Transformers!

martin_casado

@martin_casado

6 months

Amazing to see Intel getting into the open source AI game. Well done!

4

16

139

1

8

52

Haihao Shen

@HaihaoShen

7 months

📢Excited to share our new paper () for LLM INT4 quantization with comparable and better results than GPTQ! 🎯Code is available in #oneapi @huggingface @_akhaliq @MosheWasserblat @ellacharlaix

1

14

52

Haihao Shen

@HaihaoShen

1 month

🎯Intel optimizations meeting LangChain make RAG system more efficient! ⚡️Here is an optimized embedding using Intel Extension for Transformers (). Has been integrated into @LangChainAI : More are coming. 3⃣2⃣1⃣ #iamintel

Introduction Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platfor...

Intel Gaudi 2 Remains Only Benchmarked Alternative to NV H100 for...

0

7

51

Haihao Shen

@HaihaoShen

24 days

🔥MLPerf Inference: Intel Extension for Transformers showed 1.8x performance speedup on GPT-J using INT4 inference on 5th Gen Xeon (vs. 4th). Congrats to the team: Yi, Zhentao, Hengyu, Yu, and Kevin! Blazing fast on CPUs, even clients!! 🎯Blog: #IAmIntel

Newest MLPerf results for Intel Gaudi 2 accelerator and 5th Gen Intel Xeon demonstrate how Intel is raising the bar for generative AI performance across its portfolio and with its ecosystem partners.

www.intel.com

1

11

50

Haihao Shen

@HaihaoShen

1 year

📢Intel Neural Compressor v2.1 released 🔥1) LLM quantization with enhanced SmoothQuant; 2) Native Keras model quantization; 3) Auto-quant mode; 4) TensorFlow INT8 to ONNX INT8; 5) more ... #oneAPI @humaabidi @AGRamesh13 @songhan_mit @ellacharlaix

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

1

11

43

Haihao Shen

@HaihaoShen

4 months

Happy New Year! It was my honor to be invited as the first guest speaker of 2024 by @CohereForAI . Enjoyed sharing the work that the teams have been doing to make LLMs more efficient on Intel platforms. Thanks to the outstanding event organizer @AhmadMustafaAn1 ! #iamintel @intelai

Cohere For AI

@CohereForAI

4 months

Happy New Year! Our first guest speaker of 2024 is tomorrow, Wednesday, January 3rd as our Geo Regional Asia Group welcomes @HaihaoShen , Senior AI architect in DCAI/AISE at Intel Corporation to present "Efficient LLM Inference on CPUs" Learn more:

0

5

10

4

5

43

Haihao Shen

@HaihaoShen

6 months

🚀Continued refreshing SOTA LLM INT4 accuracy and demonstrated the performance of Intel optimized Llama.cpp for LLMs! 🎯Blog: 📢Code:

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers