2024-10-22

Agenda

LLMs on Edge with AI Accelerators , @L.c Chen Lai, Meta, 20 minutes

Recording

https://bytedance.us.larkoffice.com/minutes/obushm383935yto9t2f21255

Participants

@Wilson Wang @Tina Tsou @Ying Wang @Ashwanth Gnanavelu @Caleb J @HaruhisaFukano @Kris @李博睿 @Tom Qin @Lai Chen, Qi Wang, @Akram Sheriff

Summary

The meeting discussed the work progress and plans of the Infinite H AI TSC and its sub-teams, the main contents included:

TSC introduction: Tina Tsou introduced the TSC and Infinite H AI, and participants introduced themselves.
Edge computing: Chen Lei presented on deploying PyTorch models on edge with accelerators.
Workstream 3: Wilson Wang provided an update and overview of Workstream 3.
Workstream 5: Tom Qin discussed the potential integration of Chen Lei's presentation with Workstream 5.
Edge database: Qi Tang shared ideas on using a generic data manager for LLM services.
Next steps: Tina Tsou announced the next meeting and invited participants to a ByteDance open source event.

Chapters

00:08 Roundtable introductions in Infinite H AI TSC meeting before starting the agenda

This section is a round - table introduction in a meeting. Tina Tsou, the Infinite H AI TSC chair, starts it. People like Caleb, Chen Lei, Wilson Wang, Tom Qin, Akram Sheriff, He Jianxin, Li Borui and Qi Tang introduce themselves briefly, including their names, workplaces, areas of work or projects they are involved in. After the introductions, the meeting starts with the first item related to Amazon Edge with AI accelerations.

06:22 Deploying Pytorch Models on Edge via ExeTorch

This section is about deploying Pytorch models on edge via Exit Torch. The speaker from Pytorch Edge team at Meta mentions the challenges of edge deployment due to device constraints. Exit Torch is a unified open - source solution for running Pytorch models on edge devices. They work with various hardware companies. The timeline from preview to official beta release this week is shared, along with beta goals like stability, performance, and coverage.

12:51 MCPU Performance Optimization and On - device AI with Accelerators

This section focuses on MCPU performance optimization mainly on arm CPUs, including 4 - bit jam kernel development and collaboration with Google. It also mentions NPU as a key for on - device AI, with support through delegate APIs in Exit Torch. For on - device LMS on accelerators, they work with partners for optimization. Quantization, memory optimization, etc. are key for enabling on - edge LMS. There are also updates on benchmarking and experimental on - device training.

19:07 ExeTorch: A Pytorch Platform for Deploying LM to Edge Devices and Its Adoption

This section mainly focuses on Exit Torch. It shows the convergence of loss when fine - tuning Laura 53 model as a proof of concept. There are ongoing works and features planned. It's supported by Torch AO and related APIs. Exit Torch has seen active adoptions including by Meta products and in various apps. It's portable and target - agnostic for different platforms. There are also some questions regarding its relation to work streams.

25:14 Comparison between Exit Torch and Another Runtime in ML Model Deployment

This section is mainly about discussing Exit Torch from the perspective of a machine learning researcher. It aims to provide a simple way for users with limited knowledge of target devices to run models efficiently on them. They also discuss how multiple models work together in production and how model connection logic sits on the client side. Additionally, a comparison is made between Exit Torch and On Next, highlighting Exit Torch's advantages.

31:37 Discussion on Python code link, model fine - tuning and issues in deploying on Android phone

This section mainly focuses on discussions related to development environment and code linkage. There are also questions about model fine - tuning like Laura, whether it can be done on - device or offline, and the support for both cases is mentioned. Additionally, a person shares issues faced while trying to run Lama No. 3, 8 - bit parameter on a specific cell phone and the related dependencies problems, and there is an offer to help resolve the issue.

37:27 Discussion on Model Support and Work Stream Collaboration

This section begins with a discussion about the large Lama parameter model being too large to run on a cellphone. It is noted that a 1 billion parameter model is supported already, which is expected to reduce RAM usage. There is also talk about the synergy between the work stream and edge Gateway. Tina Tsou suggests that Chen Yu use Lark for discussion, and finally asks Wilson to give an update on work stream 3.

41:25 Discussion on Spear Platform in Work Stream 3 and Related Technical Details

This section mainly focuses on the Infinite Sai overview, especially regarding work stream 3 in the context of an AI agent platform. Wilson Wang updated some content and there were discussions about integration with other work streams. Li Borui then gave a brief on the Spear platform, which has two parts - an agent generator and running agents in distributed edge environments. There were also some questions about the routing in the Spear platform and Wilson Wang provided an answer.

53:22 Discussion on data management for LM services and related announcements

This section mainly focuses on aligning with agent scenarios or frameworks. It discusses data format for prompts, including preparing data for efficient retrieval and the interface work between original database and LM services. It also mentions collecting user feedbacks, properly labeling data for future use like fine - tuning, and making a generic data management work for LM services. Additionally, there are announcements about events and meetings.