LLM Engine
A software that can load the LLM Models
Open WebUI
A Web UI Tool for Ollama
URLs
- https://openwebui.com/
- GitHub: https://github.com/open-webui/open-webui
- Docs: https://docs.openwebui.com/
Installation
Installing Both Open WebUI and Ollama Together:
# With GPU Support
docker run -d -p 3000:8080 --gpus=all \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
# For CPU only
docker run -d -p 3000:8080 \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
Kuwa Gen AI OS
一個自由、開放、安全且注重隱私的生成式人工智慧服務系統,包括友善的大語言模型使用介面,以及能支援生成式人工智慧應用的新型GenAI核心。
- 🌐 提供多語言GenAI開發與部署的整體解決方案,支援Windows及Linux
- 💬 提供群聊、引用、完整 Prompt 列表的匯入/匯出/分享等友善使用功能
- 🔄 可靈活組合 Prompt x RAGs x Bot x 模型 x 硬體/GPUs以滿足應用所需
- 💻 支援從虛擬主機、筆記型電腦、個人電腦、地端伺服器到公私雲端的各種環境
- 🔓 開放原始碼,允許開發人員貢獻並根據自己的需求打造自己的客製系統
URLs
AnythingLLM
The ultimate AI business intelligence tool. Any LLM, any document, full control, full privacy.
AnythingLLM is a "single-player" (單機個人)application you can install on any Mac, Windows, or Linux operating system and get local LLMs, RAG, and Agents with little to zero configuration and full privacy.
AnythingLLM 也有自架網站版,見文章下方的連結。
You can install AnythingLLM as a Desktop Application, Self Host it locally using Docker and Host it on cloud (aws, google cloud, railway etc..) using Docker
You want AnythingLLM Desktop if...
- You want a one-click installable app to use local LLMs, RAG, and Agents locally
- You do not need multi-user support
- Everything needs to stay only on your device
- You do not need to "publish" anything to the public internet. Eg: Chat widget for website
URLs
Ollama
Run Llama 3, Phi 3, Mistral, Gemma, and other models. Customize and create your own.
- https://ollama.com/
- GitHub: https://github.com/ollama/ollama
- Doc: https://github.com/ollama/ollama/tree/main/docs
- Video: 離線不怕隱私外洩!免費開源 AI 助手 Ollama 從安裝到微調,一支影片通通搞定! - YouTube
Installation
ollama + open webui
mkdir ollama-data download open-webui-data
docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
ports:
- 11434:11434
volumes:
- ./ollama-data:/root/.ollama
- ./download:/download
container_name: ollama
pull_policy: always
tty: true
restart: always
networks:
- ollama-docker
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- ./open-webui-data:/app/backend/data
depends_on:
- ollama
ports:
- 3000:8080
environment:
- 'OLLAMA_BASE_URL=http://ollama:11434'
extra_hosts:
- host.docker.internal:host-gateway
restart: unless-stopped
networks:
- ollama-docker
networks:
ollama-docker:
external: false
ollama
mkdir ollama-data download
docker run --name ollama -d --rm \
-v $PWD/ollama-data:/root/.ollama \
-v $PWD/download:/download \
-p 11434:11434 \
ollama/ollama
Models
List Models Installed
ollama list
Load a GGUF model manually
ollama create <my-model-name> -f <modelfile>
Page Assist
Page Assist is an open-source Chrome Extension that provides a Sidebar and Web UI for your Local AI model.
LM Studio
Discover, download, and run local LLMs.
With LM Studio, you can ...
URLs
OpenLLM
OpenLLM helps developers run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints, locally and in the cloud, optimized for serving throughput and production deployment.
- GitHub: https://github.com/bentoml/OpenLLM
- CoLab: https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb
Install
Recommend using a Python Virtual Environment
pip install openllm
Start a LLM Server
openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code
To interact with the server, you can visit the web UI at http://localhost:3000/ or send a request using curl. You can also use OpenLLM’s built-in Python client to interact with the server:
import openllm
client = openllm.HTTPClient('http://localhost:3000')
client.generate('Explain to me the difference between "further" and "farther"')
OpenAI Compatible Endpoints
import openai
client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na') # Here the server is running on 0.0.0.0:3000
completions = client.chat.completions.create(
prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
)
LangChain
from langchain.llms import OpenLLMAPI
llm = OpenLLMAPI(server_url='http://44.23.123.1:3000')
llm.invoke('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
# streaming
for it in llm.stream('What is the difference between a duck and a goose? And why there are so many Goose in Canada?'):
print(it, flush=True, end='')
# async context
await llm.ainvoke('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
# async streaming
async for it in llm.astream('What is the difference between a duck and a goose? And why there are so many Goose in Canada?'):
print(it, flush=True, end='')
Bechmark
Benchmark for LLM engines
bench.py
- ollama 支持并发之后和 vllm 相比性能如何?我们测测看_ollama vllm-CSDN博客
- YT: ollama vs vllm - 开启并发之后的 ollama 和 vllm 相比怎么样? - YouTube
import aiohttp
import asyncio
import time
from tqdm import tqdm
import random
questions = [
"Why is the sky blue?", "Why do we dream?", "Why is the ocean salty?", "Why do leaves change color?",
"Why do birds sing?", "Why do we have seasons?", "Why do stars twinkle?", "Why do we yawn?",
"Why is the sun hot?", "Why do cats purr?", "Why do dogs bark?", "Why do fish swim?",
"Why do we have fingerprints?", "Why do we sneeze?", "Why do we have eyebrows?", "Why do we have hair?",
"Why do we have nails?", "Why do we have teeth?", "Why do we have bones?", "Why do we have muscles?",
"Why do we have blood?", "Why do we have a heart?", "Why do we have lungs?", "Why do we have a brain?",
"Why do we have skin?", "Why do we have ears?", "Why do we have eyes?", "Why do we have a nose?",
"Why do we have a mouth?", "Why do we have a tongue?", "Why do we have a stomach?", "Why do we have intestines?",
"Why do we have a liver?", "Why do we have kidneys?", "Why do we have a bladder?", "Why do we have a pancreas?",
"Why do we have a spleen?", "Why do we have a gallbladder?", "Why do we have a thyroid?", "Why do we have adrenal glands?",
"Why do we have a pituitary gland?", "Why do we have a hypothalamus?", "Why do we have a thymus?", "Why do we have lymph nodes?",
"Why do we have a spinal cord?", "Why do we have nerves?", "Why do we have a circulatory system?", "Why do we have a respiratory system?",
"Why do we have a digestive system?", "Why do we have an immune system?"
]
async def fetch(session, url):
"""
参数:
session (aiohttp.ClientSession): 用于请求的会话。
url (str): 要发送请求的 URL。
返回:
tuple: 包含完成 token 数量和请求时间。
"""
start_time = time.time()
# 随机选择一个问题
question = random.choice(questions) # <--- 这两个必须注释一个
# 固定问题
# question = questions[0] # <--- 这两个必须注释一个
# 请求的内容
json_payload = {
"model": "llama3:8b-instruct-fp16",
"messages": [{"role": "user", "content": question}],
"stream": False,
"temperature": 0.7 # 参数使用 0.7 保证每次的结果略有区别
}
async with session.post(url, json=json_payload) as response:
response_json = await response.json()
end_time = time.time()
request_time = end_time - start_time
completion_tokens = response_json['usage']['completion_tokens'] # 从返回的参数里获取生成的 token 的数量
return completion_tokens, request_time
async def bound_fetch(sem, session, url, pbar):
# 使用信号量 sem 来限制并发请求的数量,确保不会超过最大并发请求数
async with sem:
result = await fetch(session, url)
pbar.update(1)
return result
async def run(load_url, max_concurrent_requests, total_requests):
"""
通过发送多个并发请求来运行基准测试。
参数:
load_url (str): 要发送请求的URL。
max_concurrent_requests (int): 最大并发请求数。
total_requests (int): 要发送的总请求数。
返回:
tuple: 包含完成 token 总数列表和响应时间列表。
"""
# 创建 Semaphore 来限制并发请求的数量
sem = asyncio.Semaphore(max_concurrent_requests)
# 创建一个异步的HTTP会话
async with aiohttp.ClientSession() as session:
tasks = []
# 创建一个进度条来可视化请求的进度
with tqdm(total=total_requests) as pbar:
# 循环创建任务,直到达到总请求数
for _ in range(total_requests):
# 为每个请求创建一个任务,确保它遵守信号量的限制
task = asyncio.ensure_future(bound_fetch(sem, session, load_url, pbar))
tasks.append(task) # 将任务添加到任务列表中
# 等待所有任务完成并收集它们的结果
results = await asyncio.gather(*tasks)
# 计算所有结果中的完成token总数
completion_tokens = sum(result[0] for result in results)
# 从所有结果中提取响应时间
response_times = [result[1] for result in results]
# 返回完成token的总数和响应时间的列表
return completion_tokens, response_times
if __name__ == '__main__':
import sys
if len(sys.argv) != 3:
print("Usage: python bench.py <C> <N>")
sys.exit(1)
C = int(sys.argv[1]) # 最大并发数
N = int(sys.argv[2]) # 请求总数
# vllm 和 ollama 都兼容了 openai 的 api 让测试变得更简单了
url = 'http://localhost:11434/v1/chat/completions'
start_time = time.time()
completion_tokens, response_times = asyncio.run(run(url, C, N))
end_time = time.time()
# 计算总时间
total_time = end_time - start_time
# 计算每个请求的平均时间
avg_time_per_request = sum(response_times) / len(response_times)
# 计算每秒生成的 token 数量
tokens_per_second = completion_tokens / total_time
print(f'Performance Results:')
print(f' Total requests : {N}')
print(f' Max concurrent requests : {C}')
print(f' Total time : {total_time:.2f} seconds')
print(f' Average time per request : {avg_time_per_request:.2f} seconds')
print(f' Tokens per second : {tokens_per_second:.2f}')
More
LocalAI
LocalAI is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families and architectures.
OpenAI Proxy
Proxy Server to call 100+ LLMs in a unified interface & track spend, set budgets per virtual key/user
Features:
- Unified Interface: Calling 100+ LLMs Huggingface/Bedrock/TogetherAI/etc. in the OpenAI ChatCompletions & Completions format
- Cost tracking: Authentication, Spend Tracking & Budgets Virtual Keys
- Load Balancing: between Multiple Models + Deployments of the same model - LiteLLM proxy can handle 1.5k+ requests/second during load tests.
企業在導入 LLM 時,可能會用到多種不同的模型,這些包含商用授權與開源授權以及來自不同的服務商。為了統一管理及開發應用這些各類不同模型,建議使用 OpenAI Proxy 這個平台來解決,以達到下列目的:
- 統一 API 介接入口與格式
- 成本追蹤
- 平衡負載
Xinference
Xorbits Inference (Xinference) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications.
NVIDIA NIM
Explore the latest community-built AI models with an API optimized and accelerated by NVIDIA, then deploy anywhere with NVIDIA NIM inference microservices.
- NVIDIA NIM for Deploying Generative AI | NVIDIA
- Doc: Introduction - NVIDIA Docs
- Models: google / gemma-7b
- YT: Self-Host and Deploy Local LLAMA-3 with NIMs - YouTube
text-generation-webui
A Gradio web UI for Large Language Models.
只能執行本地模型,不支援外部模型 API。
支援以下多重功能的 AI 平台
- Chat
- Fine-Tune Model
- Multiple model backends: Transformers, llama.cpp (through llama-cpp-python), ExLlamaV2, AutoGPTQ, AutoAWQ, GPTQ-for-LLaMa, QuIP#.
- OpenAI-compatible API server with Chat and Completions endpoints
教學
- GitHub: https://github.com/oobabooga/text-generation-webui
- GitHub: https://github.com/Atinoda/text-generation-webui-docker
- 大型語言模型LLMs課程教學 課程大綱 (三) - HackMD
- YOUTUBE [啟動 TextGen]
- YOUTUBE [上架大型語言模型]
- YOUTUBE [指派AI人設]
- YOUTUBE [模型微調]
- YOUTUBE [上架微調模型]
- 程式碼 Z01_TextGen_Colab.ipynb
- 預設密碼在程式碼裡面 (account:nchc password:nchc) 請自行修改