Google Cloud Kubernetes Engine provides a wide range of deployment options for running Gemma models with high performance and low latency using preferred development frameworks. Check out the following deployment guides for Hugging Face, vLLM, TensorRT-LLM on GPUs, and TPU execution with JetStream, plus application, and tuning guides:
Deploy and serve
Serve Gemma on GPUs with Hugging Face TGI: Deploy Gemma models on GKE using GPUs and the Hugging Face Text Generation Inference (TGI) framework.
Serve Gemma on GPUs with vLLM: Deploy Gemma with vLLM for convenient model load management and high-throughput.
Serve Gemma on GPUs with TensorRT-LLM: Deploy Gemma with NVIDIA TensorRT-LLM to maximize model operation efficiency.
Serve Gemma on TPUs with JetStream: Deploy Gemma with JetStream on TPU processors for high-performance and low latency.
Analyze data
- Analyze data on GKE using BigQuery, Cloud Run, and Gemma: Build a data analysis pipeline with BigQuery and Gemma.
Fine-tune
- Fine-tune Gemma open models using multiple GPUs: Customize the behavior of Gemma based on your own dataset.