Skip to content

Commit f116016

Browse files
authored
Docs - Add release notes (#92)
__Description__ Add LTP release note. __Major Revision__ - Add release note for v1.3 v1.2 v1.1 v1.0
1 parent 9b5a988 commit f116016

File tree

6 files changed

+152
-0
lines changed

6 files changed

+152
-0
lines changed
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
---
2+
slug: release-ltp-v1.0
3+
title: Releasing Lucia Training Platform v1.0
4+
author: Lucia Training Platform Team
5+
tags: [ltp, announcement, release]
6+
---
7+
8+
We are pleased to announce the official release of **Lucia Training Platform v1.0.0**!
9+
10+
## Lucia Training Platform v1.0.0 Release Notes
11+
12+
This inaugural release establishes Lucia Training Platform as a comprehensive AI platform solution, built on the foundation of OpenPAI with significant enhancements and customizations for enterprise AI workloads.
13+
14+
## Platform Features & Stability
15+
- Updated Virtual Machine Scale Set deployment scripts with MI300 GPU support and kubelet bug fixes
16+
- Fixed launch order issues between AMD device plugin and AMDGPU module loading
17+
- Fixed local disk mounting into containers for high-speed data loading
18+
- Implemented priority restrictions for production jobs to ensure resource allocation
19+
- Automated daily backup of user logs to blob storage with cordon trigger functionality
20+
- Updated OpenPAI-runtime image to resolve SSH crashes in large-scale training jobs
21+
- Added refresh API to clean storage cache when new Persistent Volumes (PV) or Persistent Volume Claims (PVC) are added
22+
- Implemented automated email notifications for production jobs to specific user groups
23+
24+
## Job Reliability & Monitoring
25+
- Implemented automatic detection metrics and rules for AMD GPU issues during runtime
26+
- Enabled job execution on specific cordoned nodes for admin management
27+
- Automated node cordoning and uncordoning with single node validation
28+
- Added support for monitoring count of per-VC available/used nodes in Prometheus
29+
30+
## User Experience
31+
- Complete revision of the homepage with acknowledgment of OpenPAI's great contribution
32+
- Updated all titles and references from OpenPAI to Lucia Training Platform (LTP) throughout the web portal
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
slug: release-ltp-v1.1
3+
title: Releasing Lucia Training Platform v1.1
4+
author: Lucia Training Platform Team
5+
tags: [ltp, announcement, release]
6+
---
7+
8+
We are pleased to announce the official release of **Lucia Training Platform v1.1.0**!
9+
10+
## Lucia Training Platform v1.1.0 Release Notes
11+
12+
This release introduces new inference capabilities, enhanced stability improvements, comprehensive monitoring systems, and significant security enhancements.
13+
14+
## Platform Features & Stability
15+
- Added support for inference job submission
16+
- Added prototype user interface with webportal plugin
17+
18+
## Job Reliability & Monitoring
19+
- Automated Azure VM recycling and validation processing workflows
20+
- Automated pipeline for submitting ICM tickets for unhealthy Azure VMs.
21+
- Kusto database implementation for action status tracking, node status monitoring, and job status analytics
22+
23+
## User Experience
24+
- Enhanced dashboard with comprehensive platform performance metrics
25+
26+
## Security
27+
- Forced upgrades of operating system, Linux, and Python packages to address security vulnerabilities
28+
- Updated Golang and Node.js packages to latest secure versions
29+
- Disabled and replaced unapproved registries (non-ACR/MCR) on LTP platform
30+
- Disabled SSH access for all users to enhance security posture
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
slug: release-ltp-v1.2
3+
title: Releasing Lucia Training Platform v1.2
4+
author: Lucia Training Platform Team
5+
tags: [ltp, announcement, release]
6+
---
7+
8+
We are pleased to announce the official release of **Lucia Training Platform v1.2.0**!
9+
10+
## Lucia Training Platform v1.2.0 Release Notes
11+
12+
This release introduces significant new features, enhanced reliability monitoring, improved user experience, and strengthened security measures.
13+
14+
## Platform Features & Stability
15+
- Virtual Cluster administrators can now stop jobs in their own VC
16+
- Enhanced inference job interface with external IP gateway support
17+
- Portal displays only active clusters for improved user experience
18+
- Enhanced job execution capabilities with Docker support within jobs
19+
- Resolved CUDA version mismatch issues causing job-exporter crashes
20+
- Fixed configuration refresh issues when updating user settings
21+
- Resolved blob mount failures and Azure copy token issues
22+
23+
## Local Storage
24+
- Local storage service with user API interface implementation
25+
- Integration with node recycling processes
26+
27+
## Job Reliability & Monitoring
28+
- Initial automatic node failure detection module design and implementation
29+
- Enhanced job monitoring kusto data pipeline with summary and reaction time tracking
30+
- Proactive alerting email for certificate expiration management
31+
32+
## User Experience
33+
- Added webportal plugin integration for Copilot functionality
34+
- Initial backend support for Copilot features
35+
- Enhanced dashboard with comprehensive platform metrics
36+
- Added Mean Time Between Incidents (MTBI) tracking for virtual machines and nodes in dashboard
37+
38+
## Security
39+
- Updates to address security vulnerabilities in container images
40+
- Kubernetes version upgrade for enhanced security and performance
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
slug: release-ltp-v1.3
3+
title: Releasing Lucia Training Platform v1.3
4+
author: Lucia Training Platform Team
5+
tags: [ltp, announcement, release]
6+
---
7+
8+
We are pleased to announce the official release of **Lucia Training Platform v1.3.0**!
9+
10+
## Lucia Training Platform v1.3.0 Release Notes
11+
12+
This release brings significant improvements across platform stability, inference capabilities, user experience, and security enhancements.
13+
14+
## Platform Features & Stability
15+
- Migrated PostgreSQL database to Azure disk for improved performance
16+
- Upgraded blobfuse version to support non-empty cache folder
17+
- Integrated ssh-proxy and utilization reporter services
18+
- Added local disk support for Prometheus deployment
19+
- Added H200 GPU support for VMSS provisioning scripts
20+
- Updated Kubernetes deployment scripts for Kubespray compatibility on Bare Metal Machines
21+
- CI/CD: Added GitHub workflow to build and deploy changed services
22+
- Fixed GPU and default eth detection in job exporter
23+
- Updated /etc/hosts for DNS records
24+
- Added alert logging webhook in alert manager
25+
26+
## Inference Plugin
27+
- Added support for output streaming in web portal
28+
- Implemented long-context thinking output folding support
29+
- Deployed inference model proxy service for improved model serving
30+
- Messages now display as markdown and copy as plain text
31+
32+
## Tools
33+
- Added comprehensive administration tools for LTP service management
34+
35+
## Security
36+
- Upgraded Dockerfiles with latest system updates
37+
- Added Docker-in-Docker (dind) support for Webportal
38+
39+
## User Experience
40+
- Code refactoring for improved maintainability
41+
- Integrated Copilot SGLANG/OpenAI Interface backend support
42+
- Added Copilot support for Dashboard Metrics visualization
43+
- Implemented User Feedback Loop design and functionality
44+
- Added VC based user group membership authentication
45+
46+
## License
47+
- Added Microsoft license headers to examples directory
48+
- Added Microsoft license headers to contrib directory
49+
- Added Microsoft license headers to deployment directory
50+
- Added Microsoft license headers to src directory

docs/ltp/cert_update_readme.md renamed to docs/LuciaTrainingPlatform/manual/admin/cert_update_readme.md

File renamed without changes.

docs/ltp/service_setup_readme.md renamed to docs/LuciaTrainingPlatform/manual/admin/service_setup_readme.md

File renamed without changes.

0 commit comments

Comments
 (0)