概述 Overview
HTMLLite 是一种为人工智能(AI)和大型语言模型(LLM)设计的、严格的HTML轻量级子集。它旨在将人类编写的复杂文档,转化为机器可以无歧义地解析、理解和推理的高质量结构化数据。
HTMLLite is a strict, lightweight subset of HTML designed for Artificial Intelligence (AI) and Large Language Models (LLMs). It aims to transform complex human-written documents into high-quality structured data that machines can parse, understand, and reason about unambiguously.
通过剥离标准HTML中所有与表现层(样式、脚本)相关的元素,并增强其描述内容结构与关联的能力,HTMLLite 解决了非结构化数据在AI应用中的三个核心痛点:
By stripping all presentation layer elements (styles, scripts) from standard HTML and enhancing its ability to describe content structure and relationships, HTMLLite solves three core pain points of unstructured data in AI applications:
上下文完整 Complete Context
确保每个知识区块都包含完整的上下文信息,避免信息碎片化。 Ensures each knowledge block contains complete contextual information, avoiding information fragmentation.
结构清晰 Clear Structure
使用严格的标签集合,消除结构歧义,提供清晰的文档层级。 Uses a strict set of tags to eliminate structural ambiguity and provide clear document hierarchy.
关系明确 Explicit Relationships
通过语义链接机制,让AI理解不同内容块之间的逻辑关系。 Through semantic linking mechanisms, allows AI to understand logical relationships between different content blocks.
v1.2 升级亮点 v1.2 Upgrade Highlights
🚀 本次升级的核心改进 🚀 Core Improvements in This Upgrade
-
[新增]
[New]
在"允许的内容元素"中正式加入了定义列表 (
<dl>
,<dt>
,<dd>
) 元素 Officially added definition list elements (<dl>
,<dt>
,<dd>
) to "Allowed Content Elements" - [推荐] [Recommended] 明确推荐使用定义列表作为表示键值对数据的最佳实践 Explicitly recommends using definition lists as best practice for representing key-value pair data
-
[澄清]
[Clarified]
再次强调了
<strong>
和<em>
标签作为可选的语义强调元素 Re-emphasized<strong>
and<em>
tags as optional semantic emphasis elements
这次升级的核心是通过引入定义列表(Description List)元素,极大地增强了HTMLLite对键值对(Key-Value)数据的语义表示能力。 The core of this upgrade is the introduction of Description List elements, which greatly enhances HTMLLite's semantic representation capabilities for key-value pair data.
设计原则 Design Principles
轻量纯粹 Lightweight & Pure
只包含用于定义内容结构的HTML标签,剔除所有表现层元素。 Contains only HTML tags for defining content structure, removing all presentation layer elements.
结构保真 Structure Preservation
无损保留原始文档的关键结构,特别是表格、列表和多级标题。 Losslessly preserves key structures of original documents, especially tables, lists, and multi-level headings.
原子化区块 Atomic Blocks
将文档拆分为独立的、上下文完整的"知识区块"。 Splits documents into independent, contextually complete "knowledge blocks".
语义链接 Semantic Linking
提供结构化的链接机制,使机器能够理解区块之间的逻辑关系。 Provides structured linking mechanisms that enable machines to understand logical relationships between blocks.
精确语义 Precise Semantics
鼓励使用最能准确描述数据内在关系的HTML标签。 Encourages using HTML tags that most accurately describe inherent data relationships.
AI优先 AI-First
每个设计决策都以提升AI理解能力为首要目标。 Every design decision prioritizes enhancing AI comprehension capabilities.
格式规范 Format Specification
文档结构 Document Structure
根容器 Root Container
所有 HTMLLite 内容必须被包含在一个唯一的根 <div>
容器内,并使用 class="htmllite-document"
进行标识。
All HTMLLite content must be contained within a unique root <div>
container, identified with class="htmllite-document"
.
<div class="htmllite-document">
<!-- 所有内容都在这里 -->
<!-- All content goes here -->
</div>
知识区块 Knowledge Blocks
一个知识区块由一个带有 data-htmllite-block="true"
属性的 <div>
元素定义。区块之间不可嵌套,并应以标题开始。
A knowledge block is defined by a <div>
element with the data-htmllite-block="true"
attribute. Blocks cannot be nested and should start with a heading.
<div data-htmllite-block="true">
<h2>区块标题Block Title</h2>
<!-- 区块内容 -->
<!-- Block content -->
</div>
允许的内容元素 Allowed Content Elements
在每个 data-htmllite-block
内部,只允许使用以下HTML标签:
Within each data-htmllite-block
, only the following HTML tags are allowed:
元素 Element | 用途 Purpose | 备注 Notes |
---|---|---|
<h2> - <h6> |
标题:定义区块内的主题和子主题层级 Headings: Define topic and subtopic hierarchy within blocks | 必需 Required |
<p> |
段落:包含常规文本内容 Paragraph: Contains regular text content | - |
<ul> , <ol> , <li> |
列表:用于表示项目列表或有序步骤 Lists: For representing item lists or ordered steps | - |
<table> , <thead> , <tbody> , <tr> , <th> , <td> |
表格:用于结构化数据展示 Tables: For structured data display | 保留 rowspan 和 colspan Preserve rowspan and colspan |
<dl> , <dt> , <dd> |
定义列表:用于表示键值对数据 Definition Lists: For representing key-value pair data | v1.2 新增 New in v1.2 |
<sup> , <sub> |
上/下标:用于脚注标记、化学式等 Super/subscript: For footnotes, chemical formulas, etc. | - |
<strong> , <em> |
强调:用于标记重点词汇 Emphasis: For marking important words | 可选,语义强调 Optional, semantic emphasis |
<br> |
换行:在段落内强制换行 Line break: Force line break within paragraphs | - |
关联链接 Related Links
一个关联链接由 data-htmllite-link="true"
的 <div>
定义,包含 data-href
属性和描述子元素。
A related link is defined by a <div>
with data-htmllite-link="true"
, containing a data-href
attribute and description sub-elements.
<div data-htmllite-link="true" data-href="/path/to/resource.htmllite">
<strong class="htmllite-link-title">相关资料标题Related Resource Title</strong>
<p class="htmllite-link-description">
资料的简要描述...
Brief description of the resource...
</p>
</div>
<style>
, <script>
, <link>
, <iframe>
, <span>
(在区块内), <form>
等。
Any tags unrelated to content structure are strictly forbidden, such as <style>
, <script>
, <link>
, <iframe>
, <span>
(within blocks), <form>
, etc.
应用 v1.2 规范的示例 Examples Using v1.2 Specification
现在,我们使用 HTMLLite v1.2 规范来转换数据,正式采用 <dl>
标签作为最佳实践:
Now, let's use the HTMLLite v1.2 specification to transform data, officially adopting <dl>
tags as best practice:
<div class="htmllite-document">
<div data-htmllite-block="true">
<h2>144: 南四湖以东水源涵养、生物多样性维护生态保护红线区144: Ecological Protection Red Line Area for Water Conservation and Biodiversity Maintenance East of Nansi Lake</h2>
<dl>
<dt>代码Code</dt>
<dd>SD-04-B1-01</dd>
<dt>生态功能Ecological Function</dt>
<dd>水源涵养、生物多样性维护Water conservation, biodiversity maintenance</dd>
<dt>类型Type</dt>
<dd>湿地、湖泊、农田Wetland, lake, farmland</dd>
</dl>
<h3>所在行政区域Administrative Region</h3>
<dl>
<dt>市City</dt>
<dd>枣庄市Zaozhuang City</dd>
<dt>县(区、市)County (District, City)</dt>
<dd>滕州市Tengzhou City</dd>
</dl>
<h3>外边界Outer Boundary</h3>
<dl>
<dt>面积(km²)Area (km²)</dt>
<dd>53.69</dd>
<dt>边界描述Boundary Description</dt>
<dd>滕州市西部的滨湖镇境内。Within Binhu Town in western Tengzhou City.</dd>
<dt>拐点坐标Corner Coordinates</dt>
<dd>1:116 51'04"E, 35 06'01"N; 2:116 °50'50"E, 35 °05'01"N; ...</dd>
</dl>
<h3>I类红线区Type I Red Line Area</h3>
<dl>
<dt>面积(km²)Area (km²)</dt>
<dd>N/A</dd>
<dt>边界描述Boundary Description</dt>
<dd>/</dd>
<dt>拐点坐标Corner Coordinates</dt>
<dd>/</dd>
</dl>
<h3>备注与关联Notes and Associations</h3>
<p>此区域包含滕州红荷湿地省级地质公园、滕州滨湖国家湿地公园、部分滕州市公益林。This area includes Tengzhou Red Lotus Wetland Provincial Geopark, Tengzhou Binhu National Wetland Park, and part of Tengzhou public welfare forest.</p>
<div data-htmllite-link="true" data-href="/parks/tengzhou-red-lotus-geopark.htmllite">
<strong class="htmllite-link-title">相关资料:滕州红荷湿地省级地质公园Related Information: Tengzhou Red Lotus Wetland Provincial Geopark</strong>
<p class="htmllite-link-description">
查看关于"滕州红荷湿地省级地质公园"的详细信息,包括其具体范围、保护目标和管理规定。
View detailed information about "Tengzhou Red Lotus Wetland Provincial Geopark", including its specific scope, conservation objectives, and management regulations.
</p>
</div>
</div>
</div>
v1.2 升级的价值 Value of v1.2 Upgrade
-
对机器更友好:
More Machine-Friendly:
<dl>
列表为AI提供了关于"键"和"值"关系的强信号<dl>
lists provide AI with strong signals about "key" and "value" relationships - 语义更精确: More Precise Semantics: 准确反映了原始JSON数据的内在结构 Accurately reflects the inherent structure of original JSON data
- 结构更清晰: Clearer Structure: 使得键值对的层级关系一目了然 Makes the hierarchical relationships of key-value pairs clear at a glance
DL、DT、DD 标签 DL, DT, DD Tags
定义列表(Description List)是一个强大但经常被低估的HTML工具,让我们深入了解它的结构和用法。
The Description List is a powerful but often underestimated HTML tool. Let's dive deep into its structure and usage.
什么是定义列表? What is a Definition List?
定义列表是一种用于展示"名/值"对(或"术语/描述"对)的HTML结构。它天生包含两个层次:
A definition list is an HTML structure for displaying "name/value" pairs (or "term/description" pairs). It naturally contains two levels:
- 术语/名称 (Term): Term/Name: 你想要定义或描述的主体 The subject you want to define or describe
- 描述/值 (Description): Description/Value: 对该术语的具体解释或其对应的值 The specific explanation of the term or its corresponding value
三个核心标签详解 Detailed Explanation of Three Core Tags
Description List - 定义列表Definition List
含义:Meaning: "定义列表"的缩写Abbreviation for "description list"
角色:Role: 容器标签,包裹整个名值对列表Container tag that wraps the entire name-value pair list
类比:Analogy: 如果把整个列表看作字典,<dl>
就是这本字典本身If the list is a dictionary, <dl>
is the dictionary itself
Definition Term - 定义术语Definition Term
含义:Meaning: "定义术语"的缩写Abbreviation for "definition term"
角色:Role: 名称/键 (Name/Key) 标签Name/Key tag
类比:Analogy: 在字典里,<dt>
就是要查的词条In a dictionary, <dt>
is the word you're looking up
Definition Description - 定义描述Definition Description
含义:Meaning: "定义描述"的缩写Abbreviation for "definition description"
角色:Role: 描述/值 (Description/Value) 标签Description/Value tag
类比:Analogy: 在字典里,<dd>
就是词条的解释In a dictionary, <dd>
is the definition of the word
为什么在 HTMLLite 中使用它如此重要? Why is it so Important to Use in HTMLLite?
在 HTMLLite 的语境下,使用 <dl>
远比使用 <p>
或 <strong>
更好,原因在于其无与伦比的语义精确性。
In the context of HTMLLite, using <dl>
is far better than using <p>
or <strong>
due to its unparalleled semantic precision.
使用 <p> 标签 Using <p> Tags
<p>代码: SD-04-B1-01Code: SD-04-B1-01</p>
机器的理解: Machine Understanding: "这是一个段落,内容是'代码: SD-04-B1-01'。" "This is a paragraph with content 'Code: SD-04-B1-01'."
机器需要通过NLP识别冒号,并猜测"代码"是键,"SD-04-B1-01"是值。这个过程可能出错,且计算成本更高。
The machine needs to use NLP to identify the colon and guess that "Code" is the key and "SD-04-B1-01" is the value. This process may fail and has higher computational cost.
使用 <dl> 标签 Using <dl> Tags
<dl>
<dt>代码Code</dt>
<dd>SD-04-B1-01</dd>
</dl>
机器的理解: Machine Understanding: "这是一个定义列表。术语是'代码',描述是'SD-04-B1-01'。" "This is a definition list. The term is 'Code', the description is 'SD-04-B1-01'."
这种结构化的关系是明确的、无歧义的。AI可以直接抓取内容,无需猜测。
This structured relationship is clear and unambiguous. AI can directly extract content without guessing.
最佳实践 Best Practices
保持简洁 Keep It Simple
每个知识区块应该聚焦于单一主题,避免内容过于庞大。 Each knowledge block should focus on a single topic, avoiding overly large content.
使用语义标签 Use Semantic Tags
优先使用最能准确描述内容关系的标签,如键值对使用 <dl>
。
Prioritize tags that most accurately describe content relationships, such as <dl>
for key-value pairs.
建立链接 Establish Links
通过关联链接建立知识区块之间的关系网络。 Build relationship networks between knowledge blocks through related links.
验证结构 Validate Structure
使用验证工具确保文档符合 HTMLLite 规范。 Use validation tools to ensure documents comply with HTMLLite specifications.
保留元数据 Preserve Metadata
保留对AI理解有帮助的关键元数据信息。 Preserve key metadata information that helps AI understanding.
目标明确 Clear Goals
始终以提升AI理解能力为首要目标进行文档设计。 Always design documents with the primary goal of enhancing AI comprehension capabilities.