datalyes commited on
Commit
e47925f
·
verified ·
1 Parent(s): 6b28ed9

Upload PatentTEB model: patembed-large

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ library_name: sentence-transformers
4
+ tags:
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - patent
9
+ - embeddings
10
+ - mteb
11
+ language:
12
+ - en
13
+ pipeline_tag: sentence-similarity
14
+ ---
15
+
16
+ # patembed-large
17
+
18
+ This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval.
19
+
20
+ **Note:** This model uses task-specific instruction prompts during inference for optimal performance.
21
+
22
+ ## Model Details
23
+
24
+ - **Model Type**: Sentence Transformer
25
+ - **Base Architecture**: bert-for-patents (344M params, domain-pretrained on patent corpora)
26
+ - **Parameters**: 344M
27
+ - **Number of Layers**: 24
28
+ - **Hidden Size**: 1024
29
+ - **Embedding Dimension**: 1024
30
+ - **Max Sequence Length**: 512 tokens
31
+ - **Language**: English
32
+ - **License**: CC BY-NC-SA 4.0
33
+
34
+ ## Model Description
35
+
36
+ Flagship encoder initialized from Bert-for-Patents with 24-layer transformer architecture.
37
+
38
+ This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.
39
+
40
+
41
+
42
+ ## Usage
43
+
44
+ ### Using Sentence Transformers
45
+
46
+ ```python
47
+ from sentence_transformers import SentenceTransformer
48
+
49
+ # Load the model
50
+ model = SentenceTransformer('datalyes/patembed-large')
51
+
52
+ # Encode patent texts
53
+ patent_texts = [
54
+ "A method for manufacturing semiconductor devices...",
55
+ "An apparatus for processing chemical compounds...",
56
+ ]
57
+ embeddings = model.encode(patent_texts)
58
+
59
+ # Compute similarity
60
+ from sentence_transformers import util
61
+ similarity = util.cos_sim(embeddings[0], embeddings[1])
62
+ print(f"Similarity: {similarity.item():.4f}")
63
+ ```
64
+
65
+ ### Using Transformers
66
+
67
+ ```python
68
+ from transformers import AutoTokenizer, AutoModel
69
+ import torch
70
+ import torch.nn.functional as F
71
+
72
+ # Load model and tokenizer
73
+ tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-large')
74
+ model = AutoModel.from_pretrained('datalyes/patembed-large')
75
+
76
+ def mean_pooling(model_output, attention_mask):
77
+ token_embeddings = model_output[0]
78
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
79
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
80
+
81
+ # Tokenize and encode
82
+ texts = ["A method for manufacturing semiconductor devices..."]
83
+ encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
84
+
85
+ with torch.no_grad():
86
+ model_output = model(**encoded)
87
+ embeddings = mean_pooling(model_output, encoded['attention_mask'])
88
+ embeddings = F.normalize(embeddings, p=2, dim=1)
89
+ ```
90
+
91
+ ### Patent Retrieval Example
92
+
93
+ ```python
94
+ from sentence_transformers import SentenceTransformer, util
95
+
96
+ model = SentenceTransformer('datalyes/patembed-large')
97
+
98
+ # Query patent
99
+ query = "Method for reducing power consumption in mobile devices"
100
+
101
+ # Candidate patents
102
+ candidates = [
103
+ "A power management system for portable electronic devices...",
104
+ "Chemical composition for battery manufacturing...",
105
+ "Method for wireless data transmission in mobile networks...",
106
+ ]
107
+
108
+ # Encode and retrieve
109
+ query_emb = model.encode(query)
110
+ candidate_embs = model.encode(candidates)
111
+
112
+ # Compute similarities
113
+ scores = util.cos_sim(query_emb, candidate_embs)[0]
114
+
115
+ # Get ranked results
116
+ results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
117
+ results.sort(key=lambda x: x[1], reverse=True)
118
+
119
+ for patent, score in results:
120
+ print(f"Score: {score:.4f} - {patent[:100]}...")
121
+ ```
122
+
123
+ ## Intended Use
124
+
125
+ This model is designed for patent-specific tasks including:
126
+ - Patent search and retrieval
127
+ - Prior art search
128
+ - Patent classification and clustering
129
+ - Technology landscape analysis
130
+
131
+ For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.
132
+
133
+ ## Citation
134
+
135
+ If you use this model, please cite our paper:
136
+
137
+ ```bibtex
138
+ @misc{ayaou2025patentebcomprehensivebenchmarkmodel,
139
+ title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding},
140
+ author={Iliass Ayaou and Denis Cavallucci},
141
+ year={2025},
142
+ eprint={2510.22264},
143
+ archivePrefix={arXiv},
144
+ primaryClass={cs.CL},
145
+ url={https://arxiv.org/abs/2510.22264}
146
+ }
147
+ ```
148
+
149
+ **Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264)
150
+
151
+ ## License
152
+
153
+ This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license.
154
+
155
+ **Key Terms:**
156
+ - ✅ You can use, share, and adapt the model
157
+ - ✅ You must give appropriate credit
158
+ - ❌ You may not use the model for commercial purposes
159
+ - ⚠️ If you adapt or build upon this model, you must distribute under the same license
160
+
161
+ For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/
162
+
163
+ ## Contact
164
+
165
+ - **Authors**: Iliass Ayaou, Denis Cavallucci
166
+ - **Institution**: ICUBE Laboratory, INSA Strasbourg
167
+ - **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb)
168
+ - **HuggingFace**: [datalyes](https://huggingface.co/datalyes)
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 1024,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 4096,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 16,
16
+ "num_hidden_layers": 24,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.55.2",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 39859
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "2.2.2",
5
+ "transformers": "4.55.2",
6
+ "pytorch": "2.8.0+cu128"
7
+ },
8
+ "prompts": {
9
+ "retrieval_IN": {
10
+ "q_text": "encode query for same document retrieval: ",
11
+ "pos_text": "encode document for same retrieval: "
12
+ },
13
+ "retrieval_OUT": {
14
+ "q_text": "encode query for different document retrieval: ",
15
+ "pos_text": "encode document for different retrieval: "
16
+ },
17
+ "retrieval_MIXED": {
18
+ "q_text": "encode query for mixed document retrieval: ",
19
+ "pos_text": "encode document for mixed retrieval: "
20
+ },
21
+ "retrieval_inventor": {
22
+ "q_text": "encode query for same inventor document retrieval: ",
23
+ "pos_text": "encode document for same inventor retrieval: "
24
+ },
25
+ "title2full": {
26
+ "title": "encode title query for document retrieval: ",
27
+ "full_text": "encode document for retrieval: "
28
+ },
29
+ "problem2full": {
30
+ "problem": "encode problem query for document retrieval: ",
31
+ "full_text": "encode document for retrieval: "
32
+ },
33
+ "effect2full": {
34
+ "effect": "encode effect query for document retrieval: ",
35
+ "full_text": "encode document for retrieval: "
36
+ },
37
+ "effect2substance": {
38
+ "effect": "encode effect query for substance retrieval: ",
39
+ "substance": "encode substance for retrieval: "
40
+ },
41
+ "problem2solution": {
42
+ "problem": "encode problem query for solution retrieval: ",
43
+ "solution": "encode solution for retrieval: "
44
+ },
45
+ "para_problem": {
46
+ "text1": "encode problem for problem paraphrase: ",
47
+ "text2": "encode problem for problem paraphrase: "
48
+ },
49
+ "para_solution": {
50
+ "text1": "encode solution for solution paraphrase: ",
51
+ "text2": "encode solution for solution paraphrase: "
52
+ },
53
+ "class_text2ipc3": "encode document for ipc classification: ",
54
+ "class_bloom": "encode document for bloom prediction classification: ",
55
+ "class_nli_oldnew": {
56
+ "q_text": "encode citing document for pair classification: ",
57
+ "t_text": "encode cited document for pair classification: "
58
+ },
59
+ "clusters_ext_full_ipc": "encode document for same ipc clustering: ",
60
+ "clusters_inventor": "encode document for same inventors clustering: "
61
+ },
62
+ "default_prompt_name": null,
63
+ "similarity_fn_name": "cosine"
64
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:234ea36a876fe5d5c416c1cbaad6f7221e17861fadd6481f0b96588fdc1ca482
3
+ size 1378856808
model_info.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "display_name": "patembed-large",
3
+ "source_folder": "train_v4/runs/bert-for-patents-20250821_085708-360417_asym_prompt_all_1e5_bs_32_ga4_bs_nd",
4
+ "source_path": "/media/iayaou01/Extreme SSD/patembed_artifacts/patembed_release_bundle/train_v4/runs/bert-for-patents-20250821_085708-360417_asym_prompt_all_1e5_bs_32_ga4_bs_nd",
5
+ "output_path": "/media/iayaou01/Extreme SSD/patembed_artifacts/patembed_release_bundle/models_for_release/patembed-large",
6
+ "specifications": {
7
+ "params": "344M",
8
+ "layers": 24,
9
+ "hidden_size": 1024,
10
+ "embedding_dim": 1024,
11
+ "base_model": "bert-for-patents (344M params, domain-pretrained on patent corpora)",
12
+ "max_seq_length": 512,
13
+ "description": "Flagship encoder initialized from Bert-for-Patents with 24-layer transformer architecture."
14
+ }
15
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 1000000000000000019884624838656,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff