Upload PatentTEB model: patembed-large

Browse files

Files changed (12) hide show

1_Pooling/config.json +10 -0
README.md +168 -0
config.json +24 -0
config_sentence_transformers.json +64 -0
model.safetensors +3 -0
model_info.json +15 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 1024,
+    "pooling_mode_cls_token": false,
+    "pooling_mode_mean_tokens": true,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,168 @@

+---
+license: cc-by-nc-sa-4.0
+library_name: sentence-transformers
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- patent
+- embeddings
+- mteb
+language:
+- en
+pipeline_tag: sentence-similarity
+---
+# patembed-large
+This is a **sentence-transformers** model trained specifically for **patent text embeddings**. It is part of the **PatenTEB** project, which provides state-of-the-art models for patent document understanding and retrieval.
+**Note:** This model uses task-specific instruction prompts during inference for optimal performance.
+## Model Details
+- **Model Type**: Sentence Transformer
+- **Base Architecture**: bert-for-patents (344M params, domain-pretrained on patent corpora)
+- **Parameters**: 344M
+- **Number of Layers**: 24
+- **Hidden Size**: 1024
+- **Embedding Dimension**: 1024
+- **Max Sequence Length**: 512 tokens
+- **Language**: English
+- **License**: CC BY-NC-SA 4.0
+## Model Description
+Flagship encoder initialized from Bert-for-Patents with 24-layer transformer architecture.
+This model is part of the **patembed family**, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.
+## Usage
+### Using Sentence Transformers
+```python
+from sentence_transformers import SentenceTransformer
+# Load the model
+model = SentenceTransformer('datalyes/patembed-large')
+# Encode patent texts
+patent_texts = [
+    "A method for manufacturing semiconductor devices...",
+    "An apparatus for processing chemical compounds...",
+]
+embeddings = model.encode(patent_texts)
+# Compute similarity
+from sentence_transformers import util
+similarity = util.cos_sim(embeddings[0], embeddings[1])
+print(f"Similarity: {similarity.item():.4f}")
+```
+### Using Transformers
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+import torch.nn.functional as F
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-large')
+model = AutoModel.from_pretrained('datalyes/patembed-large')
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0]
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+# Tokenize and encode
+texts = ["A method for manufacturing semiconductor devices..."]
+encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
+with torch.no_grad():
+    model_output = model(**encoded)
+    embeddings = mean_pooling(model_output, encoded['attention_mask'])
+    embeddings = F.normalize(embeddings, p=2, dim=1)
+```
+### Patent Retrieval Example
+```python
+from sentence_transformers import SentenceTransformer, util
+model = SentenceTransformer('datalyes/patembed-large')
+# Query patent
+query = "Method for reducing power consumption in mobile devices"
+# Candidate patents
+candidates = [
+    "A power management system for portable electronic devices...",
+    "Chemical composition for battery manufacturing...",
+    "Method for wireless data transmission in mobile networks...",
+]
+# Encode and retrieve
+query_emb = model.encode(query)
+candidate_embs = model.encode(candidates)
+# Compute similarities
+scores = util.cos_sim(query_emb, candidate_embs)[0]
+# Get ranked results
+results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
+results.sort(key=lambda x: x[1], reverse=True)
+for patent, score in results:
+    print(f"Score: {score:.4f} - {patent[:100]}...")
+```
+## Intended Use
+This model is designed for patent-specific tasks including:
+- Patent search and retrieval
+- Prior art search
+- Patent classification and clustering
+- Technology landscape analysis
+For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.
+## Citation
+If you use this model, please cite our paper:
+```bibtex
+@misc{ayaou2025patentebcomprehensivebenchmarkmodel,
+      title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding},
+      author={Iliass Ayaou and Denis Cavallucci},
+      year={2025},
+      eprint={2510.22264},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2510.22264}
+}
+```
+**Paper**: [PatenTEB on arXiv](https://arxiv.org/abs/2510.22264)
+## License
+This model is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license.
+**Key Terms:**
+- ✅ You can use, share, and adapt the model
+- ✅ You must give appropriate credit
+- ❌ You may not use the model for commercial purposes
+- ⚠️ If you adapt or build upon this model, you must distribute under the same license
+For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/
+## Contact
+- **Authors**: Iliass Ayaou, Denis Cavallucci
+- **Institution**: ICUBE Laboratory, INSA Strasbourg
+- **GitHub**: [PatentTEB/PatentTEB](https://github.com/iliass-y/patenteb)
+- **HuggingFace**: [datalyes](https://huggingface.co/datalyes)

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.55.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 39859
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "2.2.2",
+    "transformers": "4.55.2",
+    "pytorch": "2.8.0+cu128"
+  },
+  "prompts": {
+    "retrieval_IN": {
+      "q_text": "encode query for same document retrieval: ",
+      "pos_text": "encode document for same retrieval: "
+    },
+    "retrieval_OUT": {
+      "q_text": "encode query for different document retrieval: ",
+      "pos_text": "encode document for different retrieval: "
+    },
+    "retrieval_MIXED": {
+      "q_text": "encode query for mixed document retrieval: ",
+      "pos_text": "encode document for mixed retrieval: "
+    },
+    "retrieval_inventor": {
+      "q_text": "encode query for same inventor document retrieval: ",
+      "pos_text": "encode document for same inventor retrieval: "
+    },
+    "title2full": {
+      "title": "encode title query for document retrieval: ",
+      "full_text": "encode document for retrieval: "
+    },
+    "problem2full": {
+      "problem": "encode problem query for document retrieval: ",
+      "full_text": "encode document for retrieval: "
+    },
+    "effect2full": {
+      "effect": "encode effect query for document retrieval: ",
+      "full_text": "encode document for retrieval: "
+    },
+    "effect2substance": {
+      "effect": "encode effect query for substance retrieval: ",
+      "substance": "encode substance for retrieval: "
+    },
+    "problem2solution": {
+      "problem": "encode problem query for solution retrieval: ",
+      "solution": "encode solution for retrieval: "
+    },
+    "para_problem": {
+      "text1": "encode problem for problem paraphrase: ",
+      "text2": "encode problem for problem paraphrase: "
+    },
+    "para_solution": {
+      "text1": "encode solution for solution paraphrase: ",
+      "text2": "encode solution for solution paraphrase: "
+    },
+    "class_text2ipc3": "encode document for ipc classification: ",
+    "class_bloom": "encode document for bloom prediction classification: ",
+    "class_nli_oldnew": {
+      "q_text": "encode citing document for pair classification: ",
+      "t_text": "encode cited document for pair classification: "
+    },
+    "clusters_ext_full_ipc": "encode document for same ipc clustering: ",
+    "clusters_inventor": "encode document for same inventors clustering: "
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:234ea36a876fe5d5c416c1cbaad6f7221e17861fadd6481f0b96588fdc1ca482
+size 1378856808

model_info.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "display_name": "patembed-large",
+  "source_folder": "train_v4/runs/bert-for-patents-20250821_085708-360417_asym_prompt_all_1e5_bs_32_ga4_bs_nd",
+  "source_path": "/media/iayaou01/Extreme SSD/patembed_artifacts/patembed_release_bundle/train_v4/runs/bert-for-patents-20250821_085708-360417_asym_prompt_all_1e5_bs_32_ga4_bs_nd",
+  "output_path": "/media/iayaou01/Extreme SSD/patembed_artifacts/patembed_release_bundle/models_for_release/patembed-large",
+  "specifications": {
+    "params": "344M",
+    "layers": 24,
+    "hidden_size": 1024,
+    "embedding_dim": 1024,
+    "base_model": "bert-for-patents (344M params, domain-pretrained on patent corpora)",
+    "max_seq_length": 512,
+    "description": "Flagship encoder initialized from Bert-for-Patents with 24-layer transformer architecture."
+  }
+}

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 512,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff