Usage
API
General overview
SciAssist provides apis to make it simple to use any provided model for inference on various tasks. And they will automatically load a default model capable of inference for your task. To do inference on a task, you can:
Start by creating a task-specific parser. Taking
reference string parsingas example:
>>> from SciAssist import ReferenceStringParsing
>>> ref_parser = ReferenceStringParsing()
Pass your input string to the parser:
>>> ref_parser.predict(
... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
... type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>',
'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'],
'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']}]
If you have more than one string, use predict() and pass your input as a list:
>>> ref_parser.predict(
... [
... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
... "Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew D. McCallum. 2017. Semeval 2017 task 10 (scienceie): Extracting keyphrases and relations from scientific publications. In ACL workshop (SemEval)."
... ],
... type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']},
{'tagged_text': '<author>Isabelle</author> <author>Augenstein,</author> <author>Mrinal</author> <author>Das,</author> <author>Sebastian</author> <author>Riedel,</author> <author>Lakshmi</author> <author>Vikraman,</author> <author>and</author> <author>Andrew</author> <author>D.</author> <author>McCallum.</author> <date>2017.</date> <title>Semeval</title> <title>2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>Extracting</title> <title>keyphrases</title> <title>and</title> <title>relations</title> <title>from</title> <title>scientific</title> <title>publications.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Isabelle', 'Augenstein,', 'Mrinal', 'Das,', 'Sebastian', 'Riedel,', 'Lakshmi', 'Vikraman,', 'and', 'Andrew', 'D.', 'McCallum.', '2017.', 'Semeval', '2017', 'task', '10', '(scienceie):', 'Extracting', 'keyphrases', 'and', 'relations', 'from', 'scientific', 'publications.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle', 'date']}]
Any additional parameters for the specific task can also be included in predict(...).
For example, the reference string parsing task has a dehyphen parameter.
If you want to remove hyphens in the raw text, set the dehyphen:
>>> ref_parser.predict(
... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
... type="str",
... dehyphen=True
... )
Choose a model and tokenizer
You can choose a model you’d like to use for your task. All provided models are shown in Models.
For example, create a parser to summarize a document and specify a model and tokenizer:
>>> from SciAssist import Summarization
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
>>> summerizer = Summerization(model_name="bart-cnn-on-mup", tokenizer=tokenizer)
The task-specific parsers
Reference string parsing
- class SciAssist.ReferenceStringParsing(model_name='default', device='gpu', cache_dir=None, output_dir=None, temp_dir=None, tokenizer=None, checkpoint='allenai/scibert_scivocab_uncased', model_max_length=512, os_name=None)[source]
The pipeline for reference string parsing.
- Parameters
model_name (str, optional) – A string, the model name of a pretrained model provided for this task.
device (str, optional) – A string, cpu or gpu.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model should be cached if the standard cache should not be used.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml.
tokenizer (PreTrainedTokenizer, optional) – A specific tokenizer.
checkpoint (str or os.PathLike, optional) –
A checkpoint for the tokenizer. You can also specify the checkpoint while using the default tokenizer. Can be either:
A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like allenai/scibert_scivocab_uncased.
A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.
A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)
model_max_length (int, optional) – The max sequence length the model accepts.
- predict(input, type='pdf', dehyphen=False, output_dir=None, temp_dir=None, save_results=True)[source]
- Parameters
input (str or List[str] or os.PathLike) –
Can be either:
A string, the reference string to be parsed.
A list of strings to be parsed.
A path to a .txt file to be parsed. Each line of the source file contains a reference string.
A path to a .pdf file to be parsed, a raw scientific document without processing. The pipeline will automatically extract the reference strings from the pdf.
type (str, default to pdf) –
The type of input, can be either:
str or string.
text`or `txt for a .txt file.
pdf for a pdf file. This is the default value.
dehyphen (bool, default to False) – Whether to remove hyphens in raw text.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored. If not provided, it will use the output_dir set for the pipeline.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml. If not provided, it will use the temp_dir set for the pipeline.
save_results (bool, default to True) – Whether to save the results in a .json file. Note: This is invalid when type is set to str or string.
- Returns
[{“tagged_text”: tagged_text, “tokens”: tokens_list ,”tags”: tags_list } , … ]
- Return type
List[Dict]
Examples
>>> from SciAssist import ReferenceStringParsing >>> pipeline = ReferenceStringParsing() >>> pipeline.predict( ... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).", ... type="str" ... ) [{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']}]
Controlled Summarization(CoCoSciSum)
- class SciAssist.Summarization(model_name='default', device='gpu', task_name='controlled-summarization', cache_dir=None, output_dir=None, temp_dir=None, tokenizer=None, checkpoint='google/flan-t5-base', model_max_length=1024, max_source_length=1024, max_target_length=500, os_name=None)[source]
The pipeline for single document summarization.
- Parameters
model_name (str, optional) – A string, the model name of a pretrained model provided for this task.
device (str, optional) – A string, cpu or gpu.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model should be cached if the standard cache should not be used.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml.
tokenizer (PreTrainedTokenizer, optional) – A specific tokenizer.
checkpoint (str or os.PathLike, optional) –
A checkpoint for the tokenizer. You can also specify the checkpoint while using the default tokenizer. Can be either:
A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like facebook/bart-large-cnn.
A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.
A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)
model_max_length (int, optional) – The max sequence length the model accepts.
max_source_length (int, optional) – The max length of the input text.
max_target_length (int, optional) – The max length of the generated summary.
- predict(input, type='pdf', output_dir=None, temp_dir=None, num_beams=1, num_return_sequences=1, save_results=True, length=100, keywords=None)[source]
- Parameters
input (str or List[str] or os.PathLike) –
either (Can be) –
A string, the reference string to be parsed.
A list of strings to be parsed.
A path to a .txt file to be summarized.
A path to a .pdf file to be summarized, a raw scientific document without processing. The pipeline will automatically extract the body text from the pdf.
type (str, default to pdf) –
The type of input, can be either:
str or string.
text`or `txt for a .txt file.
pdf for a pdf file. This is the default value.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored. If not provided, it will use the output_dir set for the pipeline.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml. If not provided, it will use the temp_dir set for the pipeline.
num_beams (int, optional) – Number of beams for beam search. 1 means no beam search. num_beams should be divisible by num_return_sequences for group beam search.
num_return_sequences (int, optional) – The number of independently computed returned sequences for each element in the batch.
save_results (bool, default to True) – Whether to save the results in a .json file. Note: This is invalid when type is set to str or string.
length (int, default to 100) – The expected number of words in the summary. The value should be in [50, 100, 150, 200, 250] to ensure the controllability.
keywords (List[str], default to None) – The keywords you want to appear in thee summary.
- Returns
{ “summary”: [summary1, summary2, …], “raw_text”: raw_text }
- Return type
Dict
Examples
>>> from SciAssist import Summarization
>>> summarizer = Summarization()
>>> res = summarizer.predict('Bert_paper.pdf', type="pdf", length=50, keywords=["Cloze task"])
>>> res["summary"] ['This paper proposes a bidirectional pre-training method for language representations. The method is inspired by the Cloze task. The method is evaluated on a large suite of sentence-level and token-level tasks.']