Usage
API
General overview
SciAssist provides apis to make it simple to use any provided model for inference on various tasks. And they will automatically load a default model capable of inference for your task. To do inference on a task, you can:
Start by creating a task-specific parser. Taking
reference string parsingas example:
>>> from SciAssist import ReferenceStringParsing
>>> ref_parser = ReferenceStringParsing()
Pass your input string to the parser:
>>> ref_parser.predict(
... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
... type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>',
'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'],
'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']}]
If you have more than one string, use predict() and pass your input as a list:
>>> ref_parser.predict(
... [
... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
... "Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew D. McCallum. 2017. Semeval 2017 task 10 (scienceie): Extracting keyphrases and relations from scientific publications. In ACL workshop (SemEval)."
... ],
... type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']},
{'tagged_text': '<author>Isabelle</author> <author>Augenstein,</author> <author>Mrinal</author> <author>Das,</author> <author>Sebastian</author> <author>Riedel,</author> <author>Lakshmi</author> <author>Vikraman,</author> <author>and</author> <author>Andrew</author> <author>D.</author> <author>McCallum.</author> <date>2017.</date> <title>Semeval</title> <title>2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>Extracting</title> <title>keyphrases</title> <title>and</title> <title>relations</title> <title>from</title> <title>scientific</title> <title>publications.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Isabelle', 'Augenstein,', 'Mrinal', 'Das,', 'Sebastian', 'Riedel,', 'Lakshmi', 'Vikraman,', 'and', 'Andrew', 'D.', 'McCallum.', '2017.', 'Semeval', '2017', 'task', '10', '(scienceie):', 'Extracting', 'keyphrases', 'and', 'relations', 'from', 'scientific', 'publications.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle', 'date']}]
Any additional parameters for the specific task can also be included in predict(...).
For example, the reference string parsing task has a dehyphen parameter.
If you want to remove hyphens in the raw text, set the dehyphen:
>>> ref_parser.predict(
... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
... type="str",
... dehyphen=True
... )
Choose a model and tokenizer
You can choose a model you’d like to use for your task. All provided models are shown in Models.
For example, create a parser to summarize a document and specify a model and tokenizer:
>>> from SciAssist import Summarization
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
>>> summerizer = Summerization(model_name="bart-cnn-on-mup", tokenizer=tokenizer)
The task-specific parsers
Reference string parsing
- class SciAssist.ReferenceStringParsing(model_name='default', device='gpu', cache_dir=None, output_dir=None, temp_dir=None, tokenizer=None, checkpoint='allenai/scibert_scivocab_uncased', model_max_length=512, os_name=None)[source]
The pipeline for reference string parsing.
- Parameters
model_name (str, optional) – A string, the model name of a pretrained model provided for this task.
device (str, optional) – A string, cpu or gpu.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model should be cached if the standard cache should not be used.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml.
tokenizer (PreTrainedTokenizer, optional) – A specific tokenizer.
checkpoint (str or os.PathLike, optional) –
A checkpoint for the tokenizer. You can also specify the checkpoint while using the default tokenizer. Can be either:
A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like allenai/scibert_scivocab_uncased.
A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.
A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)
model_max_length (int, optional) – The max sequence length the model accepts.
- predict(input, type='pdf', dehyphen=False, output_dir=None, temp_dir=None, save_results=True)[source]
- Parameters
input (str or List[str] or os.PathLike) –
Can be either:
A string, the reference string to be parsed.
A list of strings to be parsed.
A path to a .txt file to be parsed. Each line of the source file contains a reference string.
A path to a .pdf file to be parsed, a raw scientific document without processing. The pipeline will automatically extract the reference strings from the pdf.
type (str, default to pdf) –
The type of input, can be either:
str or string.
text`or `txt for a .txt file.
pdf for a pdf file. This is the default value.
dehyphen (bool, default to False) – Whether to remove hyphens in raw text.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored. If not provided, it will use the output_dir set for the pipeline.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml. If not provided, it will use the temp_dir set for the pipeline.
save_results (bool, default to True) – Whether to save the results in a .json file. Note: This is invalid when type is set to str or string.
- Returns
[{“tagged_text”: tagged_text, “tokens”: tokens_list ,”tags”: tags_list } , … ]
- Return type
List[Dict]
Examples
>>> from SciAssist import ReferenceStringParsing >>> pipeline = ReferenceStringParsing() >>> pipeline.predict( ... "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).", ... type="str" ... ) [{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']}]
Single document summarization
- class SciAssist.Summarization(model_name='default', device='gpu', task_name='controlled-summarization', cache_dir=None, output_dir=None, temp_dir=None, tokenizer=None, checkpoint='google/flan-t5-base', model_max_length=1024, max_source_length=1024, max_target_length=500, os_name=None)[source]
The pipeline for single document summarization.
- Parameters
model_name (str, optional) – A string, the model name of a pretrained model provided for this task.
device (str, optional) – A string, cpu or gpu.
cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model should be cached if the standard cache should not be used.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml.
tokenizer (PreTrainedTokenizer, optional) – A specific tokenizer.
checkpoint (str or os.PathLike, optional) –
A checkpoint for the tokenizer. You can also specify the checkpoint while using the default tokenizer. Can be either:
A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like facebook/bart-large-cnn.
A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.
A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)
model_max_length (int, optional) – The max sequence length the model accepts.
max_source_length (int, optional) – The max length of the input text.
max_target_length (int, optional) – The max length of the generated summary.
- predict(input, type='pdf', output_dir=None, temp_dir=None, num_beams=1, num_return_sequences=1, save_results=True, length=None, keywords=None)[source]
- Parameters
input (str or List[str] or os.PathLike) –
either (Can be) –
A string, the reference string to be parsed.
A list of strings to be parsed.
A path to a .txt file to be summarized.
A path to a .pdf file to be summarized, a raw scientific document without processing. The pipeline will automatically extract the body text from the pdf.
type (str, default to pdf) –
The type of input, can be either:
str or string.
text`or `txt for a .txt file.
pdf for a pdf file. This is the default value.
output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored. If not provided, it will use the output_dir set for the pipeline.
temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml. If not provided, it will use the temp_dir set for the pipeline.
num_beams (int, optional) – Number of beams for beam search. 1 means no beam search. num_beams should be divisible by num_return_sequences for group beam search.
num_return_sequences (int) – The number of independently computed returned sequences for each element in the batch.
save_results (bool, default to True) – Whether to save the results in a .json file. Note: This is invalid when type is set to str or string.
- Returns
{ “summary”: [summary1, summary2, …], “raw_text”: raw_text }
- Return type
Dict
Examples
>>> from SciAssist import Summarization >>> pipeline = Summarization() >>> res = pipeline.predict('N18-3011.pdf', type="pdf", num_beams=4, num_return_sequences=2) >>> res["summary"] ['The paper proposes a method for extracting structured information from scientific documents into the literature graph. The paper describes the attributes associated with nodes and edges of different types in the graph, and describes how to extract the entities mentioned in paper text. The method is evaluated on three tasks: sequence labeling, entity linking and relation extraction. ', 'The paper proposes a method for extracting structured information from scientific documents into the literature graph. The paper describes the attributes associated with nodes and edges of different types in the graph, and describes how to extract the entities mentioned in paper text. The method is evaluated on three tasks: sequence labeling, entity linking and relation extraction. ']