Usage

API

General overview

SciAssist provides apis to make it simple to use any provided model for inference on various tasks. And they will automatically load a default model capable of inference for your task. To do inference on a task, you can:

  1. Start by creating a task-specific parser. Taking reference string parsing as example:

>>> from SciAssist import ReferenceStringParsing

>>> ref_parser = ReferenceStringParsing()
  1. Pass your input string to the parser:

>>> ref_parser.predict(
...     "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
...     type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>',
'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'],
'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']}]

If you have more than one string, use predict() and pass your input as a list:

>>> ref_parser.predict(
...     [
...         "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
...         "Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew D. McCallum. 2017. Semeval 2017 task 10 (scienceie): Extracting keyphrases and relations from scientific publications. In ACL workshop (SemEval)."
...     ],
...     type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']},
{'tagged_text': '<author>Isabelle</author> <author>Augenstein,</author> <author>Mrinal</author> <author>Das,</author> <author>Sebastian</author> <author>Riedel,</author> <author>Lakshmi</author> <author>Vikraman,</author> <author>and</author> <author>Andrew</author> <author>D.</author> <author>McCallum.</author> <date>2017.</date> <title>Semeval</title> <title>2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>Extracting</title> <title>keyphrases</title> <title>and</title> <title>relations</title> <title>from</title> <title>scientific</title> <title>publications.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>', 'tokens': ['Isabelle', 'Augenstein,', 'Mrinal', 'Das,', 'Sebastian', 'Riedel,', 'Lakshmi', 'Vikraman,', 'and', 'Andrew', 'D.', 'McCallum.', '2017.', 'Semeval', '2017', 'task', '10', '(scienceie):', 'Extracting', 'keyphrases', 'and', 'relations', 'from', 'scientific', 'publications.', 'In', 'ACL', 'workshop', '(SemEval).'], 'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle', 'date']}]

Any additional parameters for the specific task can also be included in predict(...). For example, the reference string parsing task has a dehyphen parameter. If you want to remove hyphens in the raw text, set the dehyphen:

>>> ref_parser.predict(
...     "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
...     type="str",
...     dehyphen=True
... )

Choose a model and tokenizer

You can choose a model you’d like to use for your task. All provided models are shown in Models.

For example, create a parser to summarize a document and specify a model and tokenizer:

>>> from SciAssist import Summarization
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
>>> summerizer = Summerization(model_name="bart-cnn-on-mup", tokenizer=tokenizer)

The task-specific parsers

Reference string parsing

class SciAssist.ReferenceStringParsing(model_name='default', device='gpu', cache_dir=None, output_dir=None, temp_dir=None, tokenizer=None, checkpoint='allenai/scibert_scivocab_uncased', model_max_length=512, os_name=None)[source]

The pipeline for reference string parsing.

Parameters
  • model_name (str, optional) – A string, the model name of a pretrained model provided for this task.

  • device (str, optional) – A string, cpu or gpu.

  • cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model should be cached if the standard cache should not be used.

  • output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored.

  • temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml.

  • tokenizer (PreTrainedTokenizer, optional) – A specific tokenizer.

  • checkpoint (str or os.PathLike, optional) –

    A checkpoint for the tokenizer. You can also specify the checkpoint while using the default tokenizer. Can be either:

    • A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like allenai/scibert_scivocab_uncased.

    • A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.

    • A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)

  • model_max_length (int, optional) – The max sequence length the model accepts.

predict(input, type='pdf', dehyphen=False, output_dir=None, temp_dir=None, save_results=True)[source]
Parameters
  • input (str or List[str] or os.PathLike) –

    Can be either:

    • A string, the reference string to be parsed.

    • A list of strings to be parsed.

    • A path to a .txt file to be parsed. Each line of the source file contains a reference string.

    • A path to a .pdf file to be parsed, a raw scientific document without processing. The pipeline will automatically extract the reference strings from the pdf.

  • type (str, default to pdf) –

    The type of input, can be either:

    • str or string.

    • text`or `txt for a .txt file.

    • pdf for a pdf file. This is the default value.

  • dehyphen (bool, default to False) – Whether to remove hyphens in raw text.

  • output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored. If not provided, it will use the output_dir set for the pipeline.

  • temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml. If not provided, it will use the temp_dir set for the pipeline.

  • save_results (bool, default to True) – Whether to save the results in a .json file. Note: This is invalid when type is set to str or string.

Returns

[{“tagged_text”: tagged_text, “tokens”: tokens_list ,”tags”: tags_list } , … ]

Return type

List[Dict]

Examples

>>> from SciAssist import ReferenceStringParsing
>>> pipeline = ReferenceStringParsing()
>>> pipeline.predict(
...     "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
...     type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>',
'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'],
'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']}]

Controlled Summarization(CoCoSciSum)

class SciAssist.Summarization(model_name='default', device='gpu', task_name='controlled-summarization', cache_dir=None, output_dir=None, temp_dir=None, tokenizer=None, checkpoint='google/flan-t5-base', model_max_length=1024, max_source_length=1024, max_target_length=500, os_name=None)[source]

The pipeline for single document summarization.

Parameters
  • model_name (str, optional) – A string, the model name of a pretrained model provided for this task.

  • device (str, optional) – A string, cpu or gpu.

  • cache_dir (str or os.PathLike, optional) – Path to a directory in which a downloaded pretrained model should be cached if the standard cache should not be used.

  • output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored.

  • temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml.

  • tokenizer (PreTrainedTokenizer, optional) – A specific tokenizer.

  • checkpoint (str or os.PathLike, optional) –

    A checkpoint for the tokenizer. You can also specify the checkpoint while using the default tokenizer. Can be either:

    • A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like facebook/bart-large-cnn.

    • A path to a directory containing vocabulary files required by the tokenizer, for instance saved using the [~PreTrainedTokenizer.save_pretrained] method, e.g., ./my_model_directory/.

    • A path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (like Bert or XLNet), e.g.: ./my_model_directory/vocab.txt. (Not applicable to all derived classes)

  • model_max_length (int, optional) – The max sequence length the model accepts.

  • max_source_length (int, optional) – The max length of the input text.

  • max_target_length (int, optional) – The max length of the generated summary.

predict(input, type='pdf', output_dir=None, temp_dir=None, num_beams=1, num_return_sequences=1, save_results=True, length=100, keywords=None)[source]
Parameters
  • input (str or List[str] or os.PathLike) –

  • either (Can be) –

    • A string, the reference string to be parsed.

    • A list of strings to be parsed.

    • A path to a .txt file to be summarized.

    • A path to a .pdf file to be summarized, a raw scientific document without processing. The pipeline will automatically extract the body text from the pdf.

  • type (str, default to pdf) –

    The type of input, can be either:

    • str or string.

    • text`or `txt for a .txt file.

    • pdf for a pdf file. This is the default value.

  • output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored. If not provided, it will use the output_dir set for the pipeline.

  • temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml. If not provided, it will use the temp_dir set for the pipeline.

  • num_beams (int, optional) – Number of beams for beam search. 1 means no beam search. num_beams should be divisible by num_return_sequences for group beam search.

  • num_return_sequences (int, optional) – The number of independently computed returned sequences for each element in the batch.

  • save_results (bool, default to True) – Whether to save the results in a .json file. Note: This is invalid when type is set to str or string.

  • length (int, default to 100) – The expected number of words in the summary. The value should be in [50, 100, 150, 200, 250] to ensure the controllability.

  • keywords (List[str], default to None) – The keywords you want to appear in thee summary.

Returns

{ “summary”: [summary1, summary2, …], “raw_text”: raw_text }

Return type

Dict

Examples

>>> from SciAssist import Summarization
>>> summarizer = Summarization()
>>> res = summarizer.predict('Bert_paper.pdf', type="pdf", length=50, keywords=["Cloze task"])
>>> res["summary"]
['This paper proposes a bidirectional pre-training method for language representations. The method is inspired by the Cloze task. The method is evaluated on a large suite of sentence-level and token-level tasks.']