Contributing Guide

To contribute on a new task, here’s a step-by-step guide:

  1. Fork the SciAssist repository

  2. Install Grobid Server for pdf processing

  3. Train a model on a new task

  4. Build a pipeline for the new task

1. Fork the SciAssist repository

First, make a fork of the SciAssist repository. Clone the forked repository and install the package dependencies in requirements.txt into your virtual environment.

# clone project
git clone https://github.com/WING-NUS/SciAssist
cd SciAssist

# [OPTIONAL] create conda environment
conda create -n myenv python=3.8
conda activate myenv

# install pytorch according to instructions
# https://pytorch.org/get-started/

# install requirements
pip install -r requirements.txt

2. Install Grobid Server for pdf processing

The current doc2json tool uses Grobid to first process each PDF into XML, then extracts paper components from the XML. You will need to have Java installed on your machine. Then, you can install your own version of Grobid and get it running. To setup Grobid conveniently, you can use the CLIs provided by SciAssist.

# setup CLIs
python setup.py develop

# setup Grobid
setup_grobid

# run Grobid
run_grobid

3. Train a model on a new task

General overview

Generally, to train on a new task, you will need to:

  • Enter src/ directory.

  • Choose or create a datautils class in src/SciAssist/utils/data_utils.py.

  • Prepare a datamodule(LightningDataModule) in src/SciAssist/datamodules and create datamodule config in src/SciAssist/configs/datamodule

  • Prepare a model(LightningModule) in src/SciAssist/models and create model config in src/SciAssist/configs/model

  • Specify the configs and train your model in command line:

python SciAssist/train.py --datamodule=dataconfig --model=modelconfig

You may want more information about Hydra to understand the config files better.

Step-by-step recipe to train on a new task

The src/ directory is considered the of SciAssist’s source code, so enter src/ before the next steps.

cd src

Create datautils

There should be some customed functions for data processing, you will need to create a DataUtils class for them for easy reuse in both training and inference. For example, this is a datautil for Seq2Seq task:

class SciAssist.utils.data_utils.DataUtilsForSeq2Seq(tokenizer=None, model_class=<class 'SciAssist.models.components.bart_summarization.BartForSummarization'>, checkpoint='facebook/bart-large-cnn', model_max_length=1024, max_source_length=1024, max_target_length=128)[source]
Parameters
  • tokenizer (PretrainedTokenizer, default to None) – The tokenizer for tokenization.

  • checkpoint (str) – The checkpoint from which the tokenizer is loaded.

  • model_max_length (int, optional) – The max sequence length the model accepts.

  • max_source_length (int, optional) – The max length of the input text.

  • max_target_length (int, optional) – The max length of the generated summary.

tokenize_and_align_labels(examples, inputs_column='text', labels_column='summary')[source]

Process the dataset for model input, for example, do tokenization and prepare label_ids.

Parameters
  • examples (Dataset) – { “text”: [s1, s2, …], “summary”: [l1, l2, …]}

  • inputs (str) – The name of input column

  • labels (str) – The name of target column

Returns

{“input_ids”: input_ids, “attention_mask”: attention_mask, “labels”: label_ids }

Return type

Dict

collator()[source]

The collating function.

Returns

A collating function.

For example, DataCollatorForSeq2Seq(…).

You can also custom a collating function, but remember that collator() needs to return a function.

Return type

function

postprocess(preds, labels)[source]

Process model’s outputs and get the final results rather than simple ids.

Parameters
  • preds (Tensor) – Prediction labels, the output of the model.

  • labels (Tensor) – True labels

Returns

decoded_preds, decoded_labels

Return type

(LongTensor, LongTensor)

get_dataloader(dataset, inputs_column='text', labels_column='summary')[source]

Generate DataLoader for a dataset.

Parameters
  • dataset (Dataset) – The raw dataset.

  • inputs_column (str) – Column name of the inputs.

  • labels_column (str) – Column name of the labels.

Returns

A dataloader for the dataset. Will be used for inference.

Return type

DataLoader

Prepare your datamodule

Basically, you can create a DataModule in datamodules/ to prepare your dataloader. For example, we have cora_datamodule.py for Cora dataset. In datamodules/components, you can save some fixed properties such as the label set.

A DataModule standardizes the training, val, test splits, data preparation and transforms. A DataModule looks like this:

from pytorch_lightning import LightningDataModule
from SciAssist.datamodules.components.cora_label import label2id
from SciAssist.utils.data_utils import DataUtilsForTokenClassification

class MyDataModule(LightningDataModule):

    def __init__(self, data_repo: str = "myvision/cora-dataset-final", data_utils=DataUtilsForTokenClassification):
        super().__init__()

        # use parameters by self.hparams.xx
        self.save_hyperparameters(logger=False)

        self.data_utils = self.hparams.data_utils
        self.data_train: Optional[Dataset] = None
        self.data_val: Optional[Dataset] = None
        self.data_test: Optional[Dataset] = None

    def prepare_data(self):
        # download, split, etc...
        # only called on 1 GPU/TPU in distributed
        raw_datasets = datasets.load_dataset(
            self.hparams.data_repo,
        )
    return raw_datasets


    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
        if not self.data_train and not self.data_val and not self.data_test:
            processed_datasets = self.prepare_data()
            tokenized_datasets = processed_datasets.map(
                lambda x: self.data_utils.tokenize_and_align_labels(x, label2id),
                batched=True,
                remove_columns=processed_datasets["train"].column_names,
                load_from_cache_file=True
            )
            self.data_train = tokenized_datasets["train"]
            self.data_val = tokenized_datasets["val"]
            self.data_test = tokenized_datasets["test"]

    def train_dataloader(self):
        train_split = Dataset(...)
        return DataLoader(train_split)
    def val_dataloader(self):
        val_split = Dataset(...)
        return DataLoader(val_split)
    def test_dataloader(self):
        test_split = Dataset(...)
        return DataLoader(test_split)
    #def teardown(self):
        # clean up after fit or test
        # called on every process in DDP

They are actually hook functions, so you can simply fill in them as you like.

Then, create a .yaml in configs/datamodule to instantiate your datamodule. A data config file looks like this:

# The target class of the following configs
_target_: SciAssist.datamodules.my_datamodule.MyDataModule

# Pass constructor parameters to the target class
data_repo: "myvision/cora-dataset-final"
data_utils:
  _target_: SciAssist.utils.data_utils.DataUtilsForTokenClassification

For more details about DataModule, refer DataLightningModule.

Prepare your model

All the components of a model should be included in SciAssist/models/components, including model structure, tokenizers and so on.

Next, define the logic of training, validation and test for your model in a LightningModule. Same as a LightningDataModule, a LightningModule provides some hook functions to simplify the procedure. For example:

from pytorch_lightning import LightningModule
from torchmetrics.classification.accuracy import Accuracy


from SciAssist.datamodules.components.cora_label import num_labels, LABEL_NAMES
from SciAssist.models.components.bert_token_classifier import BertForTokenClassifier
from SciAssist.utils.data_utils import DataUtilsForTokenClassification


class LitModel(pl.LightningModule):
    def __init__(self, model: BertForTokenClassifier,
                data_utils: DataUtilsForTokenClassification,
                lr: float = 2e-5):

        # Define computations here
        # You can easily use multiple components in `models/components`
        super().__init__()
        self.save_hyperparameters(logger=False)
        self.data_utils = data_utils # or self.hparams.data_utils
        self.model = model # or self.hparams.model

        # num_classes + 1 to account for the extra class used for padding
        self.val_acc = Accuracy(num_classes=num_labels+1, ignore_index=num_labels)
        self.test_acc = Accuracy(num_classes=num_labels+1, ignore_index=num_labels)


    def forward(self, input):
        # Use for inference only (separate from training_step)
        return self.hparams.model(**inputs)


    def step(self, batch):
        inputs, labels = batch, batch["labels"]
        outputs = self.forward(inputs)
        loss = outputs.loss
        preds = outputs.logits.argmax(dim=-1)
        return loss, preds, labels


    def training_step(self, batch, batch_idx):
        # the complete training loop
        loss, preds, labels = self.step(batch)
        self.log("train/loss", loss, on_step=False, on_epoch=True, prog_bar=False)
        return {"loss": loss}

    def validation_step(self, batch: Any, batch_idx: int):
        # the complete validation loop
        return {"loss": loss, "preds": true_preds, "labels": true_labels}

    def test_step(self, batch: Any, batch_idx: int):
        # the complete test loop
        return {"loss": loss, "preds": true_preds, "labels": true_labels}

    def configure_optimizers(self):
        # define optimizers and LR schedulers
       return torch.optim.AdamW(
            params=self.hparams.model.parameters(), lr=self.hparams.lr
        )

The LightningModule has many convenience methods, and here are the core ones. Check LightningModule <https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html> for further information.

Also, create a config file in configs/model:

# The target Class
_target_: SciAssist.models.cora_module.CoraLitModule
lr: 2e-5
data_utils:
    _target_: SciAssist.utils.data_utils.DataUtilsForTokenClassification


# Parameters can be nested
# When instantiating the LitModule, the following model will be automatically constructed.
model:
    _target_: SciAssist.models.components.bert_token_classifier.BertForTokenClassifier
    model_checkpoint: "allenai/scibert_scivocab_uncased"
    output_size: 13
    cache_dir: ${paths.root_dir}/.cache/
    save_name: ${model_name}
    model_dir: ${paths.model_dir}

Create a Trainer and start training

Create a trainer

Note

Actually there have been a perfect train_pipeline.py in SciAssist, so there’s no need to write a train pipeline yourself. After preparing the LightningDataModule and LightningModule, you can train the model with SciAssist/train.py as shown above. But here’s an introduction to this procedure in case of any unknown problem. If you are not interested, skip to the next part Start training.

The last step before starting training is to prepare a trainer config:

_target_: pytorch_lightning.Trainer

accelerator: 'gpu'
devices: 1
min_epochs: 1
max_epochs: 5

# ckpt path
resume_from_checkpoint: null

And then you can create a Pytorch lightning Trainer to manage the whole training process:

import hydra
from omegaconf import DictConfig
from pytorch_lightning import (
    LightningDataModule,
    LightningModule,
    Trainer,
)

# To introduce hydra config files
@hydra.main(version_base="1.2", config_path="configs/", config_name="train.yaml")
def train(config: DictConfig):
    # Init datamodule
    datamodule: LightningDataModule = hydra.utils.instantiate(config.datamodule)

    # Init lightning model
    model: LightningModule = hydra.utils.instantiate(config.model)

    # Init Trainer
    trainer: Trainer = hydra.utils.instantiate(
        config.trainer, callbacks=callbacks, logger=logger, _convert_="partial"
    )

    # To train the model
    trainer.fit(model=model, datamodule=datamodule)

More details are provided in Trainer .

Start training

Finally, you can choose your config files and train your model with the command line:

python SciAssist/train.py trainer=gpu datamodule=dataconfig model=modelconfig

You can change other configs in this way too. For example:

Train model with default configuration:

# train on CPU

python train.py trainer=cpu

# train on GPU
python train.py trainer=gpu

Train model with chosen experiment configuration from configs/experiment/:

python train.py experiment=experiment_name.yaml

You can override any parameter from command line like this:

python train.py trainer.max_epochs=20 datamodule.batch_size=64

To show the full stack trace for error occurred during training or testing:

HYDRA_FULL_ERROR=1 python train.py

4. Build a pipeline for the new task

General overview

As SciAssist aims to serve users, you will need to write a pipeline easy to use. The pipelines are stored in SciAssist/pipelines. Generally, you need to:

  • Add the model configs in SciAssist/pipelines/__init__.py.

  • Create a task-specific pipeline class inherited from Pipeline class.

  • Implement predict() function in the class.

  • Import the pipeline class in SciAssist/__init__.py.

Step-by-step recipe to add a pipeline

Add the model configs

After you have a new model, add its corresponding configs to the dict Tasks in SciAssist/pipelines/__init__.py.

  • model: A ModelClass in models/components.

  • model_dict_url: URL to download the model dict. Pipeline will load model weights from the .pt according to the URL.

  • data_utils: A DataUtils Class.

TASKS = {
    "new-task": {
        "new-model":  {
            "model": ModelClass,
            "model_dict_url": url,
            "data_utils": DataUtilsForNewTask
        },
        "default": {
            "model": ModelClass,
            "model_dict_url": url,
            "data_utils": DataUtilsForNewTask,
        },
    },

Create a task-specific pipeline class

A task-specific pipeline class should be inherited from Pipeline class, which loads a model according to task_name and model_name.

class SciAssist.pipelines.pipeline.Pipeline(task_name, model_name='default', device='gpu', cache_dir=None, output_dir=None, temp_dir=None)[source]
Parameters
  • task_name (str) – The task name, which is used to load model configs.

  • model_name (str, optional) – A string, the model name of a pretrained model provided for this task.

  • device (str, optional) – A string, cpu or gpu.

  • cache_dir (str or os.PathLike, optional, default to “~/.cache/sciassist”) – Path to a directory in which a downloaded pretrained model should be cached if the standard cache should not be used.

  • output_dir (str or os.PathLike, optional, default to “output/result” from current work directory) – Path to a directory in which the predicted results files should be stored.

  • temp_dir (str or os.PathLike, optional, default to “output/.temp” from current work directory) – Path to a directory which holds temporary files such as .tei.xml.

In your new pipeline class, specify a default model_name to choose a model and instantiate a DataUtils.

from SciAssist.pipelines.pipeline import Pipeline
from SciAssist.utils.pdf2text import process_pdf_file, get_reference

class MyPipeline(Pipeline):

    def __init__(
            self, model_name: Optional[str] = "new-model", device: Optional[str] = "gpu",
            cache_dir = None,
            output_dir = None,
            temp_dir = None,
            tokenizer: PreTrainedTokenizer = None,
            checkpoint="allenai/scibert_scivocab_uncased",
            model_max_length=512,
    ):

        super().__init__(task_name="reference-string-parsing", model_name=model_name, device=device,
                        cache_dir=cache_dir, output_dir=output_dir, temp_dir=temp_dir)

        # Instantiate a datautils
        self.data_utils = self.data_utils(
            tokenizer=tokenizer,
            checkpoint=checkpoint,
            model_max_length=model_max_length
        )

Implement the predict() function

Next, you need to fill in the predict() function. The pipeline works by this function, so its input should be directly path to a file, string or a list of string. And it return the results users expect. You can use self.model to get your model for inference.

class MyPipeline(Pipeline):

    def predict(self, input, type="pdf", output_dir=None, temp_dir=None, save_results=True):
        # Get results from the input

        if output_dir is None:
            output_dir = self.output_dir
        if temp_dir is None:
            temp_dir = self.temp_dir


        if type in ["str", "string"]:
            results = self._predict_for_string(example=input)
        elif type in ["txt", "text"]:
            results = self._predict_for_text(filename=input)
        elif type == "pdf":
            results = self._predict_for_pdf(filename=input)

        # Save predicted results as a text file
        if save_results and type not in ["str", "string"]:
            os.makedirs(output_dir, exist_ok=True)
            output_file = os.path.basename(input)

            # The output filename is default to input filename with the abbr. of the task name appending to it.
            with open(os.path.join(output_dir, f"{output_file[:-4]}_rsp.txt"), "w") as output:
                for res in results:
                    output.write(res["tagged_text"] + "\n")


        return results

An example of a task-specific pipeline and the predict() function:

SciAssist.ReferenceStringParsing.predict(self, input, type='pdf', dehyphen=False, output_dir=None, temp_dir=None, save_results=True)
Parameters
  • input (str or List[str] or os.PathLike) –

    Can be either:

    • A string, the reference string to be parsed.

    • A list of strings to be parsed.

    • A path to a .txt file to be parsed. Each line of the source file contains a reference string.

    • A path to a .pdf file to be parsed, a raw scientific document without processing. The pipeline will automatically extract the reference strings from the pdf.

  • type (str, default to pdf) –

    The type of input, can be either:

    • str or string.

    • text`or `txt for a .txt file.

    • pdf for a pdf file. This is the default value.

  • dehyphen (bool, default to False) – Whether to remove hyphens in raw text.

  • output_dir (str or os.PathLike, optional) – Path to a directory in which the predicted results files should be stored. If not provided, it will use the output_dir set for the pipeline.

  • temp_dir (str or os.PathLike, optional) – Path to a directory which holds temporary files such as .tei.xml. If not provided, it will use the temp_dir set for the pipeline.

  • save_results (bool, default to True) – Whether to save the results in a .json file. Note: This is invalid when type is set to str or string.

Returns

[{“tagged_text”: tagged_text, “tokens”: tokens_list ,”tags”: tags_list } , … ]

Return type

List[Dict]

Examples

>>> from SciAssist import ReferenceStringParsing
>>> pipeline = ReferenceStringParsing()
>>> pipeline.predict(
...     "Waleed Ammar, Matthew E. Peters, Chandra Bhagavat- ula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In ACL workshop (SemEval).",
...     type="str"
... )
[{'tagged_text': '<author>Waleed</author> <author>Ammar,</author> <author>Matthew</author> <author>E.</author> <author>Peters,</author> <author>Chandra</author> <author>Bhagavat-</author> <author>ula,</author> <author>and</author> <author>Russell</author> <author>Power.</author> <date>2017.</date> <title>The</title> <title>ai2</title> <title>system</title> <title>at</title> <title>semeval-2017</title> <title>task</title> <title>10</title> <title>(scienceie):</title> <title>semi-supervised</title> <title>end-to-end</title> <title>entity</title> <title>and</title> <title>relation</title> <title>extraction.</title> <booktitle>In</booktitle> <booktitle>ACL</booktitle> <booktitle>workshop</booktitle> <booktitle>(SemEval).</booktitle>',
'tokens': ['Waleed', 'Ammar,', 'Matthew', 'E.', 'Peters,', 'Chandra', 'Bhagavat-', 'ula,', 'and', 'Russell', 'Power.', '2017.', 'The', 'ai2', 'system', 'at', 'semeval-2017', 'task', '10', '(scienceie):', 'semi-supervised', 'end-to-end', 'entity', 'and', 'relation', 'extraction.', 'In', 'ACL', 'workshop', '(SemEval).'],
'tags': ['author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'author', 'date', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'title', 'booktitle', 'booktitle', 'booktitle', 'booktitle']}]

Make the new pipeline easy to import

After you get a new pipeline, import it in SciAssist/__init__.py.

from SciAssist.pipelines.new_task import MyPipeline

Finally, users can import it directly from SciAssist.

from SciAssist import MyPipeline

pipeline = MyPipeline()
res = pipeline.predict(input)