將數(shù)據(jù)格式化為PyTorch數(shù)據(jù)集對象，以便微調(diào)BERT

Question 1

我正在使用來自數(shù)據(jù)科學(xué)的fine-tuning一個伯特模型的現(xiàn)有代碼。我面臨的問題屬于代碼的這一部分，它試圖將我們的數(shù)據(jù)格式化為PyTorchdata.Dataset對象：

class MeditationsDataset(torch.utils.data.Dataset):
    def _init_(self, encodings, *args, **kwargs):
        self.encodings = encodings
    def _getitem_(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def _len_(self):
        return len(self.encodings.input_ids)


dataset = MeditationsDataset(inputs)

運行代碼時，我會遇到以下錯誤：

TypeError                                 Traceback (most recent call last)
<ipython-input-144-41fc3213bc25> in <module>()
----> 1 dataset = MeditationsDataset(inputs)

/usr/lib/python3.7/typing.py in __new__(cls, *args, **kwds)
    819             obj = super().__new__(cls)
    820         else:
--> 821             obj = super().__new__(cls, *args, **kwds)
    822         return obj
    823 

TypeError: object.__new__() takes exactly one argument (the type to instantiate)

我已經(jīng)搜索了這個錯誤，但這里的問題是，遺憾的是，我對PyTorch或OOP都不熟悉，所以我無法修復(fù)這個問題。你能告訴我應(yīng)該從這個代碼中添加或刪除什么以便運行它嗎？提前非常感謝。

如果需要，我們的數(shù)據(jù)如下：

{'input_ids': tensor([[   2, 1021, 1005,  ...,    0,    0,    0],
                      [   2, 1021, 1005,  ...,    0,    0,    0],
                      [   2, 1021, 1005,  ...,    0,    0,    0],
                      ...,
                      [   2, 1021, 1005,  ...,    0,    0,    0],
                      [   2,  103, 1005,  ...,    0,    0,    0],
                      [   2,    4,    0,  ...,    0,    0,    0]]), 
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0],
                           ...,
                           [0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0],
                           [0, 0, 0,  ..., 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 1,  ..., 0, 0, 0],
                           ...,
                           [1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 1,  ..., 0, 0, 0],
                           [1, 1, 0,  ..., 0, 0, 0]]), 
 'labels': tensor([[   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   ...,
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2, 1021, 1005,  ...,    0,    0,    0],
                   [   2,    4,    0,  ...,    0,    0,    0]])}

Question 2

Python中的特殊函數(shù)使用雙下劃線前綴和后綴。在您的情況下，要實現(xiàn)data.Dataset，必須有__init__、__getitem__和__len__：

class MeditationsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, *args, **kwargs):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

Answer 1

Python中的特殊函數(shù)使用雙下劃線前綴和后綴。在您的情況下，要實現(xiàn)data.Dataset，必須有__init__、__getitem__和__len__：

class MeditationsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, *args, **kwargs):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

將數(shù)據(jù)格式化為PyTorch數(shù)據(jù)集對象，以便微調(diào)BERT

熱門問答

在PD dataframe中自動創(chuàng)建多個列

如何運行main.go文件而不必在容器外單獨運行它們？

登錄完成且用戶導(dǎo)航到新頁面后導(dǎo)航欄未更新

VBA（Excel）課程作業(yè)

如何使用令牌頁面獲取頁面id facebook頁面

np.vectorize只有大小為1的數(shù)組才能轉(zhuǎn)換為標量

如何在不同的Web應(yīng)用之間共享Tomcat中的過濾器(Filter)

Swift中怎樣利用NSUserDefaults在不同應(yīng)用啟動之間保留數(shù)組數(shù)據(jù)

使用SurfaceView加MediaPlayer播放視頻流時往前拖動進度條，視頻又重新播放了

C++11如何處理異常安全性，它提供了哪些新的工具或機制

python元素升序和降序排序

ajax返回readyState為0?