This content originally appeared on DEV Community and was authored by Super Kai (Kazuya Ito)
*Memos:
- My post explains Transformer layer.
- My post explains RNN().
- My post explains LSTM().
- My post explains GRU().
- My post explains manual_seed().
- My post explains requires_grad.
Transformer() can get the 2D or 3D tensor of the one or more elements computed by Transformer from the 2D or 3D tensor of one or more elements as shown below:
*Memos:
- The 1st argument for initialization is
d_model
(Optional-Default:512
:Type:int
): *Memos:- It must be
1 <= x
. - It must be same as the number of the elements of the deepest dimension of
src
andtgt
. - It must be divisible by
nhead
.
- It must be
- The 2nd argument for initialization is
nhead
(Optional-Default:8
-Type:int
). *It must be1 <= x
. - The 3rd argument for initialization is
num_encoder_layers
(Optional-Default:6
-Type:int
). *It must be1 <= x
. - The 4th argument for initialization is
num_decoder_layers
(Optional-Default:6
-Type:int
). *It must be1 <= x
. - The 5th argument for initialization is
dim_feedforward
(Optional-Default:2048
-Type:int
): *Memos:- It must be
0 <= x
. -
0
does nothing.
- It must be
- The 6th argument for initialization is
dropout
(Optional-Default:0.1
-Type:int
orfloat
). *It must be0 <= x <= 1
. - The 7th argument for initialization is
activation
(Optional-Default:'relu'
-Type:str
oractivation function
): *Memos: -'relu'
or'gelu'
can be set forstr
.- An activation function can be directly set. *Not just ReLU() or GELU() but also LeakyReLU(), Sigmoid(), Softmax(), etc can be set.
- The 8th argument for initialization is
custom_encoder
(Optional-Default:None
-Type:transformer encoder
). *TransformerEncoder() can be set. - The 9th argument for initialization is
custom_decoder
(Optional-Default:None
-Type:transformer decoder
). *TransformerDecoder() can be set. - The 10th argument for initialization is
layer_norm_eps
(Optional-Default:1e-05
-Type:int
orfloat
). - The 11th argument for initialization is
batch_first
(Optional-Default:False
-Type:bool
). - The 12th argument for initialization is
norm_first
(Optional-Default:False
-Type:bool
). - The 13th argument for initialization is
bias
(Optional-Default:True
-Type:bool
). *My post explainsbias
argument. - The 14th argument for initialization is
device
(Optional-Default:None
-Type:str
,int
or device()): *Memos:- If it's
None
, get_default_device() is used. *My post explainsget_default_device()
and set_default_device(). -
device=
can be omitted. -
My post explains
device
argument.
- If it's
- The 15th argument for initialization is
dtype
(Optional-Default:None
-Type:dtype): *Memos:- If it's
None
, get_default_dtype() is used. *My post explainsget_default_dtype()
and set_default_dtype(). -
dtype=
can be omitted. -
My post explains
dtype
argument.
- If it's
- The 1st argument is
src
(Required-Type:tensor
offloat
): *Memos:- It must be the 2D or 3D tensor of one or more elements.
- Its D must be same as
tgt
's. - The number of the elements of the deepest dimension must be same as
d_model
andtgt
's. - Its
device
anddtype
must be same astgt
andTransformer()
's. - The tensor's
requires_grad
which isFalse
by default is set toTrue
byTransformer()
.
- The 2nd argument is
tgt
(Required-Type:tensor
offloat
): *Memos:- It must be the 2D or 3D tensor of one or more elements.
- Its D must be same as
src
's. - The number of the elements of the deepest dimension must be same as
d_model
andsrc
's. - Its
device
anddtype
must be same assrc
andTransformer()
's. - The tensor's
requires_grad
which isFalse
by default is set toTrue
byTransformer()
.
- The 3rd argument is
src_mask
(Optional-Default:None
:Type:tensor
offloat
orbool
). *It must be the 2D or 3D tensor of one or more elements. - The 4th argument is
tgt_mask
(Optional-Default:None
:Type:tensor
offloat
orbool
). *It must be the 2D or 3D tensor of one or more elements. - The 5th argument is
memory_mask
(Optional-Default:None
:Type:tensor
offloat
orbool
). *It must be the 2D or 3D tensor of one or more elements. - The 6th argument is
src_key_padding_mask
(Optional-Default:None
:Type:tensor
offloat
orbool
). *It must be the 1D tensor of one or more elements. - The 7th argument is
tgt_key_padding_mask
(Optional-Default:None
:Type:tensor
offloat
orbool
). *It must be the 1D tensor of one or more elements. - The 8th argument is
memory_key_padding_mask
(Optional-Default:None
:Type:tensor
offloat
orbool
). *It must be the 1D tensor of one or more elements. - The 9th argument is
src_is_causal
(Optional-Default:None
:Type:bool
). - The 10th argument is
tgt_is_causal
(Optional-Default:None
:Type:bool
). - The 11th argument is
memory_is_causal
(Optional-Default:False
:Type:bool
). - The
device
anddtype
(float
) ofsrc_mask
,tgt_mask
,memory_mask
,tgt_mask memory_mask
,src_key_padding_mask
,src_key_padding_mask
,tgt_key_padding_mask
andmemory_key_padding_mask
must be same asTransformer()
's,d_model
's,src
's andtgt
's. - The
dtype
(bool
) ofsrc_mask
,tgt_mask
,memory_mask
,tgt_mask memory_mask
,src_key_padding_mask
,src_key_padding_mask
,tgt_key_padding_mask
andmemory_key_padding_mask
must be the same. -
tran1.device
andtran1.dtype
don't work.
import torch
from torch import nn
tensor1 = torch.tensor([[8., -3., 0., 1.]])
tensor2 = torch.tensor([[5., 9., -4., 8.],
[-2., 7., 3., 6.]])
tensor1.requires_grad
tensor2.requires_grad
# False
torch.manual_seed(42)
tran1 = nn.Transformer(d_model=4, nhead=2)
tensor3 = tran1(src=tensor1, tgt=tensor2)
tensor3
# tensor([[1.5608, 0.1450, -0.6434, -1.0624],
# [0.8815, 1.0994, -1.1523, -0.8286]],
# grad_fn=<NativeLayerNormBackward0>)
tensor3.requires_grad
# True
tran1
# Transformer(
# (encoder): TransformerEncoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerEncoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=6, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=6, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
# (decoder): TransformerDecoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerDecoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (multihead_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=4, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=4, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm3): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# (dropout3): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
# )
tran1.encoder
# TransformerEncoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerEncoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=4, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=6, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
tran1.decoder
# TransformerDecoder(
# (layers): ModuleList(
# (0-5): 6 x TransformerDecoderLayer(
# (self_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (multihead_attn): MultiheadAttention(
# (out_proj): NonDynamicallyQuantizableLinear(
# in_features=4, out_features=4, bias=True
# )
# )
# (linear1): Linear(in_features=4, out_features=2048, bias=True)
# (dropout): Dropout(p=0.1, inplace=False)
# (linear2): Linear(in_features=2048, out_features=6, bias=True)
# (norm1): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm2): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (norm3): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# (dropout1): Dropout(p=0.1, inplace=False)
# (dropout2): Dropout(p=0.1, inplace=False)
# (dropout3): Dropout(p=0.1, inplace=False)
# )
# )
# (norm): LayerNorm((4,), eps=1e-05, elementwise_affine=True)
# )
tran1.d_model
# 4
tran1.nhead
# 2
tran1.batch_first
# False
torch.manual_seed(42)
tran2 = nn.Transformer(d_model=4, nhead=2)
tran1(src=tensor2, tgt=tensor3)
# tensor([[-0.8631, 1.6747, -0.6517, -0.1599],
# [-0.0919, 1.6377, -0.5336, -1.0122]],
# grad_fn=<NativeLayerNormBackward0>)
torch.manual_seed(42)
tran = nn.Transformer(d_model=4, nhead=2, num_encoder_layers=6,
num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
activation='relu', custom_encoder=None, custom_decoder=None,
layer_norm_eps=1e-05, batch_first=False, norm_first=False,
bias=True, device=None, dtype=None)
tran(src=tensor1, tgt=tensor2, src_mask=None, tgt_mask=None,
memory_mask=None, src_key_padding_mask=None,
tgt_key_padding_mask=None, memory_key_padding_mask=None,
src_is_causal=None, tgt_is_causal=None, memory_is_causal=False)
# tensor([[1.5608, 0.1450, -0.6434, -1.0624],
# [0.8815, 1.0994, -1.1523, -0.8286]],
# grad_fn=<NativeLayerNormBackward0>)
tensor1 = torch.tensor([[8., -3.], [0., 1.]])
tensor2 = torch.tensor([[5., 9.], [-4., 8.],
[-2., 7.], [3., 6.]])
torch.manual_seed(42)
tran = nn.Transformer(d_model=2, nhead=2)
tran(src=tensor1, tgt=tensor2)
# tensor([[1.0000, -1.0000],
# [-1.0000, 1.0000],
# [-1.0000, 1.0000],
# [-1.0000, 1.0000]], grad_fn=<NativeLayerNormBackward0>)
tensor1 = torch.tensor([[8.], [-3.], [0.], [1.]])
tensor2 = torch.tensor([[5.], [9.], [-4.], [8.],
[-2.], [7.], [3.], [6.]])
torch.manual_seed(42)
tran = nn.Transformer(d_model=1, nhead=1)
tran(src=tensor1, tgt=tensor2)
# tensor([[0.], [0.], [0.], [0.], [0.], [0.], [0.], [0.]],
# grad_fn=<NativeLayerNormBackward0>)
tensor1 = torch.tensor([[[8.], [-3.], [0.], [1.]]])
tensor2 = torch.tensor([[[5.], [9.], [-4.], [8.]],
[[-2.], [7.], [3.], [6.]]])
torch.manual_seed(42)
tran = nn.Transformer(d_model=1, nhead=1)
tran(src=tensor1, tgt=tensor2)
# tensor([[[0.], [0.], [0.], [0.]],
# [[0.], [0.], [0.], [0.]]], grad_fn=<NativeLayerNormBackward0>)
Transformer().generate_square_subsequent_mask()
can get the 2D tensor of the zero or more 0.
(Default), 0.+0.j
or False
and -inf
(Default), -inf+0.j
or True
as shown below:
*Memos:
- The 1st argument is
sz
(Required-Type:int
). *It must be0 <= x
. - The 2nd argument for initialization is
device
(Optional-Default:None
-Type:str
,int
or device()): *Memos:- If it's
None
,cpu
is set. -
device=
can be omitted. -
My post explains
device
argument.
- If it's
- The 3rd argument for initialization is
dtype
(Optional-Default:None
-Type:dtype): *Memos:- If it's
None
,float32
is set. -
dtype=
can be omitted. -
My post explains
dtype
argument.
- If it's
import torch
from torch import nn
tran = nn.Transformer()
tran.generate_square_subsequent_mask(sz=3)
tran.generate_square_subsequent_mask(sz=3, device=None, dtype=None)
# tensor([[0., -inf, -inf],
# [0., 0., -inf],
# [0., 0., 0.]])
tran1.generate_square_subsequent_mask(sz=5)
# tensor([[0., -inf, -inf, -inf, -inf],
# [0., 0., -inf, -inf, -inf],
# [0., 0., 0., -inf, -inf],
# [0., 0., 0., 0., -inf],
# [0., 0., 0., 0., 0.]])
tran1.generate_square_subsequent_mask(sz=5, dtype=torch.complex64)
# tensor([[0.+0.j, -inf+0.j, -inf+0.j, -inf+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, -inf+0.j, -inf+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, 0.+0.j, -inf+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, -inf+0.j],
# [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]])
tran1.generate_square_subsequent_mask(sz=5, dtype=torch.bool)
# tensor([[False, True, True, True, True],
# [False, False, True, True, True],
# [False, False, False, True, True],
# [False, False, False, False, True],
# [False, False, False, False, False]])
This content originally appeared on DEV Community and was authored by Super Kai (Kazuya Ito)
Super Kai (Kazuya Ito) | Sciencx (2024-10-02T02:54:25+00:00) Transformer in PyTorch. Retrieved from https://www.scien.cx/2024/10/02/transformer-in-pytorch/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.