Understanding the YoloV5 Annotation Format

By Tom O’Donnell (tkodonne@purdue.edu)

To train YoloV5, we need both an image and an annotation file. The image is the original image, and the annotation file looks something like:

Image.txt contents:

5 0.209316 0.241667 0.116745 0.091667
4 0.715802 0.295833 0.073113 0.091667

As such, the annotation file follows the format:

class x_center y_center width height

Where box coordinates are normalized to the dimensions of the input image (meaning they take values between 0 and 1). For example, in a 416x416 image, the pixel (0, 3) would be normalized as (0, 0.007212), where 0.007212 is simply 3/416.

Using this knowledge, we can calculate translations to the image, which is a very common procedure in data augmentation and diversification. For example, let’s say we want to shift the whole image right 15 pixels and down 25 pixels. First, NOTE THAT THE POSITIVE Y DIRECTION IS DOWN. This would imply that the original origin (0,0) is now (15, 25).

As such, to apply this transformation, you’d add ([15/416], [25/416]) to the x_center and y_center parameters in the annotation file.

Personally, I think it’s easiest to use a translation matrix and apply it via opencv:

Trans_mat = np.float32([
    [1, 0, trans_x],
    [0, 1, trans_y]])

Using this knowledge, you should now understand the annotation file format used in YoloV5, and you should also know how to calculate simple translation operations for image augmentation / data diversification purposes.