一、struct模块
struct模块可以用来处理二进制,把整数、浮点数、字符串等生成字节流,或者由字节流生成浮点数、字符串等。
常用函数
1
2
3
4
5
6
7
8
9
struct.pack(format, v1, v2, ...)
#Return a bytes object containing the values v1, v2, … packed according to the format string format. The arguments must match the values required by the format exactly.
#按照format格式,从v1、v2、...生成bytes,返回bytes
struct.unpack(format, buffer)
#Unpack from the buffer buffer (presumably packed by pack(format, ...)) according to the format string format. The result is a tuple even if it contains exactly one item. The buffer’s size in bytes must match the size required by the format, as reflected by calcsize().
#按照format,从bytes生成tuple,返回tuple
struct.calcsize(format)
#Return the size of the struct (and hence of the bytes object produced by pack(format, ...)) corresponding to the format string format.
#返回format格式对应的字节数目
format格式
1
2
[order][format]
#order可选,定义字节对齐方式
Character | Byte order | Size | Alignment |
---|---|---|---|
@ | native | native | native |
= | native | standard | none |
< | little-endian | standard | none |
> | big-endian | standard | none |
! | network (= big-endian) | standard | none |
[order]:可选,用于表示字节对齐方式。
Format | C Type | Python type | Standard size |
---|---|---|---|
x | pad byte | no value | |
c | char | bytes of length 1 | 1 |
b | signed char | integer | 1 |
B | unsigned char | integer | 1 |
? | _Bool | bool | 1 |
h | short | integer | 2 |
H | unsigned short | integer | 2 |
i | int | integer | 4 |
I | unsigned int | integer | 4 |
l | long | integer | 4 |
L | unsigned long | integer | 4 |
q | long long | integer | 8 |
Q | unsigned long long | integer | 8 |
n | ssize_t | integer | |
N | size_t | integer | |
e | float | 2 | |
f | float | float | 4 |
d | double | float | 8 |
s | char[] | bytes | |
p | char[] | bytes | |
P | void * | integer |
[format]:数字+格式,数字表示该格式的字节数目。
示例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import struct
a = 1
b = 0.0
c = 'hello'
print(struct.calcsize('>if5s'))
#13:>表示大端,i表示4字节integer,f表示4字节float,5s表示5字节bytes
s_bytes = struct.pack('>if5s', a, b, c.encode('utf-8'))
print(s_bytes)
#b'\x00\x00\x00\x01\x00\x00\x00\x00hello'
s_tuple = struct.unpack('>if5s', s_bytes)
print(s_tuple)
#(1, 0.0, b'hello')
二、读取MNIST数据
MNIST是一个手写数字图像数据集,包含60000个训练图像和10000个测试图像,图像以字节形式存储在文件中。
train-images-idx3-ubyte.gz: training set images (9912422 bytes),训练图像集
train-labels-idx1-ubyte.gz: training set labels (28881 bytes),训练图像标签集
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes) ,测试图像集
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes),测试图像标签集
文件格式
offset:偏置,可以理解为起始字节。type:类型,数据占用的字节数。value:对应的值。
譬如,offset为0004,type为32 bit integer,value为60000,表示从文件的第0004个字节开始读4个字节得到的值应为60000。
[offset] | [type] | [value] | [description] |
---|---|---|---|
0000 | 32 bit integer | 0x00000801(2049) | magic number (MSB first) |
0004 | 32 bit integer | 60000 | number of items |
0008 | unsigned byte | ?? | label |
0009 | unsigned byte | ?? | label |
…… | |||
xxxx | unsigned byte | ?? | label |
TRAINING SET LABEL FILE (train-labels-idx1-ubyte), labels values are 0 to 9.
[offset] | [type] | [value] | [description] |
---|---|---|---|
0000 | 32 bit integer | 0x00000803(2051) | magic number |
0004 | 32 bit integer | 60000 | number of images |
0008 | 32 bit integer | 28 | number of rows |
00012 | 32 bit integer | 28 | number of columns |
0016 | unsigned byte | ?? | pixel |
0017 | unsigned byte | ?? | pixel |
…… | |||
xxxx | unsigned byte | ?? | pixel |
TRAINING SET IMAGE FILE (train-images-idx3-ubyte), pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
读取数据
load_mnist,从MNIST加载数据;show_zero2nine,显示数字0-9的图像;show_different_number,显示不同数字的图像。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import os
import struct
import numpy as np
import matplotlib.pyplot as plt
def load_mnist(path, kind='train'):
labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind)
images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind)
with open(labels_path, 'rb') as lbpath:
# 读magic number和number of items,MSB
label_magic, label_num = struct.unpack('>2I', lbpath.read(8))
labels = np.fromfile(lbpath, dtype=np.uint8)
with open(images_path, 'rb') as imgpath:
# 读magic number和number of images、rows、cols,MSB
image_magic, image_num, rows, cols = struct.unpack('>4I', imgpath.read(16))
images = np.fromfile(imgpath, dtype=np.uint8).reshape(image_num, rows*cols)
return label_magic, label_num, labels, image_magic, image_num, images
def show_zero2nine(images, labels):
fig, axes = plt.subplots(nrows=2, ncols=5)
axes = axes.flatten()
for ax, i in zip(axes, range(10)):
image = images[labels == i][0].reshape(28, 28)
ax.set_axis_off()
ax.imshow(image, cmap='Greys')
plt.show()
def show_different_number(images, labels, number=0):
fig, axes = plt.subplots(nrows=5, ncols=5)
axes = axes.flatten()
for ax, i in zip(axes, range(25)):
image = images[labels == number][i].reshape(28, 28)
ax.set_axis_off()
ax.imshow(image, cmap='Greys')
plt.show()
if __name__ == '__main__':
data_path = '/Users/cy/Coding/git/mnist/data/'
number = 1
label_magic, label_num, labels, image_magic, image_num, images = load_mnist(data_path)
print('label_magic number: %d, label number: %d.' % (label_magic, label_num))
print('image magic number: %d, image number: %d.' % (image_magic, image_num))
show_zero2nine(images, labels)
show_different_number(images, labels, number)