如何预处理我的 Mapdataset 以适合我的模型输入？

如何解决如何预处理我的 Mapdataset 以适合我的模型输入？

我使用由文本中的标签和字符串中的浮点向量组成的 MapDataset。这是我读取 tfrecord 内容的方式：

def extract_data(tfrecord_ds):
    feature_description = {
        'classes_text': tf.io.FixedLenFeature((),tf.string),'data': tf.io.FixedLenFeature([],tf.string)
    }

def _parse_data_function(example_proto):
    return tf.compat.v1.parse_single_example(example_proto,feature_description)
parsed_dataset = tfrecord_ds.map(_parse_data_function)

dataset = parsed_dataset.cache().shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
return dataset

我想根据 label.txt 文件将 label_text 转换为 int 并将 data 字符串转换为浮点向量。

我想使用这些数据来训练这样的自定义模型：

my_model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1024),dtype=tf.float32,name='input_embedding'),tf.keras.layers.Dense(512,activation='relu'),tf.keras.layers.Dense(num_classes)
    ],name='audio_detector')

如何处理我的 MapDataset 从 (string,string) 到 (int,float_array) 以便能够训练我的模型？

编辑：

这是我对数据进行编码的方式：

 features = {}
                                features['classes_text'] = tf.train.Feature(
                                    bytes_list=tf.train.BytesList(value=[audio_data_generator.label.encode()]))
                                bytes = embedding.numpy().tobytes()
                                features['data'] = tf.train.Feature(bytes_list=tf.train.BytesList(value=[bytes]))
                                tf_example = tf.train.Example(features=tf.train.Features(feature=features))
                                writer.write(tf_example.SerializetoString())

解决方法

使用 -- Here I am identifying contracts with an overlapping previous contract alter table mytable add column flag_overlap INT default 0; update mytable set flag_overlap = 1 where dt_end is NOT null and dt_end_prev > dt_end; -- Creating a table with only those workers with at least two overlapping contracts drop table if exists mytable_id; create table mytable_id as select WORKER_ID from mytable where flag_overlap = 1 group by WORKER_ID; -- This is my table of interests with all the contracts for those workers identified in the previous step drop table if exists mytable_mod; create table mytable_mod as select * from mytable a inner join mytable_id b on a.WORKER_ID = b.WORKER_ID order by WORKER_ID,dt_start; alter table mytable_mod add unique index idx_ord_id(id); -- The rest of the code is the same as the one posted in this question,-- simply I referred to the table 'mytable_mod' and no longer to 'mytable'. -- [...] -- At the end I updated the 'revised' end date of my original table 'mytable' UPDATE mytable a left outer join mytable_mod b on a.ord_all = b.ord_all set a.dt_end = b.dt_end,a.dt_end_next = b.dt_end_next,a.dt_end_prev = b.dt_end_prev ; 对嵌入进行编码更容易。

写入 tfrecords 时使用：

tf.train.FloatList

读取时将嵌入大小指定为 features = { 'classes_text': tf.train.Feature(bytes_list=tf.train.BytesList(value=[label.encode()])),'data': tf.train.Feature(float_list=tf.train.FloatList(value=embedding)) } tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writer.write(tf_example.SerializeToString())，例如：

tf.io.FixedLenFeature

要将 label_text 转换为 int，您可以使用 embedding_size = 10 feature_description = { 'classes_text': tf.io.FixedLenFeature((),tf.string),'data': tf.io.FixedLenFeature([embedding_size],tf.float32) }。

tf.lookup.StaticVocabularyTable

编辑

如果您希望保持保存数据的方式，可以使用 # Assuming lable.txt contains a single label per line. with open('label.txt','r') as fin: categories = [line.strip() for line in fin.readlines()] init = tf.lookup.KeyValueTensorInitializer( keys=tf.constant(categories),values=tf.constant(list(range(len(categories))),dtype=tf.int64)) label_table = tf.lookup.StaticVocabularyTable( init,num_oov_buckets=1) feature_description = { 'classes_text': tf.io.FixedLenFeature((),tf.float32) } def _parse_data_function(example_proto): example = tf.compat.v1.parse_single_example(example_proto,feature_description) # Apply the label lookup. example['classes_text'] = label_table.lookup(example['classes_text']) return example parsed_dataset = tfrecord_ds.map(_parse_data_function) dataset = parsed_dataset.cache().shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE) 将 numpy 向量转换为二进制字符串。不过，您必须将此代码包装在 tf.function 和 tf.py_function 中。

np.frombuffer