gst*_*low 5 java spring multithreading partitioning spring-batch
我读过有关 spring-batch 中的分区的内容,我找到了一个演示分区的示例。该示例从 CSV 文件中读取人员,进行一些处理并将数据插入数据库。因此,在此示例中,1 个分区 = 1 个文件,因此分区器实现如下所示:
public class MultiResourcePartitioner implements Partitioner {
private final Logger logger = LoggerFactory.getLogger(MultiResourcePartitioner.class);
public static final String FILE_PATH = "filePath";
private static final String PARTITION_KEY = "partition";
private final Collection<Resource> resources;
public MultiResourcePartitioner(Collection<Resource> resources) {
this.resources = resources;
}
@Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> map = new HashMap<>(gridSize);
int i = 0;
for (Resource resource : resources) {
ExecutionContext context = new ExecutionContext();
context.putString(FILE_PATH, getPath(resource)); //Depends on what logic you want to use to split
map.put(PARTITION_KEY + i++, context);
}
return map;
}
private String getPath(Resource resource) {
try {
return resource.getFile().getPath();
} catch (IOException e) {
logger.warn("Can't get file from from resource {}", resource);
throw new RuntimeException(e);
}
}
}
Run Code Online (Sandbox Code Playgroud)
但如果我有单个 10TB 文件怎么办?Spring Batch是否允许以某种方式对其进行分区?
我尝试了以下方法来实现我想要的:
分两步 - 第一步将文件分成几部分,第二步处理第一步后得到的部分:
@Configuration
public class SingleFilePartitionedJob {
@Autowired
private JobBuilderFactory jobBuilderFactory;
@Autowired
private StepBuilderFactory stepBuilderFactory;
@Autowired
private ToLowerCasePersonProcessor toLowerCasePersonProcessor;
@Autowired
private DbPersonWriter dbPersonWriter;
@Autowired
private ResourcePatternResolver resourcePatternResolver;
@Value("${app.file-to-split}")
private Resource resource;
@Bean
public Job splitFileProcessingJob() throws IOException {
return jobBuilderFactory.get("splitFileProcessingJob")
.incrementer(new RunIdIncrementer())
.flow(splitFileIntoPiecesStep())
.next(csvToDbLowercaseMasterStep())
.end()
.build();
}
private Step splitFileIntoPiecesStep() throws IOException {
return stepBuilderFactory.get("splitFile")
.tasklet(new FileSplitterTasklet(resource.getFile()))
.build();
}
@Bean
public Step csvToDbLowercaseMasterStep() throws IOException {
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setResources(resourcePatternResolver.getResources("split/*.csv"));
return stepBuilderFactory.get("csvReaderMasterStep")
.partitioner("csvReaderMasterStep", partitioner)
.gridSize(10)
.step(csvToDataBaseSlaveStep())
.taskExecutor(jobTaskExecutorSplitted())
.build();
}
@Bean
public Step csvToDataBaseSlaveStep() throws MalformedURLException {
return stepBuilderFactory.get("csvToDatabaseStep")
.<Person, Person>chunk(50)
.reader(csvPersonReaderSplitted(null))
.processor(toLowerCasePersonProcessor)
.writer(dbPersonWriter)
.build();
}
@Bean
@StepScope
public FlatFileItemReader csvPersonReaderSplitted(@Value("#{stepExecutionContext[fileName]}") String fileName) throws MalformedURLException {
return new FlatFileItemReaderBuilder()
.name("csvPersonReaderSplitted")
.resource(new UrlResource(fileName))
.delimited()
.names(new String[]{"firstName", "lastName"})
.fieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {{
setTargetType(Person.class);
}})
.build();
}
@Bean
public TaskExecutor jobTaskExecutorSplitted() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setMaxPoolSize(30);
taskExecutor.setCorePoolSize(25);
taskExecutor.setThreadNamePrefix("cust-job-exec2-");
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
}
Run Code Online (Sandbox Code Playgroud)
小任务:
public class FileSplitterTasklet implements Tasklet {
private final Logger logger = LoggerFactory.getLogger(FileSplitterTasklet.class);
private File file;
public FileSplitterTasklet(File file) {
this.file = file;
}
@Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
int count = FileSplitter.splitTextFiles(file, 100);
logger.info("File was split on {} files", count);
return RepeatStatus.FINISHED;
}
}
Run Code Online (Sandbox Code Playgroud)
分割文件的逻辑:
public static int splitTextFiles(File bigFile, int maxRows) throws IOException {
int fileCount = 1;
try (BufferedReader reader = Files.newBufferedReader(Paths.get(bigFile.getPath()))) {
String line = null;
int lineNum = 1;
Path splitFile = Paths.get(bigFile.getParent() + "/" + fileCount + "split.txt");
BufferedWriter writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
while ((line = reader.readLine()) != null) {
if (lineNum > maxRows) {
writer.close();
lineNum = 1;
fileCount++;
splitFile = Paths.get("split/" + fileCount + "split.txt");
writer = Files.newBufferedWriter(splitFile, StandardOpenOption.CREATE);
}
writer.append(line);
writer.newLine();
lineNum++;
}
writer.close();
}
return fileCount;
}
Run Code Online (Sandbox Code Playgroud)
所以我把所有的文件片段放到了特殊的目录中。
但这不起作用,因为上下文初始化文件夹/split尚不存在。
我已经生成了有效的解决方法:
public class MultiResourcePartitionerWrapper implements Partitioner {
private final MultiResourcePartitioner multiResourcePartitioner = new MultiResourcePartitioner();
private final ResourcePatternResolver resourcePatternResolver;
private final String pathPattern;
public MultiResourcePartitionerWrapper(ResourcePatternResolver resourcePatternResolver, String pathPattern) {
this.resourcePatternResolver = resourcePatternResolver;
this.pathPattern = pathPattern;
}
@Override
public Map<String, ExecutionContext> partition(int gridSize) {
try {
Resource[] resources = resourcePatternResolver.getResources(pathPattern);
multiResourcePartitioner.setResources(resources);
return multiResourcePartitioner.partition(gridSize);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
Run Code Online (Sandbox Code Playgroud)
但看起来很丑。这是一个正确的解决方案吗?
小智 1
Spring Batch 允许您进行分区,但具体如何操作取决于您。
您可以简单地在分区程序类中拆分 10TB 文件(按数量或按最大行数),每个分区都会读取一个拆分文件。你可以找到很多关于如何在java中分割大文件的例子。 按最大行数分割非常大的文本文件
| 归档时间: |
|
| 查看次数: |
3176 次 |
| 最近记录: |