Mar*_*ark 209 java line-numbers large-files
我使用大量数据文件,有时我只需要知道这些文件中的行数,通常我打开它们并逐行读取它们直到我到达文件末尾
我想知道是否有更聪明的方法来做到这一点
mar*_*nus 234
这是迄今为止我发现的最快版本,比readLines快6倍.在150MB日志文件上,这需要0.35秒,而使用readLines()时需要2.40秒.只是为了好玩,linux'wc -l命令需要0.15秒.
public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
Run Code Online (Sandbox Code Playgroud)
编辑,9年半以后:我几乎没有Java经验,但无论如何我试图将此代码与LineNumberReader
下面的解决方案进行对比,因为它让我感到困扰,没有人这样做.似乎特别是对于大文件我的解决方案更快.虽然在优化器完成一项体面的工作之前似乎需要几次运行.我已经玩了一些代码,并创建了一个最快的新版本:
public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}
// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '\n') {
++count;
}
}
readChars = is.read(c);
}
// count remaining characters
while (readChars != -1) {
System.out.println(readChars);
for (int i=0; i<readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
readChars = is.read(c);
}
return count == 0 ? 1 : count;
} finally {
is.close();
}
}
Run Code Online (Sandbox Code Playgroud)
基准测试结果为1.3GB文本文件,y轴以秒为单位.我用相同的文件执行了100次运行,并测量了每次运行System.nanoTime()
.您可以看到它countLinesOld
有一些异常值,并且countLinesNew
没有异常值,虽然它只是快一点,但差异在统计上是显着的.LineNumberReader
显然比较慢.
小智 198
我已经实现了另一个问题的解决方案,我发现它在计算行时效率更高:
try
(
FileReader input = new FileReader("input.txt");
LineNumberReader count = new LineNumberReader(input);
)
{
while (count.skip(Long.MAX_VALUE) > 0)
{
// Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
}
result = count.getLineNumber() + 1; // +1 because line index starts at 0
}
Run Code Online (Sandbox Code Playgroud)
DMu*_*gan 28
对于不以换行结尾的多行文件,接受的答案有一个错误.以换行符结尾的单行文件将返回1,但是没有换行符的两行文件也将返回1.以下是修复此问题的已接受解决方案的实现.除了最终阅读之外,endsWithoutNewLine检查对于所有内容都是浪费,但与整体功能相比应该是微不足道的时间.
public int count(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean endsWithoutNewLine = false;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n')
++count;
}
endsWithoutNewLine = (c[readChars - 1] != '\n');
}
if(endsWithoutNewLine) {
++count;
}
return count;
} finally {
is.close();
}
}
Run Code Online (Sandbox Code Playgroud)
msa*_*yag 21
使用java-8,您可以使用流:
try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
long numOfLines = lines.count();
...
}
Run Code Online (Sandbox Code Playgroud)
小智 12
上面的方法count()的答案给了我行错误计数,如果文件的末尾没有换行符 - 它无法计算文件中的最后一行.
这种方法对我来说效果更好:
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
Run Code Online (Sandbox Code Playgroud)
我知道这是一个古老的问题,但是接受的解决方案与我需要做的并不完全相符.所以,我改进它以接受各种行终止符(而不仅仅是换行符)并使用指定的字符编码(而不是ISO-8859- n).所有在一个方法(适当的重构):
public static long getLinesCount(String fileName, String encodingName) throws IOException {
long linesCount = 0;
File file = new File(fileName);
FileInputStream fileIn = new FileInputStream(file);
try {
Charset encoding = Charset.forName(encodingName);
Reader fileReader = new InputStreamReader(fileIn, encoding);
int bufferSize = 4096;
Reader reader = new BufferedReader(fileReader, bufferSize);
char[] buffer = new char[bufferSize];
int prevChar = -1;
int readCount = reader.read(buffer);
while (readCount != -1) {
for (int i = 0; i < readCount; i++) {
int nextChar = buffer[i];
switch (nextChar) {
case '\r': {
// The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
linesCount++;
break;
}
case '\n': {
if (prevChar == '\r') {
// The current line is terminated by a carriage return immediately followed by a line feed.
// The line has already been counted.
} else {
// The current line is terminated by a line feed.
linesCount++;
}
break;
}
}
prevChar = nextChar;
}
readCount = reader.read(buffer);
}
if (prevCh != -1) {
switch (prevCh) {
case '\r':
case '\n': {
// The last line is terminated by a line terminator.
// The last line has already been counted.
break;
}
default: {
// The last line is terminated by end-of-file.
linesCount++;
}
}
}
} finally {
fileIn.close();
}
return linesCount;
}
Run Code Online (Sandbox Code Playgroud)
这个解决方案在速度上与可接受的解决方案相当,在我的测试中慢了大约4%(尽管Java中的时序测试非常不可靠).
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (Stream<String> lines = Files.lines(file.toPath())) {
return lines.count();
}
}
Run Code Online (Sandbox Code Playgroud)
在 JDK8_u31 上测试。但与此方法相比,性能确实很慢:
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {
byte[] c = new byte[1024];
boolean empty = true,
lastEmpty = false;
long count = 0;
int read;
while ((read = is.read(c)) != -1) {
for (int i = 0; i < read; i++) {
if (c[i] == '\n') {
count++;
lastEmpty = true;
} else if (lastEmpty) {
lastEmpty = false;
}
}
empty = false;
}
if (!empty) {
if (count == 0) {
count = 1;
} else if (!lastEmpty) {
count++;
}
}
return count;
}
}
Run Code Online (Sandbox Code Playgroud)
经过测试,速度非常快。
小智 5
我测试了上述用于计数行的方法,这是我在系统上测试的不同方法的观察结果
文件大小:1.6 Gb方法:
此外,Java8方法似乎非常方便:Files.lines(Paths.get(filePath),Charset.defaultCharset())。count()[返回类型:long]
归档时间: |
|
查看次数: |
396558 次 |
最近记录: |