标签: file-processing

随机从文件中选择行而不用Unix扼杀它

我有一个10 ^ 7行文件,其中我想从文件中随机选择1/100行.这是我所拥有的AWK代码,但它会预先包含所有文件内容.我的PC内存无法处理这样的问题.还有其他办法吗？

awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {  
  for ( i=1;i<=c ;i++ )  { 
    num=int(rand() * c)
    if ( a[num] ) {
        print a[num]
        delete a[num]
        d++
    }
    if ( d == c/100 ) break
  }
 }' file

Run Code Online (Sandbox Code Playgroud)

unix linux awk random-sample file-processing

nev*_*int

2009 03-28

51
推荐指数

7
解决办法

4万
查看次数

使用GNU并行拆分命令行args

使用GNU parallel:http://www.gnu.org/software/parallel/

我有一个程序,需要两个参数,例如

$ ./prog file1 file2
$ ./prog file2 file3
...
$ ./prog file23456 file23457

Run Code Online (Sandbox Code Playgroud)

我正在使用生成文件名对的脚本,但这会产生问题,因为脚本的结果是单个字符串 - 而不是一对.喜欢:

$ ./prog "file1 file2"

Run Code Online (Sandbox Code Playgroud)

GNU parallel 似乎有一大堆技巧,我想知道是否有一个用于在分隔符周围分割文本:

$ generate_file_pairs | parallel ./prog ?  
  # where ? is text under consideration, like "file1 file2"

Run Code Online (Sandbox Code Playgroud)

简单的解决方法是在prog中手动拆分args,但我想知道它是否可能GNU parallel.

bash file-processing gnu-parallel

drh*_*des

2015 12-19

37
推荐指数

1
解决办法

1万
查看次数

使用延迟文本和字节索引处理非常大的文本文件

我正在尝试处理一个非常大的unicode文本文件(6GB +).我想要的是计算每个独特单词的频率.Data.Map当我遍历文件时,我使用严格来跟踪每个单词的计数.这个过程需要太多时间和太多内存(20GB +).我怀疑地图很大,但我不确定它应该达到文件大小的5倍!代码如下所示.请注意我尝试了以下内容:

使用Data.HashMap.Strict而不是Data.Map.Strict.Data.Map似乎在较慢的内存消耗增加率方面表现更好.

使用lazy ByteString而不是lazy 读取文件Text.然后我编码为文本做一些处理,然后对其进行编码,回ByteString了IO.

import Data.Text.Lazy (Text(..), cons, pack, append)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TI
import Data.Map.Strict hiding (foldr, map, foldl')
import System.Environment
import System.IO
import Data.Word

dictionate :: [Text] -> Map Text Word16
dictionate = fromListWith (+) . (`zip` [1,1..])

main = do
    [file,out] <- getArgs
    h <- openFile file ReadMode
    hO <- openFile out WriteMode …

Run Code Online (Sandbox Code Playgroud)

text haskell hashmap file-processing bigdata

has*_*ine

2015 08-04

10
推荐指数

1
解决办法

642
查看次数

我可以从运行在其上的Web API应用程序将文件写入服务器计算机上的文件夹吗？

我在我的Web API应用程序中使用此代码写入CSV文件:

private void SaveToCSV(InventoryItem invItem, string dbContext)
{
    string csvHeader = "id,pack_size,description,vendor_id,department,subdepartment,unit_cost,unit_list,open_qty,UPC_code,UPC_pack_size,vendor_item,crv_id";

    int dbContextAsInt = 0;
    int.TryParse(dbContext, out dbContextAsInt);
    string csvFilename = string.Format("Platypus{0}.csv", dbContextAsInt);

    string csv = string.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12}", invItem.ID,
        invItem.pksize, invItem.Description, invItem.vendor_id, invItem.dept, invItem.subdept, invItem.UnitCost,
        invItem.UnitList, invItem.OpenQty, invItem.UPC, invItem.upc_pack_size, invItem.vendor_item, invItem.crv_id);

    string existingContents;
    using (StreamReader sr = new StreamReader(csvFilename))
    {
        existingContents = sr.ReadToEnd();
    }

    using (StreamWriter writetext = File.AppendText(csvFilename))
    {
        if (!existingContents.Contains(csvHeader))
        {
            writetext.WriteLine(csvHeader);
        }
        writetext.WriteLine(csv);
    }
}

Run Code Online (Sandbox Code Playgroud)

在开发机器上,默认情况下,csv文件保存为"C:\ Program Files(x86)\ IIS Express".为了准备将它部署到最终的休息/工作场所,我需要做些什么来保存文件,例如,到服务器的"Platypi"文件夹 - 什么特别的？我是否必须专门设置某些文件夹柿子才能写入"Platypi".

这只是改变这条线的问题:

string csvFilename = …

Run Code Online (Sandbox Code Playgroud)

c# file-permissions streamwriter file-processing asp.net-web-api

B. *_*non

2014 02-08

9
推荐指数

1
解决办法

2万
查看次数

如何在Perl中执行类似SQL的连接？

我必须通过组合两个不同的文件来处理一些数据.它们都有两列可以形成一个主键,我可以用它来并排匹配它们.问题中的文件很大(大约5GB,有2000万行)所以我需要一个高效的代码.我如何在Perl中执行此操作？

我举个例子:

如果文件A包含列

id, name, lastname, dob, school

Run Code Online (Sandbox Code Playgroud)

文件B包含列

address, id, postcode, dob, email

Run Code Online (Sandbox Code Playgroud)

我需要通过匹配两个文件中的id和dob来连接这两个文件,以获得具有列的输出文件:

 id, name, lastname, dob, school, address, postcode, email

Run Code Online (Sandbox Code Playgroud)

perl filemerge file-processing

sfa*_*tor

2012 01-03

8
推荐指数

1
解决办法

698
查看次数

C:转到已知文件行的最佳方式

我有一个文件,其中我想迭代而不处理当前行的任何类型.我正在寻找的是去确定文本文件行的最佳方法.例如,将当前行存储到变量中似乎没用,直到我到达预定行.

示例:

file.txt的

foo
fooo
fo
here

Run Code Online (Sandbox Code Playgroud)

通常,为了得到here,我会做的事情如下:

FILE* file = fopen("file.txt", "r");
if (file == NULL)
    perror("Error when opening file ");
char currentLine[100];
while(fgets(currentLine, 100, file))
{
    if(strstr(currentLine, "here") != NULL)
         return currentLine;
}

Run Code Online (Sandbox Code Playgroud)

但是fgets必须完全阅读三行并且currentLine必须存储foo,fooo并且fo.

有没有更好的方法来做到这一点,知道here第4行？有点像go to文件？

c io file fgets file-processing

Bad*_*dda

2017 05-29

8
推荐指数

3
解决办法

823
查看次数

Nodejs读取非常大的文件（~10GB），逐行处理然后写入其他文件

我有一个特定格式的10 GB日志文件，我想逐行处理该文件，然后在应用一些转换后将输出写入其他文件。我正在使用节点来执行此操作。

虽然这个方法很好，但是需要花费很多时间。在 JAVA 中，我能够在 30-45 分钟内完成此操作，但在 Node 中，完成同样的工作需要 160 多分钟。以下是代码：

以下是从输入中读取每一行的启动代码。

var path = '../10GB_input_file.txt'; var output_file = '../output.txt'; function fileopsmain(){ fs.exists(output_file, function(exists){ if(exists) { fs.unlink(output_file, function (err) { if (err) throw err; console.log('successfully deleted ' + output_file); }); } }); new lazy(fs.createReadStream(path, {bufferSize: 128 * 4096})) .lines .forEach(function(line){ var line_arr = line.toString().split(';'); perform_line_ops(line_arr, line_arr[6], line_arr[7], line_arr[10]); } ); }
Run Code Online (Sandbox Code Playgroud)

这是对该行执行某些操作并将输入传递给 write 方法以将其写入输出文件的方法。

function perform_line_ops(line_arr, range_start, range_end, daynums){ var _new_lines = ''; …
Run Code Online (Sandbox Code Playgroud)

file-io file-handling file-processing large-files node.js

HVT*_*VT7

lucky-day

7
推荐指数

1
解决办法

1万
查看次数

如何从Perl文件中获取n个随机行？

关注这个问题,我需要n从文件(或stdin)中随机获取完整的行.这将类似于head或tail,除了我想从中间的一些.

现在,除了使用链接问题的解决方案循环文件之外,n在一次运行中获得精确线条的最佳方法是什么？

作为参考,我试过这个:

#!/usr/bin/perl -w use strict; my $ratio = shift; print $ratio, "\n"; while () { print if ((int rand $ratio) == 1); }
Run Code Online (Sandbox Code Playgroud)
$ratio我想要的线的粗略百分比在哪里.例如,如果我想要10行中的1个:

random_select 10 a.list
Run Code Online (Sandbox Code Playgroud)
但是,这并没有给我一个确切的数额:

aaa> foreach i ( 0 1 2 3 4 5 6 7 8 9 ) foreach? random_select 10 a.list | wc -l foreach? end 4739 4865 4739 4889 4934 4809 4712 4842 4814 4817
Run Code Online (Sandbox Code Playgroud)
另一个想法是啜饮输入文件,然后n从数组中随机选择,但如果我有一个非常大的文件,这是一个问题. …

perl random-sample file-processing

Nat*_*man

2017 05-23

6
推荐指数

1
解决办法

4143
查看次数

Files.walkFileTree的并行版本(java或scala)

有没有人知道java Files.walkFileTree或类似东西的任何并行等价物？它可以是Java或Scala库.

java io multithreading scala file-processing

mat*_*att

lucky-day

6
推荐指数

2
解决办法

5042
查看次数

Apache Commons IO 文件监控与 JDK WatchService

我需要开发一个应用程序，一旦文件在预定义目录中创建，它将处理 csv 文件。预计会有大量传入文件。

我见过在生产中使用 Apache Commons IO 文件监控的应用程序。它工作得很好。我见过它一天处理多达 2100 万个文件。似乎 Apache Commons IO 文件监控会轮询目录并执行 listFiles 来处理文件。

我的问题：JDK WatchService 是否和 Apache Commons IO 文件监控一样好？有谁知道任何优点和缺点？

java file-processing watchservice apache-commons-io

Sap*_*asu

2015 10-04

5
推荐指数

1
解决办法

3140
查看次数

标签统计

file-processing ×10

io ×2

java ×2

perl ×2

random-sample ×2

apache-commons-io ×1

asp.net-web-api ×1

awk ×1

bash ×1

bigdata ×1

c ×1

c# ×1

fgets ×1

file ×1

file-handling ×1

file-io ×1

file-permissions ×1

filemerge ×1

gnu-parallel ×1

hashmap ×1

haskell ×1

large-files ×1

linux ×1

multithreading ×1

node.js ×1

scala ×1

streamwriter ×1

text ×1

unix ×1

watchservice ×1

标签 统计

标签统计